Methods of conducting statistical surveys | Methodology and Registers | GUS - Portal Informacyjny

Methods of conducting statistical surveys

Methods of conducting statistical surveys

Which surveys are conducted by Statistics Poland?

What are the methods of conducting statistical surveys? (including representative method)

Statistical surveys conducted by Statistics Poland are included in the Statistical Surveys Programme of Official Statistics established annually by the Council of Ministers by regulation (link do angielskiej wersji strony: http://bip.stat.gov.pl/dzialalnosc-statystyki-publicznej/program-badan-statystycznych/).

Statistical surveys of official statistics differ from other statistical surveys in terms of their characteristics mentioned in the Act on Official Statistics, such as:

  • data collected and gathered in statistical surveys of official statistics are individual data, affected by confidentiality provisions (statistical confidentiality) and therefore cannot be shared or published, but can be used for the purposes of statistical studies, calculations and analyses or to create a sampling frame for statistical surveys;
  • results of those surveys are considered as official statistics;
  • access to output statistical information, i.e. calculations, studies and analyses, is equal, equivalent and simultaneous.

Statistical surveys can be divided, depending on the number of units under observation covered by the survey, into censuses and partial surveys.

Census: it covers the entire general population, i.e. each unit of the population is covered by the survey.

Partial survey: it covers certain selected units of the population (sample), i.e. only a part of the general population is observed; partial statistical surveys of official statistics include surveys with random sample selection (representative surveys), in which each sampling unit has a certain probability of being included in the sample, and non-random selection – purposive sampling.

The basic census conducted by official statistics is the population and housing census. It provides information necessary to determine the state and structure of the phenomenon at a precisely defined moment; it is a special survey carried out usually every 10 years, based on collecting information on appropriately prepared census forms.

A representative survey means that in order to examine the characteristics of the entire population, only a certain group of units is selected for the survey – a random sample from the general population; the probability theory enables to determine the   magnitude of error made when generalizing the results from a random sample to the entire population.

There are several stages of carrying out statistical surveys:

  • survey planning stage, including setting objectives of the survey;
  • sampling and data collection stage;
  • data processing and analysis stage;
  • data sharing stage, publishing and dissemination of output statistical information.

At each stage specific research methods are used.

Due to the data collection methods, surveys can be divided into:

  • surveys based on electronic forms: CAPI (Computer-Assisted Personal Interview); telephone interview - interview conducted by telephone by a statistical interviewer; CATI (Computer-Assisted Telephone Interview); CAWI or CAII (Computer-Assisted Web Interview or Computer-Assisted Internet Interview) - online survey supervised by computer system; surveys based on forms obtained through the Statistics Poland Reporting Portal;
  • surveys based on paper forms: PAPI (Paper-and-Pencil Interview);
  • surveys based on administrative registers.

In case of representative surveys, the preparatory phase related to sampling is determining the sampling strategy. We define the sampling scheme and sampling frame. As part of defining the sampling frame, we determine the sampling units and the scope of the sampling frame. By defining the sampling scheme, we determine the target population and the probabilities associated with possible samples.

Sampling strategy is a combination of sampling scheme and specific estimators. The choice of strategy involves considering the estimators that can be created. Important aspects of the strategy are systematic errors, variance and the mean square error of the estimator (MSE). Systematic error (bias) occurs when the sampling frame and the target population do not completely match, or in case estimator does not match the type of sampling scheme, or when there are non-random non-responses. Efforts should be made to ensure that the estimator is accurate and precise, i.e. that both the variance and the mean square error are small.

In order to improve the survey organization and facilitate the work of interviewers, in case of direct interviews (CAPI or PAPI), a two-stage sampling scheme is used, i.e. firstly, first-stage units are drawn, including a certain set of second-stage units, and then from the randomly selected previously first-stage units, samples of second-stage units are drawn. This is used, for example, in social surveys: the first-stage units are census clusters, enumeration areas or statistical areas, and the second-stage units are dwellings. This variant of the sampling scheme may be used for economic reasons, but it may cause deterioration of the precision of the survey results compared to the single-stage sampling scheme.

When choosing sampling scheme and estimation method, the time parameter should be taken into account, i.e. the fact that survey is repeated according to a certain time scheme. Various sampling strategies can be adopted: two extreme approaches are to draw a different sample each time or to use the same sample. Between these two extremes lie various survey patterns, the choice of which is made depending on the expected survey objectives.

After determining the sampling scheme and estimators, the sample size should be determined. Two aspects must be taken into account: costs and precision and their mutual dependence. Generally, precision improves when the sample size increases, and costs rise when the sample size increases.

After determining the sample size, we usually proceed to stratification of the sampling frame. Stratified sampling method is widespread. In this method, we create population strata that may be treated as separate subpopulations, defining strategies for them separately, and selecting samples independently. At the beginning of stratification, we determine the characteristics according to which we will carry it out, which depends on the purposes of stratification. Possible goals are:

  • increasing precision,
  • creating estimates for separate strata or subpopulations consisting of more than one strata,
  • more efficient planning of field work,
  • use of different sampling frames for different parts of the population.

The purpose of stratification is to distinguish as homogeneous groups of individuals in a diverse community as possible, so that each of these groups will have appropriate representation in the sample. This is particularly important in highly heterogeneous populations. The strata should be as diverse as possible and homogeneous inside. We stratify in such a way that the final strata are disjoint and cover the entire population, i.e. each population unit belongs to one and only one layer. One of the most important issues when using stratified sampling scheme is the so-called sample allocation, i.e. the distribution of sample elements in individual strata. You need to specify how many units from each stratum are to be selected for the sample. We can distinguish the following solutions: proportional allocation, Neyman (optimal) allocation, uniform allocation.

These and other concepts used in official statistics are defined in the Glossary of terms available at:

Statistics Poland / Metainformation / Glossary / Terms used in official statistics.

After drawing the sample, the stage of data collection begins and then generalization of the results, i.e. generalizing information from the sample level to the population. This is achieved using the methods of weighting (assigning weights to units from the sample before the survey) and reweighting (weighting after conducting the survey, correction of weights). Weighting methods are also used to deal with missing values and achieve consistency with data from other sources - they can reduce bias and increase precision by using additional information.

When generalizing results from the sample to the population, the probabilities of inclusion of a given unit in the sample, related to the established sampling scheme, the so-called inclusion probabilities (πh), are of crucial importance. The inverse of the inclusion probability is called the sampling weight (1/πh), and assigning weights to the appropriate units under observation is the so-called weighting process. In stratified sampling, in which units from each stratum are drawn by simple random sampling without replacement, the inclusion weights for each stratum can be described by a simple formula: ah  = Nh/nh (where Nh is the population size in a given stratum and nh is the sample size in stratum h). If non-responses occur, sampling weights must be modified. The modified weights are the inverse of the result of multiplying the inclusion probability (πh) and the response probability. We estimate the response probability on the basis of available data, the simplest way is to adopt the so-called completeness indicators, i.e. quotients of the number of units examined to the number of units that should be examined, in other words, the response probability is determined as a proportion of the number of responses given/observations examined in relation to the planned number of responses given/observations examined less the number of units in the sample outside the scope of the survey.

The stage of processing and analysis of the collected data includes data editing, imputation, estimation, integration and analysis. Editing data, simply put, means checking data to detect errors. Firstly, data completeness is checked – whether for all observations we obtained answers to all the questions asked. Then, data validation can be performed, i.e. determining whether the responses collected are possible/acceptable, and for this purpose, for example, the ranges of acceptable data are used. We further examine if there are acceptable relationships between the data by checking the proportions between variables and the correctness of arithmetic calculations, such as adding variables to the total sum. We distinguish 2 types of missing values – they may concern the lack of answer from a surveyed unit (unit non-response) or lack of answer to individual questions (item non-response). In the first case, we deal with this problem by using data weighting methods, while in the second case, we use data imputation methods. Imputation is completing missing data. There are many methods of imputation.

When using imputation, one should bear in mind that the assigned values are only artificially introduced substitutes for the answers.

Each sample survey involves the need to determine precision of the estimates. Precision estimator may take the form of a function depending on data collected in the survey and adopted sampling scheme. In simpler cases, we can estimate variance of the estimator by substituting survey data into specific analytical formulas. In more complex cases, i.e. where we deal with more complicated estimator, e.g. quantile thresholds, Gini index, it may be necessary to use the so-called resampling methods in which subsamples are repeatedly drawn from a set of sample elements and, based on the variation of the estimator for these subsamples, parameters of distribution of the estimator from the sample (e.g. variance) are estimated. The most commonly used resampling method for variance estimation is the bootstrap method.

There is quite an extensive literature describing sample survey techniques and methods for assessing their precision. Example manuals are:

  • Cochran, W.G. (1977), Sampling Techniques, 3rd ed., Wiley, New York
  • Särndal, C.E., Swensson, B., Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlag, New York
  • Bracha, Cz. (1996), Teoretyczne podstawy metody reprezentacyjnej, PWN, Warszawa.

Statistics Poland carries out about 30 representative surveys and the most important ones are:

  • HBS  Household Budget Survey;
  • LFSLabour Force Survey;
  • EU-SILCEuropean Union - Community Statistics on Income and Living Conditions.

The above surveys belong to the group of social surveys. Other surveys conducted by Statistics Poland include agricultural and enterprise surveys.

The most important sample surveys in the area of agriculture include:

  • Farm structure survey (R-SGR);
  • The population survey of cattle, sheep, poultry and livestock production (R-ZW-B);
  • Survey of pig population and pork production (R-ZW-S).

Examples of sample surveys belonging to the group of enterprise surveys include:

  • Report on the economic activity of enterprises (SP-3);
  • Quarterly survey on revenues in commercial enterprises (H-01/k);
  • Job Vacancy Survey (Z-05);
  • Structure of Earnings Survey (Z-12).

The most important elements related to the sampling process and generalization of research results, e.g. information regarding sampling frame, sampling scheme, sample selection methodology and methodology for generalizing results are included in the Statistics Poland’s publications regarding a given study (usually in the chapter entitled ‘Methodological notes’). Electronic versions of the publications are available on the Statistics Poland Information Portal. Detailed information on specific survey topics is also presented there in the form of the so-called methodological reports, available at

Statistics Poland / Metainformation / Methodological reports.

Samples for most representative surveys are drawn once a year, but there are also surveys for which the sample is drawn four times a year or every few years.

The most time-consuming surveys for the network of statistical interviewers are:

  • HBS;
  • LFS;
  • Survey on participation of Polish residents in travel.

These are surveys carried out in all weeks of a given year/quarter.

In most studies, there is rotation, i.e. in a given year, in addition to ‘new’ addresses, addresses drawn in previous years are also taken into account.

We distinguish primary and reserve samples. The reserve sample is used if no response is received from the unit in the primary sample. As a consequence, there are more responses, but the bias in the results may increase. Reserve samples are currently being drawn for the following surveys:

  • HBS;
  • EU-SILC;
  • Survey on information society;
  • Survey on participation of Polish residents in travel.

Own elaboration: Statistics Poland, Department for Innovation