Tabulates data from random surveys, including multistage surveys and surveys with unequal probabilities of selection (S.D. Langton).
Options
PRINT = string token |
Controls printed output (summary , stratumsummary , psusummary , totals , means , ratios , influence , wald , quantiles , monitor ); default summ , tota , infl |
---|---|
PLOT = string token |
Controls which high-resolution graphs are plotted (single , separate , weights , influence , diagnostic ); default * i.e. none |
STRATUMFACTOR = factor |
Stratification factor; default * , i.e. unstratified |
NUNITS = table, scalar or variate |
Numbers of units in each STRATUMFACTOR level (for a multistage design these will be the number of primary sampling units) |
SAMPLINGUNITS = factor |
Factor indicating the primary sampling units; default * , i.e. single stage design |
NSECONDARYUNITS = table, scalar or variate |
Numbers of secondary sampling units for the levels of the SAMPLINGUNITS factor |
CLASSIFICATION = factors |
Domains for which separate estimates are required |
NINFLUENCE = scalar |
Number of influential points to print; default 10 |
MRFACTOR = identifiers |
Identifier of factors to index the sets of multiple responses in the tables |
WEIGHTS = variate |
Survey weights |
FPCOMIT = string token |
Whether to omit the finite population correction from calculation of variances (yes , no ); default no |
METHOD = string token |
Method of bootstrapping (simple , sarndal ); default simp |
NBOOT = scalar |
Number of bootstrap samples to use; default 0 uses a Taylor series approximation |
SEED = scalar |
Seed for random number generator for bootstrap; default 0 |
CIPROBABILITY = scalar |
The probability level for the confidence intervals; default 0.95 |
CIMETHOD = string token |
Method for forming confidence intervals (automatic , tdistribution , percentile , logit ); default auto |
PERCENTQUANTILES = scalar or variate |
Percentage points for which quantiles are required; default 50 (i.e. median) |
Parameters
Y = variates |
Response data |
---|---|
X = variates |
Base data for ratio estimation |
LABELS = variates or texts |
Labels for influential points |
OUTWEIGHTS = tables |
Saves weights |
TOTALS = tables |
Saves total estimates |
SETOTALS = tables |
Saves standard errors of estimates |
VCTOTALS = symmetric matrices |
Saves variance-covariance matrix of total estimates |
MEANS = tables |
or scalars Saves mean estimates |
SEMEANS = table |
Saves standard errors of mean estimates |
VCMEANS = symmetric matrices |
Saves variance-covariance matrix of mean estimates |
RATIOS = tables |
Saves estimates of ratios |
SERATIOS = tables |
Saves standard errors of ratios |
VCRATIOS = symmetric matrices |
Saves variance-covariance matrix of ratio estimates |
NOBSERVATIONS = tables |
Saves numbers of (non-missing) observations |
SUMWEIGHTS = tables |
Saves sums of weights |
FITTEDVALUES = variates |
Supplies fitted values for each observation |
INFLUENCE = variates |
Saves influence statistics |
WALD = variates |
Saves Wald statistics |
QUANTILES = tables or pointers |
Table to contain quantiles at a single PERCENTQUANTILE or pointer of tables for several PERCENTQUANTILE s |
SEQUANTILES = tables or pointers |
Saves standard errors of quantiles |
VCQUANTILES = tables or pointers |
Saves variance-covariance matrix of quantiles |
LQUANTILES = tables or pointers |
Saves lower confidence limits of quantiles |
UQUANTILES = tables or pointers |
Saves upper confidence limits of quantiles |
LTOTALS = tables |
Saves lower confidence limits of totals |
UTOTALS = tables |
Saves upper confidence limits of totals |
LMEANS = tables |
Saves lower confidence limits of means |
UMEANS = tables |
Saves upper confidence limits of means |
LRATIOS = tables |
Saves lower confidence limits of ratios |
URATIOS = tables |
Saves upper confidence limits of ratios |
CELLINFLUENCE = variates |
Saves influence statistics for individual cells |
Description
SVTABULATE
procedure calculates estimates from surveys, together with the correct asymptotic standard errors, allowing for the design of the survey. In particular, information about the numbers of sampling units in the survey population is needed and this can be supplied in one of three ways.
1. The WEIGHTS
option can be used to supply weights which will generally be the inverse of the probability of selection (pi expansion weights, Sarndal et al. 1992). This is simple, but cannot convey the full design information for multi-stage surveys.
2. The option NUNITS
can be used to list the number of primary sampling units per stratum using a table or variate with one value for each stratum. Similarly, in a two-stage design, NSECONDARYUNITS
indicates the number of secondary units in each primary sampling unit.
3. The dataset can contain the full survey population with unsampled (or non-responding) units indicated by missing values for the response variables. This allows Genstat to deduce the numbers of units without the need to supply any further information; it is thus simple to use, but is not feasible with large or complex surveys. The NUNITS
(and NSECONDARYUNITS
if appropriate) option should be set to a value of -1 to indicate that this is required.
Other information on the survey design is provided using the STRATUMFACTOR
and SAMPLINGUNITS
options.
The response variable is specified using the Y
parameter. Estimated counts of the number of observations can be produced by leaving the parameter unset (this is equivalent to analysing a vector of 1’s). The Y
parameter can also be left unset if the procedure is used to calculate survey weights. The X
parameter can be set in order to produce estimates of the ratio Y/X
. By default estimates of totals, means or ratios are for the whole population, but the CLASSIFICATION
option can be set to one or more factors defining subsets of the data for which estimates are required. The list of CLASSIFICATION
factors can also include pointers defined using the FMFACTORS
procedure, representing a multiple response factor. SVTABULATE
generates an ordinary factor to classify the dimension of the tables corresponding to each set of multiple responses. You can supply identifiers for these factors (thus allowing them to be accessed outside the procedure), using the MRFACTOR
option.
The FITTEDVALUES
parameter is used when estimating population totals via a model-assisted approach. Variance estimates are then calculated using the residual deviation about the fitted values. This can be used in conjunction with the SVCALIBRATE
procedure to provide estimates following calibration weighting.
Output is controlled by the PRINT
and PLOT
options. The latter produces various plots that are useful in identifying outliers and influential points which may require further investigation. The setting single
of the PLOT
option produces a scatterplot of values of Y
against X
, whilst separate
produces a separate graph for each combination of levels of the CLASSIFICATION
factors. (excluding multiple response factors). The graphs are log-transformed, unless negative values are present. If the log-transformation is required and zeros are present a small constant is added first. When X
is unset, both single
and separate
produce a scatterplot of Y
against CLASSIFACTION
. The weights
and influence
settings produce histograms of the weights and influence statistics respectively. The setting diagnostic
produces a scatterplot of influence statistics against weights; this plot tends to be more informative than the histograms with large datasets. The influence statistic for an observation is defined as the absolute percentage change in the total estimate when the observation is replaced by a missing value and the associated weight redistributed to other units in the same stratum. When CLASSIFICATION
is set, influence statistics are printed for individual cells in the table of results, as well as for the grand total. When PRINT
is set to influence
, details are printed of the observations with the highest influence; the number printed can be controlled by the NINFLUENCE
option. By default this output is labelled by the row number of the observation, but the LABELS
parameter can be used to specify more meaningful identifiers in the form of a variate, text or factor.
The FPCOMIT
option is provided so that the finite population correction (see e.g. Sarndal et al. 1992) can be omitted. This is usually done when a simplified variance estimate is produced for multistage samples by ignoring the within-cluster component of variation (the ultimate cluster approach); since this is non-conservative, the omission of the FPC is sometimes advocated to counteract this and to ensure that standard errors are appropriate. Genstat will produce the ultimate cluster results if it is only provided with the survey weights (i.e. NUNITS
and NSECONDARYUNITS
left unset), but this approach is not recommended since the correct analysis can be produced with little extra effort.
Results of the analysis can be saved using the parameters TOTALS
, MEANS
, RATIOS
and QUANTILES
, with the corresponding standard errors using SETOTALS
, SEMEANS
, SERATIOS
and SEQUANTILES
. Confidence limits are saved using LTOTALS
, LMEANS
, LRATIOS
and LQUANTILES
for the lower limits, and UTOTALS
, UMEANS
, URATIOS
and UQUANTILES
for the upper limits. By default, 95% confidence limits are produced, but this may be changed using the CIPROBABILITY
option. When the Y
parameter is unset, TOTALS
, SETOTALS
, LTOTALS
and UTOTALS
contain estimated counts of observations. Numbers of (non-missing) observations and the sum of the weights can be saved using the NOBSERVATIONS
and SUMWEIGHTS
parameters. These are set to tables classified by the CLASSIFICATION
factors; if CLASSIFICATION
is unset, they are they are set to a table with a single cell labelled 'All data'
. The OUTWEIGHTS
and INFLUENCE
parameters allow you to save variates containing the weights and influences, respectively. CELLINFLUENCE
saves the influence statistics with respect to the individual cells in the table of results, as opposed to the influence statistics for the grand total, which is saved by the INFLUENCE
parameter. The WALD
parameter can be used to save Wald statistics comparing means between the different levels of the CLASSIFICATION
factors.
The simplest quantile, and the one produced by default, is the median (50% quantile), but the PERCENTQUANTILE
option allows you to request any percentage point between 1 and 99. Moreover, by specifying a variate as the setting for PERCENTQUANTILE
, you can obtain several quantiles at the same time. However, if you then want to save the results, the setting of the QUANTILES
parameter must be a pointer with length equal to the required number of quantiles, instead of a single table.
By default, standard errors and confidence limits are based on Taylor-series approximations. However, bootstrap standard errors can be obtained by setting the NBOOT
option to the desired number of bootstrap samples. For exploratory analyses a relatively low value (perhaps 20) may suffice, but where test statistics or confidence limits are required a value of at least 400 is recommended. The CIMETHOD
option controls how the confidence limits are formed:
percentile |
uses simple percentiles of the bootstrapped distribution; |
---|---|
tdistribution |
calculates a standard error from the bootstrapped estimates and then uses the t-distribution to form intervals; |
logit |
is for proportions, and ensures that the calculated limits lie between 0 and 1 (see Heeringa et al. 2010); |
automatic |
uses the percentile method when at least 400 bootstrap samples have been used, otherwise it uses the t-distribution method when Y is set, and the logit method when Y is not set. |
The default is CIMETHOD=automatic
.
The bootstrapping method is selected using the METHOD
option. In a one-stage design the default of simple forms each bootstrap sample by sampling with replacement from the original sample within each stratum. In a two-stage design (i.e. if SAMPLINGUNITS
is set), primary sampling units are first sampled with replacement, and then secondary units are sampled with replacement within the selected primary units. Variance estimates from the boostrapping process will be biased where there are very few sampling units in each stratum and so the method is not recommended in this situation. The setting METHOD=sarndal
constructs a “pseudo-population” by replicating each sampled unit by the rounded value of its weight, so that, for example, an observation with weight 16.1 is represented sixteen times in the pseudo-population (see Sarndal et al. 1992, page 442). The bootstrap sample is formed by sampling with replacement from this pseudo-population. Option SEED
provides a seed for the random sampling.
Options: PRINT
, PLOT
, STRATUMFACTOR
, NUNITS
, SAMPLINGUNITS
, NSECONDARYUNITS
, CLASSIFICATION
, NINFLUENCE
, MRFACTOR
, WEIGHTS
, FPCOMIT
, METHOD
, NBOOT
, SEED
, CIPROBABILITY
, CIMETHOD
, PERCENTQUANTILES
.
Parameters: Y
, X
, LABELS
, OUTWEIGHTS
, TOTALS
, SETOTALS
, VCTOTALS
, MEANS
, SEMEANS
, VCMEANS
, RATIOS
, SERATIOS
, VCRATIOS
, NOBSERVATIONS
, SUMWEIGHTS
, FITTEDVALUES
, INFLUENCE
, WALD
, QUANTILES
, SEQUANTILES
, VCQUANTILES
, LQUANTILES
, UQUANTILES
, LTOTALS
, UTOTALS
, LMEANS
, UMEANS
, LRATIOS
, URATIOS
, CELLINFLUENCE
.
Method
The procedure uses the methods for survey analysis described in most survey analysis textbooks; Sarndal et al. (1992) give the best account of these for the case where weights vary within a stratum or sampling unit. If the dataset contains the full population, as opposed to just sampled or responding units, the options NUNITS
and/or NSECONDARYUNITS
can be set to -1, in which case the procedure calculates the numbers using TABULATE
.
When bootstrapping is used, bootstrap samples are formed using the SVBOOT
procedure.
Action with RESTRICT
Restrictions of the Y
variate or any of the CLASSIFICATION
factors are used to define a subpopulation, and the estimates produced relate to that subpopulation. Any restrictions on SAMPLINGUNITS
, STRATUMFACTOR
or WEIGHTS
are ignored.
References
Heeringa, S.G., West, B.T. & Berglund, P.A. (2010). Applied Survey Data Analysis. CRC Press, Boca Raton.
Lehtonen, R. & Pahkinen, E.J. (1994). Practical Methods for Design and Analysis of Complex Surveys. Wiley, Chichester.
Sarndal, C., Swenssion, B. & Wretman, J. (1992). Model Assisted Survey Sampling. Springer-Verlag, New York.
See also
Procedures: SVBOOT
, SVCALIBRATE
, SVGLM
, SVHOTDECK
, SVREWEIGHT
, SVSAMPLE
, SVSTRATIFIED
, SVWEIGHT
.
Commands for: Survey analysis.
Example
CAPTION 'SVTABULATE example 1',!t(\ 'Orkney oats data (Sampford, Table 5.1, page 61).',\ 'Stratified random survey.'); STYLE=meta,plain VARIATE Oats READ Oats 15 20 18 18 23 27 25 60 28 128 69 72 : FACTOR [LEVELS=3; VALUES=4(1,2,3)] Stratum TABLE [CLASS=Stratum; VALUES=12,12,11] N SVTABULATE [PRINT=summary,totals,influence; STRATUM=Stratum; NUNITS=N;\ NINFLUENCE=10; FPCOMIT=no] Y=Oats CAPTION 'SVTABULATE example 2',!t(\ 'Province 91 data (Lehtonen & Pahkinen, Table 3.7, page 88).',\ 'Two stage sample.'); STYLE=meta,plain VARIATE UE91; DECIMALS=0 FACTOR CLU; DECIMALS=0 READ [SERIAL=yes] CLU,UE91 2 3 3 2 4 7 4 7 : 760 767 142 187 94 262 98 219 : VARIATE [VALUES=4(4)] nsu SVTABULATE [PRINT=summary,totals,influence; SAMPLINGUNITS=CLU; NUNITS=8;\ NSECONDARYUNITS=nsu; NINFLUENCE=10; FPCOMIT=no] Y=UE91 CAPTION 'SVTABULATE example 3',!t(\ 'Province 91 data (Lehtonen & Pahkinen, Table 3.5, page 83).',\ 'One stage cluster sample, using full population format.');\ STYLE=meta,plain VARIATE UE91; DECIMALS=0 FACTOR CLU; DECIMALS=0 READ [SERIAL=yes] CLU,UE91 1 2 2 2 3 5 3 5 6 7 4 4 8 8 3 5 1 2 4 5 6 6 7 1 7 8 4 3 1 6 7 8 : * 666 528 760 * * * * * * * * 129 128 * * * 187 * * * * * * * 331 * * * * * 568 : SVTABULATE [PRINT=summary,totals,influence; SAMPLINGUNITS=CLU; NUNITS=-1;\ NSECONDARYUNITS=-1; NINFLUENCE=10; FPCOMIT=no; METHOD=csimple] \ Y=UE91