Performs hot-deck and model-based imputation for survey data (S.D. Langton).
Options
PRINT = string token |
Controls printed output (summary , monitoring , check , list , regression ); default summ |
---|---|
METHOD = string token |
Imputation method (hotdeck , modelbased ); default hotd |
DMETHOD = string token |
Method for calculating distances (mean , minimax , regression ); defaule mini |
%THRESHOLD = scalar |
Percentage threshold for matches |
THRESHOLD = scalar |
Absolute threshold for matches |
DVARIABLES = variates or factors |
Variables to use for distance calculation or factors |
DRANGES = scalars |
Ranges to use for distance calculations with each of the DVARIABLES ; default * uses the observed range |
LABELS = variate, factor or text |
Provides labels for the cases |
SEED = scalar |
Seed for random numbers; default 0 |
IMPUTE = variate or scalar |
The variate provides logical (0 or 1) values to indicate whether each unit is to be imputed, alternatively the scalar specifies a number of rows to be selected at random to be imputed to allow the effectiveness of the imputation process to be studied; default * imputes values for any units where an OLDSTRUCTURE contains a missing value |
DONORS = variate |
Logical variate indicating whether each unit can be used as a donor; default * implies that all units are used with complete data for each OLDSTRUCTURE |
RSAVE = rsave |
Regression analysis to use for METHOD=model or DMETHOD=regression |
URECEPTORS = variate |
Saves unit numbers of receptor (imputed) cases |
UDONORS = variate |
Saves unit numbers of donor cases |
DISTANCES = variate |
Saves the distances for the chosen receptor-donor pairs |
Parameters
OLDSTRUCTURE = variates or factors |
Structure containing missing values |
---|---|
NEWSTRUCTURE = variates or factors |
New structures with imputed values |
OVERWRITE = string tokens |
Whether to overwrite any existing data for imputed cases (yes , no ); default no |
Description
Survey data frequently contain missing values. When all the information is missing for a sample unit it is generally appropriate to allow for this by modifying the weights, but when only certain variables are missing (item non-response) imputation is often used to fill in the missing values. SVHOTDECK
performs “hot-deck” imputation (see for example Korn & Graubard 1998) whereby replacement values are taken from another unit, chosen at random, usually from a list of suitable matches determined on the basis of a suitable distance metric. The procedure can also be used for model-based imputation; in this case the imputed value is taken as the sum of the fitted value from a regression model and a residual chosen at random from another unit. In the description below “donor” is used to mean a unit supplying data to a “receptor” that has a missing value initially.
The data are usually supplied by the OLDSTRUCTURE
parameter, in variates and/or factors, containing missing values. The NEWSTRUCTURE
parameter supplies new variates or factors to contain the values of each OLDSTRUCTURE
variate or factor, but with the missing values replaced by the imputed values. By default, imputation is carried out for any row of data where an OLDSTRUCTURE
contains missing values. Alternatively, the rows to be imputed can be specified by setting option IMPUTE. This can supply a logical variate, containing the value one in the units whose values are to be imputed, and zero elsewhere, or it can supply a scalar specifying a number of rows to be selected at random to be imputed. The scalar setting is useful if you want to study the effectiveness of the imputation process.
By default, imputed values will be used only to replace the missing values in each OLDSTRUCTURE
, unless the corresponding setting of the OVERWRITE
parameter is yes
. Imputed values are then inserted even if the original value is not missing. This would allow you, for example, to compare real and imputed data in order to check the efficiency of the imputation process. Alternatively, you might set OVERWRITE=yes
for every OLDSTRUCTURE
in order to preserve the correlations between the variables by taking all the values from each donor.
By default, any row of OLDSTRUCTURE
with no missing values may be used as a donor, unless option DONORS
is used to specify a logical variate to indicate the rows that are to act as potential donors.
The DVARIABLES
option is used to supply one of more variables to use to determine the matching between donors and receptors. In the simplest case, if you set DVARIABLES
to a single factor, the donors are selected at random from receptors with the same factor value (e.g. to replace observations by others from the same stratum). For more complex matching, DVARIABLES
can be set to a list of variates or factors which are then used to determine a distance between each receptor and the potential donors. By default the distance for a DVARIABLES
variate is calculated as
d = |xi – xj| / r
where r is the observed range of the data, but an alternative value of r may be supplied using the DRANGES
option. DRANGES
should be set to 1 if no scaling of the distances is required. For a DVARIABLES
factor a simple matching criterion is used, so d = 0 if xi and xj are the same, and d = 1 if they are not.
Matches are then determined using these distances according to a “minimax” approach, where the best match is the one with the minimum value of the maximum absolute difference between any of the DVARIABLES
. Alternatively you can set the DMETHOD
option to mean
to use the mean of the absolute differences, or to regression
to request that the distances are determined on the basis of predictions from a regression.
The RSAVE
option specifies the regression analysis to use when DMETHOD=regression
. The terms in the model must include the DVARIABLES
. If RSAVE
is not specified, the most recent regression analysis is used. The calculation of the distances between units is then weighted by the appropriate regression coefficients: for example, if the slope of x1
is 0.24 and two units have x1
values of 10 and 20, the distance is
(20 – 10) × 0.24 = 2.4.
DRANGES
are ignored when DMETHOD=regression
.
Conventional hot-deck imputation is the default method. Alternatively, if you set option METHOD=modelbased
, SVHOTDECK
will do model-based imputation. Note, though, that this cannot be used if DMETHOD=regression
. Model-based imputation uses a regression analysis, specified by the RSAVE
option. If RSAVE
is not specified, the most recent regression analysis is used. The method creates an imputed value by adding a random residual to the fitted value of the selected donor. This method can be used only if the OLDSTRUCTURE
is the same as the y-variate in the regression. DVARIABLES
will frequently be left unset in this situation, so that the residuals are chosen totally at random. However, in some situations it may be preferable to select residuals from similar units, in which case DVARIABLES
can be used to determine the matching, as with the hot-deck method.
By default, SVHOTDECK
will determine the single best match for each unit, where possible. In many cases (e.g. when doing multiple imputation), it is required to select one at random from the closest matches. The %THRESHOLD
option specifies the tolerance to use in these situations: for example, setting %THRESHOLD
to 10 requests that the match is selected at random from amongst the donors with distance up to 10% greater than the minimum distance. The SEED
option specifies the seed for the random numbers that are used for this operation (default 0). Alternatively, if it is desired to specify the distance relative to the minimum in absolute terms, the THRESHOLD
option should be used instead. If both THRESHOLD
and %THRESHOLD
are set, both criteria must be met. The THRESHOLD
value is normally set relative to the minimum distance, but, if it is set to a negative value this is taken to mean that a match is selected at random from those with a distance less than the absolute value of the THRESHOLD
. Thus, for example, if THRESHOLD
is set to -0.2 and METHOD=mean
, any units with a mean distance of less than 0.2 (after taking into account settings of DRANGES
) from the unit to be imputed are considered matches, and one of these is selected at random. Alternatively, if THESHOLD
is set to 0.2 and the best match is for example 0.18, any units with a mean distance of less than 0.18 + 0.2 = 0.38 are considered matches, and one of these is selected at random.
The URECEPTORS
and UDONORS
options can be used to save the unit numbers of the receptor (imputed) cases and the donor cases, respectively. Note that, if the IMPUTE
option is set, the OLDSTRUCTURE
and NEWSTRUCTURE
parameters need not be set. The use of URECEPTORS
and UDONORS
then allows more complicated methods of replacement to be used than those provided directly by SVHOTDECK
.
Printed output and plots are controlled by the PRINT
option, with the settings:
monitoring |
provides information about each match, |
---|---|
summary |
provides a summary, |
list |
produces a list of recipients and donors, |
check |
prints correlations as well as giving a scatter plot of the predictions against the actual data, and |
regression |
gives details of the model used when DMETHOD is set to regression . |
To use check
it is necessary to impute for data values that are present. This can be achieved either by specifying these units using IMPUTE
, or by setting IMPUTE
to a scalar, in which case the appropriate number of rows will be selected at random.
Options: PRINT
, METHOD
, DMETHOD
, %THRESHOLD
, THRESHOLD
, DVARIABLES
, DRANGES
, LABELS
, SEED
, IMPUTE
, DONORS
, RSAVE
, URECEPTORS
, UDONORS
, DISTANCE
.
Parameters: OLDSTRUCTURE
, NEWSTRUCTURE
, OVERWRITE
.
Action with RESTRICT
SVHOTDECK
takes restrictions from any OLDSTRUCTURE
or DVARIABLES
vectors. Only unrestricted units are used as either donors or receptors. However, restrictions on IMPUTE
and DONORS
are ignored.
References
Korn, E.L. & Graubard, B.I. (1999). Analysis of Health Surveys. Wiley, New York.
See also
Procedures: SVBOOT
, SVCALIBRATE
, SVGLM
, SVREWEIGHT
, SVSAMPLE
, SVSTRATIFIED
, SVTABULATE
, SVWEIGHT
, MULTMISSING
, QMVREPLACE
.
Commands for: Survey analysis.
Example
CAPTION 'SVHOTDECK example',\ 'Orkney oats data (Sampford, Table 5.1, page 61).';\ STYLE=meta,plain VARIATE Oats READ Farm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 : READ Crops 50 50 52 58 60 60 62 65 65 68 71 74 78 90 91 92 96 110 140 140 156 156 190 198 209 240 274 300 303 311 324 330 356 410 430 : READ Oats 17 17 10 16 6 15 20 18 14 20 24 18 23 0 27 34 25 24 43 48 44 45 60 63 70 28 62 59 66 58 128 38 69 72 103 : "Insert some missing values to impute" CALCULATE Oatsmiss = MVINSERT(Oats; Farm.IN.!(17,23,30)) "First nearest match. Set DRANGE to 1 to make distances easy to interpret" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ SEED=600209] Oatsmiss; NEWSTRUCTURE=Oatsimp1 "now pick at random from those within 20 acres of nearest match on crops" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ THRESHOLD=20; SEED=12345] Oatsmiss; NEWSTRUCTURE=Oatsimp2 "and at random from those differing in crop area by 20 hectares or less" SVHOTDECK [PRINT=summary,list,monitoring; DVARIABLES=Crops; DRANGES=1;\ THRESHOLD=-20; SEED=23456] Oatsmiss; NEWSTRUCTURE=Oatsimp3 PRINT Farm,Crops,Oats,Oatsmiss,Oatsimp1,Oatsimp2,Oatsimp3;\ DECIMALS=0; FIELD=9