Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

When private or stigmatizing characteristics are included in sample surveys, direct questions result in low cooperation of the respondents. To increase cooperation, indirect questioning procedures have been established in the literature. Nonrandomized response methods are one group of such procedures and have attracted much attention in recent years. In this article, we consider four popular nonrandomized response schemes and present a possibility to improve the estimation precision of these schemes. The basic idea is to require multiple indirect answers from each respondent. We develop a Fisher scoring algorithm for the maximum likelihood estimation in the presented new schemes and show the better efficiency of the new schemes compared with the original designs.


Introduction
Surveys are important tools in many disciplines of science, for instance, social science and economics.Sometimes, variables which are viewed as private or stigmatizing are involved in the survey.Examples for such sensitive variables are financial situation, political views, cheating in examinations, undeclared work, insurance fraud, and discrimination.Direct questions on such characteristics will often yield low cooperation of the respondents, i.e., answer refusal and untruthful answers will often occur.Therefor, skilful questioning procedures that protect the interviewees' privacy and deliver data enabling statistical inference were developed in the literature.One group of procedures is the class of randomized response (RR) methods.In RR techniques, the respondent conducts a random experiment and gives a certain indirect answer depending on the result of the random experiment.For example, consider the following process with the sensitive attribute undeclared work and throwing a die as random experiment: If the die shows 1 or 2, the interviewee answers the question "Have you conducted work in the last year without declaring this to the relevant public authorities?".If the die shows 3-6, the opposite question "Did you declare all your work in the last year to the relevant public authorities?"must be responded.The interviewer does not observe the random experiment and hears only yes or no, but does not know the question that is answered.This protects the privacy.Based on the indirect answers of many respondents, the distribution of the sensitive variable can be estimated.The described procedure corresponds to the RR technique by Warner (1965).Various other RR methods are available today.See, for example, Fox and Tracy (1986), Chaudhuri (2011), Chaudhuri and Christofides (2013), or Chaudhuri, Christofides and Rao (2016) for overviews.
The random experiment in RR methods is a bit cumbersome and causes doubts on the suitability of RR methods for online surveys.This motivated diverse authors to introduce nonrandomized response (NRR) methods, for example, Yu, Tian, and Tang (2008), Tan, Tian, and Tang (2009), Tang, Tian, Tang, and Liu (2009) or Groenitz (2014).In NRR schemes, an indirect answer that depends on the respondent's outcome of an auxiliary characteristic must be given.The auxiliary characteristic is defined on the same population which the sensitive characteristic is defined on.Typically, the auxiliary characteristic is independent of the sensitive attribute and possesses a known distribution.To give an example, we mention the characteristic describing whether the respondent's birthday is in January -April or not.In NRR procedures, the respondent would give the same answer if he or she is asked again.
To improve the estimation efficiency of RR methods, some authors study repeated RR methods (Eriksson, 1973;Alavi & Tajodini, 2016;Groenitz, 2016).Here, the interviewee must repeat the random experiment multiple times.Say, we have two repetitions.Depending on the sensitive characteristic and the result of the first repetition of the experiment, the first indirect answer must be given.Depending on the sensitive characteristic and the result of the second repetition, the second indirect answer must be provided.That is, two indirect answers are necessary.
In this article, we present some repeated NRR techniques.We derive inference for these procedures and show that our repeated NRR methods improve the estimation efficiency of the original NRR techniques.The basic idea for repeated NRR techniques is to involve multiple different auxiliary characteristics in the procedure.For example, one can consider the characteristic describing whether the respondent's birthday is in January to April and the characteristic describing whether the respondent's telephone number ends on 0-6.
In Section 2, we explain the NRR methods considered in this paper.In Section 3, we describe the corresponding repeated NRR designs.The maximum likelihood (ML) estimation and the estimation variance for the multiple-trial NRR schemes are addressed in Section 4. The accuracy gains of repeated NRR techniques in comparison with single-trial NRR techniques are demonstrated in Section 5.

NRR Methods
In this section, four NRR methods are described: The crosswise method and the triangular method (both Yu et al., 2008), the multi-category design by Tang et al. (2009), and the diagonal technique by Groenitz (2014).Let the sensitive characteristic be denoted by X.We give some concrete examples for X: (i) X ∈ {1, 2} with X = 1 if the person has paid the taxes for the last year correctly and X = 2 if he or she has evaded taxes last year.
(ii) X ∈ {1, 2} with X = 1 if the person's annual income exceeds a certain value and X = 2 else.
(iii) X ∈ {1, 2, 3} where X = 1 holds if the person never has conducted insurance fraud, X = 2 holds if the person has conducted insurance fraud once or twice, and X = 3 holds if the person has conducted insurance fraud three or more times.
For the crosswise and triangular method, X ∈ {1, 2}, i.e., X with two categories, is required.For the methods by Tang et al. (2009) and Groenitz (2014), X can have an arbitrary number of categories coded by 1, 2, ..., k.The triangular method and the Tang et al. (2009) method demand that the category X = 1 is nonsensitive.The crosswise method can be applied for the examples (i) and (ii).The triangular method can handle example (i).The technique of Tang et al. (2009) is suitable for (iii) and the diagonal technique can be applied for (iii) and (iv).
For each of the considered NRR designs, a nonsensitive auxiliary variable W is necessary.The respondents' individual values of W must not be known to the interviewer or the survey agency.W and X must be independent and W must possess a known distribution.For the crosswise and triangular method, W must have the categories W = 1 and W = 2.For the Tang et al. (2009) method and the diagonal technique, W must have the k categories W = 1, ..., W = k.Examples for W with two categories were already given in the Introduction.A W with k = 4 is as follows: Let W be based on the number formed by the last three digits of the interviewee's telephone number.If this number is ≤ 624, 625 − 749, 750 − 874, and 875 − 999, we define W = 1, W = 2, W = 3, and W = 4, respectively.For example, the telephone number 9478722 results in the number 722 and W = 2.
In the survey, the respondents provide an indirect answer A that depends on X and W. Giving an indirect answer A protects the privacy.The concrete answer schemes are: -Crosswise method: For X = W = 1 or X = W = 2, the answer A = 1 must be given.For other combinations of X and W, the indirect answer is A = 2.
-Triangular method: For X = W = 1, we have A = 1.In the other cases, A = 2 is required.
- Tang et al. (2009) method: For X = 1, the answer is the value of the nonsensitive variable, that is, A = W.For X = i with i = 2, ..., k, the answer is the value of the sensitive characteristic, that is, A = X.
-Diagonal technique: The answer is given by the formula A = [(W − X) mod k] + 1, however, the respondents do not receive this mathematical formula.Instead, they receive a table that illustrates the answer to give.For example, for In this section, we introduce a repeated version for each of the NRR methods from Section 2. Here, every respondent gives multiple indirect answers.We consider the case of two indirect answers in particular.
As preliminary consideration, let us fix some NRR scheme from Section 2 and assume that the respondent should give a first indirect answer A 1 based on the sensitive X and the nonsensitive auxiliary characteristic W and a second indirect answer A 2 also based on X and W.Then, A 1 = A 2 always follows.Consequently, the second indirect answer does not contain additional information.Thus, it does not work to base both indirect answers on X and W.
The solution is to utilize a separate nonsensitive auxiliary attribute for each repetition.Say, the nonsensitive auxiliary characteristic for the first and second trial is denoted by W 1 and W 2 , respectively.For a fixed NRR scheme from Section 2, the interview procedure for the two-trial version is as follows.The interviewee first gives the indirect answer A 1 depending on X and W 1 according to the fixed NRR scheme.Afterward, he or she gives the second indirect answer A 2 depending on X and W 2 also according to the selected NRR scheme.
For each NRR technique from Section 2, neither the respondent's value of W 1 nor the value of W 2 must be known to the interviewer or the survey agency.For the crosswise and triangular method, W 1 , W 2 ∈ {1, 2} is necessary.For the Tang et al. (2009) and Groenitz (2014) method, W 1 and W 2 both must have the categories 1, ..., k.We make three further assumptions: The vector (W 1 , W 2 ) and X are independent, W 1 and W 2 are independent, and W 1 and W 2 possess known distributions (W 1 and W 2 are allowed to have different distributions).These three assumptions can usually be seen as fulfilled when the auxiliary characteristics are constructed, for example, from birthday periods, telephone numbers, or house numbers.

Statistical Inference for Repeated NRR Designs
We define π i to be the proportion of persons in the population having X value equal to i (i = 1, ..., k) and set π = (π 1 , ..., π k−1 ) ⊤ .We now develop the ML estimation for π for the repeated NRR designs and a sample of size n drawn by simple random sampling with replacement.The estimation variance is also addressed.Fix one of the repeated NRR designs and define c 1i and c 2i to be the proportion of population units with W 1 = i and W 2 = i, respectively.Let the entry (i, j) of the k × k matrix C 1 be given by P(A 1 = i|X = j).Analog, let the entry (i, j) of the k × k matrix C 2 be given by P(A 2 = i|X = j).For the crosswise method, we have ) .
For the triangular method, For the diagonal technique, each row of C 1 is a left-cyclic shift of the row above and the first row is (c 11 , ..., c 1k ).Regarding C 2 , each row is again a left-cyclic shift of the row above where the first row is now (c 21 , ..., c 2k ).
Consider a 1 , a 2 , x ∈ {1, ..., k}, define and obtain where entry (p, q) of the matrix C 1 and C 2 is denoted by C 1 (p, q) and C 2 (p, q), respectively.Consequently, A 1 and A 2 are conditionally independent.As next step, we define λ i j to be the joint proportion of population units with A 1 = i and A 2 = j (i, j = 1, ..., k).These joint proportions are arranged in the column vector λ of length k 2 where we first sort by the value of A 1 .For example, for k = 3, λ is given by λ = (λ 11 , λ 12 , λ 13 , λ 21 , λ 22 , λ 23 , λ 31 , λ 32 , λ 33 ) where C is a k 2 × k matrix and the jth column of C is given by C 1 (:, j) ⊗ C 2 (:, j).Here, C 1 (:, j) and C 2 (:, j) represents the jth column of C 1 and C 2 , respectively, and the symbol ⊗ stands for the Kronecker matrix product.The Kronecker matrix product of two matrices R ∈ R r 1 ×r 2 and S ∈ R s 1 ×s 2 is defined as , that is, R ⊗ S is a matrix of size r 1 s 1 × r 2 s 2 .Thus, C is the columnwise Kronecker product of C 1 and C 2 .To give an example, for k = 3, we have For the following, it is advisable to number the k 2 answer categories by 1, ..., k 2 where we first sort by answer A 1 and then by A 2 .For example, for k = 3, the numbering scheme is given by Table 2.

Precision Improvement
We quantify the estimation inaccuracy by the trace of the asymptotic variance matrix of the ML estimator for π = (π 1 , ..., π k−1 ) ⊤ .For this variance matrix, we refer to the end of the previous section.We start this section with a formal proof that the estimation inaccuracy of a two-trial NRR method is always less than or equal to the estimation inaccuracy of the single-trial process.
We have ] .
In the following, we show that the last two summands are zero (zero matrix).We introduce the function g with It is true that Consequently, E(E(g(A 11 , A 12 )|A 11 )) = E(g(A 11 , A 12 )) = 0 holds.That is, the third summand is zero.Regarding the fourth summand, we have Thus, we obtain ] . (1) The matrix G = G(π 1 , ...π k−1 ) is the Fisher matrix if we only have observations on A 11 , ..., A n1 , that is, if we only require one indirect answer per respondent.It follows from (1) that F − G is positive-semidefinite.By a known property of the Löwner order (Nordström, 1989, p. 4473), we obtain that G −1 − F −1 is positive-semidefinite.Thus, the trace of G −1 is larger than or equal to the trace of F −1 .G −1 is the asymptotic variance matrix of the ML estimator for π for one indirect answer per interviewee and F −1 is the asymptotic variance matrix of the ML estimator for two indirect answers per interviewee.Hence, we have shown that the estimation inaccuracy of a two-trial NRR method is always less than or equal to the estimation inaccuracy of the single-trial process.
For numerical illustration, we now compute the estimation inaccuracy of our two-trial NRR techniques for concrete parameter specifications and make comparisons to the estimation inaccuracy of the single-trial versions.For the crosswise method, we set π 1 = 0.8 and consider c 11 ∈ {0.1, 0.2, ..., 0.9, 1} and c 21 ∈ {0.1, 0.2, ..., 0.9, 1}. (2) The quantity n times the asymptotic variance of the ML estimator for π 1 for the two-trial crosswise method is presented for any combination of c 11 and c 21 in the middle of Table 3.In the right column of Table 3, we provide the quantity n times the asymptotic variance of the ML estimator for π 1 for the single-trial crosswise method depending on the parameter c 11 .
Here, the asymptotic variance for the single-trial version is and ./symbolizing componentwise division.For the triangular method, we again consider π 1 = 0.8 and proceed analogously to the crosswise method.The computational results for the triangular method are given in Table 4.For the Tang et al. (2009) technique, we consider k = 3 categories, (π 1 , π 2 , π 3 ) = (0.6, 0.3, 0.1), and 10 distributions of an auxiliary variable as follows: Note.This table shows the quantity n times the asymptotic variance of the ML estimator for π 1 for the crosswise method.
For c 11 = 0.5 in the single-trial procedure and c 11 = c 21 = 0.5 in the two-trial procedure, the log-likelihood does not depend on π implying that ML estimation is not adequate in these cases.
T ang c (6)   T ang c (7)   T ang c (8)   T ang c (9)   T ang c (10)   T ang   Note.This table presents n times the trace of the asymptotic variance matrix of the ML estimator for (π 1 , π 2 , π 3 ) for the diagonal method by Groenitz (2014).

Summary
NRR designs for sensitive attributes have attracted much attention in the literature of the last years.In this article, we have considered two-trial versions of four NRR schemes.In a two-trial design, each person in the sample must provide two indirect answers.Each answer depends on a separate auxiliary characteristic.We have developed the maximum likelihood inference for the distribution of the sensitive variable and derived the asymptotic estimation variance.Moreover, we analyzed the gains in estimation precision by two indirect answers per respondent instead of one indirect answer.
the technique byTang et al. (2009), the first column of C 1 equals (c 11 , ..., c 1k ) ⊤ .The jth column of C 1 for j = 2, ..., k has entry 1 as jth component while the other components are 0. In the matrix C 2 , the first column is (c 21 , ..., c 2k ) ⊤ .The jth column of C 2 for j = 2, ..., k has entry 1 as jth component and entry 0 for the other components.

Table 1 .
Table of required indirect answer A for diagonal technique Table 1 is such a table.

Table 4
This table provides the quantity n times the asymptotic variance of the ML estimator for π 1 for the triangular method.
Note.This table shows n times the trace of the asymptotic variance matrix of the ML estimator for (π 1 , π 2 ) for theTang et  al. (2009)method.