An Improved Bound for Security in an Identity Disclosure Problem

Identity disclosure of an individual from a released data is a matter of concern especially if it belongs to a category with low frequency in the data-set. Nayak et al. (2016) discussed this problem vividly in a census report and suggested a method of obfuscation, which would ensure that the probability of correctly identifying a unit from released data, would not exceed t for some 1/3<t<1. However, we observe that for the above method the level of security could be extended under certain conditions. In this paper, we discuss some conditions under which one can achieve a security for any 0<t<1.


Introduction
Many agencies release data to motivate statistical research and industrial work.But often these data-sets carry some information which may be sensitive to the individual bearing it.Erasing the name or some identity number associated with an individual may not always be sufficient to hide the identity of the individual.For example, imagine a situation where a data-set of p variables corresponding to n individuals are released and among these p variables there is a variable named "pin-code"( sometimes called zip-code).Now "pin-code" is not supposed to be a sensitive variable, but it may happen that the intruder, who is trying to identify some individual in the data-set, has an idea about where the individual lives and thus can guess his "pin-code".In this case, if in the dataset there is no other individual having the same "pin-code", he can directly guess from this information which row in the data-set corresponds to the individual and thus the identity is revealed.Hence, suppressing identity numbers or names is not always sufficient to prevent identity disclosure.In case, there are a few variables with low frequency cells, it is usually easy for the intruder to identify the individual.
Various articles including [5] [4] [2] have discussed this problem and various authors have proposed different risk measures to evaluate the security in the released data.However, here we follow the framework of Nayak et.al. [2] where the intruder has a knowledge of the variable category X (B) corresponding to his target unit B. If the variable X has k categories c 1 , c 2 , . . ., c k , then we assume without loss of generality X (B) = c 1 and the frequencies of the categories in the data-set are T 1 , T 2 , . . ., T k respectively.
If T 1 = 1, i.e.only X (B) has category c 1 , the intruder can guess the row of his target unit with certainty.If T 1 is small, the intruder knows that his target unit is definitely one of the T 1 many units and then taking into consideration other information, he may successfully identify the row of his target unit or make a correct guess.Thus, in this case, the variable information must be suppressed before releasing the data.
One way to do that is to completely erase the variable but that is not desirable to the statistician.The usual practice is to perturb the data in such a way so that the new data can be treated like the original data in making statistical inferences.
is the perturbed data then the transition matrix P is given by, ((p ij )) where, This matrix is not released and is unknown to the statistician.This method of obfuscation is known as the post-randomization method (PRAM).If we assume where If we want to treat Z as the original data, we must have Π = Λ = P Π.But Π is generally unknown to the one, who is masking the data.However, he can estimate Π from the original data with T/n where n is the total sample size.If we want S/n to be an unbiased estimator of Π, we must have, Gouweleeuw et.al.( 1998) [6] defined a post randomization method to be an invariant PRAM if P satisfies Equation (2).The error due to estimation after post randomization was studied in the literature by various authors including Nayak et.al. [7].
One of the common techniques to achieve an invariant PRAM is to use an Inverse Frequency Post Randomization (IFPR) block diagonal matrix, in which the entire dataset is partitioned into few groups and within each group, categories are interchanged.If it is not desirable to change the category of some variable, it can be made to form its own block.Thus, if there are m groups, given by {c c i fall into the same group and p ij = 0 if c j and c i fall into different groups.Within each group, p ij is given by, where 0 < θ < 1 and k ′ > 1 is the block size of the group that i and j fall into.
However, the parameter θ of the model should be carefully chosen to ensure that the perturbed data is secured from the intruder, at least, up to a certain extent.To measure the risk of disclosure, Nayak et.al.[2] suggested checking whether the probability of correctly identifying an individual given any structure of T and any value of S 1 is bounded by some specified quantity 0 < ξ < 1.Moreover, they showed that there exists a θ ⋆ , where 0 < θ ⋆ < 1 which gives the transition matrix, P (θ ⋆ ) = ((p ⋆ ij )) 1≤i≤k,1≤j≤k where p ⋆ ij is chosen according to Equation (3) with θ = θ ⋆ for each i, j = 1, 2, • • • , k 1 and k 1 is the block size of the group c 1 belongs to.Without loss of generality, we assume the block c 1 belongs to is the first block.This matrix P (θ ⋆ ) when used to post randomize X, for any 1 3 ≤ ξ < 1, where CM denotes "Correct Match".However, if we can extend the search range of θ from 0 < θ < 1 to 0 < θ < T 1 and can find all categories in the first block that satisfy T j ≥ T 1 for all j = 1, then the level of security can be extended to any 0 < ξ < 1.Note that, under this definition, there is no harm in the range of the probabilities as they certainly lie between 0 and 1.However, smaller the value of ξ, larger the block size is required.Therefore we can extend the security as far as the frequency distribution permits.

Our Approach
As mentioned earlier, our framework is similar to that of Nayak et.al.[2].From the intruder's point of view, we assume that as he gets access of the released data Let S 1 be the total number of units having class c 1 .If S 1 = 0, intruder stops searching for his target unit B in the data-set.If S 1 = a for some a > 0, he selects one unit randomly among these a individuals and concludes that to be his target unit B. Under this assumption, we discuss how to choose the parameter θ of the IFPR block diagonal matrix ( See Equation ( 3)), depending on T 1 , so that the probability of correctly identifying unit B is less than some specified 0 < ξ < 1.Our method is described in the following paragraph.
ξ , then there is no need for obfuscation as the intruder can choose one unit randomly and conclude it as his target unit B. Since, in the original data, the probability of correctly identifying ξ , the probability is less than ξ.This is quite intuitive since identification risk is a problem associated with low-frequency classes.If T 1 ≤ 1 ξ , then we find k 1 = K 1 (ξ, T 1 ) classes ( where the function Such an event is usually feasible for moderate values of ξ as T 1 usually has small values.If such classes are available, we can have any desired level of security, i.e., for any fixed 0 < ξ < 1, there exists a corresponding θ ⋆ such that if the data is perturbed with matrix P (θ ⋆ ), Equation (4) holds.If, however, such classes are not available, we can find the integer n ⋆ such that we next try for ξ 2 = 1 n ⋆ −2 and so on until we get a success for some and a θ ⋆ , such that if the data is perturbed with P (θ ⋆ ), then Equation ( 4) is satisfied for any 1 n ⋆ −l < ξ < 1.According to Nayak et.al. [2], there is always a solution for ξ ≥ 1 3 which implies n ⋆ can take a minimum value.However, n ⋆ can take higher values in many cases.

Model,Assumptions and Results
As discussed earlier, the goal of the paper is to find out a method by which a data can be perturbed ensuring as much security as possible.Since security is an abstract term, we limit ourselves to ensure that the measure, given by Equation ( 4)) holds for low values of ξ.Smaller the value of ξ, better the security of the data.Let us denote, by R 1 (a, t), the probability of correctly identifying the individual from released data given S 1 = a and the frequency distribution of X given by t : If R 1 (a, t) is bounded by ξ for any t, then note that is bounded by ξ for any a ≥ 0, which signifies that the probability of correctly identifying an individual is less than ξ, no matter how small or large the frequency of category c 1 is, in the released data.R 1 (a, t) is used instead of R 1 (a) because it is hard to calculate the probability if t is not known.Note that, CM stands for "Correct Match" in the above equations ( 5) (6).
Recall that if we use, IFPR block diagonal matrix to perturb X, the category c 1 may get changed to one of {c 1 , c 2 , . . ., c k 1 }, k 1 ≥ 2 with positive probability.let us denote By our assumption, since the intruder searches his target unit B among the ones with Again, since, the intruder is assumed to choose randomly one unit among a units to be B, P Again, we have, where We denote the sum by Σ a−1 Equation ( 8) and (9) implies that and since from Equation ( 7) , we finally have, Nayak et.al.[2] observed that although it seems intuitive that R 1 (1, t) ≥ R 1 (a, t) for any t, a > 1 there are certain cases it does not hold true.However, they proved that if if β 1 is highest, i.e., the odds that c 1 goes to any category other than c 1 , then the risk of disclosure should be maximum if a = 1.We checked that this is quite true which leads us to our first result, stated in the following theorem and the proof is given in Appendix Section.
for any t, a > 1, where R 1 (a, t) is given by Equation (10).
Assuming Theorem 3.1 holds, proving Equation ( 4) is equivalent to prove that R 1 (1, t) ≤ ξ for any t.For this condition to hold, we must carefully choose the parameter θ in (1).
Due to Nayak et.al. [2], we have, To proceed further we also need the following lemma, proof of which is deferred in Appendix Section.
For Theorem 3.1 to hold, in an IFPR block diagonal matrix, we must have T 1 −θ for Theorem 3.1 to hold.Again, θ is chosen by solving ψ(θ, T 1 ) = ξ.Thus, for fixed ξ and T 1 we have a θ and a corresponding K 1 (ξ, T 1 ) which is the largest integer contained in K(θ, T 1 ).K 1 (ξ, T 1 ) is the minimum number of categories required to form the block containing c 1 .For some possible choices of ξ and some possible values of T 1 , the value of K 1 (ξ, T 1 ) is calculated and given in Table 1.While choosing the block size, one must note that the block size k 1 must be larger than or at least equal to K 1 (ξ, T 1 ) to ensure Equation (4).

Simulation Results
To illustrate the process, we simulate a sample of size n = 2000 from k = 8 categories such that the probability of falling into a category is given by the vector Π = (0.001, 0.1, 0.2, 0.05, 0.12, 0.13, 0.301, 0.098).The sample has frequency distribution given by Table 2. Two units in the data-set have Category 1, one of which is unit B = 780.Since T 1 = 2, the probability of Correct Match from true data is 0.5 which is very high.We want this probability to be lower, say below ξ = 0.1.So, we transform the data to Z using the IPRAM method with a transition matrix P .To choose an ideal P we apply the procedure of this paper.From Table 1, we get the required block size is 6.So, we would apply transition to the first k 1 = 6 categories with the lowest probability of occurrence and do not alter the categories for the rest 2 categories.To solve for h(θ) = ξ, we have θ ⋆ = 1.656854 which gives the transition matrix, 0.172 0.166 0 0.166 0.166 0.166 0 0.166 0.002 0.992 0 0.002 0.002 0.002 0 0.002 0 0 1 0 0 0 0 0 0.003 0.003 0 0.984 0.003 0.003 0 0.003 0.001 0.001 0 0.001 0.993 0.001 0 0.001 0.001 0.001 0 0.001 0.001 0.993 0 0.001 0 0 0 0 0 0 1 0 0.002 0.002 0 0.002 0.002 0.002 0 0.991 Using this transition matrix we ran 1000 simulations to get 1000 different Zs.The mean squared estimation error for each category is given by E = (4.9350e− 07, 7.6125e − 07, 0.0000e+00, 7.4300e−07, 8.8550e−07, 7.8375e−07, 0.0000e+00, 8.5550e−07) which is quite low and the average probability of correct match in 1000 simulations is 0.07639286 < 0.1.
The process thus seems to work well for simulated data.

Conclusion
The method works fine in most practical cases, because, in general, since we want to obfuscate categories with low frequency, there will be sufficient number of categories with higher frequency values than them.Accordingly, the security level can be increased.
However, the greatest drawback of this method of obfuscation is that we have assumed the game of the intruder, i.e., it selects one of the units with the desired categorical value randomly looking at the obfuscated data.But this is not expected to happen since in most cases there will be many regressive variables associated and the selection will not be, in general, random.This problem was also discussed in [4].
However, if the model assumptions hold true, the discussed method is successful in giving a better security.
We will prove this result by a two dimensional induction procedure.First, we show that the statement is true for k 1 = 2 for all a ∈ N, then we show that if the statement is true for k 1 = k 1 0 , then it is true for k 1 = k 1 0 + 1 for all a. Case: We have, Writing Σ a+1 similarly, we note that there are a + 2 terms in the expansion of Σa+1 − In the last expression, let us denote the first term by T erm(s, β1) and the second term by T erm(s − 1, β).Note that since β 1 ≥ β i ∀i T erm(s, β 1 ) − T erm(s, β) ≥ 0.
Case: k 1 = k 1 0 + 1: The general expression for Σa can be given by the following expression.Σa = where Σ(s,k 1 0 ) = s!Σ s for k 1 0 categories instead of k 1 = k 1 0 + 1 categories.Like before, we write down the terms of Σa+1 − Σ Σa + β 1 Σa .(a + 2) th term = Summing all the elements we get, 13 Thus the statement is true for k 1 = k 1 0 + 1 if true for k 1 0 for any a ≥ 1.Thus, we see (12) always holds and hence the proof.

Table 1 :
Showing minimum block size required for some possible choices of security level ξ and some possible values of class frequency T 1

Table 2 :
Table showing frequencies of Categories for True Data from Simulated data-set