ON THE PRIVACY AND UTILITY PROPERTIES OF TRIPLE MATRIX-MASKING

Privacy protection is an important requirement in many statistical studies. A recently proposed data collection method, triple matrix-masking, retains exact summary statistics without exposing the raw data at any point in the process. In this paper, we provide theoretical formulation and proofs showing that a modified version of the procedure is strong collection obfuscating: no party in the data collection process is able to gain knowledge of the individual level data, even with some partially masked data information in addition to the publicly published data. This provides a theoretical foundation for the usage of such a procedure to collect masked data that allows exact statistical inference for linear models, while preserving a well-defined notion of privacy protection for each individual participant in the study. This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator. We achieve this by splitting the trust between two parties, the “masking service provider” and the “data collector.”


Introduction
In the digital age, vast amount of data becomes available for research. At the same time, there is increasing pressure to protect the privacy of study subjects when their data is used. For medical research, the Health Insurance Portability and Accountability Act of 1996 and subsequent rulings have imposed legal requirements for privacy protection on the collection and handling of health data. Among other things, basic privacy protection measures include the removal of all personal identifiers when releasing data for use. However, simply removing the personal identifier variables does not prevent possible identification of the individual from other variables. To prevent the identification of an individual record, researchers have shown that released data should be aggregated to satisfy privacy conditions such as k-anonymity [Sweeney, 2002], l-diversity [Machanavajjhala et al., 2007] and tcloseness [Li et al., 2007].
However, releasing data only at aggregated levels severely restricts its usefulness in many research studies. Alternatively, methods are designed to release obfuscated micro-data that allows for the usual statistical analysis while preserving the privacy at individual levels. Some examples of such obfuscated micro-data publishing are: noise addition [Brand, 2002], multiple imputation [Rubin, 1993, Drechsler andReiter, 2010], information preserving statistical obfuscation [Burridge, 2003], random projection based perturbation [Liu et al., 2006], random orthogonal matrix masking [Ting et al., 2008]. Particularly, in the random orthogonal matrix masking scheme, a masked data set AX is published, where X denotes the data matrix of real responses and A is a random orthogonal matrix. The published data AX keeps the exact values for sufficient statistics of linear models, thus allowing exact statistical inference for many standard data analysis methods [Ting et al., 2008, Wu et al., 2017b while protecting privacy by denying the user's direct access to the raw data X. While the above methods all protect the privacy of individual entries through publishing only the random perturbed micro-data, the privacy protection can be lost when multiple micro-data sets are combined from multiple inquires to the same database. Differential privacy is proposed to quantify the effectiveness of privacy protection of the random noise addition/ perturbation schemes , Dwork and Naor, 2008 against multiple inquires to the database. Then the noise level can be adjusted to achieve a quantified tradeoff between inference efficacy and privacy preservation (measured by the differential privacy metric).
Traditionally, there is a trustworthy data collector/manager that collects raw data and ensure privacy protection by releasing the data sets with random perturbations. Such procedures however do not protect against attacks where an unscrupulous party has unauthorized access to the raw data set X kept by these centers. Such security breaks are becoming more common as shown by the recent well-publicized incidences involving hacking against databases at major retailers, banks and credit bureaus [Huffington Post, 2011, Reuters., 2015.
This paper fits into a line of work tackling the problem of how to create useful synthetic data without having a trustworthy data aggregator, and provides a theoretical study of the triple matrix-masking (TM 2 ) procedure [Wu et al., 2017b] that does not assume such a trustworthy data collector/manager. The TM 2 procedure is a multi-party collection and masking system that aims to collect and publish the random orthogonal masked data set AX. We prove that, assuming no collusion between parties, no party learns more than the orbit of the data matrix under the action of the orthogonal group. More specifically, given the view of a particular party, let be the set of data matrices that could possibly have resulted in that view. We show that contains the full orbit of the data matrix and that given any prior on the data matrix, the party's posterior is simply their prior restricted to . We call data collection procedures with such properties as strong obfuscating since any extra information beyond AX available to a party does not help in further identifying the individual level data.
In the differential privacy literature, the issue of untrustworthy data collector can be dealt with using local differential privacy procedures [Kasiviswanathan et al., 2011], where noises are added to the individual data before passing to the data collector. The resulting synthetic data from differential privacy procedures, however, does not preserve exact statistics hence require special inference procedures designed to achieve optimal statistical inference Duchi et al. [2017]. Our TM 2 procedure provides an alternative where the published masked data exactly preserve any statistics of the data that are preserved under the action of the orthogonal group. This provides an useful utility that exact statistical inference for linear models are preserved, thus standard linear statistical inference procedures can be applied directly on the resulting synthetic data from the TM 2 procedure. On the other hand, the TM 2 procedure is only for a one-shot collection of each individual's data. When the individual data providers are sampled in multiple independent collections by different data collectors, differential privacy procedures can measure and limit the privacy leakage for the composition of the multiple collections. The TM 2 procedure does not consider the privacy leakage for the composition of the multiple collections.
Section 2 describe the TM 2 procedure and two new modifications to make it strong obfuscating. The theoretical analysis is provided in Section 3. Section 4 provides a summary and more detailed discussions for the relationship of the TM 2 procedure to the differential privacy and multi party computation methods.

The Masked Data Collection Procedure TM 2 nd Its Modification
The privacy-preserving data collection scheme TM 2 was proposed first in Wu et al. [2017b] and later expanded by Wu et al. [2017a]. We describe our modified basic version of the TM 2 method here: Step 1. The data collectors plan the data collection, create the database structure, program the data collection system. They randomly generate a p × p random orthogonal matrix B, which is distributed to the participants' data collection devices.
Step 2. Each participant's data x 1 (a vector of dimension p 1 ) is collected and merged with Gaussian noise x 2 (of dimension p 2 ) into a vector x = (x 1 , x 2 ) of dimension p = p 1 + p 2 . Then x is right multiplied by B on the participant's device, and only the resulting masked data xB leaves the device and is sent to the masking service provider.
Step 3. The masking service provider generates another n × n random orthogonal matrix A 2 . After receiving data from all participants, it combines the individual data xB into a n × p matrix XB, left multiplies by A 2 and sends the doubly masked data A 2 XB to the data collectors.
matrix A 1 , left multiply it to A 2 X 1 , and publish AX 1 (where A = A 1 A 2 ) which is accessible by all data users.
Detailed theoretical analysis of the privacy guarantee on the TM 2 method has been missing. This paper fills that gap by proving theoretically that this modified version of the TM 2 method is strong obfuscating. We prove the strong obfuscating guarantee by showing that: (A) the extra information that any party in the process owns will not allow the party to reduce the data domain (possible values of data) small enough to identify individual level data; and (B) there is no statistical information leakage beyond the domain restriction considered in (A).
Compared to the original TM 2 scheme in Wu et al. [2017b], we make two modifications on the TM 2 procedure. For the first modification, we add random Gaussian noise in Step 2. The data collector wants to collect p 1 variables on n individuals so that the real response matrix becomes X 1 of dimensions n × p 1 . We ask each participant to generate p 2 pure Gaussian noise variables, on his/her device according to a fixed variance parameter σ 2 . Hence, the full data matrix would be X = (X 1 , X 2 ). For privacy protections proved in later sections, we require that p 1 < n ≤ p = p 1 + p 2 . In Step 2 of this modified procedure, Gaussian noise x 2 is mingled with real response x 1 to provide protection in addition to the random mask B. In Step 4, after the collectors get back A 2 X = (A 2 X 1 , A 2 X 2 ), they separate the matrix and discard those noises. Therefore the published data set AX 1 with A = A 1 A 2 still gives the exact summary statistic, as it is only masked by A without containing the added noise.
For the second modification, instead of using a random invertible matrix for the right mask B as originally proposed by Wu et al. [2017b], we use a random orthogonal matrix for the right mask B. As we will see in the privacy analysis in the next section, using an invertible matrix does make one part of the privacy proof easier. However, the other part of privacy proof depends on using a uniformly distributed random matrix to avoid information leakage that can lead to probabilistic attacks. While there is a natural uniform distribution on all orthogonal matrices that is well studied in literature, there is no natural uniform distribution on the set of all invertible matrices. The uniformly distributed orthogonal matrix B does provide sufficient privacy protection when combined with the addition of noise X 2 .

Privacy Analysis of TM 2 etup.
To rigorously study the privacy protection issues in this data collection process, we analyze the information that can be accessed by each party and analyze whether such information allows inference of the individual level data.
First, we illustrate how to analyze the privacy protection assuming that the adversary only has access to the publicly published left-masked data set M L = AX 1 where A is a random n × n orthogonal matrix. The issue becomes whether an adversary can identify individual level data knowing only M L = m L .
We consider the analysis in two stages. When given M L = m L , this restricts the possible values of X 1 and can thus reveal information. In the first stage, we consider whether this support restriction on X 1 (due to M L = m L ) enables the identification of individual data. Let X 1 denote the support of X 1 , and let X 1 m L denote the restricted support of X 1 given that M L = m L . The privacy preservation depends on the size of X 1 m L . For example, in an extreme case, if the X 1 m L contains only one matrix, then X 1 is known to everyone and data privacy cannot be protected. Generally we can show, in next section, that this restricted support X 1 m L is big enough so that identification of individual data is impossible.
In the second stage, we consider whether the adversary can learn any information beyond the restriction on support which was analyzed in the first stage. Such information can enable adversaries to launch probabilistic attacks [Machanavajjhala et al., 2007, Fung et al., 2010. Fortunately, due to the independence between the mask A and the raw data X 1 , we can show that the posterior density of X 1 given M L = m L is the same as the prior density of X 1 restricted to the support X 1 m L . Thus any loss of privacy is through the support restriction already studied in stage one. Therefore, knowing M L = m L does not identify individual level data.
Next, we consider the privacy protection for all parties involved in the whole TM 2 data collection process. That is, we conduct the above two-stage privacy protection analysis given all information available to one party in the process. The data collector and the masking service provider each have access to some intermediate masked data in addition to the public data. Hence, we need to analyze privacy protection for an adversary knowing this intermediate masked data together with the public data set M L = AX 1 .
The data collector knows, in addition to M L = AX 1 , the double masked data A 2 XB. Since the data collector knows the masks A 1 and B, knowing these data A 1 A 2 X 1 and A 2 XB are simply equivalent to knowing A 2 X. Due to the fact that X 2 is purely noise independent of raw data X 1 , the theoretical privacy analysis for the data collector knowing A 2 X = (A 2 X 1 , A 2 X 2 ) will have basically the same results as the analysis for the user with access only to M L = AX 1 .
The masking service provider has access to the right-masked data M R = XB in addition to the public left-masked data M L = AX 1 . This information results in the most severe restriction on the support when compared to what resulted from knowledge by other parties. Thus, this is the weakest link for privacy preservation in the whole TM 2 data collection scheme. In Section 3, we present details of the two-stage privacy protection analysis when both M L and M R are known.

Notations, Formalizations and Technical Preliminaries.
We denote the probability densities of random matrices X 1 , X 2 , A and B as π X 1 x 1 , π X 2 x 2 , π A (a) and π B (b) respectively. The supports of these distributions are denoted respectively as For example, given only the public masked data INFO = M L , the restricted support is Let n denote the set consisting of all orthogonal matrices. In the case of left masking with a random orthogonal matrix A, for any orthogonal Here and throughout this paper, we use T to denote the transpose of a matrix.
For the strong obfuscating guarantee, we wish to show that the extra information available to the parties in the process does not cause any privacy loss more than the publicly released final left-masked data M L = AX 1 . We want to show that: stage one (i) the restricted support X 1 INFO is the same as X 1 M L = n X 1 ; stage two (ii) the conditional probability distribution of X 1 given INFO is similar to the probability distribution of X 1 given support X 1 INFO , thus there is no privacy loss through probability attacks beyond the loss from the support restriction considered in stage one.
We now formalize the precise mathematical statements to prove in stages one and two. More precisely for stage one, we hope that the restricted support is the same as if only the public left-masked data is available.
for INFO available to any one party in the process. For the second stage, we denote π X 1 | INFO x 1 | INFO as the posterior distribution of X 1 given INFO. The prior density π X 1 restricted on the support X 1 INFO is To show that there is no extra privacy loss beyond the support restriction considered in stage one, we prove that these two probability densities agree with each other. That is, we wish to prove Definition 3.1.-A data collection process is strong collection obfuscating if conditions (i) and (ii) hold for the information INFO available to any party in this process.
A slightly weaker version is that the above property holds with a high probability. Notice that the INFO available to any party in this process can be determined from the values of X 1 , X 2 , A and B which are generated respectively from distributions with densities π X 1 x 1 , π X 2 x 2 , π A (a) and π B (b). Thus such INFO is generated from a probability distribution defined by π X 1 x 1 , π X 2 x 2 , π A (a) and π B (b). We want that, with high probability from this probability distribution, the generated values of INFO satisfy conditions (i) and (ii).
Definition 3.2.-A data collection process is ϵ-strong collection obfuscating if, with probability at least 1 − ϵ, conditions (i) and (ii) hold for the information INFO available to any party in this process.
Our definition of the strong collection obfuscating procedure ensures that there is no privacy loss due to observations by any party in the process beyond those contained in the publicly released final data. This definition delineates the privacy protection in the collection process from the privacy protection in the publicly releasing of final data M L = AX 1 . The theoretical analysis concentrates on the soundness of the collection process.
Given the public left-masked data M L = AX 1 , the statistics X 1 T X 1 are released to the user.
The user has the first two exact statistical moments and statistical models, such as linear regression, can be fitted exactly as if the user has the raw data set X 1 . And the residuals are known up to an orthogonal matrix multiplication, therefore the usual statistical model diagnostics methods can also be carried out as if done on the raw data set.
For continuous data, the user cannot recover the individual level data since the user only sees a linear combination of all individuals' data, and there is no utilizable statistical distributional information other than the prior (population) density π X 1 . This ensures the privacy of individual data.
In practice, the types of elements in X 1 may also be known to the user. This can further restrict the support. We assume that the elements in data matrix X 1 are all encoded as numerical values (e.g., "yes/no" answer to a question may be encoded as 1 and 0). We consider that the type of data in each column, either as continuous/discrete/binary, is public knowledge. Let j denote the support of the type of data in the j-th column of X 1 . For example, if data are continuous, then j = ℛ; if the data are binary, then j = 0, 1 ; if the data are positive integers, then j = 1, 2, … = + . Knowing the type of data in each column would restrict the support of X 1 to X 1 = U: (all entries of j-th column in U) ∈ j j = 1, …, p 1 . .
Then with knowledge of both INFO and types of data, the restricted support becomes the intersection of X 1 and X 1 INFO , Let n denote the set of all permutation matrix P. Since all permutation matrices are orthogonal and permutation does not change the type of elements, we have the following Lemma: Lemma 3.3.-For any strong obfuscating data collection process, Lemma 3.3 indicates that a strong collection obfuscating data collection process offers some privacy protection even when the data types are known. Since all permutations are in the It is not clear whether the type can be combined with some side information (such as data that a particular individual is a smoker) to reveal other individual level data. However, notice that any weakness in this aspect is inherently due to releasing the public data AX 1 . Our strong collection obfuscating procedure ensures that no extra privacy loss is added during the process beyond the privacy loss in releasing AX 1 .
As we discussed in the previous section, the party with the most information during the TM 2 process is the masking service provider who knows INFO = (M L , M R ). Therefore, in the next section, we study when (i) and ( invariant under the matrix multiplication, the uniform distribution is also invariant under the matrix multiplication. Lemma 3.4.-Let π 0 (·) denote the probability density function of the uniform distribution on n . Then for any orthogonal matrix A 0 ∈ n , π 0 (a) = π 0 A 0 a = π 0 aA 0 , f or all a ∈ n .
(3.1) Also, the product of two uniformly distributed orthogonal matrices is also uniformly distributed.
Lemma 3.5.-If A 1 ~ π 0 and A 2 ~ π 0 are independent of each other, then their product A = A 1 A 2 also follows the uniform distribution π 0 on n .
The proof is straightforward and can be found in Chapter 2 of Zhang [2014].
In the TM 2 scheme, when the masking service provider and the data collector each generate a random orthogonal matrix A 1 and A 2 respectively according to π 0 , then the mask A = A 1 A 2 for the publicly released data set is also uniformly distributed. In practice, the uniformly distributed random orthogonal matrices can be generated using algorithms described in Heiberger [1978], Anderson et al. [1987], Wu et al. [2017b].

Restricted Support Given Knowledge of Masked Data Sets.
We first prove that condition ( Theorem 3.6.-Suppose A = n and B = ℐ p , p 1 ≤ n ≤ p and X is full rank (i.e., rank(X) = n), then for any P ∈ n , PX 1 ∈ X 1 M L , M R . In other words, Since X is full-rank and n ≤ p, there exists a (p − n) × p matrix X * such that X X * is full-rank and thus invertible. Since P ∈ n , P T X X * is also full-rank and invertible. Hence we can define Also let U = PX 2 thus (U, U) = PX, we have where I is the identity matrix of size (p − n) × (p − n). The first p rows in the last equation  Theorem 3.6 states that condition (i) is satisfied when the X is full rank. In the original TM 2 scheme proposal, the full rank condition may or may not be satisfied because it is determined by the underlying probability distribution of X 1 which is outside the control of the designer of this procedure. With the modification of extra noise matrix X 2 , we can ensure the full rank condition by specifying the noise generation mechanism. Particularly, we specify that each individual data provider generates a p 2 -dimension noise vector with i.i.d. elements from a Gaussian distribution with p 2 ≥ n. This will ensure with probability one that X is indeed full rank.
Remark 1. (Size of the right mask)-For privacy preservation, the size of right mask p has to be bigger than the data size n as assumed in Theorem 3.6. When p < n, some rows of M R are linear dependent which provides further restriction on the support. We provide such a counter example in Appendix A to illustrate that such a restriction together with knowledge of data type can reveal individual level data.
Above we considered the support restriction under the original TM 2 scheme proposal [Wu et al., 2017b] of invertible right mask, B = ℐ p . However, unlike p , ℐ p does not form a compact Hausdorff topological group. Therefore, there exists no uniform distribution on ℐ p .
Due to the non-uniformity of B, the posterior distribution of X 1 given (M L , M R ) leaks information beyond the support restriction, thus the second stage condition (ii) no longer holds. This makes the usage of random invertible right masks in the TM 2 scheme very tricky. It is unclear what distribution on ℐ p should be used to generate the random invertible

B.
Here, we consider the modification of the TM 2 scheme where the right mask B is a random orthogonal matrix generated from the uniform distribution π 0 on p . We show that if the random noise X 2 is large enough, then condition (i) still holds when the orthogonal right mask B is used.
Let λ min (M) and λ max (M) denotes the minimum and the maximum eigenvalues of a semipositive definite matrix M. The restricted support will remain big if the noise is large enough: (3.5) Now we have a result similar to Theorem 3.6.
Theorem 3.7.-Suppose A = n and B = p , p 1 ≤ n ≤ p. If condition (3.5) holds, then The proof is provided in Appendix B.
Next, we show that condition (ii) also holds when under condition (3.5). Then we discuss how achievable the technical condition (3.5) is in practice.

Information Leakage Beyond the Support Restriction.
We now study the second stage condition (ii) by checking the amount of information an adversary can get from the posterior distribution of X 1 given (M L , M R ) beyond their restriction on the support of X 1 . Given INFO = (M L , M R ), the posterior density is denoted as The prior density π X 1 restricted on the support X 1 m L , m R is denoted as π X 1 | X 1 m L , m R .
Theorem 3.8.-Let X 1 be a random matrix with probability density π X 1 . We assume that the elements in X 2 are generated i.i.d. from a Gaussian distribution with mean zero. When condition (3.5) holds, given M L and M R , the posterior density of X 1 is the same as the prior density restricted on X 1 M L , M R . That is, The proof of Theorem 3.8 is provided in Appendix D.

ϵ-strong obfuscating TM 2 .
Theorem 3.7 and Theorem 3.8 states, respectively, that conditions (i) and (ii) hold under condition (3.5). Combining them, we have the following Theorem.
The ϵ-strong collection privacy property ensures that there is at most ϵ probability for the process to leak any privacy information beyond the public released data AX 1 . TM 2 achieves this property when the technical condition (3.7) holds. To achieve the technical condition (3.7), we generate the p 2 -dimensional noise vector x 2 with i.i.d. Gaussian elements of mean zero and a sufficiently large variance σ 2 . We present a technical probability bound in Appendix C, where the probability of violating condition (3.5) is decreasing exponentially and specifics a σ 2 value which ensures condition (3.7). Larger variance σ 2 always increases the probability that condition (3.5) holds. In practice, the variance σ 2 is only limited due to the computation accuracy. That is, σ should not exceed raw data values by the orders of magnitude allowed by the machine precision.

Extension to Alleviate Collusion Risks.
We have shown that the privacy of individual data can be protected when no party in the TM 2 scheme knows all the masks. However, there are also risks of collusion among different parties in the procedure. Since the right mask B is known to the data collector and all individual data providers, if one of them share this info with the masking services provider, then the privacy protection can be broken. Wu et al. [2017a] proposed ideas to protect against this collusion risk using the ideas of multiparty computation. For each individual, the data vector x can be broken up as K 1 random components x 1 , …, x K 1 where x = x 1 + … + x K 1 . Then such components are sent to K 1 right masking service providers, one to each. The resulting masked data x i B i , i = 1, ..., K 1 , are then sent to the left masking service provider to be merged and created the double masked data AX i B i , i = 1, ..., K 1 . For further protection, they can be passed through K 2 left masking service providers to generate Then the masked data AX i B i , i = 1, ..., K 1 , are sent to the corresponding right masking service providers to remove the right masking. Then the resulting AX i , i = 1, ..., K 1 , are sent to the data collector to generate AX = AX 1 + … + AX K 1 . Unless all K 1 right (or all K 2 left) masking service providers collude, they cannot find values of all components X 1 , …, X K 1 .
The stage one theoretical analysis on this extended TM 2 scheme can be analyzed similarly as before, where the restricted support condition (i*) holds given condition (3.5). The stage two analysis is more involved, as the posterior distribution of X 1 given some shares, depends on the distribution of the shares. Which random distributions should the shares be generated from to effect no additional privacy loss remains an open question, and will be investigated in future work.

Discussions and Conclusions
This paper conducts a theoretical analysis of privacy preservation in a modified TM 2 scheme. Random noises were used with uniformly distributed orthogonal matrix masks to hide individual data during the data collection process. The noise addition in the first step of the TM 2 scheme is similar to the idea of noise perturbed response schemes. However, the critical difference is that our noise addition is only intended to help mask data during the transition, and is in fact removed after the right mask removal. The resulting published data set is a left masked data set with exact summary statistics, unlike many other noise addition schemes where the summary statistics are randomly approximated.
This work is aimed to protect against unscrupulous access to the raw data X 1 traditionally hold by a trusted operator. We would like to further clarify the relationship to differential privacy methods  which aims to provide a strong privacy protection and closure under composition of multiple accesses to the database. There are two types of differential privacy models. In the central model, a trusted database operator holds the raw data, and releases noise perturbed summary statistics for inquires. In the local model [Evfimievski et al., 2003, Kasiviswanathan et al., 2011, Cormode et al., 2018, noise is added at the individual level based on the idea of randomized response methods [Warner, 1965, Blair et al., 2015. The local differential privacy procedures similarly addresses the issue of untrustworthy central database operator. In recent years, Goolge [Erlingsson et al., 2014], Apple [Thakurta et al., 2017] and Microsoft [Ding et al., 2017] have all developed and deployed local differential privacy procedures in data collection.
There are two type of possible unscrupulous access to the raw data X 1 to be addressed. The first is that the data collector is untrustworthy. The second is that an unscrupulous party might break in to the server containing data collected by an honest data collector. In the differential privacy literature, the first type is handled by using local differential privacy procedures, while the second type is addressed via pan-private data analysis [Dwork et al., 2010]. Our TM 2 scheme protects against both type of unscrupulous accesses, but only allow for a one-shot collection for each individual's data.
While both the local differential privacy procedures and the TM 2 scheme can provide protection against unscrupulous accesses, the goals are somewhat different. The TM 2 scheme aims to collect a masked data set that preserves the first two statistical moments of the variables (note that X 1 T X 1 is knowable from the publicly available AX 1 ). This allows exact statistical inferences on quantities depending on these statistical moments. The local differential privacy methods, on the other hand, aims to provide a stronger privacy protection under composition of multiple data collections/accesses.
The idea of the TM 2 scheme is similar to secure multi-party computation (SMC) procedures, in that this scheme tries to distribute information among parties so that each party does not get access to individual level data other than its own. There are also important differences between TM 2 and SMC. They differ in their designed purposes even though both want parties to cooperate in a joint task while keeping privacy. SMC is designed to conduct joint statistical analysis without the parties revealing their data to each other. TM 2 wants to collect the masked data set, which enables statistical analysis, without parties revealing the actual data to the data collector. Operationally, SMC requires distributed storage of data as well as distributed computation. Specifically, if we require that the private data of parties never leave their devices, then SMC needs the parties to stand by ready for any statistical analysis that may occur much later in the future. In contrast, the TM 2 method is only distributive in the data collection stage. The private data leaves the parties' devices in a masked form, and later is centrally stored in masked form AX 1 . Since all future statistical analysis is conducted on the publicly released AX 1 , there is no need for the parties to be available for future analysis.
In this paper, we presented a privacy analysis clearly separating the risks coming from support restriction and the risks of probabilistic attacks beyond the support restriction. With the analysis, we show that the TM 2 scheme is safe to collect a synthetic data set AX 1 which is a random orthogonal transformation of the raw data set X 1 . All information during the data collection procedure is masked, and no one during the procedure can access the raw data set. This removes the issue of trusting a data record keeper and provides a new tool for researchers to collect data allowing exact statistical inference for linear models while provide a privacy protection: no hacking attack against a party in the data collection procedure can access real individual level data since all parties do not have enough information to infer the private individual data.
with the first two rows as X a and the last row as X b . Without loss of generality, we assume that a 2 ≠ 0 so that 1 a 1 0 a 2 is non-singular, and we assume that a 3 /a 2 is not an integer. Then the first column X 1 = (1, 0, 0) T can be uniquely determined from the masked data M R .
To see this, we decompose similarly as in the decomposition of X = −1 is known to anyone with access to M R .
Using (A.1), this means X b X a −1 = 0, a 3 /a 2 is determined from the masked data M R . Then the first element in X b X a −1 X a = X b indicates that (0, a 3 /a 2 )(x 11 , x 21 ) T = x 31 . That is, (a 3 / a 2 )x 21 = x 31 .
Since x 21 and x 31 are binary entries in X 1 and a 3 /a 2 is not an integer, the attacker can infer from (a 3 /a 2 )x 21 = x 31 that x 21 = x 31 = 0. Then we must have x 11 = 1 due to M R (and thus X) being full-rank. That is, we now know every entry in X 1 = (1, 0, 0) T from the masked data M R .
Notice that according to Lemma 3.3, a strong collection obfuscating procedure would not have allowed this identification of individual data from the random permutation. There is indeed additional privacy loss without assuming p < n. In general, when p < n and M R is full rank, M b M a −1 X a = X b along with knowledge of the data type may leak sensitive information about original data X.
This is equivalent to Now we apply a singular value decomposition on M R = SDV where S ∈ n , V ∈ p and D is a diagonal matrix with nonincreasing nonnegative diagonal elements. Then, due to (B.2), the singular decomposition of PX 1 , U is SDV for a V ∈ p . Therefore B = V T V is the orthogonal matrix satisfies (3.4). □ π X 1 | M L , M R x 1 | m L , m R = π X 1 , M L , M R x 1 , m L , m R ∫ X 1 m L , m R π X 1 , M L , M R x 1 *, m L , m R dx 1 * , (D.1) and compare it with the prior density π X 1 x 1 restricted on the support X 1 m L , M R .
Recall that the probability densities for X 1 , X 2 , A and B at values X 1 = x 1 , X 2 = x 2 , A = a and B = b are denoted respectively as π X 1 x 1 , π X 2 x 2 , π A (a) and π B (b). Due to the independence of the generation mechanism of these quantities, their joint density is π X 1 , X 2 , A, B x 1 , x 2 , a, b = π X 1 x 1 π X 2 x 2 π A (a)π B (b), (D.2) for X 1 , X 2 , A, B ∈ X 1 × X 2 × A × B .
Since the elements of X 2 are i.i.d. from the Gaussian distribution N(0, σ 2 ), where f (x) = 1 ( 2πσ) np 2 e − x 2σ 2 and x 2 F 2 = ∑ 1 ≤ i ≤ n, 1 ≤ j ≤ p 2 x 2, i j 2 with ‖ · ‖ F denote the Frobenius norm. Thus the joint density becomes π X 1 , X 2 , A, B x 1 , x 2 , a, b = π X 1 x 1 f x 2 F 2 π A (a)π B (b) . Notice that given A = a and M L = m L , then X 1 = a T m L . Also, given B = b and M R = m R , then (X 1 , X 2 ) = m R b T so that i.e., aA 0 ∈ A x 1 *, m L . On the other hand, for any a ∈ A x 1 *, m L , aA 0 −1 x 1 = ax 1 * = m L , i.e., aA 0 −1 ∈ A x 1 , m L . Taken together, we have a one-to-one mapping between the two sets A x 1 , m L and A x 1 *, m L . Particularly, A x 1 *, m L = A x 1 , m L A 0 .
(D.6) Hence for uniform density π xA = π 0 , (D.6) and Lemma 3.4 implies that Plug (D.5) and (D.7) into (D.1) and cancel the common factors, we get Next, for any pair of x 1 and x 1 * that both belongs to X 1 m L , m R , there exist (x 2 , b) and x 2 *, b * such that x 1 , x 2 b = m R = x 1 *, This diagram shows each party's knowledge about the data and the masking matrices in the modified TM 2 method. Each party knows some masked version of the data: XB for the masking service provider, A 2 X for the data collector, and A 1 A 2 X 1 for everybody including the public. Nobody knows the original data X 1 , with each data provider (participant) knowing only his/her row x 1