Journal of Privacy and Confidentiality 2018-04-06T11:05:04-07:00 Lars Vilhuber Open Journal Systems <p>The <em>Journal of Privacy and Confidentiality</em>&nbsp;is an open-access multi-disciplinary journal whose purpose is to facilitate the coalescence of research methodologies and activities in the areas of privacy, confidentiality, and disclosure limitation. The JPC seeks to publish a wide range of research and review papers, not only from academia, but also from government (especially official statistical agencies) and industry, and to serve as a forum for exchange of views, discussion, and news.</p> How Will Statistical Agencies Operate When All Data Are Private? 2018-02-23T13:27:53-08:00 John M Abowd <p>The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency’s firewall than inside it—compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations—blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.</p> 2017-05-30T00:00:00-07:00 ##submission.copyrightStatement## Calibrating Noise to Sensitivity in Private Data Analysis 2018-02-23T13:27:52-08:00 Cynthia Dwork Frank McSherry Kobbi Nissim Adam Smith <p>We continue a line of research initiated in Dinur and Nissim (2003); Dwork and Nissim (2004); and Blum et al. (2005) on privacy-preserving statistical databases.</p> <p>Consider a trusted server that holds a database of sensitive information. Given a query function $f$ mapping databases to reals, the so-called {\em true answer} is the result of applying $f$ to the database. To protect privacy, the true answer is perturbed by the addition of random noise generated according to a carefully chosen distribution, and this response, the true answer plus noise, is returned to the user.</p> <p>Previous work focused on the case of noisy sums, in which $f = \sum_i g(x_i)$, where $x_i$ denotes the $i$th row of the database and $g$ maps database rows to $[0,1]$. We extend the study to general functions $f$, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the {\em sensitivity} of the function $f$. Roughly speaking, this is the amount that any single argument to $f$ can change its output. The new analysis shows that for several particular applications substantially less noise is needed than was previously understood to be the case.</p> <p>The first step is a very clean definition of privacy---now known as differential privacy---and measure of its loss. We also provide a set of tools for designing and combining differentially private algorithms, permitting the construction of complex differentially private analytical tools from simple differentially private primitives.</p> <p>Finally, we obtain separation results showing the increased value of interactive statistical release mechanisms over non-interactive ones.</p> 2017-05-30T00:00:00-07:00 ##submission.copyrightStatement## On the Meaning and Limits of Empirical Differential Privacy 2018-02-23T13:27:52-08:00 Anne-Sophie Charest Yiwei Hou <p>Empirical differential privacy (EDP) has been proposed as an alternative to differential privacy (DP), with the important advantages that the procedure can be applied to any bayesian model and requires less technical work from the part of the user. While EDP has been shown to be easy to implement, little is known of its theoretical underpinnings. This paper proposes a careful investigation of the meaning and limits of EDP as a measure of privacy. We show that EDP can not simply be considered an empirical version of DP, and that it could instead be thought of as a sensitivity measure on posterior distributions. We also show that EDP is not well-defined, in that its value depends crucially on the choice of discretization used in the procedure, and that it can be very computationnaly intensive to apply in practice. We illustrate these limitations with two simple conjugate bayesian model: the beta-binomial model and the normal-normal model.</p> 2017-05-30T00:00:00-07:00 ##submission.copyrightStatement## Practical Data Synthesis for Large Samples 2018-04-06T11:05:04-07:00 Gillian M Raab Beata Nowok Chris Dibben <p>We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data and can be used with a single synthetic data set. We make recommendations on how to synthesise data based on these results. The practical consequences of these results are illustrated with an example from the Scottish Longitudinal Study.</p> 2018-02-02T00:00:00-08:00 ##submission.copyrightStatement## A New Data Collection Technique for Preserving Privacy 2018-02-23T13:29:50-08:00 Samuel S Wu Shigang Chen Deborah L Burr Long Zhang <p>A major obstacle that hinders medical and social research is the lack of reliable data due to people's reluctance to reveal private information to strangers. Fortunately, statistical inference always targets a well-defined population rather than a particular individual subject and, in many current applications, data can be collected using a web-based system or other mobile devices. These two characteristics enable us to develop a data collection method, called triple matrix-masking (TM$^2$), which offers strong privacy protection with an immediate matrix transformation so that even the researchers cannot see the data, and then further uses matrix transformations to guarantee that the data will still be analyzable by standard statistical methods. The entities involved in the proposed process are a masking service provider who receives the initially masked data and then applies another mask, and the data collectors who partially decrypt the now doubly masked data and then apply a third mask before releasing the data to the public. A critical feature of the method is that the keys to generate the matrices are held separately. This ensures that nobody sees the actual data, but because of the specially designed transformations, statistical inference on parameters of interest can be conducted with the same results as if the original data were used. Hence the TM$^2$ method hides sensitive data with no efficiency loss for statistical inference of binary and normal data, which improves over Warner's randomized response technique. In addition, we add several features to the proposed procedure: an error checking mechanism is built into the data collection process in order to make sure that the masked data used for analysis are an appropriate transformation of the original data; and a partial masking technique is introduced to grant data users access to non-sensitive personal information while sensitive information remains hidden.</p> 2018-02-02T00:00:00-08:00 ##submission.copyrightStatement##