A Privacy Preserving Algorithm to Release Sparse High-dimensional Histograms

Main Article Content

Bai Li
Vishesh Karwa
Aleksandra Slavković
Rebecca Carter Steorts
https://orcid.org/0000-0003-0114-8181

Abstract

Differential privacy has emerged as a popular model to provably limit privacy risks associated with a given data release. However releasing high dimensional synthetic data under differential privacy remains a challenging problem. In this paper, we study the problem of releasing synthetic data in the form of a high dimensional histogram under the constraint of differential privacy.
We develop an $(\epsilon, \delta)$-differentially private categorical data synthesizer called \emph{Stability Based Hashed Gibbs Sampler} (SBHG). SBHG works by combining a stability based sparse histogram estimation algorithm with Gibbs sampling and feature selection to approximate the empirical joint distribution of a discrete dataset. SBHG offers a competitive alternative to state-of-the art synthetic data generators while preserving the sparsity structure of the original dataset, which leads to improved statistical utility as illustrated on simulated data. Finally, to study the utility of the resulting synthetic data sets generated by SBHG, we also perform logistic regression using the synthetic datasets and compare the classification accuracy with those from using the original dataset.

Article Details

How to Cite
Li, Bai, Vishesh Karwa, Aleksandra Slavković, and Rebecca Steorts. 2018. “A Privacy Preserving Algorithm to Release Sparse High-Dimensional Histograms”. Journal of Privacy and Confidentiality 8 (1). https://doi.org/10.29012/jpc.657.
Section
Articles
Author Biographies

Bai Li, Duke University

Bai Li is currently a first year PhD student in the department of Statistical Science at Duke University, where he received his M.S degree under the supervision of Rebecca C. Steorts. 

Vishesh Karwa, Temple University

Assistant Professor of Statistics Faculty

My research addresses the challenges in performing statistical inference using complex and/or massive data such as networks, high-dimensional contingency tables, and data that are missing or incomplete. My work is at the intersection of statistics, machine learning, and theoretical computer science and is motivated by many real-world problems with applications to social, political and behavioral sciences. Some of the problems that I currently work on include: (1) Statistical foundations of data privacy and confidentiality, (2) Causal inference under network interference, (3) Finite-sample inference for network models and high-dimensional contingency tables, and (4) Selective inference and adaptive data analyses. 

Vishesh Karwa joined the Department of Statistics in 2017. He is also a member of TDAI. Prior to joining Ohio State, Vishesh spent two years at Harvard in the department of statistics and department of Computer science as a Post Doctoral fellow and one year at CMU as a research scientist. 

Aleksandra Slavković, Pennsylvania State University

Slavkovic is a professor of statistics who joined Penn State in 2004. She has served in various positions in the statistics department, including associate head for diversity and equity and associate head for graduate studies. Slavkovic has affiliated appointments in the Institute for CyberScience, the Department of Public Health Sciences, and the Penn State College of Medicine, and she serves on Penn State’s Clinical and Translational Sciences Director’s Council. She has also held visiting scholar positions at Cornell University, the University of Minnesota, and Utrecht University.

Slavkovic received master's degrees in human-computer interaction and in statistics and a doctoral degree in statistics from Carnegie Mellon University. Her current research interests include statistical data privacy with applications across different domains, algebraic statistics, causal inference, and more broadly the application of statistics to information sciences and social sciences.

Rebecca Carter Steorts, Duke University

Rebecca C. Steorts received her B.S. in Mathematics in 2005 from Davidson College, her MS in Mathematical Sciences in 2007 from Clemson University, and her PhD in 2012 from the Department of Statistics at the University of Florida under the supervision of Malay Ghosh, where she was a U.S. Census Dissertation Fellow and was a recepient for Honorable Mention (second place) for the 2012 Leonard J. Savage Thesis Award in Applied Methodology. Rebecca was a Visiting Assistant Professor in 2012--2015, where she worked closely with Stephen E. Fienberg.

Rebecca is currently an Assistant Professor in the Department of Statistical Science at Duke University. She is affliated faculty in the Departments of Computer Science and Biostatics and Bioinformatics, the information intiative at Duke (iiD), and the Social Science Research Institute.

Rebecca was named to MIT Technology Review's 35 Innovators Under 35 for 2015 as a humantarian in the field of software. Her work was profiled in the Septmember/October issue of MIT Technology Review and she was recognized with an invited talk at EmTech in November 2015. In addition, Rebecca is a recepient of a NSF CAREER award, a collaborative NSF award, a collaborative grant with the Laboratory of Analaytic (LAS) at NC State University, a Metaknowledge Network Templeton Foundation Grant, the University of Florida (UF) Graduate Alumni Fellowship Award, the U.S. Census Bureau Dissertation Fellowship Award, and the UF Innovation through Institutional Integration Program (I-Cubed) and NSF for development of an introductory Bayesian course for undergraduates. Her research interests are in large scale clustering, record linkage (entity resolution or de-duplication), privacy, network analysis, and machine learning for computational social science applications.

Funding data