INSPECTRE

Forked from HuanyuZhang/INSPECTRE

Published article:

Published code:

We provide a Python implementation of our differentially private estimators of entropy and support coverage. This is a tutorial for anyone who wants to use or build upon these estimators.

Prerequisites

This project is implemented in Python3, using Numpy and matplotlib. Before running the code, make sure Python3, Numpy and Matplotlib are installed. Note that our code for entropy estimation is incompatible with Python2.

Entropy estimators

In this project, we implement our estimators for private estimation of entropy, including a private version of the plug-in estimator, and a privatized estimator based on the method of best-polynomial approximation. We compare performance between these estimators and a number of non-private estimators, including the plug-in estimator, the Miller-Madow Estimator, and the sample optimal polynomial approximation estimator from the paper Minimax Rates of Entropy Estimation on Large Alphabets via Best Polynomial Approximation.

Some of our code (including most of entropy.py and coeffs.txt) is based off of the non-private estimators from the project entropy by Yihong Wu and Pengkun Yang.

Comprehensive script

We provide main_entropy.py as an example script for our private estimator. In this script, we compare performance for these entropy estimators on different distributions including uniform, a distribution with two steps, Zipf(1/2), a distribution with Dirichlet-1 prior, and a distribution with Dirichlet-1/2 prior. We use RMSE (root-mean-square error) to indicate the performance of the estimator.

Program arguments

k int: Set alphabet size.
eps float: Set privacy parameter eps.
l_degree int: Set polynomial degree for private poly. Default L=1.2 log k.
M_degree float: Set the right endpoint of approximation interval for private poly. Default M=2.0 log k.
N_degree int: Set the threshold to apply polynomial estimator for private poly. Default M=1.6 log k.

For the parameters of poly estimator, we just use the default values in their code, which are L=1.6 log k, M=3.5 log k, N=1.6 log k. Please see entropy for more information.

Support coverage estimator

In this project, we implement our estimator for private estimation of support coverage. This is a privatized version of the Smoothed Good-Toulmin (SGT) estimator of Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu, from their paper Optimal prediction of the number of unseen species. We compare the performance of the private and non-private statistics on both synthetic data and real-world data, including US Census name data and a text corpus from Shakespeare's Hamlet.

Some of our code is based off the SGT implementation of Orlitsky, Suresh, and Wu, graciously provided to us by Ananda Theertha Suresh. Specific files used are indicated in comments, and hamlet_total.csv is a reformatted version of a provided file. lastnames_total.csv is a subsampling of a file of Frequently Occurring Surnames from the Census 2000.

Synthetic data

We provide main_synthetic.py as an example for our private estimator. In this script, we compare performance for the estimators on different distributions including uniform, a distribution with two steps, Zipf(1/2), a distribution with Dirichlet-1 prior, and a distribution with Dirichlet-1/2 prior. We use RMSE (root-mean-square error) to indicate the performance of the estimator.

Program arguments

k int: Set alphabet size for the distribution.
eps_index list: Set privacy parameter for the private estimators.
n int: The number of the seen samples

Real data

We provide main_real.py as an example for our private estimator. In this script, we compare performance for the estimators on real data. We use RMSE (root-mean-square error) to indicate the performance of the estimator.

Program arguments

file_name string: The name of the histogram file. The histogram file must be in .csv, which has only one column and each row is the number of samples for each species. We provide hamlet_total.csv and lastnames_total.csv as some examples.
eps_index list: Set privacy parameter for the private estimators.

Reference

For detailed explanations and analysis of our estimators, please refer to

published article:

Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang. 2020. "INSPECTRE: Privately Estimating the Unseen". Journal of Privacy and Confidentiality 10 (2). https://doi.org/10.29012/jpc.724

@article{jpc:724,
  author={Jayadev Acharya and Gautam Kamath and Ziteng Sun and Huanyu Zhang},
  title={{INSPECTRE}: Privately Estimating the Unseen},
  doi={10.29012/jpc.724},
  year={2020},
  volume={10},
  issue={2},
  journal={Journal of Privacy and Confidentiality},
}

arXiv paper INSPECTRE: Privately Estimating the Unseen, authored by Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
entropy		entropy
support coverage		support coverage
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entropy

entropy

support coverage

support coverage

.gitignore

.gitignore

README.md

README.md

Repository files navigation

INSPECTRE

Table of contents

Prerequisites

Entropy estimators

Comprehensive script

Program arguments

Support coverage estimator

Synthetic data

Program arguments

Real data

Program arguments

Reference

About

Releases 1

Packages

Languages

journalprivacyconfidentiality/INSPECTRE

Folders and files

Latest commit

History

Repository files navigation

INSPECTRE

Table of contents

Prerequisites

Entropy estimators

Comprehensive script

Program arguments

Support coverage estimator

Synthetic data

Program arguments

Real data

Program arguments

Reference

About

Resources

Stars

Watchers

Forks

Languages