Synthetic Business Microdata an Australian example

Main Article Content

Chien-Hung Chien
Alan Hepburn Welsh
John D Moore

Abstract

Enhancing microdata access is one of the strategic priorities for the Australian Bureau of Statistics (ABS) in its transformation program. However, balancing the trade-off between enhancing data access and protecting confidentiality is a delicate act. The ABS could use synthetic data to make its business microdata more accessible for researchers to inform decision making while maintaining confidentiality. This study explores the synthetic data approach for the release and analysis of business data. Australian businesses in some industries are characterised by oligopoly or duopoly. This means the existing microdata protection techniques such as information reduction or perturbation may not be as effective as for household microdata. The research focuses on addressing the following questions: Can a synthetic data approach enhance microdata access for the longitudinal business data? What is the utility and protection trade-off using the synthetic data approach? The study compares confidentialised input and output approaches for protecting confidentiality and analysing Australian microdata from business survey or administrative data sources.

Article Details

How to Cite
Chien, Chien-Hung, Alan Hepburn Welsh, and John D Moore. 2020. “Synthetic Business Microdata: An Australian Example”. Journal of Privacy and Confidentiality 10 (2). https://doi.org/10.29012/jpc.733.
Section
Articles

References

J. M. Abowd and I. M. Schmutte. Revisiting the economics of privacy: Population statistics and confidentiality protection as public goods. 2015. accessed at link on 12022017.
J. M. Abowd and L. Vilhuber. How protective are synthetic data? In International
Conference on Privacy in Statistical Databases, pages 239–246. Springer, 2008.
J. M. Abowd, R. H. Creecy, and F. Kramarz. Computing person and firm effects using linked longitudinal employer-employee data. Report, US Census Bureau, 2002. accessed at link on 12022016.
ABS. Information paper transforming statistics for the future, 2016. accessed at link on
01022017.
ABS. Microdata entry page, 2017. accessed at link on 01082017.
P. Allison. Imputation by predictive mean matching promise and peril, 2015. accessed at link on 12042016.
A. N. Baraldi and C. K. Enders. An introduction to modern missing data analyses. Journal of school psychology, 48(1):5–37, 2010. ISSN 0022-4405.
J. G. Bethlehem, W. J. Keller, and J. Pannekoek. Disclosure control of microdata. Journal of the American Statistical Association, 85(409):38–45, 1990. ISSN 01621459. doi: 10.
2307/2289523. URL http://www.jstor.org/stable/2289523.
R. Breunig and M.-H. Wong. A richer understanding of australia’s productivity performance
in the 1990s: Improved estimates based upon firm-level panel data. Economic Record, 84 (265):157–176, 2008.
L. F. Burgette and J. P. Reiter. Multiple imputation for missing data via sequential regres- sion trees. American Journal of Epidemiology, 172(9):1070–1076, 2010.
C.-H. Chien and A. Mayer. Use of a prototype linked employer-employee database to describe characteristics of productive firms. Report, Australian Bureau of Statistics, 2015. Available online.
C.-H. Chien, A. H. Welsh, and J. D. Moore. Research paper: Synthetic microdata - a possible dissemination tool. Report, Australian Bureau of Statistics, 2018. Available online.
C.-H. Chien, A. H. Welsh, and R. Breunig. Approaches to analysing micro-drivers of aggre- gate productivity. Report, Australian Bureau of Statistics, 2019. Available online.
J. O. Chipperfield. Disclosure-protected inference with linked microdata using a remote analysis server. Journal of Official Statistics, 30(1):123–146, 2014. ISSN 2001-7367.
J. O. Chipperfield and C. M. O’Keefe. Disclosure-protected inference using generalised linear models. International Statistical Review, 82(3):371–391, 2014. ISSN 1751-5823. doi: 10.1111/insr.12054.
T. Desai, F. Ritchie, and R. Welpton. Five safes: designing data access for research. 2016.
URL http://eprints.uwe.ac.uk/28124/1/1601.pdf.
J. Drechsler. Synthetic Datasets for Statistical Disclosure Control Theory and Implementa-
tion. Lecture Notes in Statistics. Springer, New York, 2011. ISBN 978-1-4614-0325-8.
J. Drechsler. Multiple imputation of multilevel missing data—rigor versus simplicity.
Journal of Educational and Behavioral Statistics, 40(1):69–95, 2015. doi: 10.3102/
1076998614563393.
J. Drechsler and J. P. Reiter. An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics & Data Analysis, 55 (12):3232–3243, 2011.
R. E. Fay. When are inferences from multiple imputation valid? In Proceedings of the
Survey Research Methods Section, pages 227–232. American Statistical Association, 1992. O. Harel and X. Zhou. Multiple imputation: review of theory, implementation and software.
Statistics in medicine, 26(16):3057–3077, 2007. ISSN 0277-6715.
J. Honaker and G. King. What to do about missing values in time‐series cross‐section data.
American Journal of Political Science, 54(2):561–581, 2010. ISSN 1540-5907.
G. King, J. Honaker, A. Joseph, and K. Scheve. Analyzing incomplete political science data: An alternative algorithm for multiple imputation. American Political Science Review, 95:
49–69, March 2001.
F. Koller-Meinfelder. Analysis of incomplete survey data-multiple imputation via bayesian bootstrap predictive mean matching. Thesis, 2009.
R. J. Little. A test of missing completely at random for multivariate data with missing values. Journal of the American statistical Association, 83(404):1198–1202, 1988. ISSN
0162-1459.
R. J. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley and Sons,
2014. ISBN 1118625889.
B. Loong. Topics and applications in synthetic data. Thesis, 2012. URL https://dash. harvard.edu/handle/1/9527319.
D. C. Mare, D. R. Hyslop, and R. Fabling. Firm productivity growth and skill. New
Zealand Economic Papers, pages 1–25, 2016. ISSN 0077-9954. doi: 10.1080/00779954.
2016.1203815.
X.-L. Meng. Multiple-imputation inferences with uncongenial sources of input. Statistical
Science, pages 538–558, 1994. ISSN 0883-4237.
T. Nguyen and D. Hansell. firm dynamics and productivity growth in australian manufac- turing and business services oct 2014. Report, ABS, 2014. Available online.
C. M. O’Keefe and N. Shlomo. Comparison of remote analysis with statistical disclosure control for protecting the confidentiality of business data. Trans. Data Privacy, 5(2):
403–432, 2012.
T. E. Raghunathan, J. M. Lepkowski, J. Van Hoewyk, and P. Solenberger. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology, 27(1):85–96, 2001. ISSN 0714-0045.
J. P. Reiter. Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. Journal of the Royal Statistical Society: Series A (Statistics in Society),
168(1):185–205, 2005a. ISSN 1467-985X. doi: 10.1111/j.1467-985X.2004.00343.x. URL
http://dx.doi.org/10.1111/j.1467-985X.2004.00343.x.
J. P. Reiter. Using cart to generate partially synthetic public use microdata. Journal of
Official Statistics, 21(3):441, 2005b. ISSN 0282-423X.
D. B. Rubin. The bayesian bootstrap. The Annals of Statistics, 9(1):130–134, 1981. ISSN
00905364. URL http://www.jstor.org/stable/2240875.
D. B. Rubin. Statistical disclosure limitation. Journal of official Statistics, 9(2):461–468,
1993.
N. Schenker and J. M. G. Taylor. Partially parametric techniques for multiple im- putation. Computational Statistics and Data Analysis, 22(4):425–446, 1996. ISSN
0167-9473. doi: http://dx.doi.org/10.1016/0167-9473(95)00057-7. URL http://www. sciencedirect.com/science/article/pii/0167947395000577.
M. Schomaker and C. Heumann. Model selection and model averaging after multiple im- putation. Computational Statistics and Data Analysis, 71:758–770, 2014. ISSN 0167-
9473. doi: http://dx.doi.org/10.1016/j.csda.2013.02.017. URL //www.sciencedirect. com/science/article/pii/S016794731300073X.
N. Shlomo. Releasing microdata disclosure risk estimation, data masking and assessing
utility. Journal of Privacy and Confidentiality, 2(1):7, 2010.
S.-M. Tam, K. Farley-Larmour, and M. Gare. Supporting research and protecting confiden- tiality. abs microdata access: Current strategies and future directions. Statistical Journal of the IAOS, 26(3, 4):65–74, 2009. ISSN 1874-7655.
M. T. Tan, G.-L. Tian, and K. W. Ng. Bayesian missing data problems: EM, data augmen- tation and noniterative computation. Chapman and Hall/CRC, 2009. ISBN 1420077503. The Australian Government. STATISTICS DETERMINATION - REG 7, 1983. Available
online.
G. Vink, L. E. Frank, J. Pannekoek, and S. Van Buuren. Predictive mean matching im- putation of semicontinuous variables. Statistica Neerlandica, 68(1):61–90, 2014. ISSN
0039-0402.
I. R. White, P. Royston, and A. M. Wood. Multiple imputation using chained equations issues and guidance for practice. Statistics in Medicine, 30(4):377–399, 2011. ISSN 1097-
0258. doi: 10.1002/sim.4067. URL http://dx.doi.org/10.1002/sim.4067.