Home Artificial Intelligence Evaluating Synthetic Data — The Million Dollar Query Part 1 — Some Easy Experiments Part 2— Real Datasets, Real Generators Conclusion References

Evaluating Synthetic Data — The Million Dollar Query Part 1 — Some Easy Experiments Part 2— Real Datasets, Real Generators Conclusion References

0
Evaluating Synthetic Data — The Million Dollar Query
Part 1 — Some Easy Experiments
Part 2— Real Datasets, Real Generators
Conclusion
References

The dataset utilized in Part 1 is easy and may be easily modeled with just a mix of Gaussians. Nevertheless, most real-world datasets are much more complex. On this a part of the story, we are going to apply several synthetic data generators to some popular real-world datasets. Our primary focus is on comparing the distributions of maximum similarities inside and between the observed and artificial datasets to grasp the extent to which they may be considered random samples from the identical parent distribution.

The six datasets originate from the UCI repository² and are all popular datasets which were widely utilized in the machine learning literature for many years. All are mixed-type datasets, and were chosen because they vary of their balance of categorical and numerical features.

The six generators are representative of the foremost approaches utilized in synthetic data generation: copula-based, GAN-based, VAE-based, and approaches using sequential imputation. CopulaGAN³, GaussianCopula, CTGAN³ and TVAE³ are all available from the Synthetic Data Vault libraries⁴, synthpop⁵ is on the market as an open-source R package, and ‘UNCRi’ refers back to the synthetic data generation tool developed under the proprietary Unified Numeric/Categorical Representation and Inference (UNCRi) framework⁶. All generators were used with their default settings.

The table below shows the common maximum intra- and cross-set similarities for every generator applied to every dataset. Entries highlighted in red are those through which privacy has been compromised (i.e., the common maximum cross-set similarity exceeds the common maximum intra-set similarity on the observed data). Entries highlighted in green are those with the best average maximum cross-set similarity (not including those in red). The last column shows the results of performing a Train on Synthetic, Test on Real (TSTR) test, where a classifier or regressor is trained on the synthetic examples and tested on the actual (observed) examples. The Boston Housing dataset is a regression task, and the mean absolute error (MAE) is reported; all other tasks are classification tasks, and the reported value is the realm under ROC curve (AUC).

Average maximum similarities and TSTR result for six generators on six datasets. The values for TSTR are MAE for Boston Housing, and AUC for all other datasets. [Image by Author]

The figures below display, for every dataset, the distributions of maximum intra- and cross-set similarities corresponding to the generator that attained the best average maximum cross-set similarity (excluding those highlighted in red above).

Distribution of maximum similarities for synthpop on Boston Housing dataset. [Image by Author]
Distribution of maximum similarities for synthpop Census Income dataset. [Image by Author]
Distribution of maximum similarities for UNCRi on Cleveland Heart Disease dataset. [Image by Author]
Distribution of maximum similarities for UNCRi on Credit Approval dataset. [Image by Author]
Distribution of maximum similarities for UNCRi on Iris dataset. [Image by Author]
Distribution of average similarities for TVAE on Wisconsin Breast Cancer dataset. [Image by Author]

From the table, we are able to see that for those generators that didn’t breach privacy, the common maximum cross-set similarity could be very near the common maximum intra-set similarity on observed data. The histograms show us the distributions of those maximum similarities, and we are able to see that typically the distributions are clearly similar — strikingly so for datasets akin to the Census Income dataset. The table also shows that the generator that achieved the best average maximum cross-set similarity for every dataset (excluding those highlighted in red) also demonstrated best performance on the TSTR test (again excluding those in red). Thus, while we are able to never claim to have discovered the ‘true’ underlying distribution, these results show that essentially the most effective generator for every dataset has captured the crucial features of the underlying distribution.

Privacy

Only two of the seven generators displayed issues with privacy: synthpop and TVAE. Each of those breached privacy on three out of the six datasets. In two instances, specifically TVAE on Cleveland Heart Disease and TVAE on Credit Approval, the breach was particularly severe. The histograms for TVAE on Credit Approval are shown below and show that the synthetic examples are far too just like one another, and in addition to their closest neighbors within the observed data. The model is a very poor representation of the underlying parent distribution. The explanation for this will be that the Credit Approval dataset accommodates several numerical features which might be extremely highly skewed.

Distribution of average maximum similarities for TVAE on Credit Approval dataset. [Image by Author]

Other observations and comments

The 2 GAN-based generators — CopulaGAN and CTGAN — were consistently among the many worst performing generators. This was somewhat surprising given the immense popularity of GANs.

The performance of GaussianCopula was mediocre on all datasets except Wisconsin Breast Cancer, for which it attained the equal-highest average maximum cross-set similarity. Its unimpressive performance on the Iris dataset was particularly surprising, provided that this can be a quite simple dataset that may easily be modeled using a mix of Gaussians, and which we expected could be well-matched to Copula-based methods.

The generators which perform most consistently well across all datasets are synthpop and UNCRi, which each operate by sequential imputation. Which means that they only ever have to estimate and sample from a univariate conditional distribution (e.g., P(x₇|x₁, x₂, …)), and this is usually much easier than modeling and sampling from a multivariate distribution (e.g., P(x₁, x₂, x₃, …)), which is (implicitly) what GANs and VAEs do. Whereas synthpop estimates distributions using decision trees (that are the source of the overfitting that synthpop is liable to), the UNCRi generator estimates distributions using a nearest neighbor-based approach, with hyper-parameters optimized using a cross-validation procedure that stops overfitting.

Synthetic data generation is a brand new and evolving field, and while there are still no standard evaluation techniques, there may be consensus that tests should cover fidelity, utility and privacy. But while each of those is significant, they will not be on an equal footing. For instance, an artificial dataset may achieve good performance on fidelity and utility but fail on privacy. This doesn’t give it a ‘two out of three’: if the synthetic examples are too near the observed examples (thus failing the privacy test), the model has been overfitted, rendering the fidelity and utility tests meaningless. There was an inclination amongst some vendors of synthetic data generation software to propose single-score measures of performance that mix results from a large number of tests. This is basically based on the identical ‘two out of three’ logic.

If an artificial dataset may be considered a random sample from the identical parent distribution because the observed data, then we cannot do any higher — we’ve got achieved maximum fidelity, utility and privacy. The Maximum Similarity Test provides a measure of the extent to which two datasets may be considered random samples from the identical parent distribution. It relies on the straightforward and intuitive notion that if an observed and an artificial dataset are random samples from the identical parent distribution, instances must be distributed such that an artificial instance is as similar on average to its closest observed instance as an observed instance is comparable on average to its closest observed instance.

We propose the next single-score measure of synthetic dataset quality:

The closer this ratio is to 1 — without exceeding 1 — the higher the standard of the synthetic data. It should, in fact, be accompanied by a sanity check of the histograms.

LEAVE A REPLY

Please enter your comment!
Please enter your name here