7 minutes read

Intuitive Introduction to Synthetic Data And Its Privacy Implications

Can synthetic data—though not real—present a plausible alternative for handling sensitive information?

Explore the factors and challenges involved in the usage of synthetic data, as well as the potential risks associated with heavily relying on it, particularly in the context of training generative AI models.

Read this article to gain insights into the ideal scenarios for leveraging synthetic data and discover how combining synthetic data with techniques like differential privacy can help offer privacy guarantees for their safe implementation.

What Is the Difference Between Real and Synthetic Data?

Let’s take a look at the differences between real and synthetic data:

Real data refers to the original, raw, and unaltered information collected from multiple sources, such as surveys, operational systems, transaction data, etc. Real data is direct, accurate, and represents real-world situations.

Real data comes directly from the source.
It represents factual information.
This type of data contains inherent patterns, correlations, and nuances that reflect the real world

This comes with a caveat that the collection of real data has its own challenges around representative sampling, historical biases and fairness.

Synthetic data, on the other hand, is artificially generated data that mimics the characteristics of real data. It's created with the help of algorithms and is mainly used when sensitive information in real data needs to be protected or data availability is limited.

Synthetic data is "manufactured" rather than directly gathered from real-world events.
It mirrors the statistical properties of real data but doesn't contain any original sensitive information.

Essentially, synthetic data is fake data that looks the same but isn’t. Therefore, the synthetic dataset allows you to learn from the data without disclosing any sensitive information or compromising the privacy of individuals.

What Is Meant by Synthetic Data?

Imagine a sensitive dataset, such as Uber journeys, flight records, or information about what apps people are using at 8pm.

For obvious reasons, we cannot share this dataset with others, yet we might want to use it to recognise the trends and patterns to draw insights between departments, such as marketing or sales. Creating a synthetic dataset could be one possible solution.

What Is Synthetic Data In Generative AI

Synthetic data can be created by Machine Learning (ML) generative models, based on patterns, statistical properties, and relationships learned from actual data. These algorithms learn from the existing data to create the synthetic dataset.

There are a range of different techniques and algorithms for generating synthetic data including variational auto-encoders or copulas. One of the popular algorithms to generate synthetic data is the Generative Adversarial Networks (GANs) model. GANs are based on "generator" and "discriminator" neural networks. While the generator creates realistic synthetic data, the discriminator distinguishes real from fake results.

During training, the generator competes with the discriminator to produce data that attempts to fool the model, gradually improving the output. This process eventually results in a high-quality synthetic dataset that closely resembles real data.

The common application of GANs algorithm is for the image generation process. GANs can generate realistic images of people, animals, and objects that don’t actually exist.

What Are the Benefits of Using Synthetic Data?

The most compelling reason, from the privacy perspective, to utilise synthetic data is its ability to mirror patterns, high-level aggregates, and correlations of the original dataset without holding any sensitive information. This opens up numerous possibilities.

Synthetic datasets are often used as stand-ins for test datasets, to validate mathematical models and to train machine learning (ML) models.

In addition, synthetic data can reflect “what if” scenarios and situations that don’t exist in the real world. This enables data scientists to simulate outcomes, test hypotheses, and mathematical models, and make strategic decisions. For instance, in the financial industry, synthetic data can model potential crises or trends, enabling us to prepare and make robust plans before they are actually needed.

Some of the best use cases for synthetic data include:

Testing: By providing large datasets, synthetic data allows thorough assessments, mimicking real-life scenarios without the need to utilise sensitive data.
Model training: Synthetic data can train AI models without incurring privacy issues and aid in the fine-tuning of models in the development stage.
Product development: Synthetic data contributes to a controlled environment for developers to work with while they fine-tune their products, enabling them to avoid the pitfalls resulting from inaccurate or incomplete data.

What Are Synthetic Data’s Limitations?

Despite its potential, synthetic data comes with its challenges. It can miss out on some information, such as losing certain correlations.

Typically, for synthetic data, we need to know the questions we’ll be asking in advance for the data to reflect that. If you have additional questions after, the data might not provide the needed answers. Further, if the underlying data changes, you need to create synthetic data again.

Moreover, while synthetic data may reflect high-level aggregates accurately, individual-level information might not always make sense. It is possible that the biases present in original datasets may be reflected in the synthetic data. Therefore, ensuring fairness is a challenge.

Synthetic data does not inherently provide privacy guarantees and care must be taken to ensure proper implementation using methods such as differential privacy. We will talk about this later in the article.

Training AI Models With Synthetic Data

One of the most common use cases of synthetic data is when training AI models.

While AI models, particularly Large Language Models (LLMs) like OpenAI's ChatGPT and Google's Bard have traditionally been trained on a vast array of internet-scraped data, as AI companies face challenges in accessing high-quality data and increasing scrutiny over privacy concerns, synthetic data has been used as an alternative.

Sam Altman has commented: “[I'm] pretty confident that soon all data will be synthetic data”. This notion presents a challenge that may have consequences for both individuals and organisations.

When using synthetic data to train AI systems, it might induce errors in learning and draw misleading conclusions because the input didn’t come from real data points. Further, heavy reliance on synthetic data means that no new information is being added, leading to the diminished value of the AI results.

As the models are often trained on their outputs, which may contain inaccuracies or fabrications, there's a risk of producing "regurgitated knowledge," without any new insights.

Studies conducted by Oxford and Cambridge have raised concerns about the potential risks of training AI models using their unprocessed outputs. These outputs may include inaccuracies or false information, which, over time, could lead to a deterioration of the AI technology, potentially resulting in “irreversible effects”.

GANs algorithms generate synthetic data from relatively small samples, which means that the algorithm might not grasp the full scope of information. Though seemingly realistic at an individual level, there could be major data gaps at a group level—potentially leading to "model collapse" if not supplemented with adequate real-world data.

Transparency deficits also pose serious challenges, fostering uncertainties around data quality and validity The dataset can miss out on nuanced information that you are unable to identify. This is something that shouldn’t be overlooked if it has the potential to impact real-life decisions.

Privacy-Preserving Synthetic Data

Since synthetic data within itself doesn’t provide strict privacy guarantees, additional measures, such as differential privacy, need to be incorporated. In order to do that, an appropriate differential privacy mechanism has to be incorporated into a synthesizer.

There are a range of commercial and open-source implementations of differentially private synthetic data. One such initiative is the SmartNoise-Synth, a privacy-preserving synthetic data generation tool developed by OpenDP, Microsoft, and Harvard's IQSS. SmartNoise-Synth aims to balance privacy protection and data utility whilst preserving statistical accuracy by leveraging differential privacy techniques to generate synthetic data.

But just like with other instances when using differential privacy, one needs to keep in mind that the balance between privacy protection and the utility of the data is determined by privacy loss parameters, epsilon and delta values.

This recent study reviews different approaches taken by various companies in the synthetic data space (some with differential privacy, some without) points out that: "Synthetic data, when not underpinned by robust privacy guarantees like Differential Privacy (DP), can lead to significant privacy breaches, especially concerning outliers".

The Future Of Synthetic Data

There are challenges when using synthetic data. The limitations reside in ensuring privacy, maintaining nuanced information, and verifying the quality of generated data. As the technology advances, the data might become more accurate—but more limitations might come up too.

Therefore, the best use cases for synthetic data currently involve testing or modelling “what if” scenarios which don’t directly impact individuals, therefore minimising the potential negative consequences. Organisations can benefit from using synthetic data for research, informed decision-making, and product development.

Overall, synthetic data offers a promising approach to analysing sensitive data but in order to offer strict privacy guarantees it needs to be combined with other techniques such as differential privacy. This allows organisations to harness valuable insights while prioritising privacy and responsible data science.

Interested in learning more about data science that preserves the privacy of individuals? Join our free Antigranular platform where you can learn and get hands-on with the latest privacy-enhancing technologies, connect with data scientists, refine your skills, and take part in exciting competitions.

synthetic data

differential privacy

privacy-enhancing technologies

responsible ai

Understanding Differential Privacy: A Non-Technical Perspective

Data Privacy Attacks: The Alarming Risk of Reconstruction Attacks on Seemingly Anonymous Data