What is Synthetic Data?

Everything You Need to Know About Synthetic Data

Businesses rolling out artificial intelligence (AI) face a major hurdle in gathering enough data for their models. For many use cases, the correct data is simply not available, or it is difficult and costly to obtain. When creating AI models, missing or incomplete data is not enough, and even big tech companies are bound to err on the side of this. For example, researchers found in 2018 that top facial-recognition software could easily recognize the faces of white men, but was 34 percent more mistaken when it identified people with darker skin tones . The data used to train these models lacks representation of an entire subset of the population. So, in this case, how should companies respond? Synthetic data offers a compelling solution. Synthetic data is data that is artificially generated through a computer program, rather than from real events. Businesses can augment their training data with synthetic data to fill all potential and edge use cases, save on data collection costs, or meet privacy requirements. With increased computing power and cloud data storage options, synthetic data is more accessible than ever. This is certainly a positive development: synthetic data drives the development of AI solutions to better serve all end users.

Why use synthetic data?

Let’s say you have an AI problem to solve and you’re not sure whether you should invest in synthetic data to partially or fully satisfy your data needs. Here are a few reasons why synthetic data is a good fit for your project:

Improve model reliability

Get more diverse data for your models without having to collect more data. With synthetic data, you can train your model on the same person with different hairstyles, facial hair, wearing different eyeglasses, different head poses, etc., as well as create unique images based on skin color, ethnicity, bone structure, freckles, and more. Different faces make the model more reliable.

Faster than "real" data acquisition

Teams can generate large amounts of synthetic data in a short amount of time. This is especially helpful when real data relies on events that occur infrequently. For example, when collecting data for self-driving cars, teams may struggle to capture enough real-world data due to the rarity of extreme road conditions. In addition, data scientists can also set up algorithms to automatically label synthetic data as it is created, thereby reducing the time-consuming labeling process.

Interpretation of edge cases

Machine learning algorithms prefer balanced datasets. Recall our facial recognition example. If these companies created synthetic data of darker-skinned faces to fill in the data gaps, not only would the accuracy of the models increase (which is, in fact, what several of them have done), but it would also generate more ethical Model. Synthetic data helps teams cover all use cases, including edge cases where data is insufficient or non-existent.

Protect users' private data

Depending on the industry and the type of data, organizations may face security challenges when dealing with sensitive data. For example, in the healthcare industry, patient data often includes personal health information (PHI), which needs to meet high security requirements before it can be used. Synthetic data can alleviate privacy concerns because it does not involve information about real people. If your team needs to meet certain data privacy requirements, consider synthetic data as an alternative.

Application Scenarios for Synthetic Data

From a business perspective, synthetic data has many applications: model validation, model training, new product testing data, etc. Several industries have pioneered the use of synthetic data in machine learning, and we'll highlight a few of them:

car

Companies developing autonomous vehicles often rely on simulations to test performance. Under certain conditions, such as extreme weather conditions, it may be difficult or dangerous to obtain real traffic data. Overall, there are too many variables to consider in all possible driving experiences to rely on field testing with a real car on the road. Synthetic data is safer and faster than manual data collection.

medical insurance

Healthcare is the industry of choice for adopting synthetic data due to the sensitivity of the data. Teams can leverage synthetic data to capture physiological information across all possible patient types, ultimately helping to diagnose disease more quickly and accurately. A vivid example of this is Google's melanoma detection model , which uses synthetic data from darker-skinned individuals (unfortunately, there is insufficient clinical data in the field), enabling the model to be applicable to all skin types.

Safety

Synthetic data facilitates increased security for organizations. Going back to our facial recognition example, you may have heard the term "deepfakes," which refers to artificially created images or videos. Businesses can create deep fakes to test their own security systems and facial recognition platforms. Video surveillance also leverages synthetic data to train models at a lower cost and faster.

data portability

Businesses need safe and secure ways to share their training data with others. Another interesting use case for synthetic data is to hide personally identifiable information (PII) before making the dataset available to others. This is known as privacy-preserving synthetic data and can be used to share scientific research datasets, medical data, sociological data and other fields that may contain PII.

How to create synthetic data

Teams can programmatically create synthetic data using machine learning techniques. Typically, they will use a set of sample data to create synthetic data; the synthetic data must preserve the statistical properties of the sample data. Synthetic data itself can be binary, numeric, or categorical. It should be randomly generated, of arbitrary length, and reliable enough to cover the desired use case. There are several techniques for generating synthetic data; the most common techniques are described below:

Extract from distribution data

If you don't have real data, but know the dataset distribution, you can generate synthetic data from the distribution. In this technique, you generate random samples from any distribution (normal, exponential, etc.) to create fake data.

Fit real data to distributed data

If you do have real data, you can use techniques such as Monte Carlo methods to find the best fitting distribution for your data and use that to generate synthetic data.

deep learning

Deep learning models can generate synthetic data. For example:

Variational autoencoder model: This unsupervised model compresses an initial dataset and sends it to a decoder, which then outputs a representation of that initial dataset.
Generative Adversarial Network (GAN) Model: The GAN model consists of two networks. A generator takes in a sample dataset and outputs synthetic data. The discriminator compares synthetic data with real datasets and fine-tunes iteratively.

A combination of the above approaches may be most beneficial, depending on how much real data you are starting with and what you are using the synthetic data for.

The Future of Synthetic Data

Over the past decade, we have seen a dramatic acceleration in the use of synthetic data. While this saves businesses time and money, it is not without its challenges. Synthetic data lacks outliers that naturally occur in real data, and for some models, outliers are critical for accuracy. It is also important to note that the quality of synthetic data often depends on the input data used for generation; biases present in input data can easily propagate into synthetic data, so the importance of using high-quality data as a starting point cannot be underestimated. Finally, it requires additional output control; that is, the synthetic data needs to be compared with human-annotated real data to ensure that no inconsistencies arise. Despite these challenges, synthetic data remains an exciting field of opportunity. Even when real data is unavailable, synthetic data can help us generate innovative AI solutions. Most importantly, synthetic data can help companies create products that are more inclusive and better representative of the diversity of end users. Expert Insights from Appen's Director of Data Science Remember that synthetic data is a data augmentation technique, it is not a replacement for data capture and labeling. It's important to realize that you can't create a model that works well in the real world without any real data. You will probably cover most cases, but there will be many edge cases where the model will fail (for example, for our face recognition case, there may be some rare lighting conditions, rare facial features, plastic surgery, etc., you may Never thought about it - if you started with only synthetic data, you wouldn't know these things, no matter how realistic those faces might be). Beyond that, there are a few more things to keep in mind when creating and using synthetic data:

Understand the reliability requirements of your model to define the synthetic data you need: Even before you start generating synthetic data, you want to figure out what your model really needs and create a set of functional requirements for the types of synthetic data that you need. Constructing synthetic data that resembles existing data is useless for the model. Instead, you might want to improve diversity (e.g., faces with different facial features in a face recognition use case) and variation (e.g., slight deviations from the same person) through data augmentation. You may want to consider some rare or edge cases and prioritize them when generating synthetic data. Another approach is to derive synthetic data needs from false positives and false negatives predicted by real-world training, validation, and test datasets to reduce these occurrences.
Learn what synthetic data can and can't do for your dataset and models: Data augmentation greatly improves the accuracy of your models, but it doesn't make them perfect. Since our synthetic data distribution is close to what we know to be real data, it cannot magically effectively understand any significantly different data produced in the real world, nor can it create predictions or outcomes that the training data could not have guided it to generate. We also have to take into account the source and conditions of the data (for example, the faces generated on ThisPersonDoesNotExist.com are generated from profile avatars. When the sky is overcast and the room is very dark, these will not help your model recognize indoor image).
Learn about the various synthetic data tools available to you and what's coming soon: Common approaches to synthetic data are either cloning part of the data from the real world and overlaying it onto another real data, or using Unity or some 3D environment to generate photorealistic data. But due to changes in GAN and VAE technology, this field is developing rapidly. Instead of creating entirely new data, we can create variations of real-world data by synthesizing some new components on top of real data (e.g., adding freckles to real faces, changing shadow angles, etc.). As another example, overlayed data can be optimized to be more realistic. There are many other tools that can be used, but you need to know them first.
Versioning of data: As synthetic data is generated, the ability to generate better synthetic data increases. The image you generated last month may now be washed out by a newer version that looks more realistic (e.g. you discover a better skin texture for faces, a new GPU helps you from ray tracing for more detailed effects, etc.). You don't want to train old models with these old versions of images. Version management will help you understand what data has been replaced with new data and verify model improvements as you add different synthetic data or update old data.

In summary, your synthetic data can improve your model's performance in the real world. Any approach you take or data you generate must make your model more reliable and help improve its performance. Clearly defining the requirements of the model in terms of where it falls short will help you focus on choosing the right tools and generating the right data.