Once seen as less desirable than real data, synthetic data is now seen by some as a panacea. The real data is confusing and closed to prejudice. New data confidentiality regulations make collection difficult. In contrast, synthetic data is virgin and can be used to construct more diverse data sets. You can produce perfectly labeled faces, for example, of different ages, shapes and ethnicities to build a visual detection system that works across populations.
But synthetic data has its limitations. If it fails to reflect reality, it could end up producing an even worse AI than the disordered and biased data of the real world — or it could simply inherit the same problems. “What I don’t want to do is give the pulse to this paradigm and say,‘ Oh, this will solve so many problems, ’” says Cathy O’Neil, a data scientist and founder of the algorithmic audit firm ORCAA. “Because he’ll also ignore a lot of things.”
Realistic, not real
Deep learning has always been about data. But in recent years, the AI community has learned that good the data are more important than great data. Even small amounts of clean, cleanly labeled data can do more to improve the performance of an AI system than 10 times the amount of untreated data, or even a more advanced algorithm.
This changes the way companies approach their development of AI models, says Datagen CEO and co-founder Ofir Chakon. Today, they start by acquiring as much data as possible and then adjusting and tuning their algorithms for better performance. Instead, they should do the opposite: use the same algorithm while improving the composition of their data.
But collecting real-world data to do this type of iterative experimentation is too expensive and time-consuming. This is where Datagen comes in. With a synthetic data generator, teams can create and test dozens of new data sets daily to identify which maximizes the performance of a model.
To ensure the realism of their data, Datagen gives its vendors detailed instructions on how many individuals to scan in each age group, BMI range, and ethnicity, as well as an established list of actions to be taken for them, such as walking. around a room or drinking soda. Vendors return both high-fidelity static images and motion capture data of those stocks. Datagen’s algorithms then extend this data into hundreds of thousands of combinations. The synthesized data are sometimes then verified again. False faces are plotted against real faces, for example, to see if they look realistic.
Datagen is generating facial expressions to monitor driver alertness in smart cars, body movements to track customers in cashless stores, and irises and hand movements to improve eye and hand tracking ability. of VR headsets. The company says its data is already being used to develop computer vision systems that serve tens of millions of users.
They are not just synthetic humans that are mass-produced. Click-Ins is a startup that uses synthetic AI to perform automated vehicle inspections Using the design software, it recreates all the makes and models of vehicles that its AI needs to recognize and renders them with different colors, damage and deformations in different driving conditions. lighting, against different backgrounds. This allows the company to update its AI when car manufacturers launch new models, and helps it avoid data privacy violations in countries where license plates are considered private information and so cannot they are present in the photos used to form the AI.
Especially.ai works with financial, telecommunications and insurance companies to provide spreadsheets of false customer data that allow companies to share their customer database with outside vendors in a legally compliant manner. Anonymization can reduce the richness of a data set but it also fails to adequately protect people’s privacy. But synthetic data can be used to generate detailed falsified data sets that share the same statistical properties as a company’s real data. It can also be used to simulate data that society doesn’t even have, including a more diverse customer population or scenarios such as fraudulent activity.
Proponents of synthetic data say it may also help assess AI. In a recent article published at an AI conference, Suchi Saria, associate professor of machine learning and health care at Johns Hopkins University, and his co-authors demonstrated how data generation techniques could be used to extrapolate different patient populations from a single set. of data. This may be useful if, for example, a company had only data from the younger population of New York but wanted to understand how its AI works on an older population with a higher prevalence of diabetes. He has now started his own company, Bayesian Health, which will use this technique to help test AI medical systems.
The limits of falsification
But are synthetic data surviving?
When it comes to privacy, “just because the data is“ synthetic ”and doesn’t directly correspond to real user data doesn’t mean it doesn’t encode sensitive information about real people,” says Aaron Roth, professor of computer science and information science at the University of Pennsylvania. Some data generation techniques have been demonstrated to carefully reproduce images or text found in training data, for example, while others are vulnerable to attacks that make them completely regurgitate this data.
This could be good for a company like Datagen, whose synthetic data is not intended to hide the identities of individuals who have agreed to be scanned. But it would be bad news for companies offering their solution as a way to protect financial information or sensitive patients.
Research suggests that the combination of two synthetic data techniques in particular-differential privacy and generative adversary networks-It will produce the strongest privacy protections, says Bernease Herman, a data scientist at the University of Washington eScience Institute. But skeptics worry that this nuance may be lost in the marketing language of synthetic data sellers, who won’t always be ready on what techniques to use.