Companies at the forefront of the technology, like OpenAI, Meta, and Google, are scouring the internet and troves of books, podcasts, and videos searching for data to train their models.
Some industry leaders, however, worry this kind of “land grab” for publicly available data isn’t the right approach, especially since it puts companies at risk of copyright lawsuits. Instead, they’re calling for companies to train their models on synthetic data.
Synthetic data is artificially generated rather than collected from the real world. It can be generated by machine learning algorithms with little more than a seed of original data.
Business Insider chatted with Ali Golshan, CEO and cofounder of Gretel, who one might call an evangelist for synthetic data. Gretel allows companies to experiment and build with synthetic data. It is working with major players in the healthcare space, such as genomics company Illumina, consulting firms like Ernst & Young, and consumer companies like Riot Games.
Golshan says synthetic data is a safer and more private alternative to “messy” public data, and that it can shepherd most companies into the next era of generative AI development.
The following conversation has been edited for clarity.
Why is synthetic data better than raw public data?
Raw data is just that: raw. It’s often filled with holes, inconsistencies, and biases from the processes used to capture, label, and leverage it. Synthetic data, on the other hand, allows us to fill those gaps, expand into areas that can’t be captured in the wild, and intentionally design the data needed for specific applications.
This level of control, with humans in the loop designing and refining the data, is crucial for pushing GenAI to new heights in a responsible, transparent, and secure manner. Synthetic data enables us to create datasets that are more comprehensive, balanced, and tailored to specific AI training needs, which leads to more accurate and reliable models.
Great, are there any cons to synthetic data?
Where synthetic data isn’t very good is at the end of the day, if you have no data or clarity, you can’t just have it create perfect data for you just, so you can experiment endlessly. So there is that scope that needs to be created.
Ultimately, the other part of it is that synthetic data is very good at privacy if you have enough data. So, if you have only a few hundred records and want ultimate privacy, that comes at a huge cost to utility and accuracy because the data is very limited. So, when it comes to absolutely zero data and wanting a domain-specific task or having very limited data and wanting great privacy and accuracy, those are just incompatible with the approaches.
What are the challenges of using public data?
Public data presents several challenges, especially for specialized use cases in healthcare. Imagine trying to train an AI model for predicting COVID-19 outcomes using only publicly available case count data — you’d be missing crucial specifics like patient comorbidities, treatment protocols, and detailed clinical progression. This lack of comprehensive data severely limits the model’s effectiveness and reliability.
Adding to this challenge is the growing regulatory pressure against data collection practices. The Federal Trade Commission and other regulatory bodies are increasingly pushing back against web scraping and unauthorized data access — and rightly so. As AI becomes more powerful, the risk of re-identifying individuals from supposedly anonymized data is higher than ever.
There’s also the critical issue of data freshness across all industries. In today’s fast-paced business environment, organizations need real-time data to remain competitive and train models that respond rapidly to changing market conditions, consumer behaviors, and emerging trends. Public domain data often lags by weeks, months, or even years, making it less valuable for cutting-edge AI applications that require up-to-the-minute insights.
What do you think about companies like Meta and OpenAI that are willing to risk copyright lawsuits to get access to public data?
The era of ‘move fast and break things’ is over, especially in the age of GenAI, where there’s too much at stake to operate in such a flippant manner. We’re advocating for an approach that leads with privacy. By prioritizing privacy from the start and embedding it into the customers’ AI products and services — by design — you get faster, more sustainable, and defensible AI development. That’s what our partners and, ultimately, their customers want. In this sense, privacy is a catalyst for GenAI innovation.
This privacy-first approach is why partners like Google, AWS, EY, and Databricks work with us. They know that current methods are unsustainable and the future of AI will be driven by consensual, licensed data and thoughtful data-driven design, not by grasping at every bit of public data available. It’s about creating a foundation of trust with your users and stakeholders, which is crucial for long-term success in AI development.
Companies are scrambling to build models that unlock insights from proprietary data. Where does synthetic data fit into that equation?
By some estimates, companies use only 1-10% of the data they collect. The rest is stored and siloed so that few can even access or experiment with it. This creates additional costs and data breach risks with no return value. Now, imagine if a company could safely open access to that remaining 90% of data. Cross-functional teams could collaborate and experiment with it to extract value without creating additional privacy or security risks. That level of knowledge sharing would be a huge boon for innovation.
It’s like we’re moving from the parable of the blind men trying to describe an elephant to each other. Each only has a grasp and understanding of the part they can touch; the rest is a black box. Providing an entire organization with shared access to the ‘crown jewels’ and the opportunity to surface new insights from that data would be a paradigm shift in how companies and products are built. This is what people mean when they speak of ‘democratizing’ data.
There are already ways of training smaller models with a fraction of the data we may have once used that yield great results. Where are we headed regarding the amount of data we need for training generative AI?
The idea of throwing the kitchen sink, in terms of data, to train a large language model is part of the problem and reflects the old ‘move fast and break things’ mentality. It’s a land grab by companies with the means to do that, while AI regulations are still being hashed out.
Now that the dust is settling, people are realizing that the future lies in smaller, more specialized models targeted to very specific tasks and orchestrating the actions of these models through an agentic, systematic approach. This specialized model approach provides more transparency and removes much of the ‘black box’ nature of AI models since you’re designing the models from the ground up, piece by piece.
It’s also where regulation is heading. After all, how else will companies adhere to ‘risk-based’ regulations if we can’t even quantify AI risks for each task we apply them to?
This shift toward more focused, efficient models aligns perfectly with differential privacy and synthetic data. We can generate precisely the data needed for these narrow AI models, ensuring high performance without the ethical and practical issues of massive data collection. It’s about smart, targeted development rather than the brute-force approach companies have taken.
Source link
lol