Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Mostly AI is moving to address a major AI training bottleneck for enterprises. The Austrian company, known for providing a platform for synthetic data generation, today announced the launch of synthetic text. This new functionality allows enterprises to unlock value from their proprietary datasets without worrying about privacy risks.
Starting today, the offering generates a synthetic version of an organization’s proprietary information, without including personally identifiable information (PII) or diversity gaps. This gives teams a way to train and fine-tune reliable large language models (LLMs) for faster innovation and better decision-making.
The capability comes at a time when AI training is hitting a plateau and enterprises are looking to go beyond public data sources to find sources that could offer greater value and potential than the residual public data.
How does Synthetic Text work?
Synthetic, or artificially generated data, is often seen as the go-to alternative when real data is too expensive, unavailable, imbalanced or unusable. Enterprises have been producing and working with synthetic information (mostly images) for quite some time, but the rise of generative AI is expected to propel its application to a whole new level, covering wider data types. According to Gartner, by 2026, 75% of companies will use gen AI to create synthetic data, up from less than 5% in 2023
However, even when AI is generating synthetic data, it may lack organization-specific context and insights. This could keep downstream models from learning and performing up to the expected mark.
To address this, Mostly AI provides enterprises with a platform to train their own AI generators that can produce synthetic data on the fly. The company started off by enabling the generation of structured tabular datasets, capturing nuances of transaction records, patient journeys and customer relationship management (CRM) databases. Now, as the next step, it is expanding to text data.
While proprietary text datasets – like emails, chatbot conversations and support transcriptions – are collected on a large scale, they are difficult to use because of the inclusion of PII (like customer information), diversity gaps and structured data to some level.
With the new synthetic text functionality on the Mostly AI platform, users can train an AI generator using any proprietary text they have and then deploy it to produce a cleansed synthetic version of the original data, free from PII or diversity gaps. Just like the tabular data generator, it also captures the nuances and insights in the text (along with the context of accompanying structured data). Plus, users get a variety of language model options (including Mistral-7B and Viking-7B) to train the generator.
“The selected LLM is fine-tuned with the original text data on the Mostly AI Platform. This will take place in the context of additional structured data that is provided with text (e.g. specific customer information) to increase the quality of the created synthetic text. With the fine-tuned LLM in place, the Mostly AI Platform will create the synthetic text which can be downloaded or stored in a database for further processing,” Tobias Hann, the CEO of the company, told VentureBeat.
How will it help enterprises?
With the synthetic text generated from the platform’s generators, enterprises can power a range of analytics and gen AI use cases. Hann said there are no live applications as the product has just been announced but the company is looking at the generation of prompt-response pairs (like question-answer pairs) as the initial application given these pairs are widely used for fine-tuning LLMs like aimed customer service.
The new feature, and its ability to unlock value from proprietary text without privacy concerns, makes it a lucrative offering for enterprises looking to strengthen their AI training efforts. The company claims training a text classifier on its platform’s synthetic text resulted in 35% performance enhancement as compared to data generated by prompting GPT-4o-mini.
However, it is important to note that this is still an apples-to-oranges comparison and there are no benchmarks yet comparing the performance of Mostly AI’s synthetic text generator with other synthetic generators like Gretel.
“The Mostly AI platform has been benchmarked against other companies and solutions in the past and has consistently demonstrated superior performance when it comes to the quality (accuracy, fidelity) and privacy of the created synthetic data,” Hann added.
Source link lol