Cohere for AI, the nonprofit research lab run by the artificial intelligence startup Cohere Inc., pushed the boundaries of multilingual frontier AI model research today with the release of Aya Expanse, a family of high-performance multilingual large language models that it says outperform other leading open rivals.
The new family includes two new models in 8 billion and 32 billion parameters released with open weights on hosting sites Kaggle and Hugging Face. The models cover 23 languages including English, Arabic, Chinese, Czech, Dutch, French, German, Greek and Hindu.
“Aya Expanse marks an important step to expand high-quality coverage of languages in LLMs,” said the Cohere research team. “Since we first launched the Aya initiative two years ago, we have collaborated with over 3,000 researchers from 119 countries to expand cutting-edge multilingual research.”
The Aya Initiative is a goal by Cohere to advance state-of-the-art multilingual AI to bridge the gap between people across the world using technology and expand the number of languages covered by AI. It involves building the Aya collection, the largest multilingual dataset collection to date, which includes 513 million examples, and Aya-101, an AI model capable of covering more than 100 languages.
The team said that it used several new core research innovations in Aya Expanse that gave it superior performance. These included the use of synthetic data, human feedback in late-term training and model merging.
To train Aya Expanse, the company said the lab turned to synthetic data for languages with limited data sets. This is not an uncommon practice in the AI industry, using data generated by “teacher” models for training.
However, there is a problem where large language models can suffer from model collapse or produce “gibberish” when trained on synthetic data. To avoid this, the company used data arbitrage, where it used teacher models that had specialized skills in particular multilingual language skills.
Near the late stage of model training, the company said, it began using feedback from human teachers to guide the model toward high-quality outputs. Many multilingual models tend to be biased toward Western cultures and settings, mostly thanks to the countries of origin of their datasets and the companies that build them.
“Our work is one of the first that extends preference training to a massively multilingual setting, accounting for different cultural and linguistic perspectives,” the company said. “We find this leads to large gains both in general performance and safety.”
Finally, to increase performance, Cohere combines the model weights of multiple fine-tuned candidates at each stage in an attempt to create a single model. According to a study written on the subject, merging can sometimes bring improvements up to 8% and 10% in general performance and safety respectively.
The company said these innovations brought Aya Expanse 8B to achieve a 60.4% simulated win rate in multilingual performance against Google LLC’s Gemma 2 9B LLM in m-ArenaHard benchmarks. The larger model, Aya Expanse 32B, outperforms Gemma 2 72B and Mistral 8x22B at 51.8% and 76.6%, respectively. It also outperformed Meta Platforms Inc.’s Llama-3.1 70B, a model twice its size, in pair-wise win rates at 54%.
In addition to releasing the open weights for Aya Expanse 8B and 32B, Cohere said the company is continuing to collaborate on wider multilingual AI research to broaden access to linguistic data, software and compute resources.
Image: Rawpixel.com/Freepik
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU
Source link
lol