Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As the world continues to gush over the prowess of the all-new GPT-4o-mini, Apple has chosen to expand its family of small models. A few hours ago, the research team at Apple working as part of the DataComp for Language Models project, released a family of open DCLM models on Hugging Face.
The package includes two main models at the core: one with 7 billion parameters and the other with 1.4 billion parameters. They both perform pretty well on the benchmarks, especially the bigger one — which has outperformed Mistral-7B and is closing in on other leading open models, including Llama 3 and Gemma.
Vaishaal Shankar from the Apple ML team described these as the “best-performing” open-source models out there. Something worth noting is the project was made truly open source with the release of the model weights, the training code and the pretraining dataset.
What do we know about Apple DCLM models?
Led by a team of multidisciplinary researchers, including those at Apple, University of Washington, Tel Aviv University and Toyota Institute of Research, the DataComp project can be described as a collaborative effort to design high-quality datasets for training AI models, particularly in the multimodal domain. The idea is pretty simple here: use a standardized framework – with fixed model architectures, training code, hyperparameters and evaluations – to run different experiments and figure out which data curation strategy works best for training a highly performant model.
The work on the project started a while ago and the experiments led the team to figure out that model-based filtering, where machine learning (ML) models automatically filter and select high-quality data from larger datasets, can be key to assembling a high-quality training set. To demonstrate the effectiveness of the curation technique, the resulting dataset, DCLM-Baseline, was used to train the new DCLM decoder-only transformer English language models with 7 billion and 1.4 billion parameters from scratch.
The 7B model, trained on 2.5 trillion tokens using pretraining recipes based on the OpenLM framework, comes with a 2K context window and delivers 63.7% 5-shot accuracy on MMLU. According to the researchers, this represents a 6.6 percentage point improvement on the benchmark compared to MAP-Neo — the previous state-of-the-art in the open-data language model category — while using 40% less compute for training.
More importantly, its MMLU performance is pretty close to that of leading open models – open weights but closed data – in the market, including Mistral-7B-v0.3 (62.7%), Llama3 8B (66.2%), Google’s Gemma (64.3%) and Microsoft’s Phi-3 (69.9%).
The model’s performance across Core and Extended benchmarks (average of dozens of different tasks, including HellaSwag and ARC-E) saw further improvements when the researchers extended its context length to 8K by doing an additional 100B of training on the same dataset, using the Dataset Decomposition technique. The MMLU result, however, remained unchanged.
“Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation,” the researchers noted in a paper detailing the work on DataComp-LM.
Powerful smaller model
Just like DCLM-7B, the smaller 1.4B version of the model, trained jointly with Toyota Research Insitute on 2.6 trillion tokens, also delivers impressive performance across MMLU, Core and Extended tests.
In the 5-shot MMLU test, it scored 41.9%, which is considerably higher than other models in the category, including Hugging Face’s recently released SmolLM. According to benchmarks, the 1.7B version of SmolLM has an MMLU score of 39.97%. Meanwhile, Qwen-1.5B and Phi-1.5B also follow behind with scores of 37.87% and 35.90%, respectively.
Currently, the larger model is available under Apple’s Sample Code License, while the smaller one has been released under Apache 2.0, allowing for commercial use, distribution and modification. Notably, there’s also an instruction-tuned version of the 7B parameter model in the HF library.
It is also important to note here that this is just early research, highlighting the effectiveness of data curation. The models are not for Apple devices and may exhibit certain biases from test training data or produce harmful responses.
Source link lol