Data Machina #239

The Power of Truly Open Source AI. The spin doctors of some big closed-AI companies have been busy inflating the “AGI is here soon, AGI will be an existential risk” bubble. But that thankfully that is deflating quickly, and backfiring somehow.

In the meantime, the open source AI community is stubbornly embarked upon releasing truly open source, efficient, smallish, powerful AI models that match or beat the closed AI models from big companies.

The reaction from these big closed AI companies: “Oh! open source AI models are dangerous, we need to regulate open source AI. And btw: We’re dropping the pricing trousers for using our closed models.” A recent report from Stanford HAI, totally debunks all the myths about dangerous open source AI, and the exaggerations coming from the closed AI companies.

Truly open source AI research and models is the only way forward to advance AI.

A new, truly open source language model. Two days ago, The Allen Institute for AI (AI2) released OLMo 7B, a truly open source SOTA language model which was trained with Databricks Mosaic Model Training. OLMo was released on Apache 2.0 license and comes with:

Full training data used, training code, training logs, and training metrics
Full model weights and 500+ model checkpoints
Fine-tuning code and adapted models

Checkout the blogpost, repo & tech report here: How to Get Started with OLMo SOTA truly open source LM.

A new, truly open source text embedding model. Also a few days ago, Nomic AI released Nomic Embed, a truly open source text embedding model, that is SOTA in 2 main benchmarks. Nomic Embed has a 8192 context-length, and beats Open AI text-embedding-3-small. The model is released under Apache 2.0 license and comes with the full training code, training data and model weights. Checkout the blogpost, repo and tech report here: Introducing Nomic Embed: A Truly Open Text Embedding Model.

Want to learn more on Nomic Embed? Checkout this vid from the guys at LangChain: How to build a long context RAG app with OSS components from scratch using Nomic Embed 8k, Mistral-instruct 32k and Ollama.

And speaking of text embedding models, Salesforce Research just released SFR-Embedding-Mistral model, now SOTA in the MTEB benchmark. The model was trained on top of 2 open source models: E5-mistral-7b-instruct and Mistral-7B-v0.1.

A new, fully open source SOTA multi-lingual model based on a RNN. Last week, a team of independent researchers backed by Stability AI and Eleuther AI, released Eagle 7B. The model beats all 7B open source models in the main multilingual benchmarks, and it’s super cheap compute-efficient. The beauty of this model is that it’s an attention-free, linear transformer built on the RWKV-v5 architecture, which is based on a RNN. Checkout the blogpost, repo, and demo here: Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5.)

Yesterday, Hugging Face released HuggingChat Assistants (blogpost, demo), a nice alt to closed-model chat assistants, that uses 6 top open source models. Albeit rather basic yet, the idea is to have the open source community developing several powerful features already planned.

This is such a cool open source AI project! ADeus: An Open-Source AI Wearable Device for less that $100 (repo, sw/hw list.) It uses Ollama, Supabase, Coral AI microcontroller (soon to be replaced by Raspberry Py Zero.) Checkout the intro vid: