Large Language Models (LLMs) are changing how people interact with information, increasing global productivity and abruptly shifting markets. Companies have been leveraging this technology to integrate it into their products or business processes using third-party APIs, but the proof-of-concept era has ended. Now, it’s time to differentiate LLMs-powered products and provide added value to your customers.
Open source models are the best way to achieve transparency and secure setups. When fed with proprietary data, there’s a clear differential competitive advantage. Fine-tuned LLMs are more adapted to the context they are intended to deal with by using high-quality data. With cloud platforms and services such as AWS Bedrock, Hugging Face, Scale.ai, and others, making your LLM domain-specific and easy to deploy is easier than ever. This balances the TCO (total cost of ownership) and reduces overall cost compared to using closed or private models from providers such as OpenAI or Anthropic.
In this article, we will discuss which open source LLMs are the best in terms of adaptability (how much can they be adapted to different knowledge domains by fine-tuning), manageability (how easy it is to prompt, fine-tune and deploy the model) and quality (what are the provided baseline metrics).
Why are LLMs Open Sourced?
From a theoretical point of view, open source typically refers to software or a computational model available for anyone to access, use, modify, and share. It promotes transparency, collaboration, and accessibility. This approach allows collective problem-solving, with community members contributing to improvements, enhancements, and innovations.
Also, incumbents use open source software (OSS) as a competitive strategy. IBM used Linux against Microsoft, and Google used Android against Apple. Other communities have leveraged their open source contributions to build consultancy and support services and SaaS (software as a service) on top of them. In the Data field, we can see similar examples: Spark and Databricks, Langchain and LangSmith, Feast and Tecton, etc. These companies are often called ‘Open-Core’ or ‘Commercial Open Source Software’ (COSS) companies. They offer the core product as open-source, while premium features or services are sold commercially.
In the LLMs field, more specifically, apart from the reasons mentioned above, some actors intend to promote the greater good by making the open source community part of the advances in the development of artificial intelligence. They believe that only by making the open source community part of the advances in the development of artificial intelligence will we achieve safe human-aligned “AGI” (Artificial General Intelligence).
To consider an LLM fully open source, model weights, model code, and training data need to be published. Training data is often the distinguishing factor, and it’s not always easy to share due to copyright and competitive reasons. This happened, for example, with the Books3 dataset. For this reason, practical considerations make LLMs with weights and code available as open source.
Typical licenses for LLMs are:
- Non-commercial: which only allows academic research or personal use,
- Copy-left: which requires all modifications on top of it to be released under the same license
- Permissive: that allows commercial usage or modification in proprietary applications. The most popular permissive licenses include Apache 2.0 or MIT.
Why would you use an Open Source LLM?
The main reasons for choosing an open source model are controllability and transparency. Cost is not necessarily a positive variable. Self-hosting is very expensive due to all the ad-hoc tooling and maintenance it requires. Managed services such as AWS Bedrock, OctoAI, Replicate, or similar do not yet match proprietary performance and cost.
In general terms, open source models are better for debuggability, explainability, and their capacity to augment their capabilities through fine-tuning. By doing this, you can steer the LLMs to your specific needs defined by the problem domain.
Criteria for Evaluating Open Source LLMs
Despite the bigger question of whether to use a proprietary or open source LLM, if you have already decided to pursue the open-source track, let’s explore the decision framework for choosing the model that better suits your use case and strategy.
What factors should you consider when selecting open source LLMs?
- Cost: Includes inference or fine-tuning costs, which are directly dependent on the LLM size.
- TPOT (Time per output token): Speed of text generation as experienced by the end user. This is a typical baseline to compare LLMs.
- Task performance: Performance requirements of the task and the relevant metrics like precision or accuracy.
- Type of tasks: This one is greatly dependent on the type of tasks or interactions your LLM use case is required to provide.
- Model Flavor: There are several types of models depending on how they have been trained (typically called “base” models) and whether they have suffered any fine-tuning (the more common ones are the “instruct” and “chat” ones).
- Other factors include Licensing, Safety and Developer community activity.
Cost
The costs for building-associated scaffolding, monitoring, observability, deployment, and maintenance (LLMOps) depend primarily on the deployment option you choose: fully self-hosted or through a managed service. Some managed services offer none, limited, or full LLMOps utilities around their inference or fine-tuning services.
Managed services such as AWS Bedrock, Replicate, Octo.ai, Together.AI or Azure are typically the best first options if you don’t have a perfect market fit for your use case or just want to run low on ownership and maintenance costs.
TPOT (Time per output token)
For many tasks, the desired performance metric is the time it takes since the instruction is sent to the LLM and the full response is provided (“Latency”). However, that depends on the input and output tokens. That’s why it is better to use this baseline, as it will be proportional to the final user-perceived latency.
If what you’re looking to build is a streaming chat interface, then TTFT (Time to first token) is what you need. It measures how quickly users start seeing the model’s output after entering their query.
Task performanceIt is worth mentioning that public benchmarks are not the absolute truth. Evaluation is always necessary, depending on our use case. Benchmarking can be easily gamed and hard to interpret; therefore, we need to create our own evaluation datasets and processes to understand the best LLM to solve our problem.
From a general point of view, we can rely on several public benchmarks. We have the LM Evaluation Harness by Eleuther AI, one of the most popular ones, which supports over 400 benchmark tasks. It can be used through Hugging Face Hub. The main tasks that this benchmark contains are:
- MMLU (“Massive Multitask Language Understanding”): This task evaluates knowledge-intensive tasks, history, biology, mathematics and other 50 subjects in a multiple choice framework.
- ARC (“Abstraction and Reasoning Corpus” ): This one concerns multiple-choice grade school science questions that require complex reasoning and world knowledge.
- HellaSwag: The LLM is evaluated for commonsense reasoning. It is asked to predict what might happen next out of the given choices based on common sense.
- TruthfulQA: Tasks the LLM ability to provide answers that don’t contain falsehoods.
- Winogrande: The LLM is evaluated to fill-in-the-blanks questions that test commonsense reasoning.
- GSM8K: Tests the ability to complete grade school mathematics problems involving a sequence of basic arithmetic operations.
We also have the LMSYS leaderboard, which uses Elo ratings to assess LLMs by crowdsourcing votes in the so-called Chatbot Arena. This one has been presented several times as the standard, although some voices now claim that human biases might be steering the leaderboard to chat-focused models.
We also have the Hugging Face Open LLM Leaderboard, which is currently populated with open-source LLMs from the community but doesn’t provide 100% reliability. Some models are said to be contaminated with the test data.
Type of tasksThere are a lot of standard tasks such as summarisation, classification, creative writing, coding, question-answering, chat interfaces, machine translation, sentiment analysis, etc. Typically, there is always a public benchmark that assesses the intended task unless it is a very specific or niche use case.
Licensing
As explained above, if you want to use the LLM for commercial purposes, make sure the license is permissive, such as Apache 2.0, MIT, Llama 2 or 3 (although this last one requires the company to show to the final user that the capability has been built with Meta’s Llama 3).
Safety
For most of the use cases safety is a vital requirement. You don’t want the LLM to produce content that might be toxic or misinformative. Benchmarks such as TruthfulQA, TrustGPT, and Latent Jailbreak aim to measure this negative impact.
Committing to a specific LLM or family of LLMs will come with the requirement to adapt your prompting and fine-tuning processes to the way the model has been pre-trained, trained and fine-tuned and the data used for that. You need to choose models that provide a foreseeable progression.
For example, Meta’s Llama family is a very good example of progress within the open source LLM domain. They have consistently released better models, all based on the same foundational architecture. This evolution of their open source commitment provides confidence in investing in the Llama model family.
Depending on your requirements, some other companies might be more suited. For example, Mistral.AI or Cohere might be good candidates to rely on if you need support for European languages, as they are heavily developing and investing in multilingual LLMs.
Potential to improve
This dimension of LLM evaluation criteria is obviously influenced by the previous one, but one good way to assess this potential is by looking at the number of successful applications of the different LLMs in the industry or academia. By looking at the different ways in which an LLM has been adapted to a specific domain (either by fine-tuning, advanced prompting, or RAG), you can infer that the LLM has a good capacity to specialize or generalize to different topics and tasks.
Model Flavor Instruct models: Specialized in following instructions in human language. This can be achieved by using techniques such as FLAN. It is a SFT (supervised fine-tuning technique). With Reinforcement Learning from Human Feedback (RLHF), you can achieve better results. Reinforcement Learning from AI Feedback (RLAIF) is also possible. Instead of using a human choice among several options, an AI called Constitutional LLM that has human principles embedded in it does the selection. It is typically always better to choose the instruct model instead of the base version.
- Chat models: These are a type of instruction-tuned models suited for multi-turn dialogs.
- Long context models: Bigger context length.
- Domain-adapted or task-adapted models: Models fine-tuned on specific tasks such as summarisation or financial sentiment analysis.
The Best 5 Open Source LLMs
In general, AI practitioners and leaders shouldn’t focus only on the best LLM out there. Choosing an LLM or a family of LLMs for your use case is a strategic selection. Companies such as Meta, Google, Mistral AI, 01.ai, Microsoft, Alibaba and Alignment Lab AI (creators of OpenChat) are the most active groups releasing open source LLMs for commercial usage.
For this article, we are going to use one of the leaderboards that is supposedly more aligned with humans. This is the LMSYS leaderboard. It uses the Elo Rating from crowdsourced anonymous evaluations to rank LLMs. As of June 2024, the top open source LLMs on the list are as follows:
A highly valuable resource for comparing Large Language Models (LLMs) is Artificial Analysis. This site provides consistently up-to-date information for detailed data points on cost, throughput, latency, and performance across various tasks, making it an essential tool for informed analysis.Therefore, this section will provide key data points about each LLM and insights into other dimensions.
Meta Llama 3 Family
This whitepaper by Meta researchers presented Llama 3 models in April 2024. They provide state-of-the-art performance compared to all the other 8B and 70B parameter-scale LLMs. They claim that this model architecture is a standard decoder-only transformer with a tokenizer consisting of a vocabulary of 128k tokens that encodes language more efficiently, which is one reason for the substantial performance improvement.
The models were pre-trained with 15 trillion tokens collected from publicly available sources containing 5% data points covering 30 non-English languages.
Post-training was performed using a combination of supervised fine-tuning, rejection sampling, proximal policy optimisation, and direct policy optimisation. Human annotations in these datasets have proved to be the main influence on having highly aligned instruct models.
With Llama 3 70B holding rank 11 in the LMSYS, being the top open source contestant against proprietary models such as GPT-3.5, GPT-4 or Claude Opus, puts this LLM as one of the best options. Even Llama 3 8B holds position 23 in the same ranking, offering great performance across different tasks at a much reduced cost for inference.
Llama 3 has already been widely adopted by academia and industry. Yale School of Medicine fine-tuned it and created Llama-3 8b Meditron v1.0, achieving impressive results on biomedical question answering. Interestingly, the healthcare sector has adopted this model, as NVIDIA showcases in this article.
As of today, there are more than 11,000 thousand fine-tuned versions of the different Llama 3 models in Hugging Face. You can also find different use cases in this article such as TherapistAI. But as of the writing of this article, probably a lot of companies are already developing applications on top of Llama 3 models.
Alibaba Cloud Qwen Family
Qwen 2 large language models have been built by Alibaba Cloud and released in mid-2024. They have made a great splash in the scene and have achieved great results across benchmarks and rankings. The models have been released on different sizes, from 0.5B to 72B parameters, with base and instruction-tuned variants. These models excel in multilingual capabilities, supporting 29 languages. They leverage group query attention that provides good performance.
There are also multimodal extensions such as Qwen-VL and Qwen-Audio.
The rest of the released models are Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, and Qwen2-57B-A14B. All of these have the Apache 2.0 license. Apart from that, Qwen2-7B and Qwen2-72B can handle 128k tokens in context.
From the point of view of public benchmarks, Qwen-2 72B exhibits better benchmarks than Llama-3 70b. Qwen-2 72B also has a fine-tuned version that is more aligned with humans, Qwen-2 72B-instruct. Considering the smaller tier of models, Qwen-2 7B also presents better benchmarks than Llama-3 7B.
Alibaba is particularly proud of these success stories from Rinna and Lightblue companies. Both have leveraged and fine-tuned Qwen-14B for Japanese, achieving great results.
01.AI Yi Family
Yi-1.5 models were introduced in May 2024, offering substantial improvements over their predecessors in areas such as coding, math, reasoning, and instruction-following capabilities. These models come in three sizes: 34B, 9B and 6B, each supporting context lengths of 4K, 16K and 32K tokens. These models use a modified version of the decoder-only transformer architecture based on the Llama implementation.
These models were pretrained with a high-quality corpus of 3.6 trillion tokens sourced from both English and Chinese text, ensuring a bilingual and high-quality dataset. This pre-training was complemented by fine-tuning 3 million diverse samples to enhance performance across various tasks. The training data was meticulously filtered using learned filters and a comprehensive deduplication pipeline to ensure the highest quality.
Yi-1.5 models have demonstrated exceptional performance in benchmarks, with the 34B variant outperforming many larger models in tasks such as language understanding, commonsense reasoning, and reading comprehension.
The 34B version is specifically good in the “Needle-in-a-Haystack” test. The Yi-34B-200K’s performance improved by 10.5%, rising from 89.3% to an impressive 99.8%.
Abacus.AI uses the Yi-1.5 model in their Smaug model, which is a 72B and 34B variant of the Yi-1.5 model. Also, Upstage uses the Yi-1.5 model in their SOLAR-10.7B model, which is a pre-trained open-source LLM that generates random text and requires fine-tuning for specific requirements.
Moreover, its model card contains great detail and is a good source for learning, too.
Mistral AI Family
Mistral.ai has rapidly emerged as a leading player in the field of artificial intelligence by developing a suite of open-source large language models that balance high performance with efficiency.
The company has introduced other notable models like Mixtral 8x7B and Mixtral 8x22B, which utilize a sparse mixture of experts architecture to optimize performance and cost-efficiency. These models have shown superior performance on benchmarks compared to larger models, demonstrating the effectiveness of Mistral.ai’s innovative approaches. All these models are released under the Apache 2.0 license, ensuring they are fully open-source and can be freely used and modified. This commitment to open-source development allows researchers and developers worldwide to build on Mistral.ai’s cutting-edge work, fostering a collaborative environment for AI innovation.
Mistral Large and its counterparts have been adopted across different sectors, including academic research, enterprise applications, and tech startups. These models are used in tasks ranging from natural language processing and multilingual translation to complex reasoning and coding. Their availability on platforms like AWS and Azure ensures they are accessible to a wide range of users, supporting both commercial and open-source projects.
Some interesting use cases for Mistral models are the following: Mixtral 8x7B for a browser AI assistant in Brave and its code-specific features in Brave Search. Also, Perplexity built some of its features and PPXL models on top of Mistral 7B. A wide range of partnerships have also been disclosed, such as the ones with SAP or Snowflake. Other companies are already using Mistral models, and although they haven’t disclosed if they have turned into their open source offerings, we can acknowledge that their potential is there.
Microsoft Phi 3 Family
Microsoft’s Phi-3 family represents a series of small language models (SLMs) designed for high performance and efficiency. The Phi-3 models include Phi-3-mini (3.8 billion parameters), Phi-3-small (7 billion parameters), Phi-3-medium (14 billion parameters), and Phi-3-vision, a multimodal model integrating language and vision capabilities. These models are optimized for tasks requiring strong reasoning, coding, and math capabilities, outperforming other models of similar sizes in various benchmarks. The Phi-3 models are available on platforms such as Microsoft Azure and Hugging Face, making them accessible for diverse applications and deployments.
Phi-3-mini is optimized for mobile and compute-limited environments, making it ideal for smart sensors and remote diagnostics applications. Phi-3 models, integrated into Azure AI, support generative tasks such as text summarization and code generation, with Phi-3-vision adding visual data processing. These models generate medical reports and personalized learning content in healthcare and education. Enterprises use Phi-3 for data analytics and customer support, leveraging its ability to handle large context windows efficiently.
Given that the main goal of these models is to be deployed on local environments, latency is very variable and depends on several factors such as the prompt length, the type of inference engine, the floating point precision or the type of hardware in which they run. More information can be found in this post from ONNX.
Some use cases of this model have already been published, especially in hardware-constrained environments, such as agriculture. ITC has been using Phi models since Microsoft launched them.
Bonus: Nvidia Nemotron-4 340B Family
The newest model, released on June 14th, has been made available by NVIDIA, especially for synthetic data generation. It has been trained on more than 50 languages and more than 40 programming languages. These models aim to address the challenge of acquiring high-quality training data, which is often costly and hard to obtain. By offering a permissive open model license, Nemotron-4 340B provides developers with a free and scalable solution for creating synthetic data to enhance LLM training. The Nemotron-4 340B family includes base, instruct, and reward models that collectively form a pipeline for generating and refining synthetic data. These models are optimized for use with NVIDIA NeMo, an open-source framework that facilitates end-to-end model training, including data curation and evaluation. Additionally, they are designed for efficient inference using the open-source NVIDIA TensorRT-LLM library.
In terms of performance, the Nemotron-4-340B models excel on the MMLU benchmark with a 0.78 score, a key measure of their accuracy and capability. Specifically, Nemotron-4-340B-Base is competitive with other leading open access models, such as Llama-3 70B and Mixtral 8x22B, on tasks like ARC-Challenge and BigBench Hard. Nemotron-4-340B-Instruct surpasses corresponding instruct models in terms of chat and instruction-following capabilities, while Nemotron-4-340B-Reward achieves top accuracy on RewardBench, outperforming even proprietary models like GPT-4o-0513 and Gemini 1.5 Pro-0514. This high level of performance, combined with the ability to generate over 98% of training data synthetically, underscores the potential of the Nemotron-4 340B family to advance AI research and applications across various domains.
We will have to wait until we see exciting applications of this model to further enhance open source large language models!
How to Operationalise Open Source LLMs
To get started with open-source large language models (LLMs), one should ideally explore platforms like Hugging Face first. It offers a collection of models for different tasks and is currently the industry standard. New models and weights are published there first. These platforms provide interfaces for interactive testing and evaluation, allowing users to input prompts and observe the outputs in real-time.
Operationalising large language models involves several mindset shifts if you are coming from traditional machine learning and MLOps. The size of the artifacts that need to be managed in this case really matters. The models are much more difficult to handle as they need utilities and tools for fetching weights, developing prompts or fine-tuning, iterating and evaluating changes, deploying them, managing all the surrounding components, and, lastly, monitoring them.
The total cost of ownership plays a pivotal role when working with open source models and making decisions around their feasibility. The return on investment should be very clearly projected depending on the different degrees of personalisation or customisation that we want to apply. It’s not the same approach that we have to follow if we want to train a model from scratch, fine-tune it with internal data for a specific use case or just do in-context learning or prompt engineering to adapt it to the domain at hand. Companies should be very clear about the potential gains that they can achieve when looking at these possible opportunities. Investments for one or the others are drastically different.
On training
Training a large language model from scratch, for example, leveraging open source architectures, is tremendously expensive. LLMs are typically trained with trillions of training token data, requiring vast GPU computing power over several days. The procedures for acquiring data, cleaning it, and ensuring its quality are key jobs, as is already shared by lots of research labs in this field. Assuming this job entails having a thoughtful strategy and vision, this is only available to a handful of companies and research labs with access to large pools of AI talent, enormous amounts of data, and scalable infrastructure.
On fine-tuning and prompt engineering
On the other hand, if one wants to do fine-tuning and/ or in-context learning, it is a different story. It’s been clear by now that fine-tuning is widely accessible. We now have platforms that allow us to overcome all the complexities of the fine-tuning process itself and allow developers to upload their fine-tuning datasets. In contrast, the platform does all the heavy lifting. OpenAI, Predibase, Labelbox, Sagemaker Ground Truth While this is true, the complexity of fine-tuning lies on the dataset’s quality. Research studies find that fine-tuned models outperform base or instruct ones with few (less than 5k or 10k, depending on the application) high-quality samples.
This is quite encouraging! Nevertheless, one needs to follow a thorough scientific evaluation process to achieve a proper fine-tuned model. This involves having a proper system to select the fine-tuning samples, curating them, fine-tuning the model, and evaluating it with the appropriate mix of human judgment and automatic evaluations with custom-defined rubrics. Despite the enormous difference with training a model, fine-tuning also requires engineers with knowledge of statistical evaluation and AI in general to build this system or procedure.
In-context learning, also called prompt engineering, is the simplest technique to speed up implementing a LLM-powered application. It has shown a lot of potential value, and it’s always advised to choose this form of adaptation of the LLM to your use case before finding the “market fit” of the solution or application. The effort required to do this is not as high as training or fine-tuning, providing very good results on the final task.
Just about before deploying the first LLM application into production and already proving the value, either if you have done fine-tuning or just prompt engineering, there are a set of needs that you have to meet. How will you iterate on the quality of the LLM outputs? How will you know that your LLM is not hallucinating and is actually providing the expected value? This is where applying LLMOps techniques helps.
A key factor is setting up automated evaluated systems using LLM critics or evals aligned with human judgment. It enables a more important goal, which is having a development environment that allows for quick cycles of iteration-to-feedback. With this setup, we can quickly acknowledge the impact of any change applied to the artifacts of the application (Either the improvements over the fine-tuned dataset, the prompt iterations, changes in the temperature, the selection of the decoding algorithm, the quantization technique applied, the mixture of LLMs, the RAG or knowledge base used and the data points chosen). All in all, compared to more traditional software development, it is like building a CI/CD pipeline with unit and functional tests to assess the system’s accuracy.
Conclusion
The burgeoning landscape of open-source large language models (LLMs) in 2024 marks a pivotal advancement in the integration and accessibility of AI technologies across various industries. This article has underscored the profound impact these models have on enhancing global productivity and the strategic advantages they offer companies seeking to innovate and stay ahead in competitive markets. Open-source LLMs promote a culture of transparency and collaboration and provide significant cost benefits and adaptability, which are crucial for companies aiming to tailor AI tools to their specific operational needs.
Furthermore, the discussion highlights the strategic benefits of open-source over proprietary models, particularly in terms of cost-effectiveness, flexibility, and the ability to fine-tune and manage AI solutions tailored to specific business outcomes. As the technology evolves, it becomes increasingly clear that adopting open source LLMs can propel businesses towards more innovative, efficient, and customizable AI implementations. This, coupled with the community-driven enhancements and the ethical advancement of AI, such as striving towards Artificial General Intelligence (AGI), positions open-source LLMs as not just tools of technological evolution but as catalysts for broader societal benefits in the digital age.
The next step in exploring open source LLM is to understand how to train and fine-tune an LLM for niche problem domains and with proprietary data. This article presents the key information AI practitioners and leaders should know when fine-tuning LLMs.
FAQs
- What are the benefits of open source large language models (LLMs)?
Open source LLMs offer several advantages, including transparency, collaboration, flexibility, and cost-effectiveness. They allow companies to customize and fine-tune the models for their specific use cases, providing a competitive edge. Additionally, open source models promote collective problem-solving and innovation within the AI community.
- How do I choose the right open source LLM for my project?
When selecting an open source LLM, consider factors such as cost (inference and fine-tuning), performance on relevant tasks (benchmarks like MMLU, ARC, HellaSwag), latency requirements, licensing terms, developer community activity, and potential for improvement through fine-tuning or prompting.
- What are the top open source LLMs in 2024?
Some of the leading open source LLMs in 2024 include Meta’s Llama 3 family, Alibaba Cloud’s Qwen 2 models, 01.AI’s Yi 1.5 family, Mistral AI’s models (Mistral 7B, Mixtral 8x7B, Mistral 8x22B), and Microsoft’s Phi 3 family. NVIDIA’s Nemotron-4 340B is also a notable addition for synthetic data generation.
- How can I operationalize open source LLMs in my organization?
Operationalizing open source LLMs involves considering the total cost of ownership, choosing between training from scratch, fine-tuning, or prompt engineering, setting up automated evaluation systems, and establishing a development environment for iterative improvements. Platforms like Hugging Face can help streamline this process.
- What are the potential use cases for open source LLMs?
Open source LLMs can be applied to a range of tasks, such as natural language processing, question answering, text summarization, creative writing, machine translation, sentiment analysis, coding, and complex reasoning. They are being adopted across industries like healthcare, education, customer support, and data analytics.
Source link
lol