While many applications rely on LLM APIs, local deployment of LLMs is appealing due to potential cost savings and reduced latency. Privacy requirements or a lack of internet connectivity might even make it the only option.
The major obstacle to deploying LLMs on premises is the memory requirements of LLMs, which can be reduced through optimization techniques like quantization and flash attention. If inference latency is not a concern, running LLMs on CPUs can be an attractive low-cost option.
Libraries and frameworks like Llama.cpp, Ollama, and Unsloth help set up and manage LLMs. Best practices for building local LLM applications include abstracting the model and using orchestration frameworks or routers.
While LLM APIs offer quick access to powerful large language models, they’re not always the best option —whether due to cost, privacy concerns, or the need for customization. In many cases, running a model locally seems more appealing or even inevitable in meeting requirements. However, this introduces operational challenges like sourcing hardware, choosing the best way to run the models, and implementing the necessary observability and monitoring. How can you make the right decision in these situations?
In this post, we’ll explore strategies for selecting the most suitable local model and running it locally, even when resources are tight. We’ll cover how to optimize memory usage, accelerate inference, and leverage fine-tuning techniques like LoRa to maximize model performance for your specific application.
Why deploy LLMs locally?
Let’s say you’re developing an LLM system for a customer service platform. Just like in traditional software architecture, you’ll need to decide where to run your components. The main options would be running the system on-premise, in the cloud, or interfacing with a fully managed LLM service.
As with traditional systems, you must weigh the trade-offs of each option, focusing on cost, privacy, latency, and the complexity of the systems you need to implement and maintain. This is the process we will be going through in the following section to answer the question: When do I decide to run LLMs locally?
Cost
When using pre-trained LLMs like GPT-4o or Claude Sonnet through an API, the first reaction of many users is to say that it’s really expensive—but is it? The current pricing for GPT-4 is around $10 per million tokens. Initially, this seems like a lot. So, the logical alternative would be to get some instance from a cloud service provider and just host our own LLM, right?
Well, let’s run the math: when running Llama 3.1 70B on a highly optimized AWS runtime, you can expect to pay around $13 per million tokens, assuming you’re able to at least keep your instance being used constantly to some extent. If you have a lot of traffic, allowing you to constantly run large batches of samples, it might go as low as $2 per million tokens. However, even if there might be some slight cost reduction relative to GPT-4o, the accuracy is also more comparable with the closed-source GPT-4o-mini, which beats it in many cases, and this smaller model only costs 60 cents per million tokens to run!
Based on this back-of-the-envelope estimate, we see that while GPT-4o models are expensive, hosting your own models in the cloud is just as costly. The biggest issue seems to be the rental prices for hardware being too high. So, why not use your own hardware?
This is the real advantage of running LLMs locally: the hardware is already there, or it is a single purchase that will cost you nearly $0 to maintain (if you’re willing to invest the time and build the knowledge internally, that is). Your local GPU is also not subject to demand spikes that inflate cloud costs or GPU shortages that make virtual machines in the cloud not only expensive but sometimes wholly unavailable.
Privacy
A very obvious argument in favor of running LLMs locally is that, in some cases, there is no alternative. While we mostly hear about the fancy cloud machines and LLM APIs, many businesses looking to adopt LLMs have to run everything on-premise due to internal policies or even laws that forbid them from sending sensitive data to a remote service.
In this case, knowledge of how to run local LLMs is crucial, as it will form the foundation of the LLM infrastructure that needs to be built within the organization.
Latency
In traditional software architectures, to minimize latency, we usually aim to run the code as close to the end user as possible. We run apps on smartphones and render dynamic elements of websites client-side. It follows then that the same should be true for LLMs, right?
Well, it’s complicated. If you were to have a supercomputer at home running GPT-4 instead of in an OpenAI data center, it would reduce your latency. However, network latency is such a small part of the overall latency in an LLM application that you would barely notice.
Additionally, you probably don’t have a supercomputer at home, so the increase in latency from running on inferior hardware will probably be higher than whatever you gained from running locally.
Based on this rough analysis, we can already conclude that the only case in which running models locally makes sense in terms of latency is very small models. Since they’re fairly fast to compute, omitting network latency might provide a noticeable improvement in the user experience. As an example of such a small model, Google recently released Gemini Nano on their Pixel smartphones.
What does it take to run LLMs locally?
The common perception regarding running LLMs is that this task requires powerful and expensive hardware. For the most part, this is true. However, recent advancements in optimization techniques, such as quantization and attention mechanism optimizations, have made it possible to run LLMs locally, even on a CPU.
Memory requirements
The one thing you won’t be able to optimize your way out of is memory. You need memory to store the model’s weights and the data you feed to the model. If you’re running on a CPU, you need RAM, and if you’re on a GPU, you need VRAM.
In most scenarios, the factor limiting the model size will be the memory you have available locally. But how much do you really need? The rule of thumb is to multiply the number of bytes per weight by the number of model weights.
To see this in action, let’s take a look at the memory requirements of loading the recent Llama 3.1 model family out of the box with no optimizations:
Number of parameters |
Memory needed (32 bit / 4 byte inference) |
Out of the box, there is no single consumer-grade GPU that can run even the smaller 8 billion parameter models (at the time of writing, the highest VRAM Nvidia GPU is the RTX 4090 with 24 GB).
Plus, the estimated numbers don’t even include the memory required for the context passed to the model. Transformer models use a self-attention mechanism that requires each token to attend to every other token in the context. This creates a quadratic memory complexity in relation to the context size.
As a result, the context memory can very quickly become larger than the model weights:
Model size |
1k tokens |
16k tokens |
128k tokens |
The numbers in the table are taken from Hugging Face’s analysis of Llama 3.1. In this example, the measurements were taken when running in fp16 precision, meaning that for the 8B model with a full context window of 128 thousand tokens, the context takes up just as much memory as the loaded model weights!
Yet, you’ll often see people run and even fine-tune these models on 15GB of VRAM. So how do they do it?
Resource optimizations
As we’ve just seen, there are two main components to the amount of memory you need to run an LLM: model weights and context window. So, what can we do to minimize it?
Quantization
Quantization is a technique that reduces the precision of the model weights, allowing you to run LLM models with significantly less memory. By converting the model weights from full floating-point precision (that’s typically 32-bit) to lower precision formats (such as 16-bit or even 4-bit), we can decrease memory usage and increase the speed of computations. This allows us to fit larger models into our available memory.
However, this process is a lossy compression method, meaning there will be performance degradation. A good analogy that explains this process is JPEG compression. Similarly to quantizing model weights, we are performing lossy compression on a matrix of values. If we don’t overdo it, the final result will be almost indistinguishable from the original image.
Based on recent research at the University of Washington, it seems like it’s always preferable to use bigger models quantized to 4-bit precision rather than less compressed smaller models. In other words, it’s a good approach to default to 4-bit precision for inference.
Let’s now look back at the Llama 3.1 family models and their memory requirements from loading their weights alone with different levels of quantization:
Flash attention
Flash Attention is an optimization technique that accelerates the self-attention mechanism in LLM models. It enables efficient computation of attention scores by reducing the memory footprint and speeding up the process.
Flash Attention leverages specialized algorithms to compute the attention scores in a way that minimizes the amount of data held in memory, allowing larger context windows to be processed without exhausting the available memory.
The Flash Attention optimization turns the quadratic scaling of memory requirements with context length into a more linear scaling, meaning that the bigger your context window, the more impactful this optimization becomes.
If you’re thinking about going over 4,000 to 8,000 tokens, whether or not to adopt Flash Attention isn’t even a question. The accuracy differences are negligible, and the benefits in memory footprint and speed are massive. In the newest Flash-Attention 3 release, the speedup is about three-fold, which is extremely useful, especially for larger models.
GPU vs. CPU
By default, if you’re running an LLM, you’ll always want to have a GPU. However, this might not always be possible. Maybe you can’t afford one, or the device you are developing your application for doesn’t have one. So, what should you know about CPU inference?
One thing is for sure: it will be slow. CPUs are simply not made for the scale of tensor operations that an LLM requires.
However, also for CPUs, the biggest limiting factor is the amount of memory. Since RAM is extremely affordable, as long as you don’t care about latency too much, you can run bigger models for very cheap on a CPU.
To put this into perspective, at the time of writing, Nvidia’s H100 GPU with 80 GB VRAM costs north of $25.000. You can buy a high-end server with 256 GB for less than that.
Finding the best LLM your hardware can handle
As we discussed, memory is the limiting factor when it comes to running LLMs locally. Hence, the first thing you need to do is find out how much of it you have. A good estimate of how big you can go is to double your memory and then subtract roughly 30% for loading other model-related data. This gives you a rough idea of the maximum number of parameters you can get away with running at 4-bit precision.
Let’s run through a few examples of the models you can run at different amounts of memory:
Keep in mind that the values in this table are just estimates. You might find that slightly bigger models still fit, especially if you’re willing to spend time learning about, trying, and evaluating different optimizations.
The best libraries for running LLMs locally
Within the Large Language Model inference ecosystem, many libraries have come to specialize in running these models in as accessible hardware as possible.
This is especially true when it comes to Apple Silicon chips, whose unified memory architecture treats RAM and VRAM very similarly, allowing surprisingly high amounts of VRAM in a laptop GPU.
Llama.cpp
Llama.cpp was one of the first libraries purpose-made for running LLMs locally. Its objective was to provide a native C/C++ implementation of an LLM inference server with no external dependencies.
Llama.cpp leans heavily on the previous work of its developer, Georgi Gervanov, on the GGML tensor library. With time, its simplicity as a pure C implementation made GGML a good platform to support many different backends and operation systems. CUDA support was added for Nvidia GPUs, Metal for Apple’s chips, and many other platforms are supported as well. This makes it possible to run open-source models on basically any kind of accelerator.
Due to the usage of the GGML library, llama.cpp supports only a single model format, the “GPT-Generated Unified Format,” or GGUF for short. This binary format defines how to save model weights and has become just as popular as llama.cpp. Hence, you’ll struggle to find any model release that doesn’t include at least one version in this format. Usually, you’ll even be able to choose between different precision levels.
In terms of features, llama.cpp provides three main ways to interact with the LLMs: via a command line interface, through code (with bindings for many popular programming languages), and through a simple web server.
With a single command, you can launch a ChatGPT-like interface:
llama-cli -m your_model.gguf -p "You are a helpful assistant" -cnv
This command loads your GGUF model and starts a conversation command line interface with an initial “You are a helpful assistant” system prompt.
Hugging Face Transformers
Hugging Face is a legend in the machine-learning space. With the rising popularity of LLMs, the Hugging Face Hub has become the place to train, run, evaluate, and share models.
Their Transformers Python library is particularly useful, especially if you’re a more advanced user looking to build fine-tuning or complex inference with the ability to customize the code that loads, runs, and trains your models. Many higher-level frameworks build on top of the Transformers library or pull models from the Hugging Face Hub.
With the high-level abstractions provided by the Transformers library, loading and using a model is straightforward:
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B"
pipeline = transformers.pipeline("text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto")
pipeline("Hey how are you doing today?")
Ollama
Ollama is built on top of llama.cpp and contains no inference code itself. It focuses on making the experience of running LLMs locally very simple and user-friendly. It comprises a CLI, a web server, and a well-documented model library that contains the most popular GGUF models together with their respective prompt templates.
Ollama gives you all you need to run and use a model. For example, running a Llama 3.2 model is as simple as running:
This command will handle the download, build a local cache, and run the model for you.
LM Studio
LM Studio is a user-friendly application designed to run LLMs locally. It offers a graphical interface that works across different platforms, making the tool accessible for both beginners and experienced users. Under the hood, LM Studio uses llama.cpp as its backend, providing all its features and flexibility.
Key features include easy model management, a chat interface for interacting with models, and the ability to run models as local API servers compatible with OpenAI’s API format. Users can download various LLMs, including open-source options, and adjust inference parameters to optimize performance.
Getting started with LM Studio is simple. You can download the application, choose a model from the built-in catalog, and start chatting within minutes. The intuitive interface guides you through the process, from model selection to launch, eliminating the need for complex command-line operations. This ease of use makes LM Studio an excellent choice for anyone looking to experiment with LLMs without diving deep into technical details.
Unsloth
Unsloth is a Python framework that focuses on running and training LLMs. It works with the Hugging Face Transformers library, reimplementing the low-level kernels necessary to run LLMs in more efficient ways to save GPU compute and memory. This makes Unsloth extremely useful when you need to run a larger model on limited hardware.
The gains from using this library vary, but typically, you can achieve an extra 20% memory saving on top of already well-optimized Hugging Face Transformers code. While this might not sound significant, it often makes the difference between being able to run a much bigger model on similar hardware, resulting in better model performance for your use cases.
As an impressive example, the developers provide a Colab notebook that allows you to fine-tune a Mistral 7B model on Google Colab’s free tier with only 15GB of VRAM. To put this into perspective, running Mistral 7B in a completely default and unoptimized way would require around 30 GB of VRAM.
What really sets Unsloth apart is its focus on optimization and efficiency. The developers frequently update the code to improve performance. Each release introduces incremental optimizations that are always measured together with the accuracy of the models running with these optimizations to make sure that the performance of the model doesn’t degrade when it’s optimized for memory or inference speed.
When benchmarking common LLM workloads, we can see that Unsloth consistently produces the best memory footprint and performance, which makes it the best option across the board. However, some features of this library (like multi-GPU support) are locked behind the closed-source “Pro” tier, which requires a subscription.
WebLLM
WebLLM brings LLM inference into the web browser, serving as a window into the future of running LLMs locally. It leverages the WebGPU API to interact with the local machine’s GPU and provides functionalities that allow developers to embed local LLM inference into their web applications. Currently, the WebGPU API is not yet supported by all popular browsers.
Besides cost savings due to not having to host LLMs in the cloud, this approach can be particularly valuable in scenarios where users might want to share sensitive information. Imagine a law firm creating a web application using WebLLM for document analysis. Clients could want sensitive legal documents to be processed by a local LLM running on their own machines. This approach ensures privacy, as confidential information never leaves the client’s computer.
Best practices for building LLM apps with local models
At first, building a production LLM application with the main model running on a local machine might not seem like a great idea. If you have a lot of traffic to your application, it probably isn’t since you’ll very quickly be running into issues with availability and scalability. However, it’s an avenue worth exploring in many other scenarios.
Hence, before we close, we’ll share some advice and tips for teams embarking on the journey of building an LLM application on top of a local model.
Abstract the model
Since OpenAI was the first large-scale LLM provider, a lot of people built apps around the OpenAI models. As a result, the OpenAI API specification became the de facto standard.
Most LLM libraries provide a way to host models behind an OpenAI-compatible API. For example, the Ollama developers provide a guide for setting up a local model as a drop-in replacement for an OpenAI model.
Why is this important? Even though running models locally can be fun, you might want to switch to using an LLM hosted by a third party later to handle more requests. Or you might have a team developing the user-facing parts of an application with an API while a different team builds the LLM inference infrastructure separately.
Use orchestration frameworks or routers
Even though it might seem easiest to make direct calls to your local LLMs or access them through local web servers, orchestration frameworks or routers can be extremely valuable in many situations.
If you’re looking to build agents or RAG workflows, you’ll want to use a framework like LlamaIndex or Langchain, the latter of which ships with connectors for Ollama and llama.cpp.
In case you don’t need complex logic, you might want to use LLM routers. Many of them will also support the local libraries we discussed, and they’ll be extremely useful for detecting errors, formatting outputs, logging calls, and even switching to different models later on. Examples of this type of router are LiteLLM and LLMStudio.
Conclusion
Running LLMs locally has become increasingly accessible thanks to model optimizations, better libraries, and more efficient hardware utilization. Whether you choose to use llama.cpp for its simplicity, Ollama for its user-friendliness, LM Studio for its UI, or more advanced solutions like Unsloth for maximum optimization, there’s now a solution for almost every use case and hardware configuration.
Above any technical considerations when running LLMs locally, the most important thing is to use something that works for you and your hardware, and if you’re just running these models as a personal project to tinker with LLMs, don’t forget to have some fun!
Explore more content topics:
Source link
lol