What We’ve Learned From A Year of Building with LLMs

Recently, a couple friends and I threw around the idea of writing about our experience with LLMs and AI Engineering (image below). One thing led to another and that’s how this three-part series came about. Here, we share our hard-won lessons, and advice to make it easier. This is also cross-posted on O’Reilly. We hope you’ll find it useful!

Behind the scenes of how this write-up started

It’s an exciting time to build with large language models (LLMs). Over the past year, LLMs have become “good enough” for real-world applications. The pace of improvements in LLMs, coupled with a parade of demos on social media, will fuel an estimated $200B investment in AI by 2025. LLMs are also broadly accessible, allowing everyone, not just ML engineers and scientists, to build intelligence into their products. While the barrier to entry for building AI products has been lowered, creating ones that are effective beyond a demo remains deceptively difficult.

We’ve identified some crucial, yet often neglected, lessons and methodologies informed by machine learning that are essential for developing products based on large language models (LLMs). Awareness of these ideas can give you a competitive advantage against most others in the field without requiring ML expertise! Over the past year, the six of us have been building real-world applications on top of LLMs. We realized that there was a need to distill these lessons in one place for the benefit of the community.

We come from a variety of backgrounds and play different roles, but we’ve all experienced firsthand the challenges that come with using this new technology. Two of us are independent consultants who’ve helped numerous clients take LLM projects from initial concept to successful product, seeing the patterns determining success or failure. One of us is a researcher studying how ML/AI teams work, and how to improve their workflows. Two of us are leaders on applied AI teams, one at a tech giant and one at a startup. Finally, one of us has taught deep learning to thousands and now works on making AI tooling and infrastructure easier to use. Despite our differing experiences, we were struck by the consistent themes in the lessons we learned, and we’re surprised these insights aren’t more widely discussed.

We’ve spent the past year getting our hands dirty and gaining valuable lessons, often the hard way. While we don’t claim to speak for the entire industry, we want to share what we’ve learned to help you avoid missteps and stay on the path to success.

Here, we share some advice and lessons for anyone building products with LLMs, organized into three sections:

Tactical: The nuts and bolts of working with LLMs. We share best practices and common pitfalls around prompting, setting up retrieval-augmented generation, applying flow engineering, and evaluation and monitoring. Whether you’re a practitioner building with LLMs or a hacker working on weekend projects, this section was written for you.
Operational: Next, we take a step back and discuss the day-to-day concerns and organizational aspects of building with LLMs. We share how we think about data (a lot!), our mental model for working with models and designing products, and how to build a team that can wield LLMs effectively. If you’re a product/technical leader or a practitioner looking to deploy sustainably and reliably, this section is for you.
Strategic: Finally, we take a long-term view and consider where the business should invest. We share our early thinking on when to use model APIs vs. when to finetune and self-host models, how we think about the LLM product lifecycle, infrastructure investments, and how we think about risk and scaling from 1 to N. This is written for founders and senior leaders looking to the future.

Our goal is to make this a practical guide to building successful products around LLMs, drawing from our own experiences and pointing to examples from around the industry.

Ready to ~~delve~~ dive in? Let’s go.

Tactical

Prompting

Focus on getting the most out of fundamental prompting techniques
Structure your inputs and outputs
Have small prompts that do one thing, and only one thing, well
Craft your context tokens

Information Retrieval / RAG

RAG is only as good as the retrieved documents’ relevance, density, and detail
Don’t forget keyword search; use it as a baseline and in hybrid search
Prefer RAG over fine-tuning for new knowledge
Long-context models won’t make RAG obsolete

Tuning and optimizing workflows

Step-by-step, multi-turn “flows” can give large boosts
Prioritize deterministic workflows for now
Getting more diverse outputs beyond temperature
Caching is underrated.
When to finetune

Evaluation & Monitoring

Create a few assertion-based unit tests from real input/output samples
LLM-as-Judge can work (somewhat), but it’s not a silver bullet
The “intern test” for evaluating generations
Overemphasizing certain evals can hurt overall performance
Simplify annotation to binary tasks or pairwise comparisons
(Reference-free) evals and guardrails can be used interchangeably
LLMs will return output even when they shouldn’t
Hallucinations are a stubborn problem.

Operational

Data

Check for development-prod skew
Look at samples of LLM inputs and outputs every day

Working with models

Generate structured output to ease downstream integration
Migrating prompts across models is a pain in the ass
Version and pin your models
Choose the smallest model that gets the job done

Product

Involve design early and often
Design your UX for Human-In-The-Loop
Prioritize your hierarchy of needs ruthlessly
Calibrate your risk tolerance based on the use case

Team & Roles

Focus on Process, Not Tools
Always be experimenting
Empower everyone to use new AI technology
Don’t fall into the trap of “AI Engineering is all I need”

Strategic

No GPUs before PMF

Training from scratch (almost) never makes sense
Start with inference APIs, but don’t be afraid of self-hosting

Iterate to something great

The model isn’t the product, the system around it is
Build trust by starting small
Build LLMOps, but build it for the right reason: faster iteration

Start with prompting, evals, and data collection

Prompt engineering comes first
Build evals and kickstart a data flywheel
The high-level trend of low-cost cognition

Enough 0 to 1 demos, it’s time for 1 to N products

Tactical: Nuts & bolts of working with LLMs (pending)

Operation: Day-to-day and org concerns (pending)

Strategic: Long-term business strategy (pending)

Contact Us

We would love to hear your thoughts on this post. You can contact us at [email protected]. Many of us are open to various forms of consulting and advisory. We will route you to the correct expert(s) upon contact with us if appropriate.

Acknowledgements

This series started as a conversation in a group chat, where Bryan quipped that he was inspired to write “A Year of AI Engineering”. Then, ✨magic✨ happened, and we were all inspired to chip in and share what we’ve learned so far.

The authors would like to thank Eugene for leading the bulk of the document integration and overall structure in addition to a large proportion of the lessons. Additionally, for primary editing responsibilities and document direction. The authors would like to thank Bryan for the spark that led to this writeup, restructuring the write-up into tactical, operational, and strategic sections and their intros, and for pushing us to think bigger on how we could reach and help the community. The authors would like to thank Charles for his deep dives on cost and LLMOps, as well as weaving the lessons to make them more coherent and tighter—you have him to thank for this being 30 instead of 40 pages! The authors thank Hamel and Jason for their insights from advising clients and being on the front lines, for their broad generalizable learnings from clients, and for deep knowledge of tools. And finally, thank you Shreya for reminding us of the importance of evals and rigorous production practices and for bringing her research and original results.

Finally, we would like to thank all the teams who so generously shared your challenges and lessons in your own write-ups which we’ve referenced throughout this series, along with the AI communities for your vibrant participation and engagement with this group.

About the authors

Eugene Yan designs, builds, and operates machine learning systems that serve customers at scale. He’s currently a Senior Applied Scientist at Amazon where he builds RecSys for millions worldwide worldwide and applies LLMs to serve customers better. Previously, he led machine learning at Lazada (acquired by Alibaba) and a Healthtech Series A. He writes & speaks about ML, RecSys, LLMs, and engineering at eugeneyan.com and ApplyingML.com.

Bryan Bischof is the Head of AI at Hex, where he leads the team of engineers building Magic – the data science and analytics copilot. Bryan has worked all over the data stack leading teams in analytics, machine learning engineering, data platform engineering, and AI engineering. He started the data team at Blue Bottle Coffee, led several projects at Stitch Fix, and built the data teams at Weights and Biases. Bryan previously co-authored the book Building Production Recommendation Systems with O’Reilly, and teaches Data Science and Analytics in the graduate school at Rutgers. His Ph.D. is in pure mathematics.

Charles Frye teaches people to build AI applications. After publishing research in psychopharmacology and neurobiology, he got his Ph.D. at the University of California, Berkeley, for dissertation work on neural network optimization. He has taught thousands the entire stack of AI application development, from linear algebra fundamentals to GPU arcana and building defensible businesses, through educational and consulting work at Weights and Biases, Full Stack Deep Learning, and Modal.

Hamel Husain is a machine learning engineer with over 25 years of experience. He has worked with innovative companies such as Airbnb and GitHub, which included early LLM research used by OpenAI for code understanding. He has also led and contributed to numerous popular open-source machine-learning tools. Hamel is currently an independent consultant helping companies operationalize Large Language Models (LLMs) to accelerate their AI product journey.

Jason Liu is a distinguished machine learning consultant known for leading teams to successfully ship AI products. Jason’s technical expertise covers personalization algorithms, search optimization, synthetic data generation, and MLOps systems. His experience includes companies like Stitchfix, where he created a recommendation framework and observability tools that handled 350 million daily requests. Additional roles have included Meta, NYU, and startups such as Limitless AI and Trunk Tools.

Shreya Shankar is an ML engineer and PhD student in computer science at UC Berkeley. She was the first ML engineer at 2 startups, building AI-powered products from scratch that serve thousands of users daily. As a researcher, her work focuses on addressing data challenges in production ML systems through a human-centered approach. Her work has appeared in top data management and human-computer interaction venues like VLDB, SIGMOD, CIDR, and CSCW.

If you found this useful, please cite this write-up as:

Yan et al. (May 2024). What We’ve Learned From A Year of Building with LLMs. eugeneyan.com. http://localhost:4000/writing/prompting/.

@article{yan2024prompting,
  title   = {What We've Learned From A Year of Building with LLMs},
  author  = {Yan, Eugene and Bischof, Bryan and Frye, Charles and Husain, Hamel and Liu, Jason and Shankar, Shreya},
  journal = {eugeneyan.com},
  year    = {2024},
  month   = {May},
  url     = {https://eugeneyan.com/writing/llm-lessons/}
}

Share on:

Join 6,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.

Source link
lol