What We Learned from a Year of Building with LLMs (Part I) #
This article dives into the tactical aspects of working with large language models (LLMs) and offers valuable insights and lessons for anyone building products with LLMs.
Key Takeaways:
- LLMs have become "good enough" for real-world applications, but creating products beyond demos remains challenging.
- The authors highlight crucial lessons learned from a year of building real-world applications on top of LLMs.
- This article focuses on tactical aspects, including prompting, retrieval-augmented generation (RAG), flow engineering, and evaluation & monitoring.
Prompting #
- Focus on fundamental prompting techniques:
- n-shot prompts + in-context learning: Provide the LLM with a few representative examples to demonstrate the task. Aim for at least 5 examples, and don't be afraid to use more.
- Chain-of-Thought (CoT): Encourage the LLM to explain its reasoning process step-by-step. Make the CoT specific by providing explicit instructions on the steps involved.
- Providing relevant resources: Expand the model's knowledge base, reduce hallucinations, and increase user trust by leveraging RAG. Tell the model to prioritize, refer to the resources directly, and mention when they are insufficient.
- Structure your inputs and outputs:
- Structured input: Use serialization formatting to provide clues about token relationships, metadata, and connect the request to similar training examples.
- Structured output: Simplifies integration into downstream systems and enhances system efficiency.
- Have small prompts that do one thing well: Avoid creating large, complex "God Object" prompts that do too much. Break down complex tasks into multiple smaller, focused prompts for better performance and easier iteration.
- Craft your context tokens carefully: Rethink the amount of context needed. Analyze the final prompt rigorously for redundancy, contradictions, and poor formatting. Consider structuring the context to highlight relationships between parts and simplify extraction.
"Be like Michaelangelo, do not build up your context sculpture—chisel away the superfluous material until the sculpture is revealed."
Information Retrieval/RAG #
- RAG is only as good as the retrieved documents:
- Use metrics like Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) to evaluate the relevance of retrieved documents.
- Consider information density and level of detail in the retrieved documents.
- Don't forget keyword search:
- Use keyword-based approaches as a baseline and in hybrid search systems.
- Keyword search is efficient for specific queries and provides interpretability.
- Embedings excel at semantic similarity but can struggle with specific keyword-based searches.
- Prefer RAG over fine-tuning for new knowledge:
- RAG consistently outperforms fine-tuning in incorporating new knowledge and current events.
- RAG is easier and cheaper to update compared to fine-tuning.
- Long-context models won't make RAG obsolete:
- While long-context models are game-changers for specific use cases, RAG remains valuable for selecting relevant information from large datasets.
- Longer contexts can lead to distractions and overwhelm the model, necessitating careful retrieval and ranking.
- Long-context models incur high inference cost, making RAG a cost-effective alternative.
Tuning and Optimizing Workflows #
- Step-by-step, multi-turn "flows" can offer significant performance boosts.
- Consider using a multi-step workflow that decomposes complex tasks into smaller, well-defined steps.
- Structure outputs to facilitate interactions with the orchestrating system.
- Experiment with planning steps, rewriting user prompts, implementing different flow architectures, planning validations, and prompt engineering with fixed upstream state.
- Prioritize deterministic workflows for now:
- Build agent systems that generate deterministic plans for more predictable and reliable execution.
- Benefits of deterministic workflows include:
- Easier to test and debug
- Failures can be traced to specific steps
- Generated plans can be used as few-shot samples for prompting and fine-tuning.
- Focus on structured, deterministic approaches to build robust and reliable agents.
- Getting more diverse outputs beyond temperature:
- Increasing temperature can lead to failure modes affecting output diversity.
- Alternative strategies to increase diversity include:
- Adjusting elements within the prompt (e.g., shuffling item order)
- Keeping a short list of recent outputs to prevent redundancy
- Varying the phrasing in prompts
- Caching is underrated:
- Reduces cost and latency by saving previously generated responses.
- Reduces risk of serving harmful content by serving vetted responses.
- When to fine-tune:
- Consider fine-tuning when prompting techniques fall short of providing reliable, high-quality output.
- Fine-tuning costs include data annotation, model training, evaluation, and self-hosting.
- Evaluate if the higher upfront cost of fine-tuning is justified based on the potential performance gains.
Evaluation & Monitoring #
- Create assertion-based unit tests from real input/output samples:
- Define assertions based on real-world input and output samples, specifying expectations for output based on:
- Whether specific phrases or ideas are present
- Word, item, or sentence count ranges
- Execution-evaluation for code generation
- Run these assertions whenever there are changes to the pipeline.
- Define assertions based on real-world input and output samples, specifying expectations for output based on:
- LLM-as-Judge can work but is not a silver bullet:
- Use LLM-as-Judge for pairwise comparisons to assess the relative quality of different prompts and techniques.
- Take precautions to control for position bias and response length.
- Consider using Chain-of-Thought to improve judgment reliability.
- Use LLM-as-Judge to evaluate new strategies and compare them to existing results.
- The "intern test" for evaluating generations:
- Imagine giving the same input and context to a college student in the relevant major. Could they succeed? How long would it take?
- Use the results of this thought experiment to guide prompt engineering, context enrichment, task complexity, and data analysis.
- Overemphasizing certain evals can hurt overall performance:
- Be wary of focusing too heavily on specific evaluation metrics, such as the Needle-in-a-Haystack (NIAH) evaluation.
- Consider incorporating more practical and realistic evaluation paradigms.
- Simplify annotation to binary tasks or pairwise comparisons:
- Use these simplified annotation approaches for faster, more reliable, and less cognitively demanding annotation.
- (Reference-free) evals and guardrails can be used interchangeably:
- Reference-free evaluations can be used as guardrails to filter or reject undesired content.
- LLMs will return output even when they shouldn’t:
- Implement robust guardrails to detect and filter harmful or inappropriate outputs.
- Log inputs and outputs for debugging and monitoring.
- Hallucinations are a stubborn problem:
- Address hallucinations using prompt engineering and factual inconsistency guardrails.
- Leverage CoT prompting and deterministic detection strategies to mitigate hallucinations.
About the Authors #
The article was written by a group of six AI and ML experts with diverse backgrounds and experiences in the field. They include:
- Eugene Yan: Senior Applied Scientist at Amazon, focuses on building machine learning systems at scale.
- Bryan Bischof: Head of AI at Hex, leads the team building "Magic" - a data science copilot.
- Charles Frye: Teaches people to build AI applications and has taught thousands the full stack of AI development.
- Hamel Husain: Independent ML consultant, helps companies operationalize LLMs to accelerate AI product journeys.
- Jason Liu: Distinguished ML consultant, guides teams to ship successful AI products.
- Shreya Shankar: ML engineer and PhD student at UC Berkeley, specializes in addressing data challenges in production ML systems.
Useful Resources #
- ApplyingML.com: Eugene Yan's website for articles and insights on ML, RecSys, LLMs, and engineering
- Building Production Recommendation Systems: O'Reilly book co-authored by Bryan Bischof
- Full Stack Deep Learning: Charles Frye's educational and consulting organization
- Weights and Biases: Organization where Charles Frye has worked and teaches AI application development.
This summary provides a comprehensive overview of the key information and advice presented in the article "What We Learned from a Year of Building with LLMs (Part I)." It serves as a valuable starting point for anyone interested in building successful products with LLMs.