LLM System Optimization

Explore top LinkedIn content from expert professionals.

  • View profile for Andrew Ng
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of LandingAI

    2,326,755 followers

    Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

  • View profile for Sebastian Raschka, PhD
    Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

    ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

    208,521 followers

    I just read the "Thinking LLMs: General Instruction Following With Thought Generation" paper (https://lnkd.in/gkzq_-iZ), which offers a simple yet effective way to improve the response quality of instruction-finetuned LLMs. Think of it as a very simple alternative to OpenAI's o1 model, which produces better answers via internal "thinking" yet only shows you the final response, not the thinking process. The idea of the proposed Thought Preference Optimization (TPO) is to incorporate a Chain-of-Thought-style prompting/reasoning into the training. However, a) just asking the model to "think" via Chain-of-Thought prompting can reduce response accuracy b) training on Chain-of-Thought data would be hard because human thought processes are usually not included in instruction datasets So, their idea is this (see figure below): 1) Modify the prompt with a Chain-of-Thought style: "think before responding." 2) Use an LLM judge to evaluate the responses (excluding the thoughts generated by the LLM) 3) Form preference pairs for DPO based on the rejected and preferred responses (these responses include the thoughts) This way, the LLM implicitly learns to optimize its thinking process to produce better responses. (Note that the thinking process doesn't need to be shown to the user in a way similar to how it's not shown to the judge LLM.) The results, based on Llama 3 8B Instruct, show that this TPO approach works quite well: i) Interestingly, if the thought prompt is prepended but the Llama 3 8B Instruct base model doesn't undergo DPO finetuning on the preference pairs, this base model performs much worse than without the thought prompt ii) Finetuning the model on the instruction data (direct response baseline) without thought prompt improves the base model performance already by a lot, about 27.6% points on AlpacaEval and 17% on Arena-Hard; this shows how important finetuning is in general iii) Now, adding the thought preference optimization further boosts the performance by 4% Note that this method is applied to general instruction-response answering and is not specific to logic or math tasks.

  • View profile for Zain Hasan

    I build and teach AI | AI/ML @ Together AI | EngSci ℕΨ/PhD @ UofT | Previously: vector DBs, data scientist, lecturer & health tech founder | 🇺🇸🇨🇦🇵🇰

    16,498 followers

    You don't need a 2 trillion parameter model to tell you the capital of France is Paris. Be smart and route between a panel of models according to query difficulty and model specialty! New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance. Overview: Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75) The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks. They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5) (2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b) A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs. Explore different routing approaches: - Similarity-weighted (SW) ranking - Matrix factorization - BERT query classifier - Causal LLM query classifier Neat Ideas to Build From: - Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation. - Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.) - Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time. - MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc. - Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer

  • View profile for Sahar Mor

    I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

    41,008 followers

    In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y

  • View profile for Steve Nouri

    The largest AI Community 14 Million Members | Advisor @ Fortune 500 | Keynote Speaker

    1,729,071 followers

    🚀 Google just dropped the blueprint for the future of agentic AI: Context Engineering, Sessions & Memory. If prompt engineering was about crafting good questions, context engineering is about building an AI’s entire mental workspace. Here’s why this paper matters 👇 What’s Context Engineering? LLMs are stateless, they forget everything between calls. 🔹Context engineering turns them into stateful systems by dynamically assembling: • System instructions (the “personality” of the agent) • External knowledge (RAG results, tools, and outputs) • Session history (ongoing dialogue) • Long-term memory (summaries and facts from past sessions) • It’s not prompt design anymore, it’s prompt orchestration. Think of sessions as your workbench, messy but active. Sessions manage short-term context and working memory. Think of memory as your filing cabinet, organized, persistent, and searchable. Memories persist facts, preferences, and strategies across time and agents. Together, they make AI personal, consistent, and self-improving. My Takeaways: Context is the new compute, your system’s intelligence depends on what it sees, not just the model you use. Memory isn’t a vector DB, it’s an LLM-driven ETL pipeline that extracts, consolidates, and prunes knowledge. Multi-agent systems need shared memory layers, not shared prompts. Procedural memory (the how) is the next frontier, agents learning strategies, not just storing facts. Building an “agent” today isn’t about chaining APIs together. It’s about context architecture to make models actually think across time. The future of AI won’t belong to those who fine-tune models, it’ll belong to those who engineer context. “Stateful AI begins with context engineering.” This might just be the new foundation of agentic systems.

  • View profile for Alex Wang
    Alex Wang Alex Wang is an Influencer

    Learn AI Together - I share my learning journey into AI & Data Science here, 90% buzzword-free. Follow me and let's grow together!

    1,111,293 followers

    LLMs are just stateless functions. If you want something dependable, it’s all about how you engineer the wrapper. In this issue, I share some lessons and patterns I’ve found (or learned from others) that help LLM-based agents go from flaky to functional: --- Why the “magic” is in the loop, not the model --- How to think about tool use, error handling, and context windows --- The value of owning your control flow --- Why smaller, focused agents usually win ... I also included one of our open-sourced projects: GenAI Agents Infrastructure - a lightweight setup we’ve been using internally to run LLM agents. You’ll find the GitHub link inside, give it a try and let me know how it goes! ___________ Welcome to the Learn AI Together newsletter — 90% buzzword-free and focused on learning materials & news in AI. Let’s grow together! Alex Wang

  • View profile for Sarthak Rastogi
    Sarthak Rastogi Sarthak Rastogi is an Influencer

    AI engineer | Posts on agents + advanced RAG | Experienced in LLM research, ML engineering, Software Engineering

    22,097 followers

    Apple’s new Superposition Prompting method improves RAG accuracy by 43%. Suppose your vector search retrieved 5 documents. Instead of processing them as one big unit, this approach lets the LLM consider each doc separately and process them in parallel. So it’s obvious how it improves the speed. But how does it improve the accuracy? A major problem in LLMs is, when there is irrelevant info in the input context, it confuses the model. So, by considering each retrieved doc separately when answering the query, this problem is reduced. Here’s how they actually do Superposition Prompting: They use a DAG structure where each query segment is a duplicate of the original query. This allows for parallel processing of the query segments. The model looks at each query segment and its retrieved docs independently, and uses path pruning to get rid of any irrelevant docs. To make inference even faster, the paper uses path caching and path parallelisation techniques: - Path caching precomputes KV embeddings for the docs - Path parallelisation computes KV caches and logits for query segments in parallel. Paper: https://lnkd.in/g_j67TpY #AI #RAG #LLMs 

  • View profile for Ross Dawson
    Ross Dawson Ross Dawson is an Influencer

    Futurist | Board advisor | Global keynote speaker | Humans + AI Leader | Bestselling author | Podcaster | LinkedIn Top Voice | Founder: AHT Group - Informivity - Bondi Innovation

    34,061 followers

    LLMs are optimized for next turn response. This results in poor Human-AI collaboration, as it doesn't help users achieve their goals or clarify intent. A new model CollabLLM is optimized for long-term collaboration. The paper "CollabLLM: From Passive Responders to Active Collaborators" by Stanford University and Microsoft researchers tests this approach to improving outcomes from LLM interaction. (link in comments) 💡 CollabLLM transforms AI from passive responders to active collaborators. Traditional LLMs focus on single-turn responses, often missing user intent and leading to inefficient conversations. CollabLLM introduces a :"Multiturn-aware reward" system, apply reinforcement fine-tuning on these rewards. This enables AI to engage in deeper, more interactive exchanges by actively uncovering user intent and guiding users toward their goals. 🔄 Multiturn-aware rewards optimize long-term collaboration. Unlike standard reinforcement learning that prioritizes immediate responses, CollabLLM uses forward sampling - simulating potential conversations - to estimate the long-term value of interactions. This approach improves interactivity by 46.3% and enhances task performance by 18.5%, making conversations more productive and user-centered. 📊 CollabLLM outperforms traditional models in complex tasks. In document editing, coding assistance, and math problem-solving, CollabLLM increases user satisfaction by 17.6% and reduces time spent by 10.4%. It ensures that AI-generated content aligns with user expectations through dynamic feedback loops. 🤝 Proactive intent discovery leads to better responses. Unlike standard LLMs that assume user needs, CollabLLM asks clarifying questions before responding, leading to more accurate and relevant answers. This results in higher-quality output and a smoother user experience. 🚀 CollabLLM generalizes well across different domains. Tested on the Abg-CoQA conversational QA benchmark, CollabLLM proactively asked clarifying questions 52.8% of the time, compared to just 15.4% for GPT-4o. This demonstrates its ability to handle ambiguous queries effectively, making it more adaptable to real-world scenarios. 🔬 Real-world studies confirm efficiency and engagement gains. A 201-person user study showed that CollabLLM-generated documents received higher quality ratings (8.50/10) and sustained higher engagement over multiple turns, unlike baseline models, which saw declining satisfaction in longer conversations. It is time to move beyond the single-step LLM responses that we have been used to, to interactions that lead to where we want to go. This is a useful advance to better human-AI collaboration. It's a critical topic, I'll be sharing a lot more on how we can get there.

  • View profile for Dan Harper
    Dan Harper Dan Harper is an Influencer

    Chief Technology Officer at AskYourTeam

    11,631 followers

    When coding with agentic AI it’s context that’s king. If the LLM makes a misstep the best next step is to: 1. Take a moment to think about what may be missing, is it gaps in context, not enough detail in the prompt or something was misinterpreted. 2. Ask your AI agent why it took the approach it did. This is not always accurate, but sometimes it can reveal what signals it took from your code to arrive at the conclusion. 3. Correct the context gap. This may be additional details needed for the task, further prompting or pointing to an existing example in the codebase it can follow. If the misstep can be corrected via an automated test, make sure you ask your AI tool to create tests for it. 4. Improve context for future sessions. It’s likely that that gap will cause future missteps and so whatever additional information you provided should be in a place for future reference. That could be directly added to all future contexts (eg: CLAUDE/AGENT md file or rules files) or a separate markdown file that can be referenced for similar work in the future. Some codebases can be more difficult for an LLM to understand and additional context or different techniques will be needed to get a good output. It can take some hours to map out comprehensive context and automated tests. You’ll notice the difference though. Once you reach a point where there’s enough guidance to an LLM, its decisions will improve dramatically.

  • View profile for Himanshu J.

    Building Aligned, Safe and Secure AI

    27,151 followers

    LLMs/ SLMs are inherently stateless, but the future of AI and AI Agents is stateful, personalized, and persistent. The critical discipline enabling this shift is Context Engineering and it is much more than just prompt engineering. Context Engineering is the process of dynamically assembling and managing all information within an LLM’s/ SLM’s context window. Think of it as the ‘mise en place’ for your agent, ensuring it has only the most relevant, high-quality ingredients for every turn.  🏛️The Two Pillars of Stateful AI:- 1. Sessions:- These govern the ‘now’. A session is the container for a single, continuous conversation, holding the chronological dialogue history and working memory. You can view it as the temporary workbench for a project.  2. Memory:- This is the mechanism for long-term persistence across multiple sessions. Memory captures and consolidates key information, acting as an organized filing cabinet that provides a continuous, personalized experience.  🐒The Production Challenge:- Combating Context Rot A major hurdle is managing the ever-growing conversation history, which increases cost, latency, and leads to ‘context rot’ (the model's diminished ability to pay attention to critical information).  ℹ️ To solve this, Context Engineering employs compaction strategies:- • Token-Based Truncation:- Simply cutting off older messages to stay within a predefined token limit.  • Recursive Summarization:- Using an LLM to periodically summarize the oldest parts of the conversation, preserving context in a condensed form.  💡The Key Production Insight:- Memory generation itself? the process of Extraction (distilling key facts) and Consolidation (integrating new facts, resolving conflicts, and deleting redundant data), must be run as an asynchronous background process. This ensures the agent is snappy, responsive, and doesn't keep the user waiting while it's ‘thinking’ about what to remember.  Context Engineering is the foundation for building trusted, adaptive assistants that truly learn and grow with the user.  What are your biggest challenges in moving your LLM proof-of-concept into a stateful production environment? #LLMOps #AIEngineering #ContextEngineering #GenAI #MachineLearning #LLMDevelopment

Explore categories