Develop and Deploy Generative AI Applications on AWS with Eviden’s GenOps Framework - Part 4

Zakaria ..

Multi-Cloud (15xAWS, 5xGCP, 2xAzure) | AWS Ambassador | AWS Golden Jacket | Architecture | DevOps | MSc. Data Science | Generative AI Lead @ Atos

Published Aug 12, 2024

Welcome to part 4 of our series on Eviden's AWS GenOps Framework. In our previous article, we explored the 4th and 5th steps of the Development Stage in detail. In this article, we'll be focusing on the 6th step: Recipe Stores. Recipe Stores are a crucial component of the framework, designed to leverage cutting-edge techniques to enhance model performance. This step involves exploring and implementing various advanced strategies, such as fine-tuning, prompt engineering, retrieval strategies, and even autonomous agents. By incorporating these techniques, organizations can significantly improve the effectiveness and efficiency of their generative AI models.

Figure 1 – Eviden’s GenOps Framework 10 Steps

Recipe Stores

Various approaches exist to enhance and tailor the output of language models for specific applications. These methods range from relatively simple techniques like prompt engineering, which involves crafting effective input prompts, to more sophisticated strategies such as Retrieval Augmented Generation (RAG), which combines the model's knowledge with external information sources. On the more advanced end of the spectrum, techniques that directly modify the model's parameters, such as fine-tuning and pre-training, offer powerful ways to adapt the model's behavior. However, these approaches differ significantly in their complexity, associated costs, required machine learning expertise, and ongoing maintenance demands. As a result, the choice of technique depends on the specific use case, available resources, and desired level of customization. Organizations and developers must carefully consider these factors to select the most appropriate method for optimizing their language model's performance in their particular context. The following figure describes the common approaches for customizing foundation models (FMs).

Figure 2 – Common approaches for customizing foundation models (FMs)

Additionally, as generative AI models become increasingly sophisticated and human-like, it is crucial to ensure they align with human values and exhibit desirable behavior. Techniques such as Reinforcement Learning from Human Feedback (RLHF) can be employed to guide these multimodal models towards being more helpful, honest, and harmless (HHH). This alignment process is essential for creating AI systems that not only perform tasks effectively but also operate within ethical boundaries and meet societal expectations.

Prompt engineering

Prompt engineering has emerged as a powerful technique in the field of artificial intelligence, particularly in the realm of large language models. This iterative process involves carefully crafting and refining the input prompts given to AI models to elicit more accurate, relevant, and targeted responses. By systematically adjusting the wording, structure, and context of prompts, researchers and developers can significantly enhance a model's performance on specific tasks without the need for extensive retraining or modification of the underlying model architecture. This approach not only allows for improved results in existing applications but also enables the adaptation of models to entirely new tasks, effectively expanding their capabilities without altering their fundamental weights or parameters. As the field of AI continues to evolve, prompt engineering stands out as a versatile and efficient method for maximizing the potential of language models across a wide range of applications.

Designing effective prompts is a crucial skill in leveraging the power of language models. By following best practices, you can significantly enhance the quality and relevance of the model's responses. Clear and concise prompts are essential, avoiding ambiguity and using natural language. Including relevant context helps the model provide more accurate and tailored responses. Specifying the desired output format through directives ensures you receive the type of response you need. Placing the requested output at the end of the prompt helps maintain focus. Phrasing inputs as questions can be particularly effective, as can providing example responses to guide the model. For complex tasks, breaking them down into subtasks or asking the model to think step-by-step can yield better results. Experimentation and creativity are key to optimizing prompts, and it's important to evaluate and refine based on the model's responses. With practice, prompt engineering becomes an intuitive skill, allowing you to harness the full potential of language models for various applications.

There are various techniques that can enhance your ability to craft and manipulate prompts, ultimately leading to more accurate and tailored responses from AI models. One such technique is zero-shot prompting, which involves presenting a task to the language model without providing any additional examples or context. This method relies on the model's inherent knowledge and capabilities to generate a response.

In contrast, few-shot prompting involves supplying the model with contextual information and examples of both the task and desired output, effectively guiding the model's response.

Another technique is chain-of-thought (CoT) prompting, which encourages the AI to break down complex problems into smaller, more manageable steps. By mastering these prompt engineering techniques, you can optimize your use of generative AI applications and achieve more precise and relevant results for your unique business objectives.

While basic prompts can be useful for general inquiries, they often fall short when addressing complex or nuanced tasks. Advanced prompt techniques offer a more sophisticated approach to generating AI responses tailored to specific business needs. The following are the most common advanced prompt techniques, each designed to enhance the precision and relevance of AI-generated content:

Self-consistency:

Self-consistency is an advanced technique that builds upon the chain-of-thought (CoT) prompting method. While CoT follows a linear path of reasoning, self-consistency encourages the model to explore multiple reasoning paths simultaneously. By aggregating results from various thought processes, this technique has shown significant improvements in arithmetic and common-sense reasoning tasks, as demonstrated in Xuezhi Wang and colleague’s paper: Self-Consistency Improves Chain of Thought Reasoning in Language Models.

Tree of Thoughts (ToT):

The Tree of Thoughts (ToT) technique, introduced by Shunyu Yao’s paper Tree of Thoughts: Deliberate Problem Solving with Large Language Models, takes the concept of multiple reasoning paths even further. Instead of following a single sequential path, ToT creates a tree-like structure of thought processes. This approach is particularly effective for tasks that require strategic planning, exploration of multiple solutions, and important initial decisions. Research by Shunyu Yao et al. has shown dramatic improvements in performance for tasks like creative writing, mini crosswords, and mathematical games when using ToT compared to traditional CoT prompting.

Retrieval Augmented Generation (RAG):

RAG is a technique that bridges the gap between pre-trained models and task-specific knowledge. By retrieving relevant information from external sources and incorporating it into the generation process, RAG allows LLMs to produce more informed and accurate responses. This method is particularly useful for handling frequently changing information and can be more cost-effective than fine-tuning models on specific datasets. RAG will be detailed in the next section of the article.

Automatic Reasoning and Tool-use (ART)

ART, introduced by Bhargavi Paranjape's paper ART: Automatic multi-step reasoning and tool-use for large language models, is designed to tackle multi-step reasoning tasks by deconstructing them into manageable components. This technique, as described by Bhargavi Paranjape, combines few-shot learning with the use of external tools like search engines and code generators. By breaking down complex problems and leveraging appropriate tools, ART enhances the model's ability to solve intricate tasks.

ReAct

The ReAct framework, introduced by Shunyu Yao’s paper ReAct: Synergizing Reasoning and Acting in Language Models, aims to combine reasoning capabilities with action-oriented tasks. This approach allows LLMs to generate both reasoning traces and specific actions based on external tools and resources. By incorporating external context, ReAct helps reduce errors such as fact hallucination and improves the overall accuracy and reliability of the model's output.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is an innovative approach in natural language processing that enhances the capabilities of language models by incorporating external knowledge sources. In this method, the initial prompt or context provided to the model is enriched with pertinent information retrieved from additional data repositories. This augmentation process allows the model to access and utilize a broader range of relevant data, leading to more informed and contextually accurate responses. RAG is particularly valuable when working with domain-specific information or when generating text that requires up-to-date knowledge. By leveraging this technique, developers can create more powerful and versatile applications that combine the generative capabilities of language models with the depth and specificity of curated data sources. This synergy results in outputs that are not only coherent and fluent but also grounded in accurate and relevant information, making RAG an increasingly popular choice for tasks requiring both creativity and factual precision.

The architecture diagram below in figure 5 describes a typical RAG (Retrieval-Augmented Generation) architecture, and it consists of two main stages. In the first stage, a batch process converts existing knowledge documents into vector embeddings and stores them in a vector database. This creates a searchable repository of information. In the second stage, when a user submits a query, the system performs a semantic search to retrieve relevant information from the vector database. This retrieved information, along with the original query, is then passed to a Large Language Model (LLM) for processing and generation of a response. This approach allows the LLM to augment its knowledge with specific, relevant information, potentially improving the accuracy and relevance of its output.

Figure 5 – Retrieval Augmented Generation (RAG) Workflow

Traditional or "Naive" RAG methods, which primarily depend on vector-based retrieval, have shown limitations in their ability to deeply understand context and perform complex reasoning tasks. By implementing sophisticated query construction, translation, and routing methods, RAG can achieve deeper contextual understanding and improved reasoning capabilities. These enhancements lead to more accurate and relevant information retrieval, ultimately resulting in higher-quality generated responses. The below figure 6 from Langchain details the most common advanced RAG techniques.

Figure 6 – Advanced RAG (Source: https://github.com/langchain-ai/rag-from-scratch)

1- Query Construction:

Described in figure 6's yellow rectangle, Query construction is the first step in the Retrieval-Augmented Generation (RAG) pipeline. It involves transforming natural language queries into formats compatible with various database types, including relational, graph, and vector databases. This step ensures accurate interpretation of user queries by the underlying data storage systems, setting the stage for effective retrieval.

2- Query Translation:

Described in figure 6's pink rectangle in the figure above, Query translation expands the Scope of Retrieval and is a crucial process in the RAG pipeline that involves transforming the initial query into various forms or sub-queries to enhance retrieval effectiveness. Two key approaches in query translation are:

2.1- Query Decomposition:

Query decomposition involves breaking down complex queries into simpler sub-queries, making it easier to retrieve relevant information. This technique includes methods such as:

a. Multi-query: Generating multiple related queries to capture different aspects of the user's intent.

b. Step-back: Creating broader, more general queries to provide context for specific questions.

c. RAG-Fusion: A method that combines information from multiple retrieved documents to generate a comprehensive response. The RAG-Fusion process typically involves:

c.1- Query Generation: Creating multiple sub-queries from the user's input.

c.2- Sub-query Retrieval: Fetching relevant information for each sub-query.

c.3- Reciprocal Rank Fusion: Merging retrieved documents using a ranking algorithm to prioritize the most relevant results.

2.2- Pseudo-documents:

This technique involves creating hypothetical documents based on the query to help in the retrieval process. One notable method is HyDE (Hypothetical Document Embeddings), which generates pseudo-documents from queries and uses them to find similar, relevant real documents through similarity search.

3- Routing:

Described in figure 6's orange Quadrant in the figure above, Routing in RAG systems involves determining the most appropriate database or information source to query based on the nature of the user's question. Two main approaches to routing are:

3.1- Logical Routing:

This method leverages Language Models (LLMs) to analyze the query and choose the most suitable database or source for retrieving relevant information. The LLM considers the query's structure, content, and intent to make this decision.

3.2- Semantic Routing:

Unlike logical routing, semantic routing uses embeddings and similarity measures to understand the meaning of the query and select the best information source. This approach focuses on the semantic content of the question rather than relying on predefined rules or query structure.

Fine-tuning

Instruction fine-tuning

Instruction fine-tuning is a powerful technique in supervised machine learning that enhances model performance by iteratively comparing the model's output to the ground truth label for a given input. This approach leverages labeled examples, consisting of instructions paired with their expected responses, to improve the model's ability to handle downstream tasks effectively. By modifying the model weights, instruction fine-tuning significantly enhances the zero-shot performance of language models on previously unseen tasks. For those looking to implement this technique on AWS, Amazon SageMaker JumpStart offers an easy and efficient solution for fine-tuning powerful generative models. Using SageMaker JumpStart in conjunction with the SageMaker Python library enables rapid scaling of fine-tuning workloads across large distributed clusters of GPU instances. For users seeking maximum flexibility and configurability, Hugging Face's implementation of the Amazon SageMaker Estimator class provides a comprehensive solution. These classes, part of the SageMaker Python library, facilitate end-to-end training job coordination using SageMaker's robust backend infrastructure, allowing for precise control over the fine-tuning process.

Domain adaptation fine-tuning

Domain adaptation fine-tuning is a powerful technique used to enhance the performance of language models in specific contexts. By leveraging proprietary or domain-specific unsupervised data, this approach allows developers to tailor a model's capabilities to a particular field or industry. The process involves fine-tuning the model on a carefully curated dataset, which helps it learn the nuances, terminology, and patterns unique to that domain. As a result, the model's response quality is significantly improved, incorporating industry-specific jargon and producing more relevant and accurate outputs. For those looking to implement this technique on AWS, Amazon SageMaker JumpStart supports domain adaptation fine-tuning of FMs. This specialized training enables the model to better understand and generate content within the target domain, making it an invaluable tool for businesses and organizations seeking to deploy AI solutions tailored to their specific needs and knowledge areas.

Parameter efficient fine-tuning

Parameter Efficient Fine-Tuning (PEFT) has emerged as a groundbreaking approach in the field of machine learning, offering a solution to the computational and storage challenges associated with traditional fine-tuning methods. By focusing on adjusting only a small subset of model parameters, PEFT achieves results comparable to full fine-tuning for downstream tasks while significantly reducing resource requirements. This technique allows for either freezing most of the model parameters and fine-tuning the remaining ones, or freezing all parameters and introducing new layers or parameters for fine-tuning. The versatility of PEFT is evident in the variety of techniques explored in the paper "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning." While these techniques differ in implementation, they generally adhere to the principle of preserving the original model's parameters while extending or replacing layers through the training of a smaller set of additional parameters. Among the most widely adopted PEFT methods are those falling under the additive and reparameterization categories, including Prompt tuning, LoRA (Low-Rank Adaptation), and QLoRA (Quantized Low-Rank Adaptation). These approaches have gained popularity due to their effectiveness in balancing performance improvements with computational efficiency.

LoRA: Low-Rank Adaptation of Large Language Models:

LoRA, or Low-Rank Adaptation, represents a significant advancement in the fine-tuning of large language models. As detailed in the LoRA: Low-Rank Adaptation of Large Language Models paper, this innovative approach tackles the challenges of adapting extensive neural networks to specific tasks while minimizing computational and storage demands. By introducing pairs of rank-decomposition matrices and freezing the original model weights, LoRA dramatically reduces the number of trainable parameters. This technique not only decreases the storage requirements for task-specific adaptations but also facilitates efficient task-switching during deployment without introducing additional inference latency. Remarkably, LoRA has demonstrated superior performance compared to other adaptation methods, including adapters, prefix-tuning, and traditional fine-tuning.

QLoRA: Efficient Finetuning of Quantized LLMs:

QLoRA (Quantized Low-Rank Adaptation) is an advanced finetuning technique for large language models (LLMs) that significantly reduces memory requirements without compromising performance. This technique, detailed in the QLoRA: Efficient Finetuning of Quantized LLMs paper, addresses the challenge of memory constraints in LLM fine-tuning by quantizing the precision of weight parameters from the standard 32-bit format to a mere 4-bit precision. It builds upon the LoRA method by quantizing the model's weight parameters to 4-bit precision, drastically reducing memory footprint and enabling finetuning on a single GPU. QLoRA introduces three key innovations: 4-bit NormalFloat, an optimal quantization data type for normally distributed data; Double Quantization, which further compresses quantization constants; and Paged Optimizers, which leverage NVIDIA unified memory to manage gradient checkpointing. These advancements make it possible to run and finetune LLMs on less powerful hardware, including consumer GPUs, opening up new possibilities for researchers and developers with limited computational resources.

The figure below is from QLoRA: Efficient Finetuning of Quantized LLMs and it represents different fine tuning methods and their memory requirements. QLORA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

Figure 10 - Different fine tuning methods and their memory requirements (Source: https://arxiv.org/abs/2305.14314)

RLHF

Fine-tuning with instructions enhances a model's ability to comprehend and respond to human-like prompts, improving overall performance. However, this method alone does not guarantee the elimination of undesirable, false, or potentially harmful outputs. To address this limitation, Reinforcement Learning from Human Feedback (RLHF) is often employed as an additional fine-tuning step. RLHF utilizes human annotations to guide the model in aligning with human values and preferences, typically implemented after other fine-tuning techniques, including instruction-based methods. This combined approach aims to create more reliable and human-aligned AI models.

The following image shows an overview of the RLHF learning process.

Figure 11: RLHF learning process (Source: https://aws.amazon.com/what-is/reinforcement-learning-from-human-feedback/)

Building and Customizing Gen AI using AWS services

When considering whether to use Amazon Bedrock or Amazon SageMaker for building and customizing generative AI applications, it's important to understand their distinct purposes. Bedrock is designed to be the most accessible entry point for generative AI, making it ideal for developers focused on building applications on top of existing AI models. On the other hand, SageMaker is the go-to platform for those who want to delve deeper into the models themselves, offering capabilities for fine-tuning, pushing the boundaries of open source and proprietary models, or even training models from scratch. For customers seeking a middle ground that combines SageMaker's performance with an easy starting point, Amazon offers SageMaker Jumpstart. The following figure describes which service and approach to follow based on the use case.

Figure 12 – Building and customizing generative AI applications on AWS

When deciding how to customize foundation models for specific tasks, it's essential to consider the nature of the task and the data requirements. If the task requires context from external data, the next consideration is whether real-time data access is necessary. For relatively static data, such as FAQs or documents, retrieval augmented generation (RAG) is ideal. However, if real-time data changes and tool integration is needed, Agents for Amazon Bedrock is the best approach. Combining agents and knowledge bases can create powerful capabilities. For simpler tasks using historical data that perform well with pre-trained models, prompt engineering can be highly effective. Lastly, for more complex tasks with historical data that require specific training, model fine-tuning is the most appropriate method. It's important to note that these approaches are not mutually exclusive and can be combined to create robust solutions tailored to specific use cases. The following figure describes an example of the common techniques to use for customizing FM using Amazon Bedrock.

Figure 13 – Customizing FM using Amazon Bedrock. (Open the image in a new tab to see the full sized image)

Amazon Bedrock's knowledge bases offer a powerful solution for implementing Retrieval Augmented Generation (RAG), a technique that enhances Large Language Model (LLM) responses with information from external data sources. By setting up a knowledge base with your specific data, applications can leverage this resource to generate more accurate and contextually relevant answers. This approach allows for the creation of natural language responses or direct quotations from the queried sources, providing a versatile tool for information retrieval and generation. The integration of knowledge bases in Amazon Bedrock streamlines the development process, offering an out-of-the-box RAG solution that significantly reduces application build time and accelerates time to market. Moreover, this method proves to be cost-effective by eliminating the need for continuous model retraining to incorporate private data, making it an efficient choice for businesses looking to harness the power of LLMs while maintaining control over their proprietary information.