H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck

H2O.ai Conﬁdential
Intro to
Andreea Turcu
Head of Global Training @H2O.ai

Table of Contents
1. What are Large Language Models (LLMs)?
2. Steps in Building LLMs
3. Importance of Data Cleaning for LLMs
4. What is LLM DataStudio? (+ Interface Demo)
5. Generate a Clean Dataset from a PDF File (Doc2QA)
Quizzes
Q&As

1. Definition of LLMs
Large Large Language Models
(LLMs) are sophisticated artificial
intelligence models specifically
designed to understand and
generate human-like language on
an extensive scale.

1. Training, Patterns and Parameters
Extensive Training Data: Trained on massive
textual datasets from diverse sources
Pattern and Meaning Learning: They absorb
knowledge of words, sentence patterns and
meanings
Signiﬁcant Parameters: Referred to as "large" due
to a substantial number of parameters

1. Generative AI vs. LLMs
Source: gpt.h2o.ai

1. LLMs vs. Foundation Models
Foundation Models
Large Language Models (LLMs)
Unlabeled
Training Data
Additional
Text-Based Data
Transformer
Algorithm
Transformer
Algorithm
Foundation
Model
LLM
Foundation Model:
- Large machine learning model trained on
unlabeled data.
- Enhanced through transformer algorithms and
ﬁne-tuning.
- Adaptable to various applications.
Large Language Model (LLM):
- Speciﬁc type of foundation model.
- Tailored for natural language processing tasks.
- Examples include GPT models (e.g., GPT-3).

1. Beneﬁts of LLMs
1. Natural Language Processing (NLP)
2. Versatility Across Diverse Domains
3. Elevated Creativity in Content Generation
4. Facilitating Global Communication Breakthroughs
5. Information Extraction Efficiency

1. Challenges of LLMs
1. Computational Resources
2. Energy Consumption
3. Fine-tuning Complexity
4. Data Privacy Concerns
5. Interpretable Output

2. The LLMs Lifecycle
1. Data Collection
2. Preprocessing
3. Model Architecture Design
4. Training the Model
5. Fine-tuning
6. Validation and Evaluation
7. Deployment
8. Monitoring

2. Preprocessing is important
1. Data Collection
2. Preprocessing
3. Model Architecture Design
4. Training the Model
5. Fine-tuning
6. Validation and Evaluation
7. Deployment
8. Monitoring

2. Gold in - Gold Out
Data Collection
Preprocessing
Model Architecture Design
Training the Model
Fine-tuning
Validation and Evaluation
Deployment
Monitoring

Fine-tuning
Refining pre-trained
models using
task-specific data,
enhancing their
performance on
targeted tasks.
Foundation
Powerful language
models trained on
extensive text data,
forming the basis for
various language
tasks.
2. Building Steps for LLMs
01 03
Eval LLMs
Thoroughly assessing
and comparing LLMs
is increasingly vital
due to their
heightened
significance and
complexity.
04
05
04
03
02
01
DataPrep
Converting
documents into
instruction pairs, like
QA pairs, facilitating
fine-tuning and tasks.
02
Database & Applications
Optimize data usage by seamlessly
integrating new PDFs into the
database, eliminating the need for
model retraining.
Improve user experiences through
advanced language comprehension
and LLM-driven response
generation.
05

Fine-tuning
Refining pre-trained
models using
task-specific data,
enhancing their
performance on
targeted tasks.
Foundation
Powerful language
models trained on
extensive text data,
forming the basis for
various language
tasks.
2. Emphasis on the DataPrep Stage
01 03
Eval LLMs
Thoroughly assessing
and comparing LLMs
is increasingly vital
due to their
heightened
significance and
complexity.
04
05
04
03
02
01
DataPrep
Converting
documents into
instruction pairs, like
QA pairs, facilitating
fine-tuning and tasks.
02
Database & Applications
Optimize data usage by seamlessly
integrating new PDFs into the
database, eliminating the need for
model retraining.
Improve user experiences through
advanced language comprehension
and LLM-driven response
generation.
05

3. Key Beneﬁts of Data Cleanliness in
Language Models
1. Improved Model Performance
2. Mitigated Bias and Unwanted Inﬂuences
3. Consistency and Coherence
4. Enhanced Generalization
5. Ethical Considerations
6. Improved User Experience and Trust

3. Key Aspects in DataPrep for LLMs

4. Deﬁnition of LLM DataStudio

H2O AI and GenAI Ecosystem
Documents
Data Sources
LLM
DataStudio
myGPT
LLM
EvalStudio
Vector DB
(Embeddings
)
Alternative
Datasets
Query + Documents
(Context)
Talk to Your
Data
● Ques Answers
● Context Search
● Doc Retrieval
● Similar Doc
● Personalization
Contextual Similarity
Continuous Eval
(feedback)
Gen AI App Store
+ +
+ +
+
+
Datasets AI Engines AI Apps
LLM
Integration
Models
Data to QA pairs
ETL for LLMs
LLM Fine Tuning
Custom GPT
API
End User
Enterprise

4. Enhancing LLM Data with LLM DataStudio
LLM DataStudio features:
● Q&A Generative of text and audio data
● Text Cleaning
● Data Quality Issue Detection
● Tokenization
● Text Length Control

4. Interface Demo

4. Demo - Curate
H2O.ai LLM Studio Website
https://h2o.ai/platform/ai-cloud/make/llm-studio/

Structured Data Preparation
Workflow in LLM DataStudio
LLM DataStudio follows a structured data
preparation process.
The process includes several stages:
❏ Data intake
❏ Workflow construction
❏ Configuration
❏ Assessment
❏ Result generation

5. The Workﬂow Builder - Demo

5. Demo - Generate a Clean Dataset
from a PDF File (Doc2QA)
A Comprehensive Overview of Large Language Models
https://arxiv.org/pdf/2307.06435.pdf

Thank you!

LLM Studio
Overview
Andreea Turcu
Customer Data Scientist
@H2O.ai

Table of Contents
What are LLMs?
Foundation vs. Fine-tuning
LLM Studio Intro
Demo / Follow along:
Connect to LLM Studio
The LLM Studio GUI
Launching an Experiment
Monitoring the Experiment
Next Steps with LLM Studio (model export)

A large language model is a type of AI
algorithm trained on huge amounts of text
data that can understand and generate
text.

LLMs can be characterized by 4 parameters:
● size of the training dataset
● cost of training
● size of the model
● performance after training

Let’s follow along!

Intro to h2oGPT
by Andreea Turcu

Agenda A bit of context
What are GPTs?
Why know what LLMs are?
LLMs origins
What is h2oGPT?
Boosting your productivity with h2oGPT
Limitations of Existing models
Beneﬁts of Open Source models
Demo of h2oGPT

v
What are GPTs?

v
Why should I know what LLMs are?

v
Why should I know what LLMs are?
Large language models like GPT have diverse business uses:
● automating content
● extracting insights from data,
● personalizing marketing,
● enabling virtual assistants,
● analyzing data,
● facilitating voice-based interactions and translations, etc.

v
What are LLMs?
- LLMs (Language Models) are computational models for understanding and generating human
language.
- They are trained on vast amounts of text data.
- LLMs learn grammar, vocabulary, and contextual relationships.
- They can generate coherent and contextually relevant text based on given prompts.
- Collaboration with AI systems becomes more efficient.
- Responsible use and enhanced user experiences can be achieved.

v
LLM Origins
Transformers are deep feed-forward neural networks that leverage a machine learning
mechanism called (self) attention and have seen wild success in natural language
processing problems
h2oGPT
The world’s best
completely open
source LLM and
permissible for
commercial use
2023
ChatGPT
Interactive interface
for users to interact
directly with GPT3
and GPT4 modeling
frameworks
2022
GPT
Auto-regressive
language modeling
where the goal is to
predict the next
token
2020
BERT
Bidirectional Encoder
Representations from
Transformers.
Model designed to
recover masked tokens
2019
Encoder-Decoder
(Seq2Seq)
Original Transformer
Architecture for
Machine Translation or
Sequence-to-Sequence
Problems
2017
Reference: https://arxiv.org/pdf/2207.09238.pdf

v
What is h2oGPT?

AI Will Boost Productivity by
10x
Continuous but slow improvements in
automatization and productivity.
Productivity in the US has increased by
250% in 70 years.*
In addition to small specialized models, LLMs
are supporting employees in their daily tasks.
Brainstorming, coding, summarization, analysis
No Code and AutoML enables all companies
to build and use highly accurate models for
specialized tasks
1-Click to solve complex business goals
AI is used in automated
mode. Employees are
supervising their AI
co-workers. Robotics leaps
forward by incorporating
LLMs
2023
2022
up to 2021
2024
2025
*2020 | MIT Work of the Future

v
Popular models such as OpenAI's ChatGPT/GPT-4, Anthropic's Claude, Microsoft's Bing AI Chat, Google's
Bard, and Cohere are powerful and effective, they have certain limitations compared to open-source LLMs:
1. Data Privacy and Security: Using hosted LLMs requires sending data to external servers. This can raise
concerns about data privacy, security, and compliance, especially for sensitive information or industries
with strict regulations.
2. Dependency and Customization: Hosted LLMs often limit the extent of customization and control, as
users rely on the service provider's infrastructure and predefined models.
3. Cost and Scalability: Hosted LLMs usually come with usage fees, which can increase significantly with
large-scale applications.
4. Access and Availability: Hosted LLMs may be subject to downtime or limited availability, affecting users'
access to the models.
Limitations of Existing Models

v
1. Cost Effective as users can scale the models on their own infrastructure
without incurring additional costs from the service provider.
2. Flexible: Deployed on-premises or on private clouds, ensuring uninterrupted
access and reducing reliance on external providers.
3. Tunable: Allow users to tailor the models to their specific needs, deploy on
their own infrastructure, and even modify the underlying code.
Overall, open-source LLMs offer greater flexibility, control, and cost-effectiveness,
while addressing data privacy and security concerns. They foster a competitive
landscape in the AI industry and empower users to innovate and customize
models to suit their specific needs.
Beneﬁts of Open Source Models

v
h2oGPT
● Released as open source under Apache-2.0 license
● Active development: h2oai/h2ogpt
● See a demo
○ gpt.h2o.ai
○ 🤗 Hugging Face Spaces
What is it?
● Commercially usable code, data, and models
● Prompt engineering - ability to prepare open-source
datasets for tuning LLMs
● Tuning: Code for ﬁne-tuning large language models
(currently up to 20B parameters) on commodity hardware
and enterprise GPU servers (single or multi node)
Optimizations
■ LoRA (low-rank approximation)
■ 4-bit and 8-bit quantization for memory-efficient
ﬁne-tuning and generation.
● Deployable: Chatbot with UI and Python API
● Evaluation: LLM performance evaluation
The world’s best open source GPT

https://gpt-gm.h2o.ai/
https://gpt.h2o.ai/
Demo of h2oGPT!
Disclaimer:
subject to modification
and updates

Explore
H2O GenAI App Store
Andreea Turcu

Table of Contents
Introduction
1. Why Generative AI?
2. H2O Generative AI Ecosystem
3. Generative AI Applications
4. H2O GenAI App Store Demo
Wrapping Up

Introduction

Why Generative AI?
● Society
● Company
● Individual

Beneﬁts of Generative AI
1. Content Creation
2. Creative Assistance
3. Natural Language Understanding
4. Personalization
5. Data Augmentation
6. Automation of Repetitive Tasks
7. Language Translation
etc.

H2O Generative AI
Ecosystem

H2O.ai Enterprise GenAI Platform
Documents as
Data Sources
LLM
DataStudio myGPT
EvalStudio
Vector DB
(Embeddings)
R. A. G.
Talk to your Data
● Question Answering
● Context Search
● Information Retrieval
● Similar Documents
● User Personalization
Contextual Similarity
Continuous Eval
(feedback)
GenAI AppStore
+ +
+ +
+
+
LLMs
Integration
Models
ETL for LLMs
Data to QA pairs LLM Fine Tuning
End Users
AI for Documents
Training Deployment
GenAI
AppStudio
Prompt
Studio
LLMOps
API
Ingestion

Generative AI
Applications

Possible Applications
● Content Generation
● Layout Design
● Image and Icon Generation
● Auto-Completion and Suggestions
● Personalization
● Chatbots and Conversational UIs
● Adaptive UIs
● Dynamic Theming
● Accessibility Features
● Prototyping and Design Exploration

H2O.ai Enterprise GenAI Platform
GenAI App Store
+ +
+ +
+
+
LLMs
Integration
Models
Training Deployment
GenAI
AppStudio

Why GenAI Apps?
What does it take to solve a speciﬁc
problem?
● Custom inputs
● Custom prompts
● Custom LLMs (when needed)
● Custom data
● Management of all of the above

H2O GenAI App Store

Investment
Scam Shield
LLM based Scam
Prevention Service
LLM
Investment
Virtual Advisor
LLM based conversation
support services
LLM
Sales
Strategy Engine
LLM based strategy
generator
LLM
Sales
Report Generator
LLM based Report
Generator
LLM
Trading
Language Assist
LLM based multilingual
assistance
LLM
Asset Management
Risk Manager
Gen AI Risk Assessment
and Allocation
LLM
Asset Management
Recommender
Gen AI Product
Recommendations
LLM
Legal
Regulator
LLMs for Regulatory
Filings
LLM
Legal
Legal Assist
Automated Regulatory
Reporting using GenAI
LLM
Operations
Credit Scorer
LLMs Credit Scoring and
Underwriting
LLM
Operations
Transaction Monitor
LLM for Transaction
Monitoring
LLM
Security
Guard Rails
LLM in Security
LLM
Gen AI App Store
Apps Powered by LLMs (h2oGPT + myGPT) | Demos
Gen AI Applications
powered by LLMs to
provide faster
information retrieval
and search from
complex datasets,
models, and the
outputs.
LLMs blended with
typical statistical and
traditional models to
provide rich outputs
enhanced by LLMs
capabilities.
This includes :
Summarization,
Question Answering,
Talk to your Data +
Documents,
Generating Feature
stories

REPEATABLE AI / DATA USE-CASES
Packaged as AI Apps
Apps: Multiple AI Data Science Use Cases
Scam Shield
H2O.ai Entity Extraction in Legal Documents
Customer Churn Detection
Anomaly Detection
Fraud Analysis
Know Your Customer
H2O Document Insights
Next Best Conversation
Customer Proﬁling
Market Basket Analysis
Customer 360
GenAI App Store
powered by H2O AI Cloud
H2O GenAI App Store made public

Demo Time!
genai.h2o.ai

Wrapping Up

Explore
H2O LLM EvalGPT
Andreea Turcu

Table of Contents
1. What are LLMs?
2. Why Evaluate LLMs?
3. What is H2O EvalGPT?
4. H2O EvalGPT User Interface
5. Conclusion

What are LLMs?

v
1. Natural Language Understanding
2. Text Generation
3. Automation and Efficiency
4. Advancements in AI Research
5. Ethical Consideration
Why are LLMs important?

v
● Transforming Communication
● Augmenting Human Abilities
● Ethical and Societal Implications
● Economic Impact
LLMs are reshaping society!

Why Evaluate LLMs?

Key aspects of LLM evaluation (I)
1. Performance Metrics
2. Benchmarking
3. Fine-Tuning and Transfer Learning
4. Robustness and Generalization
5. Bias and Fairness

Key aspects of LLM evaluation (II)
6. Computational Eﬃciency
7. Interpretability and Explainability
8. Domain-Speciﬁc Evaluation
9. User Feedback and Human Evaluation

What is H2O
EvalGPT?

GenAI AppStudio
Datasets
Unstructured
Datasets
Documents
ETL / Prep for LLMs
Documents → QA Pairs Fine Tuning LLMs
(& Prompts)
End Users
Vector DB
(Embeddings)
myGPT
R. A. G.
Talk to your Data
Document QA
Document Chat
Image/Video Chat
LLM Query
GenAI Apps
+ +
+ +
+
+
LLM
Data Studio
AI Engines
EvalStudio
AI Apps
+ LLMs
Integration
LLMOps
API
Prompt
Tuning
Parsing . Chunking
Indexing . Embeddings
LLM Agents
Chat / QA
Prompt Engineering
LLM
Workers
MLOps
Foundations of a GenAI Ecosystem
Continuous
Feedback
EvalGPT
8. GenAI Apps
5. Fine Tuning
6. Evaluation
4. Predictive ML
7. Integrations
3. Data
Preprocessing
2. Data
Collection
1. Problem
Deﬁnition

GenAI AppStudio
Datasets
Unstructured
Datasets
Documents
ETL / Prep for LLMs
Documents → QA Pairs Fine Tuning LLMs
(& Prompts)
End Users
Vector DB
(Embeddings)
myGPT
R. A. G.
Talk to your Data
Document QA
Document Chat
Image/Video Chat
LLM Query
GenAI Apps
+ +
+ +
+
+
LLM
Data Studio
AI Engines
EvalStudio
AI Apps
+ LLMs
Integration
LLMOps
API
Prompt
Tuning
Parsing . Chunking
LLM Agents
Chat / QA
Prompt Engineering
LLM
Workers
MLOps
Continuous
Feedback
EvalGPT

v
● Assess and compare Large Language Models
(LLMs) across tasks.
● Get detailed leaderboard results to streamline
workﬂows.
We evaluate LLMs using business data and offer
model submissions soon.
H2O EvalGPT:

v
● Relevance
● Transparency
● Speed and Currency
● Scope
● Interactivity and Alignment
Key Features

H2O EvalGPT User
Interface
evalgpt.ai

v
Elo Ranking

v
Evaluation Method

v
A/B Testing

v
Prompts

v
Responses

LLMs from A to Z
(data prep, building & deployment)
with H2O.ai
Audrey Létévé, Senior Customer Data Scientist
21st May 2024

Agenda
• Intro
– H2o Gen AI ecosystem and our AI-powered
search assistant : Enterprise h2oGPTe
– When/Why Fine-tuning your own LLM ?
• Preparing your data for ﬁne-tuning
• Using Open Source H2O LLM Studio and train
your own LLM
• Deployment with H2O MLOps

Unstructured
Datasets
Documents
ETL / Prep for LLMs
Documents → QA Pairs
Fine Tuning LLMs
End Users
Vector DB
(Embeddings)
myGPT
R. A. G.
Talk to your Data
Document QA
Document Chat
Image/Video Chat
LLM Query
LLM
Data Studio
EvalStudio
LLMOps
API
Continuous
Feedback
Parsing . Chunking
Chat / QA
Prompt Engineering
LLM
Workers
R. A. G. System
MLOps

Rules of thumb:
1. Don’t train to memorise facts
2. Start by trying an off-the-shelf LLM
3. Train to improve on desired task, domain, or style
Do you need to train an LLM for your task?
In many cases, your
use case may work
well with an
“off-the-shelf” LLM
without any changes
- Experiment
with how you
ask an LLM to
solve your use
case
- Small tweaks in
phrasing may
boost task
performance
dramatically
- Further train an
LLM with a
dataset of task
prompt-answers
- Start with a
dataset size in the
hundreds and
increase if
necessary
2
Use prompt
engineering
3
Fine-tune an
LLM
1
Use “off-the-shelf”
LLM as-is
Increasing
technical effort

Demo
Documentation

Demo
Documentation
LLM
Data Studio

H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck

More Related Content

What's hot (20)

Similar to H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck (20)

More from Sri Ambati (20)

Recently uploaded (20)

H2O Gen AI Ecosystem Overview - Level 1 - Slide Deck