SlideShare a Scribd company logo
Leveraging the Power of ChatGPT
and Vector Databases in the
FreeBSD Expert System
Yan-Hao Wang, AsiaBSDCon 2024
Who Am I
My name is Yan-Hao Wang, a senior high student in Taiwan and FreeBSD
Taiwan intern since 2022.
I've been involved in various tasks such as
1. Developing an online document/man-page editor.
2. Crafting tests for command utilities like gunion(8) and printenv(1).
3. Translating FreeBSD documents.
GitHub Repository
All codes have been uploaded to the freebsd_data repository. The slide will also
be put on it. If you're interested, you can access them there.
Outline
1. Introduction of the Expert System
2. Introduction of ChatGPT
3. Development Process
a. Data Cleaning and Extraction
b. Embedded Model and Vector Database
c. Integration with ChatGPT
4. OpenAI GPTs as Potential Replacements
5. Summary
Expert System
Expert system is a system that can answer user questions accurately in a specific
domain. It consists of two parts
1. Knowledge Base: stores all the relevant information related to the domain of
expertise.
2. Rule Engine: Contain some predefined rules by the data scientist. It processes
the user's questions and applies rules to generate accurate responses.
Expert System
Modern expert systems use machine learning to simulate the behavior or
judgment of domain experts.
ML model
ChatGPT
ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot developed by
OpenAI and launched on November 30, 2022. Based on a large language model
(LLM).
FreeBSD Expert System
There are multiple ways to build a FreeBSD expert system.
1. Train a new ML model with FreeBSD data.
No. I am not an ML expert and it costs a lot.
FreeBSD Expert System
There are multiple ways to build a FreeBSD expert system.
1. Train a new ML model with FreeBSD data.
No. I am not an ML expert and it costs a lot.
2. Use the existing model such as ChatGPT.
But we definitely won’t call ChatGPT a FreeBSD expert system.
FreeBSD Expert System
ChatGPT uses amount of data for training. So he can answer problems in every
domain though may not be correct. It's more like a general-purpose system.
FreeBSD Expert System
ChatGPT uses amount of data for training. So he can answer problems in every
domain though may not be correct. It's more like a general-purpose system.
The limitation of why ChatGPT can’t be called a FreeBSD expert system
1. Chatgpt may tendency to hallucinate answers when asked about unfamiliar
domains.
2. The data is not new enough (ChatGPT uses data before 2021 to train). So he
can’t answer the newest question.
FreeBSD Expert System
There are two ways to handle the limitation.
1. Fine-tune. fine-tuning is a process that takes a model that has already been
trained for one given task and then tunes or tweaks the model to make it
perform a second similar task.
FreeBSD Expert System
There are two ways to handle the limitation.
First way is, fine-tune. fine-tuning is a process that takes a model that has already
been trained for one given task and then tunes or tweaks the model to make it
perform a similar task.
OpenAI has supplied this API. For the open-source model, you should use Pytorch
and TensorFlow to handle it.
FreeBSD Expert System
However, fine-tuning is still hard for AI-unfamiliar developer. And It also cost a lot.
The second way is Retrieval Augmented Generation (RAG). Basically, it is just like
when you use ChatGPT, you can provide related info about your question, and it
can provide a much more accurate response.
This is an acceptable way, so we will use the embedded model and vector
database to achieve this.
Embedded Model
It is a type of ML model used to convert input data, such as words or sentences,
into numerical representations called embedding vector or vector.
These embeddings capture the semantic meaning or context of the input data in a
continuous vector space. It can work on tasks such as text classification and
sentiment analysis.
Vector Database
Vector databases are designed to store vectors efficiently. These databases
employ various search algorithms to find the most similar vectors, such as t.
Numerous open-source vector databases are available to choose from.
Development Process
Development - Architecture
Development - Data Extraction
Use the simple find command to extract data. The data sources are very different,
we need to convert it to plain text. We use “hs-pandoc” package to convert data.
Development - Data Cleaning
Remove unrelated info, simple find command to remove the unrelated data.
Unrelated text
Development - Data Cleaning
Actually, data cleaning is the most time-consuming step. Data scientists spend
60% of their time cleaning data rather than creating insights.
There are some tools that can help us clean the data.
OpenRefine
Development - Embedded Model
OpenAI has embedded model API, there are multiple open source embedded
models online too. In this project, we use the open source model (“gte-base”).
MTEB Leaderboard - From Hugging Face
OpenAI embedding model
Development - Embedded Model
[0.2, 0.3 … 2.3]
[0.3, 0.6 … 1.7]
[0.9, 0.1 … 3.1]
vector 1
vector 2
vector 3
Development - Embedded Model
We use “gte-base” as our model. Its model size is only 0.22 GB which my small
GPU (NVIDIA GeForce GTX 1050 Ti) can handle it.
It takes only 590MB of GPU memory and 67 minutes to embed all the documents.
Development - Embedded Model
Development - Embedded Model
Development - Embedded Model
There are multiple facts (hyperparameters) we can tune here. For example
1. The length of sentences.
2. What metadata should we leave?
3. What model should we use? Weather we need to tune the embedded model.
All these hyperparameters should be tried multiple times to get the best answer.
The answer will be different with different fields - NFL(No Free Lunch Theorems)
Development - Vector database
As previously said, we have different vector databases.
But in our local test, we just use a file to store the vector and a simple cosine
similarity algorithm. Because our data is not big (< 100 MB).
Development - Query
Question: How to use the gunion command in FreeBSD?
Query result:
1. Man page of gunion
2. Man page of gunion
3. FreeBSD status report (A New GEOM Facility, gunion)
4. Unrelated info …
Development - Query
TOP1
TOP2
TOP3
Development - Integration with ChatGPT
So we need to host an embedded model and vector database and have an open
API to let users use. Then integrate the API with ChatGPT
1. The first way is easy, we just write a Python code to use ChatGPT API and
our API. But this is not friendly to normal users.
Development - Integration with ChatGPT
So we need to host an embedded model and vector database and have an open
API to let users use. Then integrate the API with ChatGPT
1. The first way is easy, we just write a Python code to use ChatGPT API and
our API. But this is not friendly to normal users.
2. Develop ChatGPT plugin, ChatGPT plugin can let us set some API. While
asking questions ChatGPT, it will call the API and get the response.
This is the best practice of our project, the user just needs to enable the
plugin in ChatGPT.
Development - Integration with ChatGPT
OpenAI GPTs as Potential Replacements
GPTs was lauched at November 2023. It provides an easy way to generate a
custom GPT for any data you have. Which becomes a potential replacement for
our project. We only need to upload the data from step 1 and there is a custom
expert system.
On March 19, 2024, you will no longer be able to install new plugins or create new
conversations with existing plugins.
Wiki Future Audiences
The idea is inspired by Wiki. They actually already have developed a plugin. But
they also stopped the plan after the GPTs release.
This timing also coincides with OpenAI’s move away from the plugin marketplace
for ChatGPT, and towards no/low-code customizable GPTs. This shift has made
our plugin in its current form inaccessible to new users and largely redundant.
While we could repurpose this functionality towards being a GPT, we don’t believe
we would learn significantly more beyond how to create a product within the
OpenAI ecosystem.
Lessons learned, ChatGPT has not become the new information seeking
paradigm (yet?).
Summary
Solution RAG GPTs (Custom GPT) ChatGPT Plus (browse internet)
Cost Medium ~ Hard Small Small
Advantage ● Privacy
● Flexibility
● Fast ● Fast
Disadvantage ● Cost ● Privacy
● Flexibility
● Data source are different
● Flexibility
Summary
The significance of LLM is poised to exponentially increase in the future, marking
a pivotal shift in our technological landscape.
While we may not complete the production process in its entirety. But it is a good
thing to focus on any future trends and try to combine them with FreeBSD.
Reference
● What is an Expert System?
● Do data scientists spend 80% of their time cleaning data? Turns out, no?
● Wiki, Talk:Future Audiences

More Related Content

PDF
How to build your in-house ChatGPT
PPTX
Generative AI in CSharp with Semantic Kernel.pptx
PPTX
Introduction to Google App Engine with Python
DOCX
A Decision Tree based Recommendation System for Tourists.docx
PDF
Advanced Virtual Assistant Based on Speech Processing Oriented Technology on ...
PDF
Generative AI leverages algorithms to create various forms of content
PDF
LangChain Intro by KeyMate.AI
PDF
LanGCHAIN Framework
How to build your in-house ChatGPT
Generative AI in CSharp with Semantic Kernel.pptx
Introduction to Google App Engine with Python
A Decision Tree based Recommendation System for Tourists.docx
Advanced Virtual Assistant Based on Speech Processing Oriented Technology on ...
Generative AI leverages algorithms to create various forms of content
LangChain Intro by KeyMate.AI
LanGCHAIN Framework

Similar to wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expert-System-slides.pdf (20)

PDF
ChatGPT usage in software development - curse or boon.pdf
DOCX
summer file - Copy
PDF
AI in Drupal: Evolution, Modules and Possibilities
PPTX
Machine Learning
PPTX
Past, Present and Future of Generative AI
PDF
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
PDF
IRJET - A Study on Building a Web based Chatbot from Scratch
PDF
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
PDF
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
PDF
ChatGPT and AI for web developers - Maximiliano Firtman
PDF
Java and graal vm to easily deploy your machine learning services
PPTX
MuleSoft + Augmented Reality & ChatGPT
PDF
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
PDF
Distributed Tracing
PDF
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
PPTX
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
DOCX
Company Visitor Management System Report.docx
DOC
hari_duche_updated
PPTX
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
PPTX
Software Modeling and Artificial Intelligence: friends or foes?
ChatGPT usage in software development - curse or boon.pdf
summer file - Copy
AI in Drupal: Evolution, Modules and Possibilities
Machine Learning
Past, Present and Future of Generative AI
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
IRJET - A Study on Building a Web based Chatbot from Scratch
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
ChatGPT and AI for web developers - Maximiliano Firtman
Java and graal vm to easily deploy your machine learning services
MuleSoft + Augmented Reality & ChatGPT
OSMC 2023 | Experiments with OpenSearch and AI by Jochen Kressin & Leanne La...
Distributed Tracing
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
SPOTLIGHT IGNITE (10 MINUTES): THE FUTURE OF DEVELOPER TOOLS: FROM STACKOVERF...
Company Visitor Management System Report.docx
hari_duche_updated
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Software Modeling and Artificial Intelligence: friends or foes?
Ad

Recently uploaded (20)

PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
System and Network Administraation Chapter 3
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
assetexplorer- product-overview - presentation
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
Adobe Illustrator 28.6 Crack My Vision of Vector Design
System and Network Administraation Chapter 3
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Design an Analysis of Algorithms I-SECS-1021-03
Operating system designcfffgfgggggggvggggggggg
assetexplorer- product-overview - presentation
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Digital Systems & Binary Numbers (comprehensive )
Computer Software and OS of computer science of grade 11.pptx
How to Migrate SBCGlobal Email to Yahoo Easily
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
How to Choose the Right IT Partner for Your Business in Malaysia
Ad

wang-Leveraging-the-Power-of-ChatGPT-and-Vector-Databases-in-the-FreeBSD-Expert-System-slides.pdf

  • 1. Leveraging the Power of ChatGPT and Vector Databases in the FreeBSD Expert System Yan-Hao Wang, AsiaBSDCon 2024
  • 2. Who Am I My name is Yan-Hao Wang, a senior high student in Taiwan and FreeBSD Taiwan intern since 2022. I've been involved in various tasks such as 1. Developing an online document/man-page editor. 2. Crafting tests for command utilities like gunion(8) and printenv(1). 3. Translating FreeBSD documents.
  • 3. GitHub Repository All codes have been uploaded to the freebsd_data repository. The slide will also be put on it. If you're interested, you can access them there.
  • 4. Outline 1. Introduction of the Expert System 2. Introduction of ChatGPT 3. Development Process a. Data Cleaning and Extraction b. Embedded Model and Vector Database c. Integration with ChatGPT 4. OpenAI GPTs as Potential Replacements 5. Summary
  • 5. Expert System Expert system is a system that can answer user questions accurately in a specific domain. It consists of two parts 1. Knowledge Base: stores all the relevant information related to the domain of expertise. 2. Rule Engine: Contain some predefined rules by the data scientist. It processes the user's questions and applies rules to generate accurate responses.
  • 6. Expert System Modern expert systems use machine learning to simulate the behavior or judgment of domain experts. ML model
  • 7. ChatGPT ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot developed by OpenAI and launched on November 30, 2022. Based on a large language model (LLM).
  • 8. FreeBSD Expert System There are multiple ways to build a FreeBSD expert system. 1. Train a new ML model with FreeBSD data. No. I am not an ML expert and it costs a lot.
  • 9. FreeBSD Expert System There are multiple ways to build a FreeBSD expert system. 1. Train a new ML model with FreeBSD data. No. I am not an ML expert and it costs a lot. 2. Use the existing model such as ChatGPT. But we definitely won’t call ChatGPT a FreeBSD expert system.
  • 10. FreeBSD Expert System ChatGPT uses amount of data for training. So he can answer problems in every domain though may not be correct. It's more like a general-purpose system.
  • 11. FreeBSD Expert System ChatGPT uses amount of data for training. So he can answer problems in every domain though may not be correct. It's more like a general-purpose system. The limitation of why ChatGPT can’t be called a FreeBSD expert system 1. Chatgpt may tendency to hallucinate answers when asked about unfamiliar domains. 2. The data is not new enough (ChatGPT uses data before 2021 to train). So he can’t answer the newest question.
  • 12. FreeBSD Expert System There are two ways to handle the limitation. 1. Fine-tune. fine-tuning is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform a second similar task.
  • 13. FreeBSD Expert System There are two ways to handle the limitation. First way is, fine-tune. fine-tuning is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform a similar task. OpenAI has supplied this API. For the open-source model, you should use Pytorch and TensorFlow to handle it.
  • 14. FreeBSD Expert System However, fine-tuning is still hard for AI-unfamiliar developer. And It also cost a lot. The second way is Retrieval Augmented Generation (RAG). Basically, it is just like when you use ChatGPT, you can provide related info about your question, and it can provide a much more accurate response. This is an acceptable way, so we will use the embedded model and vector database to achieve this.
  • 15. Embedded Model It is a type of ML model used to convert input data, such as words or sentences, into numerical representations called embedding vector or vector. These embeddings capture the semantic meaning or context of the input data in a continuous vector space. It can work on tasks such as text classification and sentiment analysis.
  • 16. Vector Database Vector databases are designed to store vectors efficiently. These databases employ various search algorithms to find the most similar vectors, such as t. Numerous open-source vector databases are available to choose from.
  • 19. Development - Data Extraction Use the simple find command to extract data. The data sources are very different, we need to convert it to plain text. We use “hs-pandoc” package to convert data.
  • 20. Development - Data Cleaning Remove unrelated info, simple find command to remove the unrelated data. Unrelated text
  • 21. Development - Data Cleaning Actually, data cleaning is the most time-consuming step. Data scientists spend 60% of their time cleaning data rather than creating insights. There are some tools that can help us clean the data. OpenRefine
  • 22. Development - Embedded Model OpenAI has embedded model API, there are multiple open source embedded models online too. In this project, we use the open source model (“gte-base”). MTEB Leaderboard - From Hugging Face OpenAI embedding model
  • 23. Development - Embedded Model [0.2, 0.3 … 2.3] [0.3, 0.6 … 1.7] [0.9, 0.1 … 3.1] vector 1 vector 2 vector 3
  • 24. Development - Embedded Model We use “gte-base” as our model. Its model size is only 0.22 GB which my small GPU (NVIDIA GeForce GTX 1050 Ti) can handle it. It takes only 590MB of GPU memory and 67 minutes to embed all the documents.
  • 27. Development - Embedded Model There are multiple facts (hyperparameters) we can tune here. For example 1. The length of sentences. 2. What metadata should we leave? 3. What model should we use? Weather we need to tune the embedded model. All these hyperparameters should be tried multiple times to get the best answer. The answer will be different with different fields - NFL(No Free Lunch Theorems)
  • 28. Development - Vector database As previously said, we have different vector databases. But in our local test, we just use a file to store the vector and a simple cosine similarity algorithm. Because our data is not big (< 100 MB).
  • 29. Development - Query Question: How to use the gunion command in FreeBSD? Query result: 1. Man page of gunion 2. Man page of gunion 3. FreeBSD status report (A New GEOM Facility, gunion) 4. Unrelated info …
  • 31. Development - Integration with ChatGPT So we need to host an embedded model and vector database and have an open API to let users use. Then integrate the API with ChatGPT 1. The first way is easy, we just write a Python code to use ChatGPT API and our API. But this is not friendly to normal users.
  • 32. Development - Integration with ChatGPT So we need to host an embedded model and vector database and have an open API to let users use. Then integrate the API with ChatGPT 1. The first way is easy, we just write a Python code to use ChatGPT API and our API. But this is not friendly to normal users. 2. Develop ChatGPT plugin, ChatGPT plugin can let us set some API. While asking questions ChatGPT, it will call the API and get the response. This is the best practice of our project, the user just needs to enable the plugin in ChatGPT.
  • 34. OpenAI GPTs as Potential Replacements GPTs was lauched at November 2023. It provides an easy way to generate a custom GPT for any data you have. Which becomes a potential replacement for our project. We only need to upload the data from step 1 and there is a custom expert system. On March 19, 2024, you will no longer be able to install new plugins or create new conversations with existing plugins.
  • 35. Wiki Future Audiences The idea is inspired by Wiki. They actually already have developed a plugin. But they also stopped the plan after the GPTs release. This timing also coincides with OpenAI’s move away from the plugin marketplace for ChatGPT, and towards no/low-code customizable GPTs. This shift has made our plugin in its current form inaccessible to new users and largely redundant. While we could repurpose this functionality towards being a GPT, we don’t believe we would learn significantly more beyond how to create a product within the OpenAI ecosystem. Lessons learned, ChatGPT has not become the new information seeking paradigm (yet?).
  • 36. Summary Solution RAG GPTs (Custom GPT) ChatGPT Plus (browse internet) Cost Medium ~ Hard Small Small Advantage ● Privacy ● Flexibility ● Fast ● Fast Disadvantage ● Cost ● Privacy ● Flexibility ● Data source are different ● Flexibility
  • 37. Summary The significance of LLM is poised to exponentially increase in the future, marking a pivotal shift in our technological landscape. While we may not complete the production process in its entirety. But it is a good thing to focus on any future trends and try to combine them with FreeBSD.
  • 38. Reference ● What is an Expert System? ● Do data scientists spend 80% of their time cleaning data? Turns out, no? ● Wiki, Talk:Future Audiences