1. Leveraging the Power of ChatGPT
and Vector Databases in the
FreeBSD Expert System
Yan-Hao Wang, AsiaBSDCon 2024
2. Who Am I
My name is Yan-Hao Wang, a senior high student in Taiwan and FreeBSD
Taiwan intern since 2022.
I've been involved in various tasks such as
1. Developing an online document/man-page editor.
2. Crafting tests for command utilities like gunion(8) and printenv(1).
3. Translating FreeBSD documents.
3. GitHub Repository
All codes have been uploaded to the freebsd_data repository. The slide will also
be put on it. If you're interested, you can access them there.
4. Outline
1. Introduction of the Expert System
2. Introduction of ChatGPT
3. Development Process
a. Data Cleaning and Extraction
b. Embedded Model and Vector Database
c. Integration with ChatGPT
4. OpenAI GPTs as Potential Replacements
5. Summary
5. Expert System
Expert system is a system that can answer user questions accurately in a specific
domain. It consists of two parts
1. Knowledge Base: stores all the relevant information related to the domain of
expertise.
2. Rule Engine: Contain some predefined rules by the data scientist. It processes
the user's questions and applies rules to generate accurate responses.
6. Expert System
Modern expert systems use machine learning to simulate the behavior or
judgment of domain experts.
ML model
7. ChatGPT
ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot developed by
OpenAI and launched on November 30, 2022. Based on a large language model
(LLM).
8. FreeBSD Expert System
There are multiple ways to build a FreeBSD expert system.
1. Train a new ML model with FreeBSD data.
No. I am not an ML expert and it costs a lot.
9. FreeBSD Expert System
There are multiple ways to build a FreeBSD expert system.
1. Train a new ML model with FreeBSD data.
No. I am not an ML expert and it costs a lot.
2. Use the existing model such as ChatGPT.
But we definitely won’t call ChatGPT a FreeBSD expert system.
10. FreeBSD Expert System
ChatGPT uses amount of data for training. So he can answer problems in every
domain though may not be correct. It's more like a general-purpose system.
11. FreeBSD Expert System
ChatGPT uses amount of data for training. So he can answer problems in every
domain though may not be correct. It's more like a general-purpose system.
The limitation of why ChatGPT can’t be called a FreeBSD expert system
1. Chatgpt may tendency to hallucinate answers when asked about unfamiliar
domains.
2. The data is not new enough (ChatGPT uses data before 2021 to train). So he
can’t answer the newest question.
12. FreeBSD Expert System
There are two ways to handle the limitation.
1. Fine-tune. fine-tuning is a process that takes a model that has already been
trained for one given task and then tunes or tweaks the model to make it
perform a second similar task.
13. FreeBSD Expert System
There are two ways to handle the limitation.
First way is, fine-tune. fine-tuning is a process that takes a model that has already
been trained for one given task and then tunes or tweaks the model to make it
perform a similar task.
OpenAI has supplied this API. For the open-source model, you should use Pytorch
and TensorFlow to handle it.
14. FreeBSD Expert System
However, fine-tuning is still hard for AI-unfamiliar developer. And It also cost a lot.
The second way is Retrieval Augmented Generation (RAG). Basically, it is just like
when you use ChatGPT, you can provide related info about your question, and it
can provide a much more accurate response.
This is an acceptable way, so we will use the embedded model and vector
database to achieve this.
15. Embedded Model
It is a type of ML model used to convert input data, such as words or sentences,
into numerical representations called embedding vector or vector.
These embeddings capture the semantic meaning or context of the input data in a
continuous vector space. It can work on tasks such as text classification and
sentiment analysis.
16. Vector Database
Vector databases are designed to store vectors efficiently. These databases
employ various search algorithms to find the most similar vectors, such as t.
Numerous open-source vector databases are available to choose from.
19. Development - Data Extraction
Use the simple find command to extract data. The data sources are very different,
we need to convert it to plain text. We use “hs-pandoc” package to convert data.
20. Development - Data Cleaning
Remove unrelated info, simple find command to remove the unrelated data.
Unrelated text
21. Development - Data Cleaning
Actually, data cleaning is the most time-consuming step. Data scientists spend
60% of their time cleaning data rather than creating insights.
There are some tools that can help us clean the data.
OpenRefine
22. Development - Embedded Model
OpenAI has embedded model API, there are multiple open source embedded
models online too. In this project, we use the open source model (“gte-base”).
MTEB Leaderboard - From Hugging Face
OpenAI embedding model
24. Development - Embedded Model
We use “gte-base” as our model. Its model size is only 0.22 GB which my small
GPU (NVIDIA GeForce GTX 1050 Ti) can handle it.
It takes only 590MB of GPU memory and 67 minutes to embed all the documents.
27. Development - Embedded Model
There are multiple facts (hyperparameters) we can tune here. For example
1. The length of sentences.
2. What metadata should we leave?
3. What model should we use? Weather we need to tune the embedded model.
All these hyperparameters should be tried multiple times to get the best answer.
The answer will be different with different fields - NFL(No Free Lunch Theorems)
28. Development - Vector database
As previously said, we have different vector databases.
But in our local test, we just use a file to store the vector and a simple cosine
similarity algorithm. Because our data is not big (< 100 MB).
29. Development - Query
Question: How to use the gunion command in FreeBSD?
Query result:
1. Man page of gunion
2. Man page of gunion
3. FreeBSD status report (A New GEOM Facility, gunion)
4. Unrelated info …
31. Development - Integration with ChatGPT
So we need to host an embedded model and vector database and have an open
API to let users use. Then integrate the API with ChatGPT
1. The first way is easy, we just write a Python code to use ChatGPT API and
our API. But this is not friendly to normal users.
32. Development - Integration with ChatGPT
So we need to host an embedded model and vector database and have an open
API to let users use. Then integrate the API with ChatGPT
1. The first way is easy, we just write a Python code to use ChatGPT API and
our API. But this is not friendly to normal users.
2. Develop ChatGPT plugin, ChatGPT plugin can let us set some API. While
asking questions ChatGPT, it will call the API and get the response.
This is the best practice of our project, the user just needs to enable the
plugin in ChatGPT.
34. OpenAI GPTs as Potential Replacements
GPTs was lauched at November 2023. It provides an easy way to generate a
custom GPT for any data you have. Which becomes a potential replacement for
our project. We only need to upload the data from step 1 and there is a custom
expert system.
On March 19, 2024, you will no longer be able to install new plugins or create new
conversations with existing plugins.
35. Wiki Future Audiences
The idea is inspired by Wiki. They actually already have developed a plugin. But
they also stopped the plan after the GPTs release.
This timing also coincides with OpenAI’s move away from the plugin marketplace
for ChatGPT, and towards no/low-code customizable GPTs. This shift has made
our plugin in its current form inaccessible to new users and largely redundant.
While we could repurpose this functionality towards being a GPT, we don’t believe
we would learn significantly more beyond how to create a product within the
OpenAI ecosystem.
Lessons learned, ChatGPT has not become the new information seeking
paradigm (yet?).
36. Summary
Solution RAG GPTs (Custom GPT) ChatGPT Plus (browse internet)
Cost Medium ~ Hard Small Small
Advantage ● Privacy
● Flexibility
● Fast ● Fast
Disadvantage ● Cost ● Privacy
● Flexibility
● Data source are different
● Flexibility
37. Summary
The significance of LLM is poised to exponentially increase in the future, marking
a pivotal shift in our technological landscape.
While we may not complete the production process in its entirety. But it is a good
thing to focus on any future trends and try to combine them with FreeBSD.
38. Reference
● What is an Expert System?
● Do data scientists spend 80% of their time cleaning data? Turns out, no?
● Wiki, Talk:Future Audiences