Getting Started with LangChain and ChromaDB | Creating a Focused AI Knowledge Base from Local Documents
Use Case: Creating a Focused AI Knowledge Base from Local Documents
Suppose you have a directory full of curated content like PDFs, Word documents, slide decks, or text files and you want to build an AI assistant that answers questions only using that material.
LangChain, combined with a vector store like ChromaDB, makes this possible. This script serves as a minimal working example: it loads a document, chunks it into digestible sections, converts those sections into embeddings, and stores them for fast semantic search. When a question is asked, only relevant chunks from your documents are retrieved and passed to the language model. The result is a tightly scoped, reliable AI interface that stays grounded in the knowledge you provided.
What is LangChain?
LangChain is an open-source framework designed to simplify the building of applications that use LLMs. It does this by managing how text-based data interacts with advanced language models, like OpenAI's ChatGPT, enabling applications that can answer questions, summarize documents, and even assist in creative tasks like storytelling.
What is ChromaDB?
ChromaDB is a type of database specialized for storing vector embeddings. Embeddings are numerical representations of text that capture their semantic meaning. ChromaDB allows you to store and retrieve these embeddings efficiently, making it easier to find relevant information based on meaning rather than exact text matches. This is how LangChain works like a librarian, understanding each title, what each book is about, etc. When you ask for something it knows exactly where to look even if you don't use the exact words.
What is a Retriever?
from qa_system.py, line 63
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
A retriever works as the librarian for your vector database. When a user asks a question, it locates the most relevant pieces of information by comparing vector embeddings. This process, called "k-nearest neighbors search", relies on the “k” parameter to define how many chunks to retrieve; for instance, k = 3 means the retriever grabs the top three most similar text snippets. In this case I set it to 3, enough to provide useful context without cluttering the model’s focus or slowing it down. The system transforms the question into a vector, finds its three closest matches, and passes them to the language model to craft a response.
A Simple Question & Answer System
The source contains a simple project with a question-answering interface built with LangChain and ChromaDB. It provides accurate answers based on a knowledge base. If you'd like to work with it, follow the instructions in the project's README.me.
The primary files in use are src/qa_system.py and data/knowledge.txt.
Here's how it works.
Step 1: Creating a Knowledge Base
LangChain works best when your knowledge base is cleanly structured. Think short sections, clear headings, and bullet points. It expects consistently formatted text so it can chunk content effectively and generate embeddings that align with how language models retrieve context. Use markdown-style formatting, concise explanations, and logical groupings of information. You can include facts, how-tos, use cases, or even short FAQs. The goal isn’t rigid structure, but clarity: structure helps the retriever do its job, and clean chunks lead to better, more accurate responses.
In this case, I created a file called knowledge.txt and generated some information on migratory birds.
Migratory Birds: A Comprehensive Overview
Migratory birds are species that undertake regular seasonal movements between breeding and non-breeding grounds. These remarkable journeys can span thousands of kilometers and represent one of nature's most impressive phenomena.
Bird Migration Basics:
- Migration is primarily driven by food availability and breeding conditions
- Birds navigate using multiple methods including:
- Celestial cues (stars and sun)
- Earth's magnetic field
- Geographic landmarks
- Inherited genetic information
Key Types of Migration:
1. Latitudinal Migration
- Birds moving north and south between seasons
- Most common type of migration
- Example: Arctic Terns travel from Arctic to Antarctic annually
2. Longitudinal Migration
- Birds moving east to west
- Less common than latitudinal migration
- Example: European Rollers moving between Europe and Africa
3. Altitudinal Migration
- Birds moving between different elevations
- Common in mountainous regions
- Example: Mountain Quail in western North America
Notable Migratory Species:
- Arctic Tern: Longest known migration (44,000 km round trip)
- Bar-tailed Godwit: Longest non-stop flight (11,000 km)
- Rufous Hummingbird: Longest migration relative to body size
- Snow Goose: Travels in large, visible V-formations
Challenges Facing Migratory Birds:
- Habitat loss at breeding and wintering grounds
- Climate change affecting migration timing
- Light pollution disrupting navigation
- Human-made structures (buildings, wind turbines)
- Hunting and poaching
Conservation Efforts:
- International treaties protecting migratory birds
- Habitat preservation along migration routes
- Tracking programs to monitor populations
- Public education and awareness campaigns
Migration Timing and Seasons:
- Spring migration (northward) typically occurs February-May
- Fall migration (southward) typically occurs August-November
- Timing varies by species and region
- Some species migrate during night, others during day
This overview provides essential information about migratory birds
Step 2: How the Knowledge Base is Used
When the script is first executed, the system loads the knowledge.txt file using LangChain’s TextLoader. Then, it automatically splits the content into smaller, manageable chunks to optimize search and accuracy. Each chunk is transformed into a numerical vector (an embedding) using OpenAI’s embedding model. Finally, these embeddings are stored in ChromaDB, making them ready for fast, similarity-based retrieval when a question is asked.
Here's what that looks like in the script:
from qa_system.py, line 51-60
loader = TextLoader(str(knowledge_file))
documents = loader.load()
# Create vector embeddings and store them in ChromaDB
embedding = OpenAIEmbeddings(openai_api_key=api_key)
vectorstore = Chroma.from_documents(
documents,
embedding,
persist_directory=str(db_dir)
)
Step 3: Using the Q&A Script
After starting the script, you can ask a question. The system turns it into an embedding and then compares it to the stored embeddings to find the most relevant chunks of information. These chunks are passed to OpenAI’s language model, which uses them as context to generate a clear, human-like answer.
Remember the retriever from earlier in the article?
from qa_system.py, line 63
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
Conclusion
This is just a minimal starting point. LangChain is highly extensible and supports a wide range of data sources beyond plain text. You can load and chunk PDFs, Word docs, HTML, CSVs, and more using LangChain’s various document loaders. These files don’t need to be manually converted; the framework can parse and process them into embeddings just like a text file. You can also fine-tune how the content is chunked, by page, section, or custom rules, and store those embeddings in ChromaDB or other vector stores. So instead of treating files as static references, you can turn entire folders of diverse documents into a searchable, semantic knowledge base. Whether it’s a single .txt or an archive of technical PDFs, LangChain lets you treat it all like a dynamic database.