Getting Started with LangChain and ChromaDB | Creating a Focused AI Knowledge Base from Local Documents

Source: https://github.com/jadm11/langchain

Use Case: Creating a Focused AI Knowledge Base from Local Documents

Suppose you have a directory full of curated content like PDFs, Word documents, slide decks, or text files and you want to build an AI assistant that answers questions only using that material.

LangChain, combined with a vector store like ChromaDB, makes this possible. This script serves as a minimal working example: it loads a document, chunks it into digestible sections, converts those sections into embeddings, and stores them for fast semantic search. When a question is asked, only relevant chunks from your documents are retrieved and passed to the language model. The result is a tightly scoped, reliable AI interface that stays grounded in the knowledge you provided.

What is LangChain?

LangChain is an open-source framework designed to simplify the building of applications that use LLMs. It does this by managing how text-based data interacts with advanced language models, like OpenAI's ChatGPT, enabling applications that can answer questions, summarize documents, and even assist in creative tasks like storytelling.

What is ChromaDB?

ChromaDB is a type of database specialized for storing vector embeddings. Embeddings are numerical representations of text that capture their semantic meaning. ChromaDB allows you to store and retrieve these embeddings efficiently, making it easier to find relevant information based on meaning rather than exact text matches. This is how LangChain works like a librarian, understanding each title, what each book is about, etc. When you ask for something it knows exactly where to look even if you don't use the exact words.

What is a Retriever?

from qa_system.py, line 63

        retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

A retriever works as the librarian for your vector database. When a user asks a question, it locates the most relevant pieces of information by comparing vector embeddings. This process, called "k-nearest neighbors search", relies on the “k” parameter to define how many chunks to retrieve; for instance, k = 3 means the retriever grabs the top three most similar text snippets. In this case I set it to 3, enough to provide useful context without cluttering the model’s focus or slowing it down. The system transforms the question into a vector, finds its three closest matches, and passes them to the language model to craft a response.

A Simple Question & Answer System

The source contains a simple project with a question-answering interface built with LangChain and ChromaDB. It provides accurate answers based on a knowledge base. If you'd like to work with it, follow the instructions in the project's README.me.

The primary files in use are src/qa_system.py and data/knowledge.txt.

Here's how it works.

Step 1: Creating a Knowledge Base

LangChain works best when your knowledge base is cleanly structured. Think short sections, clear headings, and bullet points. It expects consistently formatted text so it can chunk content effectively and generate embeddings that align with how language models retrieve context. Use markdown-style formatting, concise explanations, and logical groupings of information. You can include facts, how-tos, use cases, or even short FAQs. The goal isn’t rigid structure, but clarity: structure helps the retriever do its job, and clean chunks lead to better, more accurate responses.

In this case, I created a file called knowledge.txt and generated some information on migratory birds.

Migratory Birds: A Comprehensive Overview

Migratory birds are species that undertake regular seasonal movements between breeding and non-breeding grounds. These remarkable journeys can span thousands of kilometers and represent one of nature's most impressive phenomena.

Bird Migration Basics:
- Migration is primarily driven by food availability and breeding conditions
- Birds navigate using multiple methods including:
  - Celestial cues (stars and sun)
  - Earth's magnetic field
  - Geographic landmarks
  - Inherited genetic information

Key Types of Migration:

1. Latitudinal Migration
- Birds moving north and south between seasons
- Most common type of migration
- Example: Arctic Terns travel from Arctic to Antarctic annually

2. Longitudinal Migration
- Birds moving east to west
- Less common than latitudinal migration
- Example: European Rollers moving between Europe and Africa

3. Altitudinal Migration
- Birds moving between different elevations
- Common in mountainous regions
- Example: Mountain Quail in western North America

Notable Migratory Species:
- Arctic Tern: Longest known migration (44,000 km round trip)
- Bar-tailed Godwit: Longest non-stop flight (11,000 km)
- Rufous Hummingbird: Longest migration relative to body size
- Snow Goose: Travels in large, visible V-formations

Challenges Facing Migratory Birds:
- Habitat loss at breeding and wintering grounds
- Climate change affecting migration timing
- Light pollution disrupting navigation
- Human-made structures (buildings, wind turbines)
- Hunting and poaching

Conservation Efforts:
- International treaties protecting migratory birds
- Habitat preservation along migration routes
- Tracking programs to monitor populations
- Public education and awareness campaigns

Migration Timing and Seasons:
- Spring migration (northward) typically occurs February-May
- Fall migration (southward) typically occurs August-November
- Timing varies by species and region
- Some species migrate during night, others during day

This overview provides essential information about migratory birds

Step 2: How the Knowledge Base is Used

When the script is first executed, the system loads the knowledge.txt file using LangChain’s TextLoader. Then, it automatically splits the content into smaller, manageable chunks to optimize search and accuracy. Each chunk is transformed into a numerical vector (an embedding) using OpenAI’s embedding model. Finally, these embeddings are stored in ChromaDB, making them ready for fast, similarity-based retrieval when a question is asked.

Here's what that looks like in the script:

from qa_system.py, line 51-60

        loader = TextLoader(str(knowledge_file))
        documents = loader.load()

        # Create vector embeddings and store them in ChromaDB
        embedding = OpenAIEmbeddings(openai_api_key=api_key)
        vectorstore = Chroma.from_documents(
            documents, 
            embedding,
            persist_directory=str(db_dir)
        )

Step 3: Using the Q&A Script

After starting the script, you can ask a question. The system turns it into an embedding and then compares it to the stored embeddings to find the most relevant chunks of information. These chunks are passed to OpenAI’s language model, which uses them as context to generate a clear, human-like answer.

Remember the retriever from earlier in the article?

from qa_system.py, line 63

        retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

Conclusion

This is just a minimal starting point. LangChain is highly extensible and supports a wide range of data sources beyond plain text. You can load and chunk PDFs, Word docs, HTML, CSVs, and more using LangChain’s various document loaders. These files don’t need to be manually converted; the framework can parse and process them into embeddings just like a text file. You can also fine-tune how the content is chunked, by page, section, or custom rules, and store those embeddings in ChromaDB or other vector stores. So instead of treating files as static references, you can turn entire folders of diverse documents into a searchable, semantic knowledge base. Whether it’s a single .txt or an archive of technical PDFs, LangChain lets you treat it all like a dynamic database.

Getting Started with LangChain and ChromaDB | Creating a Focused AI Knowledge Base from Local Documents

Jacob A.

Software QA Leadership | Automation | Applied AI Technologist

What is LangChain?

What is ChromaDB?

What is a Retriever?

A Simple Question & Answer System

Step 1: Creating a Knowledge Base

Step 2: How the Knowledge Base is Used

Step 3: Using the Q&A Script

Conclusion

More articles by this author

Others also viewed

LAI #87: Recurrent Memory, Agentic RAG, and Evaluating LLM Writing

Towards Advanced RAG

Model Context Protocol: The Universal Adapter for AI by Mustafa Kanorwala

Transforming Industries with Mistral's New SDK for AI Fine-Tuning

Your prompts are brittle. Your AI System Just Failed. Again. DSPy to the Rescue!

Unlocking Agentic AI: Why the Model Context Protocol (MCP) Matters

CAG vs. RAG Explained: Choosing the Right Approach for Your GenAI Strategy

RAG in 2025: Navigating the New Frontier of AI and Data Integration

Did Elon lie again? Grok 4: Separating Fact from Hyperbole in Critiques

Model Context Protocol (MCP) in GenAI Agents

Explore topics

What is LangChain?

What is ChromaDB?

What is a Retriever?

A Simple Question & Answer System

Step 1: Creating a Knowledge Base

Step 2: How the Knowledge Base is Used

Step 3: Using the Q&A Script

Conclusion

Model Card Quick Reference

Aug 12, 2025

The Hugging Face Chat Template Playground for PromptOps

Aug 2, 2025

Semantic Reframing Prompts & System Prompt Leakage in LLMs

Aug 1, 2025

Prompt Templates for Workflows

May 13, 2025

Netflix’s Foundation Model for Recommendations | Design, Structure, and Function

Apr 23, 2025

Prompt Role Labeling & Rapid Prototyping

Apr 21, 2025

Memory-Anchoring Prompts: Simulating Cognitive State in Stateless Language Models

Apr 14, 2025

Waulking Lexicon: A Living Language of Resonant Cognition

Apr 11, 2025

A Structured AI-Assisted Workflow for ~One Shot Development in Cursor

Mar 24, 2025

ChatGPT Deep Research: Integrating Agile User Stories, Gherkin Scenarios & GPT for Autonomous Development/Testing

Feb 26, 2025

Others also viewed

LAI #87: Recurrent Memory, Agentic RAG, and Evaluating LLM Writing

Towards Advanced RAG

Model Context Protocol: The Universal Adapter for AI by Mustafa Kanorwala

Transforming Industries with Mistral's New SDK for AI Fine-Tuning

Your prompts are brittle. Your AI System Just Failed. Again. DSPy to the Rescue!

Unlocking Agentic AI: Why the Model Context Protocol (MCP) Matters

CAG vs. RAG Explained: Choosing the Right Approach for Your GenAI Strategy

RAG in 2025: Navigating the New Frontier of AI and Data Integration

Did Elon lie again? Grok 4: Separating Fact from Hyperbole in Critiques

Model Context Protocol (MCP) in GenAI Agents

Explore topics