Building CSV Agents: Unlocking the power of gen AI for real-world data Analysis and Insights!
Unlocking Insights from Your Data with LangChain, ChatAnthropic, and Python.
In this tutorial, I will build a powerful CSV Agent capable of interacting with CSV data to extract insights and answer complex queries. This agent leverages the capabilities of the ChatAnthropic model and the LangChain framework to provide a seamless user experience for data analysis. By the end of this article, you will have a robust understanding of how to set up your own CSV Agent and its real-world applications.
Our Application Use Cases: Online Store Product Data
Architecture Overview
Tech Stack
Skills Learned
Step 1: Project Setup
Begin by creating a project folder on your local machine:
Add the necessary libraries to requirements.txt, including pandas, streamlit, langchain, and langchain-anthropic. Also, create a folder named csv-data for your CSV file, a .env file for secret keys, a .gitignore file to protect sensitive information, and set up a virtual environment:
Step 2: Create csv_agent.py
In this step, we will create the main script that will interact with the ChatAnthropic model.
The code begins by importing the necessary libraries at the top and then setting up an environment to interact with an Anthropic language model. This is achieved using a .env file that stores your ChatAnthropic API key. (Note: You can also use ChatGPT as an alternative.) FYI: LLM API KEYs require a credit card.
Next, I configured the model and used Pandas to read product data from a CSV file. This data, sourced from our online store CSV, will be used for further processing and analysis.
Finally, once the code is ready, it's time to test it. To do this, check if the print(df.head()) function outputs the extracted CSV data to the console. In your VS Code terminal, after running: python3 csv_agent.py
If everything is set up correctly, you should see your product CSV data displayed in the terminal. example above.
Step 3: Build the CSV Agent
Now that we can extract CSV data, let’s create the actual agent. Comment out any print statements that display raw data, as we will use this output programmatically.
Add the below code to your file:
First, I imported create_pandas_dataframe_agent from Langchain to enable interaction between a language model (LLM) and a Pandas DataFrame.
Next, I created an agent using the imported function, which allows me to query and analyze the data in the DataFrame (df). The function includes parameters to specify the language model to use and passes the DataFrame containing the data for analysis. I also enabled verbose=True to display detailed output during the agent's execution. Additionally, a flag is set to True to allow the execution of arbitrary Python code, which I will briefly discuss below.
To execute the query, we call the agent with our query message. The result is then stored in res.
Feel free to try out different questions to explore how the agent responds!
Execution of arbitrary Python code: let's talk about it!
Above I mentioned that I set the flag allow_dangerous_code=True. But why? Well, It enables the agent to execute arbitrary Python code dynamically, which can be useful for handling complex queries that require direct computation or analysis.
However, this feature comes with significant security risks, especially if untrusted user input is involved. Arbitrary code execution can expose your application to vulnerabilities, such as malicious code injection or data breaches. To mitigate these risks, you must explicitly enable this feature and ensure that all inputs are thoroughly validated. For additional safety, consider running the agent in a sandboxed environment to contain potential security breaches. By understanding these implications and implementing strict input validation, you can use this functionality effectively while minimizing risks.
Step 4: Enhance Prompting Techniques
To improve our agent’s responses, we will implement custom prefixes and suffixes to guide its analysis.
Using custom prefixes and suffixes in prompts can significantly enhance the quality and relevance of the responses from LLMs. Prefixes provide clear instructions or set the context for the query, guiding the AI to focus on the desired tone, style, or subject matter. Suffixes reinforce the query by emphasizing key points or refining expectations for the response, ensuring alignment with user intentions. Together, these elements act as a framework that shapes the AI's understanding, reducing ambiguity and minimizing the risk of hallucination or irrelevant outputs. This structured approach helps the agent deliver accurate, context-aware, and meaningful results.
To improve your agent's responses, please adjust the prompt to include a custom prefix, user query, and suffix. This structured input, provides the agent with a clear framework and guardrails, reducing ambiguity and guiding it toward more accurate and relevant outputs.
When running your updated code. The console window should display a more structured and detailed response. The agent will provide a breakdown of the context, calculations performed, and a clear answer to the query, demonstrating how the prefix and suffix helped guide the response effectively.
Step 5: Streamlit for UI Visualization
For better user interaction, we will integrate Streamlit into our application. Install Streamlit if you haven't already:
By integrating Streamlit into our console-based application, we can significantly improve user interaction by providing a clean, browser-based interface for data exploration and AI-driven analysis. With Streamlit, users can input questions, interact with the AI agent, and view results in an organized and visually appealing format, making the application more user-friendly and accessible.
To get started with Streamlit, you can install it using the following command:
Let's create a simple web interface that allows users to input queries about their dataset:
The code begins by importing Streamlit, which is used to create a web-based interface for interacting with the AI agent. It then sets the title of the app as "Database AI Agent with Langchain" and displays a preview of the dataset by showing the first few rows. Next, the code creates a text input box where users can type in their questions about the dataset, with a default example question provided. There is also a "Run Query" button, which, when clicked, triggers the execution of the query. The code combines the dataset query with a prefix and suffix to form a complete query, which is then processed by the agent. The AI's response is displayed under a "Final Answer" heading, formatted in markdown for easy reading.
In summary, this code sets up an interactive app that allows users to preview the dataset, ask questions, and receive AI-generated answers.
To run your Streamlit app, use the following command in your terminal:
Once the command executes successfully, your web browser should display the interactive app.
You will now have a user-friendly interface where users can preview the dataset, input questions, and receive AI-generated answers, all within the web application, enabling a smooth and engaging data exploration experience! Try prompting it with a few relevant questions to get a different analysis output.
Extra Notes:
Building Generative AI and Large Language Model (LLM) applications comes with its challenges, and one common issue is API overload. This happens when the LLM API experiences high traffic, causing delays or failures in processing your requests. These overloads are typically temporary and may resolve on their own after a short period. With this in mind, I've thought about how we can improve the code to handle such situations more effectively. One solution is to implement a retry mechanism that can automatically handle transient errors and retry the request if it fails. If you would like to see my solution please find my code on my GitHub., I have shared a link below.
Conclusion
In this tutorial, we've built a robust CSV Agent capable of extracting insights from CSV files or Tabular using LangChain and ChatAnthropic. We learned how to set up error handling, customize prompts for improved responses, and create an interactive UI with Streamlit.
If you found this article helpful or have suggestions for improvement, please let me know in the comments! You can find my code on GitHub
If you found this article helpful, have suggestions for improvement, or are interested in creating Generative AI solutions, I'd love to hear from you! Feel free to share your thoughts in the comments or send me a DM with your project details. You can also check out my code on GitHub.
Until next time, keep improving and stay awesome. See you later!