SlideShare a Scribd company logo
Sri Krishnamurthy
Fall 2023
Projects
Project 1
PDF Summary Tool: Students are tasked with creating
a Streamlit app that can summarize PDF documents.
They must choose between using nougat or pypdf
libraries to process PDFs from the SEC. The app should
allow users to select the library, test various PDFs,
provide pros/cons for each tool, and recommend one.
Additionally, students need to create and integrate
architectural diagrams of the project within Streamlit.
Data Quality Evaluation Tool: This part involves
building a Streamlit tool using the Freddie Mac single-
family dataset. The tool, designed for data quality
evaluation, should allow users to upload CSV/XLS
fi
les
and specify their type (Origination/Monthly
Performance). The tool will use pandaspro
fi
ling to
summarize data and greatexpectations to validate data
schema, integrity, and completeness. Architecture
Tools Used
• Streamlit
• Nougat or PyPDF libraries
• pandas-profiling
• greatexpectations
• Diagrams tool for architecture
Data Engineering and building tools to summarize
SEC and Freddie Mac datasets
Project 2
A system using Large Language Models to
summarize PDF documents from the SEC website.
The project for the Big Data and Intelligent Analytics
graduate course, as detailed in Assignment 2, involves
developing a tool for analysts to load PDF documents
and obtain summaries.
The project includes evaluating nougat and pypdf
libraries for processing PDFs from the SEC, replicating
a demo from the Open AI cookbook, and creating
Jupyter notebooks that can handle SEC PDF
documents. Additionally, students are tasked with
designing fast APIs for a Streamlit app, updating the
app with new functionalities, and revising design
documents and architectural diagrams to re
fl
ect the
updates.
Tools Used
• Streamlit
• Nougat or PyPDF libraries
• FAST API
• OPENAI APIs
• greatexpectations
• Diagrams tool for architecture
visualization
This project focused on
automating the creation of
embeddings and populating a
vector database. Key components
include:
Automating Embedding
Creation and Database
Population:
Air
fl
ow Pipelines: Two distinct
Air
fl
ow pipelines for data
acquisition, embedding
generation, and inserting records
into Pinecone vector database
using SEC PDF
fi
les.
Data Processing and Validation:
Implement data validation,
generate embeddings, and save
fi
le extracts.
Client-Facing Application
Development:
FastAPI and Streamlit: Develop
a user registration and login
system with JWT authentication.
Utilize a SQL database for storing
user credentials and application
logs.
Streamlit for User Interface:
Create a secure login page, a
question-answering interface, and
implement a search mechanism
using Pinecone vector database.
Deployment: Containerize each
microservice and deploy on a
public cloud platform.
Project 3
Using LLMs and RAG for document summarization of
SEC documents
Tools Used
• Airflow
• Pinecone
• FastAPI
• JWT (JSON Web Token)
• SQL Database
• Streamlit
• Docker for containerization
Project 4
Using LLMs to interact with
Snowflake using natural language
Data Engineering with Snowpark Python: Students
individually reproduce steps in creating data pipelines
with Snowpark Python, showcasing their work in a
forked repository.
Dataset Analysis: Teams select datasets from
Snow
fl
ake's marketplace, creating thematic stories and
Proof of Concept (POC) to address speci
fi
c problems.
They design architectural diagrams and implement SQL
processes and User-De
fi
ned Functions, integrating Git
actions for deployment.
Streamlit and OpenAI Integration: The project
involves connecting Snow
fl
ake with Streamlit for
analytics, developing a text-based SQL query feature
using natural language processing, and integrating
OpenAI services for query generation and re
fi
nement.
Tools Used
• Snowpark Python
• Snowflake Marketplace
• Streamlit
• OpenAI Services
• SQL Database Management
The project involves a thorough review
of the existing architecture
(Assignment 3) and its redesign using
two distinct approaches:
Open Source Components: Utilizing
primarily open-source tools like
Huggingface, LLAMA from Meta,
Amazon Bedrock, etc. The focus is on
creating a
fl
exible and customizable
stack that aligns with the dynamic
needs of the enterprise.
Enterprise Alternatives to OpenAI
Stack: Incorporating enterprise
solutions such as Google Bard,
Anthropic, Cohere, Perplexity, etc. This
approach is geared towards leveraging
the robust and reliable frameworks
o
ff
ered by leading tech organizations.
Architecture Design: Both use cases
will have detailed architecture
diagrams showcasing preparation
pipelines and inference aspects.
A comparison of the technologies in
terms of hosting and as-a-service
capabilities.
Technology Suitability Analysis:
Justi
fi
cation of selected technologies
based on application suitability.
Evaluation of scalability, reliability,
and performance metrics.
Cost Analysis: Detailed breakdown of
fi
xed and variable costs for both
architectures.
Analysis includes hosting, annual
licenses, maintenance, API access,
and use-case speci
fi
c costs (e.g.,
PDF processing).
Comparative study of cost
structures between the original and
new architectures.
Project 5
Project redesign and rearchitecture
Tools Used
Huggingface: For machine learning and natural language
processing tasks.
LLAMA from Meta: A language model for various analytical
tasks.
Amazon Bedrock: For data management and analytics
infrastructure.
Enterprise Components:
Google Bard: AI-driven data analysis and predictive
modeling.
Anthropic: Advanced AI solutions for complex data tasks.
Cohere: Provides tools for natural language understanding.

More Related Content

PDF
DAMG7245-Fall23-FinalProjectProposal.pdf
PDF
Using_python_webdevolopment_datascience.pdf
PPTX
Software Portfolio - SetFocus
DOC
ChandanResume
PDF
Maruti gollapudi cv
DOC
Appalanaidu_4.4 Years Exp in DotNet Technology
PPTX
vishwa ppt.pptxvishwa ppt.pptxvishwa ppt.pptx
PDF
Abhishek jaiswal
DAMG7245-Fall23-FinalProjectProposal.pdf
Using_python_webdevolopment_datascience.pdf
Software Portfolio - SetFocus
ChandanResume
Maruti gollapudi cv
Appalanaidu_4.4 Years Exp in DotNet Technology
vishwa ppt.pptxvishwa ppt.pptxvishwa ppt.pptx
Abhishek jaiswal

Similar to Big Data projects.pdf (20)

DOCX
Resume (1)
DOCX
Resume (1)
PDF
Juan Baquera
DOCX
Supreet Resume
PPT
Case study for communication social portal with share point implementation
DOC
peeyush_resume
PDF
Aman kaur gandhi
PDF
Aman kaur gandhi
PDF
Enterprise guide to building a Data Mesh
DOC
Portfolio
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Solving Enterprise Data Challenges with Apache Arrow
DOC
VINOD_6yrs
DOCX
Zakir_Hussain_cv
PDF
CustomerCopy
DOCX
Resume_Md ZakirHussain
DOCX
ZakirHussain
PDF
Shabarish kesa resume_new
PDF
miniprojectpresenattionandpresentatiomppt.pdf
PDF
Sam segal resume
Resume (1)
Resume (1)
Juan Baquera
Supreet Resume
Case study for communication social portal with share point implementation
peeyush_resume
Aman kaur gandhi
Aman kaur gandhi
Enterprise guide to building a Data Mesh
Portfolio
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Solving Enterprise Data Challenges with Apache Arrow
VINOD_6yrs
Zakir_Hussain_cv
CustomerCopy
Resume_Md ZakirHussain
ZakirHussain
Shabarish kesa resume_new
miniprojectpresenattionandpresentatiomppt.pdf
Sam segal resume
Ad

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Sustainable Sites - Green Building Construction
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Welding lecture in detail for understanding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
web development for engineering and engineering
PDF
Digital Logic Computer Design lecture notes
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Construction Project Organization Group 2.pptx
Lecture Notes Electrical Wiring System Components
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Operating System & Kernel Study Guide-1 - converted.pdf
Sustainable Sites - Green Building Construction
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Welding lecture in detail for understanding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Internet of Things (IOT) - A guide to understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
web development for engineering and engineering
Digital Logic Computer Design lecture notes
CH1 Production IntroductoryConcepts.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Foundation to blockchain - A guide to Blockchain Tech
UNIT 4 Total Quality Management .pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Ad

Big Data projects.pdf

  • 2. Project 1 PDF Summary Tool: Students are tasked with creating a Streamlit app that can summarize PDF documents. They must choose between using nougat or pypdf libraries to process PDFs from the SEC. The app should allow users to select the library, test various PDFs, provide pros/cons for each tool, and recommend one. Additionally, students need to create and integrate architectural diagrams of the project within Streamlit. Data Quality Evaluation Tool: This part involves building a Streamlit tool using the Freddie Mac single- family dataset. The tool, designed for data quality evaluation, should allow users to upload CSV/XLS fi les and specify their type (Origination/Monthly Performance). The tool will use pandaspro fi ling to summarize data and greatexpectations to validate data schema, integrity, and completeness. Architecture Tools Used • Streamlit • Nougat or PyPDF libraries • pandas-profiling • greatexpectations • Diagrams tool for architecture Data Engineering and building tools to summarize SEC and Freddie Mac datasets
  • 3. Project 2 A system using Large Language Models to summarize PDF documents from the SEC website. The project for the Big Data and Intelligent Analytics graduate course, as detailed in Assignment 2, involves developing a tool for analysts to load PDF documents and obtain summaries. The project includes evaluating nougat and pypdf libraries for processing PDFs from the SEC, replicating a demo from the Open AI cookbook, and creating Jupyter notebooks that can handle SEC PDF documents. Additionally, students are tasked with designing fast APIs for a Streamlit app, updating the app with new functionalities, and revising design documents and architectural diagrams to re fl ect the updates. Tools Used • Streamlit • Nougat or PyPDF libraries • FAST API • OPENAI APIs • greatexpectations • Diagrams tool for architecture visualization
  • 4. This project focused on automating the creation of embeddings and populating a vector database. Key components include: Automating Embedding Creation and Database Population: Air fl ow Pipelines: Two distinct Air fl ow pipelines for data acquisition, embedding generation, and inserting records into Pinecone vector database using SEC PDF fi les. Data Processing and Validation: Implement data validation, generate embeddings, and save fi le extracts. Client-Facing Application Development: FastAPI and Streamlit: Develop a user registration and login system with JWT authentication. Utilize a SQL database for storing user credentials and application logs. Streamlit for User Interface: Create a secure login page, a question-answering interface, and implement a search mechanism using Pinecone vector database. Deployment: Containerize each microservice and deploy on a public cloud platform. Project 3 Using LLMs and RAG for document summarization of SEC documents Tools Used • Airflow • Pinecone • FastAPI • JWT (JSON Web Token) • SQL Database • Streamlit • Docker for containerization
  • 5. Project 4 Using LLMs to interact with Snowflake using natural language Data Engineering with Snowpark Python: Students individually reproduce steps in creating data pipelines with Snowpark Python, showcasing their work in a forked repository. Dataset Analysis: Teams select datasets from Snow fl ake's marketplace, creating thematic stories and Proof of Concept (POC) to address speci fi c problems. They design architectural diagrams and implement SQL processes and User-De fi ned Functions, integrating Git actions for deployment. Streamlit and OpenAI Integration: The project involves connecting Snow fl ake with Streamlit for analytics, developing a text-based SQL query feature using natural language processing, and integrating OpenAI services for query generation and re fi nement. Tools Used • Snowpark Python • Snowflake Marketplace • Streamlit • OpenAI Services • SQL Database Management
  • 6. The project involves a thorough review of the existing architecture (Assignment 3) and its redesign using two distinct approaches: Open Source Components: Utilizing primarily open-source tools like Huggingface, LLAMA from Meta, Amazon Bedrock, etc. The focus is on creating a fl exible and customizable stack that aligns with the dynamic needs of the enterprise. Enterprise Alternatives to OpenAI Stack: Incorporating enterprise solutions such as Google Bard, Anthropic, Cohere, Perplexity, etc. This approach is geared towards leveraging the robust and reliable frameworks o ff ered by leading tech organizations. Architecture Design: Both use cases will have detailed architecture diagrams showcasing preparation pipelines and inference aspects. A comparison of the technologies in terms of hosting and as-a-service capabilities. Technology Suitability Analysis: Justi fi cation of selected technologies based on application suitability. Evaluation of scalability, reliability, and performance metrics. Cost Analysis: Detailed breakdown of fi xed and variable costs for both architectures. Analysis includes hosting, annual licenses, maintenance, API access, and use-case speci fi c costs (e.g., PDF processing). Comparative study of cost structures between the original and new architectures. Project 5 Project redesign and rearchitecture Tools Used Huggingface: For machine learning and natural language processing tasks. LLAMA from Meta: A language model for various analytical tasks. Amazon Bedrock: For data management and analytics infrastructure. Enterprise Components: Google Bard: AI-driven data analysis and predictive modeling. Anthropic: Advanced AI solutions for complex data tasks. Cohere: Provides tools for natural language understanding.