SlideShare a Scribd company logo
Applied Data Analytics 
Building a real data product
Github Repository 
http://bit.ly/1eLBzki 
Matrix Factorization 
http://slidesha.re/15Qssf0 
Links to various resources
Goals for this Course 
● Apply the ideas and tools learned during all previous program courses 
● Use a real world data set with actionable prediction 
● Present a completed project to faculty and peers 
● Build a data project portfolio 
What are your goals? 
● Understand the Data Science Pipeline 
● Understand what a complete data product looks like 
● Be able to set up and implement a data product in Python
Some Logistics 
This is a small class, I’m hoping for lots of participation! 
Course materials can be found in two places: 
● iPython: http://bit.ly/1gJ73Tt 
● Github: https://github.com/DistrictDataLabs/science-bookclub 
● Slides: on slideshare or on Blackboard 
Recommended Reading: 
● Matrix Factorization: A simple tutorial and implementation 
● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- 
and-implementation-in-python/
Agenda - Day One 
● Review Data Products 
● Review Data Science Pipeline 
● Discuss architecture of the data product we’re going to build. 
● Setting up our project 
● Ingestion of Goodreads Data 
● Lunch 
● Creating a command line admin program 
● Wrangling of Goodreads Data 
● A computational data store
Agenda - Day Two 
● Review current state of recommender project 
● Matrix math review 
● Introduction to matrix factorization 
● Building a recommender system 
● Reporting with Jinja2 
● Lunch 
● Presentations of Capstone Projects 
● Course wrap-up
Building Data Products
A data product is a product that is 
based on the combination of data 
and algorithms.” 
Hilary Mason 
“
Building Data Products with Python (Georgetown)
A data application acquires its value from the 
data itself, and creates more data as a result. 
It’s not just an application with data; it’s a 
data product. Data science enables the 
creation of data products.” 
Mike Loukides 
“
Building Data Products with Python (Georgetown)
The Data Science Pipeline
Data Ingestion Data Munging 
and Wrangling 
Computation and 
Analyses 
Modeling and 
Application 
Reporting and 
Visualization
Data Ingestion 
● There is a world of data out 
there- how to get it? Web 
crawlers, APIs, Sensors? Python 
and other web scripting 
languages are custom made for 
this task. 
● The real question is how can we 
deal with such a giant volume 
and velocity of data? 
● Big Data and Data Science often 
require ingestion specialists!
Data Wrangling 
● Warehousing the data means 
storing the data in as raw a form 
as possible. 
● Extract, transform, and load 
operations move data to 
operational storage locations. 
● Filtering, aggregation, 
normalization and 
denormalization all ensure data is 
in a form it can be computed on. 
● Annotated training sets must be 
created for ML tasks.
Computation and Analyses 
● Hypothesis driven computation 
includes design and development 
of predictive models. 
● Many models have to be trained 
or constrained into a 
computational form like a Graph 
database, and this is time 
consuming. 
● Other data products like indices, 
relations, classifications, and 
clusters may be computed.
Modeling and Application 
This is the part we’re most familiar with. 
Supervised classification, Unsupervised 
clustering - Bayes, Logistic Regression, 
Decision Trees, and other models. 
This is also where the money is.
Reporting and Visualization 
● Often overlooked, this part is 
crucial, even if we have data 
products. 
● Humans recognize patterns 
better than machines. Human 
feedback is crucial in Active 
Learning and remodeling (error 
detection). 
● Mashups and collaborations 
generate more data- and 
therefore more value!
Don’t forget feedback! 
(Active Learning for Data 
Products)
What we’re going to build today 
SCIENCE BOOKCLUB!! 
● A book club that chooses what to 
read via a recommender system. 
● Uses GoodReads data to ingest 
and return feedback on books. 
● Statistical model is a non-negative 
matrix factorization 
● Reporting using Jinja (almost a 
web app)
Workflow 
1. Setting up a Python skeleton 
2. Creating and Running Tests 
3. Wading in with a configuration 
4. Ingestion with urllib and requests 
5. Creating a command line admin with argparse 
6. Wrangling with BeautifulSoup and SQLAlchemy 
7. Modeling with numpy 
8. Reporting with Jinja2
Matplotlib Jinja2 
Reporting 
Module 
Recommender 
Module 
Octavo Architecture (really clear DSP) 
requests.py 
Ingestion 
Module 
Raw Data 
Storage Computational 
Data Storage 
Wrangling 
Module 
BeautifulSou 
p 
SQLAlchemy 
Numpy
Let’s dive into some code!

More Related Content

PPTX
Statistics vs machine learning
PDF
Topological Data Analysis and Persistent Homology
PDF
Topological data analysis
PDF
Topological Data Analysis: visual presentation of multidimensional data sets
PPTX
Python Scipy Numpy
PPTX
Machine learning introduction
PDF
Introduction to NumPy (PyData SV 2013)
PDF
Feature Engineering
Statistics vs machine learning
Topological Data Analysis and Persistent Homology
Topological data analysis
Topological Data Analysis: visual presentation of multidimensional data sets
Python Scipy Numpy
Machine learning introduction
Introduction to NumPy (PyData SV 2013)
Feature Engineering

What's hot (20)

PPT
AI Lecture 1 (introduction)
PPTX
Numerical Techniques
PPTX
Nlp toolkits and_preprocessing_techniques
PPTX
Essential NumPy
PPTX
Topological Data Analysis.pptx
PDF
PPT
Lecture 8 dynamic programming
PDF
PWL Seattle #23 - A Few Useful Things to Know About Machine Learning
PPTX
Python - Numpy/Pandas/Matplot Machine Learning Libraries
PDF
Searching and Sorting Algorithms
PPTX
Visualization and Matplotlib using Python.pptx
PDF
Introduction to XGBoost
PPTX
Tsp branch and-bound
PPT
Unit 1 chapter 1 Design and Analysis of Algorithms
PDF
Lecture5 - C4.5
PPTX
PRML Chapter 1
PDF
Linear time sorting algorithms
PDF
Introduction to Statistical Machine Learning
PPT
Lec 17 heap data structure
PDF
Unit4: Knowledge Representation
AI Lecture 1 (introduction)
Numerical Techniques
Nlp toolkits and_preprocessing_techniques
Essential NumPy
Topological Data Analysis.pptx
Lecture 8 dynamic programming
PWL Seattle #23 - A Few Useful Things to Know About Machine Learning
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Searching and Sorting Algorithms
Visualization and Matplotlib using Python.pptx
Introduction to XGBoost
Tsp branch and-bound
Unit 1 chapter 1 Design and Analysis of Algorithms
Lecture5 - C4.5
PRML Chapter 1
Linear time sorting algorithms
Introduction to Statistical Machine Learning
Lec 17 heap data structure
Unit4: Knowledge Representation
Ad

Viewers also liked (20)

PPT
Startup Pitch Decks that Work: Creating a Winning Pitch Deck
KEY
300 Milligrams - Demo Day Presentation
PDF
PDF
500’s Demo Day Batch 12 >> Alfred
PDF
Binpress
PDF
BrandBoards demo day pitch deck
PDF
Sverve
PDF
Standard Treasury Series A Pitch Deck
PDF
PinMyPet
KEY
Farmeron
PDF
Tealet - DRINK THE TEA
PDF
500’s Demo Day Batch 11 >> Slidebean
PDF
Kickfolio - 500Startups Batch 5
PDF
Kibin
PDF
task.ly pitch deck
PDF
TouristEye - Personalizing The Travel Experience - 500 Startups
PDF
Daily hundred Pitch Deck 2014
PPTX
Pitch deck for Kejahunt
PDF
Square pitch deck
PDF
Contently Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch Deck
300 Milligrams - Demo Day Presentation
500’s Demo Day Batch 12 >> Alfred
Binpress
BrandBoards demo day pitch deck
Sverve
Standard Treasury Series A Pitch Deck
PinMyPet
Farmeron
Tealet - DRINK THE TEA
500’s Demo Day Batch 11 >> Slidebean
Kickfolio - 500Startups Batch 5
Kibin
task.ly pitch deck
TouristEye - Personalizing The Travel Experience - 500 Startups
Daily hundred Pitch Deck 2014
Pitch deck for Kejahunt
Square pitch deck
Contently Pitch Deck
Ad

Similar to Building Data Products with Python (Georgetown) (20)

PDF
Building Data Apps with Python
PDF
Data Science meets Software Development
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PPTX
Data science fullOCS353 UNIT 1 UPDATED.pptx
PDF
S2-Programming_with_Data_Computational_Physics.pdf
PDF
Data Science at Scale - The DevOps Approach
PPTX
Data Science Roadmap by Swapnil Microsoft
PDF
Data science mastery course in pitampura
PPTX
Big data and data science overview
PPTX
Introduction to data science
PPTX
Data scientist roadmap
PDF
Data science presentation
PDF
Data Science: Notes and Toolkits
PDF
Applied_Data_Science_Presented_by_Yhat
PPTX
Data Science Training in Chandigarh h
PPTX
Is Spark the right choice for data analysis ?
PPTX
Data Science Mastery Course in Pitampura
PDF
Building successful data science teams
PDF
Data Science Accelerator Program
PPTX
Dot Net Full Stack course in madhapur,Hyderabad
Building Data Apps with Python
Data Science meets Software Development
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Data science fullOCS353 UNIT 1 UPDATED.pptx
S2-Programming_with_Data_Computational_Physics.pdf
Data Science at Scale - The DevOps Approach
Data Science Roadmap by Swapnil Microsoft
Data science mastery course in pitampura
Big data and data science overview
Introduction to data science
Data scientist roadmap
Data science presentation
Data Science: Notes and Toolkits
Applied_Data_Science_Presented_by_Yhat
Data Science Training in Chandigarh h
Is Spark the right choice for data analysis ?
Data Science Mastery Course in Pitampura
Building successful data science teams
Data Science Accelerator Program
Dot Net Full Stack course in madhapur,Hyderabad

More from Benjamin Bengfort (20)

PDF
Privacy and Security in the Age of Generative AI - C4AI.pdf
PDF
Implementing Function Calling LLMs without Fear.pdf
PDF
Privacy and Security in the Age of Generative AI
PDF
Digitocracy without Borders: the unifying and destabilizing effects of softwa...
PDF
Getting Started with TRISA
PDF
Visual diagnostics for more effective machine learning
PDF
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
PDF
Dynamics in graph analysis (PyData Carolinas 2016)
PDF
Visualizing the Model Selection Process
PDF
Data Product Architectures
PDF
A Primer on Entity Resolution
PDF
An Interactive Visual Analytics Dashboard for the Employment Situation Report
PPTX
Graph Based Machine Learning on Relational Data
PDF
Introduction to Machine Learning with SciKit-Learn
PDF
Fast Data Analytics with Spark and Python
PDF
Evolutionary Design of Swarms (SSCI 2014)
PDF
An Overview of Spanner: Google's Globally Distributed Database
PDF
Graph Analyses with Python and NetworkX
PDF
Natural Language Processing with Python
PDF
Beginners Guide to Non-Negative Matrix Factorization
Privacy and Security in the Age of Generative AI - C4AI.pdf
Implementing Function Calling LLMs without Fear.pdf
Privacy and Security in the Age of Generative AI
Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Getting Started with TRISA
Visual diagnostics for more effective machine learning
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Dynamics in graph analysis (PyData Carolinas 2016)
Visualizing the Model Selection Process
Data Product Architectures
A Primer on Entity Resolution
An Interactive Visual Analytics Dashboard for the Employment Situation Report
Graph Based Machine Learning on Relational Data
Introduction to Machine Learning with SciKit-Learn
Fast Data Analytics with Spark and Python
Evolutionary Design of Swarms (SSCI 2014)
An Overview of Spanner: Google's Globally Distributed Database
Graph Analyses with Python and NetworkX
Natural Language Processing with Python
Beginners Guide to Non-Negative Matrix Factorization

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
cuic standard and advanced reporting.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Approach and Philosophy of On baking technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
sap open course for s4hana steps from ECC to s4
20250228 LYD VKU AI Blended-Learning.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
cuic standard and advanced reporting.pdf
Spectroscopy.pptx food analysis technology
Approach and Philosophy of On baking technology
MIND Revenue Release Quarter 2 2025 Press Release
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
“AI and Expert System Decision Support & Business Intelligence Systems”
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity

Building Data Products with Python (Georgetown)

  • 1. Applied Data Analytics Building a real data product
  • 2. Github Repository http://bit.ly/1eLBzki Matrix Factorization http://slidesha.re/15Qssf0 Links to various resources
  • 3. Goals for this Course ● Apply the ideas and tools learned during all previous program courses ● Use a real world data set with actionable prediction ● Present a completed project to faculty and peers ● Build a data project portfolio What are your goals? ● Understand the Data Science Pipeline ● Understand what a complete data product looks like ● Be able to set up and implement a data product in Python
  • 4. Some Logistics This is a small class, I’m hoping for lots of participation! Course materials can be found in two places: ● iPython: http://bit.ly/1gJ73Tt ● Github: https://github.com/DistrictDataLabs/science-bookclub ● Slides: on slideshare or on Blackboard Recommended Reading: ● Matrix Factorization: A simple tutorial and implementation ● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- and-implementation-in-python/
  • 5. Agenda - Day One ● Review Data Products ● Review Data Science Pipeline ● Discuss architecture of the data product we’re going to build. ● Setting up our project ● Ingestion of Goodreads Data ● Lunch ● Creating a command line admin program ● Wrangling of Goodreads Data ● A computational data store
  • 6. Agenda - Day Two ● Review current state of recommender project ● Matrix math review ● Introduction to matrix factorization ● Building a recommender system ● Reporting with Jinja2 ● Lunch ● Presentations of Capstone Projects ● Course wrap-up
  • 8. A data product is a product that is based on the combination of data and algorithms.” Hilary Mason “
  • 10. A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.” Mike Loukides “
  • 12. The Data Science Pipeline
  • 13. Data Ingestion Data Munging and Wrangling Computation and Analyses Modeling and Application Reporting and Visualization
  • 14. Data Ingestion ● There is a world of data out there- how to get it? Web crawlers, APIs, Sensors? Python and other web scripting languages are custom made for this task. ● The real question is how can we deal with such a giant volume and velocity of data? ● Big Data and Data Science often require ingestion specialists!
  • 15. Data Wrangling ● Warehousing the data means storing the data in as raw a form as possible. ● Extract, transform, and load operations move data to operational storage locations. ● Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on. ● Annotated training sets must be created for ML tasks.
  • 16. Computation and Analyses ● Hypothesis driven computation includes design and development of predictive models. ● Many models have to be trained or constrained into a computational form like a Graph database, and this is time consuming. ● Other data products like indices, relations, classifications, and clusters may be computed.
  • 17. Modeling and Application This is the part we’re most familiar with. Supervised classification, Unsupervised clustering - Bayes, Logistic Regression, Decision Trees, and other models. This is also where the money is.
  • 18. Reporting and Visualization ● Often overlooked, this part is crucial, even if we have data products. ● Humans recognize patterns better than machines. Human feedback is crucial in Active Learning and remodeling (error detection). ● Mashups and collaborations generate more data- and therefore more value!
  • 19. Don’t forget feedback! (Active Learning for Data Products)
  • 20. What we’re going to build today SCIENCE BOOKCLUB!! ● A book club that chooses what to read via a recommender system. ● Uses GoodReads data to ingest and return feedback on books. ● Statistical model is a non-negative matrix factorization ● Reporting using Jinja (almost a web app)
  • 21. Workflow 1. Setting up a Python skeleton 2. Creating and Running Tests 3. Wading in with a configuration 4. Ingestion with urllib and requests 5. Creating a command line admin with argparse 6. Wrangling with BeautifulSoup and SQLAlchemy 7. Modeling with numpy 8. Reporting with Jinja2
  • 22. Matplotlib Jinja2 Reporting Module Recommender Module Octavo Architecture (really clear DSP) requests.py Ingestion Module Raw Data Storage Computational Data Storage Wrangling Module BeautifulSou p SQLAlchemy Numpy
  • 23. Let’s dive into some code!