SlideShare a Scribd company logo
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
@ZipfianAcademy
Data Engineering 101: Building your first
data product
May 4th, 2014
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Formerly
Questions? tweet @zipfianacademy #pydata
Formerly
Questions? tweet @zipfianacademy #pydata
Currently
Questions? tweet @zipfianacademy #pydata
Today Disclaimer:
All characters appearing in this presentation are
fictitious. Any resemblance to real persons, living
or dead, is purely coincidental.
Questions? tweet @zipfianacademy #pydata
Today Disclaimer:
This presentation contains strong opinions that
you may or may not agree with. All thoughts are
my own.
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• CreatingValue for Users	

• Q&A
Questions? tweet @zipfianacademy #pydata
nwsrdr (News Reader)
Source: http://www.groovypost.com/wp-content/uploads/2013/05/Bookmark-
Button.png
OR
nwsrdr
+ nwrsrdr
+ nwrsrdr
+ nwrsrdr
nwsrdr
getnews.com/bookmarklet
When browsing the web simply click the 

+nwsrdr to save any page to nwsrdr
Get nwsrdr on your desktop
Questions? tweet @zipfianacademy #pydata
nwsrdr
• Auto-categorize Articles	

• Find Similar Articles	

• Recommend articles	

• Suggest Feeds to Follow	

• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model!
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product Built on Data
(that you sell)
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
(that you sell)
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Product that Generates Data
(that you sell)
i.e. Facebook
Questions? tweet @zipfianacademy #pydata
OR
Data Products
Questions? tweet @zipfianacademy #pydata Source: http://gifgif.media.mit.edu/
OR
Data Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
OR
Data Generating
Products
Source: http://www.adamlaiacano.com/post/57703317453/data-generating-productsQuestions? tweet @zipfianacademy #pydata
Products that enhance a users’
experience the more “data” a user
provides
Data Generating
Products
Ex: Recommender Systems
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
OR
Data Science
Questions? tweet @zipfianacademy #pydata
i.e. solve more problems than you create
Data Science
Questions? tweet @zipfianacademy #pydata
Source: http://estoyentretenido.com/wp-content/uploads/2012/11/Jackie-Chan-
Meme.jpg
But.... How?!?!?!!?
Data Science
Questions? tweet @zipfianacademy #pydata
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-
It-1350063721.PNG
Questions? tweet @zipfianacademy #pydata
Data Engineering
Source: http://www.schooljotter.com/imagefolders/lady/Class_3/Engineer-
It-1350063721.PNG
!
Questions? tweet @zipfianacademy #pydata
OR
Data Engineering
Questions? tweet @zipfianacademy #pydata
Prepared
Data
Test Set
Training	

Set Train
Model
Sampling
Evaluate
Cross 	

Validation
Data Science
Questions? tweet @zipfianacademy #pydata
Raw
Data
Cleaned
Data
Scrubbing
Prepared
DataVectorization
New
Data
Test Set
Training	

Set Train
Model
Sampling
Evaluate
Cross 	

Validation
Cleaned
Data
Prepared
DataVectorizationScrubbing
Predict
Labels/
Classes
Data Engineering
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
What
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model
Questions? tweet @zipfianacademy #pydata
nwsrdr
• Auto-categorize Articles	

• Find Similar Articles	

• Recommend articles	

• No Ads!
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
Questions? tweet @zipfianacademy #pydata
nwsrdr
It’s like Prismatic + Pocket + Google Reader (RIP) + Delicious!
• Naive Bayes (classification)	

• Clustering (unsupervised learning)	

• Collaborative Filtering	

• Triangle Closing	

• Real Business Model!
Questions? tweet @zipfianacademy #pydata
Source: http://media.tumblr.com/tumblr_lakcynCyG31qbzcoy.jpg
Abstraction (Cake)
How
(ABK)
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
At Scale Locally
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
Flask
yHat
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Obligatory
Name Drop
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Pipeline
Iteration 0:
• Find out how much data	

• Run locally	

• Experiment
Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Acquire
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
api_key='xxxxxxxxxxxxx'!
!
!
!
url = 'http://api.nytimes.com/svc/search/v2/
articlesearch.json?fq=section_name.contains:("Arts"
"Business Day" "Opinion" "Sports" "U.S."
"World")&sort=newest&api-key=' + api_key!
!
!
!
# make an API request!
api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Acquire
# parse resulting JSON and insert into a mongoDB collection!
for content in api.json()['response']['docs']:!
if not collection.find_one(content):!
collection.insert(content)!
!
!
# only returns 10 per page!
"There are only %i docuemtns returned 0_o" % !
! len(api.json()[‘response']['docs'])!
Questions? tweet @zipfianacademy #pydata
Acquire
# there are many more than 10 articles however!
total_art = articles_left = api.json()['response']['meta']['hits']!
!
!
print "There are currently %s articles in the NYT archive" % total_art!
!
!
#=> There are currently 15277775 articles in the NYT archive
Questions? tweet @zipfianacademy #pydata
Acquire
Gotchas!
• Rate Limiting	

• Page Limiting
Questions? tweet @zipfianacademy #pydata
Acquire
Iterate
Iteration 1:
• (Meaningful) Sample of Data	

• Prototype — “Close the Loop”
Questions? tweet @zipfianacademy #pydata
Retrieve Meta-data for ALL NYT articles
Questions? tweet @zipfianacademy #pydata
Acquire
(take 2)
# let us loop (and hopefully not hit our rate limit)!
while articles_left > 0 and page_count < max_pages:!
more_articles = requests.get(url + "&page=" + str(page) + "&end_date=" + str(last_date))!
print "Inserting page " + str(page)!
# make sure it was successful!
if more_articles.status_code == 200:!
for content in more_articles.json()['response']['docs']:!
latest_article = parser.parse(content['pub_date']).strftime("%Y%m%d")!
if not collection.find_one(content) and content['document_type'] == 'article':!
print "No dups"!
try:!
print "Inserting article " + str(content['headline'])!
collection.insert(content)!
except errors.DuplicateKeyError:!
print "Duplicates"!
continue!
else:!
print "In collection already”!
! ! …
Iteration 0.5
Questions? tweet @zipfianacademy #pydata
Acquire
articles_left -= 10!
page += 1!
page_count += 1!
cursor_count += 1!
final_page = max(final_page, page)!
else:!
if more_articles.status_code == 403:!
print "Sleepy..."!
# account for rate limiting!
time.sleep(2)!
elif cursor_count > 100:!
print "Adjusting date”!
! ! ! ! # account for page limiting!
cursor_count = 0!
page = 0!
last_date = latest_article!
else:!
print "ERRORS: " + str(more_articles.status_code)!
cursor_count = 0!
page = 0!
last_date = latest_article!
Questions? tweet @zipfianacademy #pydata
Acquire
Download HTML content of 	

articles from NYT.com
Questions? tweet @zipfianacademy #pydata
Acquire
(and store in MongoDB™)
Acquire
# now we can get some content!!
#limit = 100!
limit = 10000!
!
for article in collection.find({'html' : {'$exists' : False} }):!
if limit and limit > 0:!
if not article.has_key('html') and article['document_type'] == 'article':!
limit -= 1!
print article['web_url']!
html = requests.get(article['web_url'] + "?smid=tw-nytimes")!
!
if html.status_code == 200:!
soup = BeautifulSoup(html.text)!
!
# serialize html!
collection.update({ '_id' : article['_id'] }, { '$set' : !
! ! ! ! ! ! ! ! ! ! ! ! ! { 'html' : unicode(soup), 'content' : [] } !
! ! ! ! ! ! ! ! ! ! ! ! } )!
!
for p in soup.find_all('div', class_='articleBody'):!
collection.update({ '_id' : article['_id'] }, { '$push' : !
! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() !
! ! ! ! ! ! ! ! ! ! ! ! ! } })!
Questions? tweet @zipfianacademy #pydata
Parse
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
scikit-learn/NLTK
Flask
yHat
Locally
Questions? tweet @zipfianacademy #pydata
Parse HTML with BeautifulSoup	

and Extract the article Body
Questions? tweet @zipfianacademy #pydata
(and store in MongoDB™)
Parse
# parse HTML content of articles!
for article in collection.find({'html' : {'$exists' : True} }):!
print article['web_url']!
soup = BeautifulSoup(article['html'], 'html.parser')!
arts = soup.find_all('div', class_='articleBody')!
!
if len(arts) == 0:!
arts = soup.find_all('p', class_=‘story-body-text')!
!
! ! …
Questions? tweet @zipfianacademy #pydata
Parse
Store
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
for p in arts:!
collection.update({ '_id' : article['_id'] }, { '$push' : !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! { 'content' : p.get_text() } !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! })!
Questions? tweet @zipfianacademy #pydata
Store
Explore
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exploratory Data Analysis with pandas
Questions? tweet @zipfianacademy #pydata
Explore
articles.describe()!
# ! ! text section!
# count 1405 1405!
# unique 1397 10!
!
fig = plt.figure()!
# histogram of section counts!
articles['section'].value_counts().plot(kind='bar')
Questions? tweet @zipfianacademy #pydata
Explore
Questions? tweet @zipfianacademy #pydata
Explore
error with 	

NYT API
Questions? tweet @zipfianacademy #pydata
Explore
api_key='xxxxxxxxxxxxx'!
!
!
!
url = 'http://api.nytimes.com/svc/search/v2/
articlesearch.json?fq=section_name.contains:("Arts"
"Business Day" "Opinion" "Sports" "U.S."
"World")&sort=newest&api-key=' + api_key!
!
!
!
# make an API request!
api = requests.get(url)!
Questions? tweet @zipfianacademy #pydata
Explore
error with 	

NYT API
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
Tokenize article text and 	

create feature vectors with NLTK
Questions? tweet @zipfianacademy #pydata
Vectorize
Vectorize
wnl = nltk.WordNetLemmatizer()!
!
def tokenize_and_normalize(chunks):!
words = [ tokenize.word_tokenize(sent) for sent in
tokenize.sent_tokenize("".join(chunks)) ]!
flatten = [ inner for sublist in words for inner in sublist ]!
stripped = [] !
!
for word in flatten: !
if word not in stopwords.words('english'):!
try:!
stripped.append(word.encode('latin-1').decode('utf8').lower())!
except:!
print "Cannot encode: " + word!
!
no_punks = [ word for word in stripped if len(word) > 1 ] !
return [wnl.lemmatize(t) for t in no_punks]!
Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Train
Train and score a model with scikit-learn
Questions? tweet @zipfianacademy #pydata
Train
# cross validate!
from sklearn.cross_validation import train_test_split!
!
xtrain, xtest, ytrain, ytest = !
! ! ! ! ! ! ! train_test_split(X, labels, test_size=0.3)!
!
# train a model!
alpha = 1!
multi_bayes = MultinomialNB(alpha=alpha)!
!
multi_bayes.fit(xtrain, ytrain)!
multi_bayes.score(xtest, ytest)
Questions? tweet @zipfianacademy #pydata
Train
Gotchas!
• Model only exists locally on Laptop	

• Not Automated for realtime prediction
Questions? tweet @zipfianacademy #pydata
Train
Exposé
Questions? tweet @zipfianacademy #pydata
Iteration 2:
• Expose your model	

• Automate your processes
Questions? tweet @zipfianacademy #pydata
Exposé
Getting that model	

off your lap(top)
Questions? tweet @zipfianacademy #pydata
Exposé
Source: http://pixel.nymag.com/imgs/daily/vulture/2012/03/09/09_joan-taylor.o.jpg/
a_560x0.jpg
Questions? tweet @zipfianacademy #pydata
Exposé
A model is just a function
Questions? tweet @zipfianacademy #pydata
Exposé
Inputs...
Questions? tweet @zipfianacademy #pydata
Exposé
Outputs...
Questions? tweet @zipfianacademy #pydata
Exposé
Serialize your model with pickle 	

(or cPickle or joblib)
Questions? tweet @zipfianacademy #pydata
Persistence
Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/
g-6mevh13be74mgnc9i8qifa0
Persistence
Questions? tweet @zipfianacademy #pydata
Persistence
SerDes
• Disk	

• Database	

• Memory	

Questions? tweet @zipfianacademy #pydata
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Exposé
Deploy your Model to yHat
Questions? tweet @zipfianacademy #pydata
Exposé
class DocumentClassifier(YhatModel):!
@preprocess(in_type=dict, out_type=dict)!
def execute(self, data):!
featureBody = vectorizer.transform([data['content']])!
result = multi_bayes.predict(featureBody)!
list_res = result.tolist()!
return {"section_name": list_res}!
!
clf = DocumentClassifier()!
yh = Yhat("jonathan@zipfianacademy.com", “xxxxxx",!
! ! ! ! ! ! ! ! ! ! ! ! ! "http://cloud.yhathq.com/")!
yh.deploy("documentClassifier", DocumentClassifier, globals())
Questions? tweet @zipfianacademy #pydata
Exposé
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
At Scale
Flask
yHat
scrapy
Hadoop Streaming 	

(w/ BeautifulSoup4)
mrjob or 	

Mortar (w/ Python UDF)
Snakebite (HDFS)
MLlib (pySpark)
requests
BeautifulSoup4
pandas
pymongo
Flask (on Heroku)
yHat
Locally
scikit-learn/NLTK
Questions? tweet @zipfianacademy #pydata
Present
Create a Flask application to 	

expose your model on the web
Questions? tweet @zipfianacademy #pydata
Present
yh = Yhat("<USERNAME>", "<API KEY>", "http://cloud.yhathq.com/")	
!
@app.route('/')	
def index():	
return app.send_static_file('index.html')	
!
@app.route('/predict', methods=['POST'])	
def predict():	
article = request.form['article']	
results = yh.predict("documentClf", { 'content': article })	
return jsonify({"results": results})	
Questions? tweet @zipfianacademy #pydata
Present
Pipeline
Only Data should Flow
Questions? tweet @zipfianacademy #pydata
Data
Remember to Remember	

(Lineage)
Acquisition
Parse
Storage
Transform/Explore
Vectorization
Train
Model
Expose
Presentation
Questions? tweet @zipfianacademy #pydata
Pipeline
Immutable append only set of Raw Data
Computation is a view on data
*Lambda Architecture by Nathan MarzQuestions? tweet @zipfianacademy #pydata
Pipeline
Functional Data Science
• Modularity	

• Define interfaces	

• Separate data from computation	

• Data Lineage
Functional
Questions? tweet @zipfianacademy #pydata
Need Robust and Flexible Pipeline!
Questions? tweet @zipfianacademy #pydata
Pipeline
Whatever you do, DO NOT cross the streams
Questions? tweet @zipfianacademy #pydata
Pipeline
NYT
API
MongoDB
BeautifulSoup
Feature
Matrixscikit-learn
Web
App
Model
Deploy
yHat
Heroku
POST
Predict
Predicted
Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Gotchas!
• Only have a static subset of articles	

• Pipeline not automated for re-training
Questions? tweet @zipfianacademy #pydata
Gotchas
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Iteration 3:
Source: http://vninja.net/wordpress/wp-content/uploads/2013/03/KCaAutomate.pngQuestions? tweet @zipfianacademy #pydata
Iterate
NYT
API
MongoDB
cron
Feature
Matrixscikit-learn
Web
App
Model
Deploy
yHat
Heroku
POST
Predict
Predicted
Section
Where we are
NLTK
scikit-learn
Questions? tweet @zipfianacademy #pydata
Amazon 	

EC2
testing
Start small (data)
and fast
(development)
testing
Increase size of
data set
Optimize and
productionize
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
How to Scale
How to Scale
testing
Develop locally
testing
Distribute
computation 	

(run on cluster)
Tune parameters
PROFIT!
$$$
Questions? tweet @zipfianacademy #pydata
Can also use a
streaming algorithm or
single machine disk
based “medium data”
technologies (i.e.
database or memory
mapped files)
Products
If you build it...
Questions? tweet @zipfianacademy #pydata
Source: http://nateemery.com/wp-content/uploads/2013/05/field-of-dreams.jpg
Products
Questions? tweet @zipfianacademy #pydata
Today
• whoami	

• Nws Rdr (News Reader)	

• The What,Why, and How of Data Products	

• Data Engineering	

• Building a Pipeline	

• Productionizing the Products	

• Q&A
Questions? tweet @zipfianacademy #pydata
Q & A
Q&A
Questions? tweet @zipfianacademy #pydata
Zipfian
Academy
@ZipfianAcademy
Data Science & Data Engineering	

12-week Bootcamp (May 12th & Sep 8th)
Weekend Workshops
http://zipfianacademy.com/apply
http://zipfianacademy.com/workshops
Next: InteractiveVisualizations w/ d3.js (June 7th)
Questions? tweet @zipfianacademy #pydata
Thank You!
Jonathan Dinu

Co-Founder, Zipfian Academy

jonathan@zipfianacademy.com

@clearspandex
@ZipfianAcademy
http://zipfianacademy.com
Questions? tweet @zipfianacademy #pydata
Appendix
Questions? tweet @zipfianacademy #pydata
Data Sources
Obtain
(ranked by ease of use)
1. DaaS -- Data as a service	

2. Bulk Download	

3. APIs	

4. Web Scraping
Questions? tweet @zipfianacademy #pydata
DaaS
(Data as a Service)
•Time Series/Numeric: Quandl	

• Financial Modeling: Quantopian	

• Email Contextualization: Rapleaf	

• Location and POI: Factual
Data Sources
Questions? tweet @zipfianacademy #pydata
Bulk Download
(just like the good ol’ days)
• File Transfer Protocol (FTP): CDC	

•Amazon Web Services: Public Datasets	

• Infochimps: Data Marketplace	

•Academia: UCI Machine Learning Repository
Data Sources
Questions? tweet @zipfianacademy #pydata
APIs
(if it’s not RESTed, I’m not buying)
• Geographic: Foursquare	

• Social: Facebook	

•Audio: Rdio	

• Content:Tumblr	

• Realtime:Twitter 	

• Hidden:Yahoo Finance
Data Sources
Questions? tweet @zipfianacademy #pydata
Web Scraping
1. wget and curl 	

2. Web Spider/Crawler	

3. API scraping	

4. Manual Download
(DIY for life)
Data Sources
Questions? tweet @zipfianacademy #pydata
• DelimitedValues	

• TSV	

• CSV	

• WSV	

• JSON	

• XML	

• Ad Hoc Formats (avoid these if you can)
Data Formats
Questions? tweet @zipfianacademy #pydata
• JSON is made up of hash tables and arrays	

• Hash tables: { “foo” : 1, “bar” : 2, baz : “3” }	

• Arrays: [1, 2, 3]
• Arrays of arrays: [[1, 2, 3], [‘foo’, ‘bar’, ‘baz’]]
• Array of hashes: [{‘foo’:1, ‘bar’:2}, {‘baz’:3}]
• Hashes of hashes: {‘foo’: {‘bar’: 2, ‘baz’: 3}}
Questions? tweet @zipfianacademy #pydata
Data Formats
{"widget": {!
"debug": "on",!
"window": {!
"title": "Sample Konfabulator Widget",!
"name": "main_window",!
"width": 500,!
"height": 500!
},!
"image": { !
"src": "Images/Sun.png",!
"name": "sun1",!
"hOffset": 250,!
"vOffset": 250,!
"alignment": "center"!
},!
"text": {!
"data": "Click Here",!
"size": 36,!
"style": "bold",!
"name": "text1",!
"hOffset": 250,!
"vOffset": 100,!
"alignment": "center",!
"onMouseUp": "sun1.opacity = (sun1.opacity / 100) * 90;"!
}!
}} !
Questions? tweet @zipfianacademy #pydata
Data Formats
• XML is a recursive self-describing container	

<container>
<item>Foo</item>
<item>Bar</item>
<container>
<item attr=”SomethingAboutBaz”>Baz</item>
</container>
</item>
<container>
Questions? tweet @zipfianacademy #pydata
Data Formats
<widget>!
<debug>on</debug>!
<window title="Sample Konfabulator Widget">!
<name>main_window</name>!
<width>500</width>!
<height>500</height>!
</window>!
<image src="Images/Sun.png" name="sun1">!
<hOffset>250</hOffset>!
<vOffset>250</vOffset>!
<alignment>center</alignment>!
</image>!
<text data="Click Here" size="36" style="bold">!
<name>text1</name>!
<hOffset>250</hOffset>!
<vOffset>100</vOffset>!
<alignment>center</alignment>!
<onMouseUp>!
sun1.opacity = (sun1.opacity / 100) * 90;!
</onMouseUp>!
</text>!
</widget>!
Questions? tweet @zipfianacademy #pydata
Data Formats
• Ad hoc data formats	

• Fixed-width (Census data)	

• Graph Edgelists
• Voting records
• etc.
Questions? tweet @zipfianacademy #pydata
Data Formats
• 7-5-5 format	

•Sam foo bar!
•Roger baz 6!
•Jane 314 99
Questions? tweet @zipfianacademy #pydata
Data Formats
• Directed Graph Format	

1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
• Directed Graph Format	

1 2!
1 3!
1 4!
2 3!
4 4
Questions? tweet @zipfianacademy #pydata
Data Formats
Programming languages like
Python, Ruby, and R have built in
parsers for data formats such as
JSON and CSV. For other
esoteric formats you will
probably have to write your own
Questions? tweet @zipfianacademy #pydata
Data Formats

More Related Content

PDF
20160504 scrubadub
PDF
WTF is Semantic Web?
KEY
Lighting talk on django-social-auth
PDF
Agile Mumbai 2019 Conference | Right to left | Mike Burrows
PPTX
非エンジニアの私でもPythonの勉強会に 参加したらしあわせになれたというお話
PDF
From Volume to Value - A Guide to Data Engineering
PPTX
Data Engineering Efficiency @ Netflix - Strata 2017
PDF
Demystifying Data Engineering
20160504 scrubadub
WTF is Semantic Web?
Lighting talk on django-social-auth
Agile Mumbai 2019 Conference | Right to left | Mike Burrows
非エンジニアの私でもPythonの勉強会に 参加したらしあわせになれたというお話
From Volume to Value - A Guide to Data Engineering
Data Engineering Efficiency @ Netflix - Strata 2017
Demystifying Data Engineering

Similar to Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014 (20)

PDF
Building Data Apps with Python
PDF
The DATA RING - A canvas for DATA PROJECT
KEY
Big data and APIs for PHP developers - SXSW 2011
PDF
Python 101 for Data Science to Absolute Beginners
PPTX
Python PPT
PPTX
Python for Big Data Analytics
PPTX
Sailing on the ocean of 1s and 0s
PPTX
Python for Big Data Analytics
PDF
Sv big datascience_cliffclick_5_2_2013
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
PDF
From Data Points to Data Lakes
PPTX
Big Data and Small Devices: What will it do for us and to us
PDF
Building Data Products with Python (Georgetown)
PPT
3 tu.dc 5min nordbib jp rombouts
PPTX
Cloud Programming Models: eScience, Big Data, etc.
PDF
OWF14 - Plenary Session : Ori Pekelman, Founder, Constellation Matrix
PDF
Data Science: Harnessing Open Data for High Impact Solutions
PPTX
Session 03 acquiring data
Building Data Apps with Python
The DATA RING - A canvas for DATA PROJECT
Big data and APIs for PHP developers - SXSW 2011
Python 101 for Data Science to Absolute Beginners
Python PPT
Python for Big Data Analytics
Sailing on the ocean of 1s and 0s
Python for Big Data Analytics
Sv big datascience_cliffclick_5_2_2013
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
From Data Points to Data Lakes
Big Data and Small Devices: What will it do for us and to us
Building Data Products with Python (Georgetown)
3 tu.dc 5min nordbib jp rombouts
Cloud Programming Models: eScience, Big Data, etc.
OWF14 - Plenary Session : Ori Pekelman, Founder, Constellation Matrix
Data Science: Harnessing Open Data for High Impact Solutions
Session 03 acquiring data
Ad

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PDF
Words in Space - Rebecca Bilbro
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Pydata beautiful soup - Monica Puerto
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
Extending Pandas with Custom Types - Will Ayd
PDF
Measuring Model Fairness - Stephen Hoover
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Words in Space - Rebecca Bilbro
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Pydata beautiful soup - Monica Puerto
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Extending Pandas with Custom Types - Will Ayd
Measuring Model Fairness - Stephen Hoover
What's the Science in Data Science? - Skipper Seabold
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Ad

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PPTX
A Presentation on Touch Screen Technology
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PPTX
1. Introduction to Computer Programming.pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mushroom cultivation and it's methods.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Chapter 5: Probability Theory and Statistics
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
SOPHOS-XG Firewall Administrator PPT.pptx
A Presentation on Touch Screen Technology
Enhancing emotion recognition model for a student engagement use case through...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Web App vs Mobile App What Should You Build First.pdf
Hindi spoken digit analysis for native and non-native speakers
1 - Historical Antecedents, Social Consideration.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
1. Introduction to Computer Programming.pptx
TLE Review Electricity (Electricity).pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mushroom cultivation and it's methods.pdf

Data Engineering 101: Building your first data product by Jonathan Dinu PyData SV 2014