SlideShare a Scribd company logo
Data science at the command line
Rapid prototyping and reproducible science
Sharat Chikkerur
sharat@alum.mit.edu
Principal Data Scientist
Nanigans Inc.
1
Outline
Introduction
Motivation
Data science workflow
Obtaining data
Scrubbing data
Exploring data
Managing large workflows
Modeling using vowpal wabbit
2
Introduction
Setup
• Follow instructions at
https://github.com/sharatsc/cdse
• Install virtualbox virtualbox.org
• Install vagrant vagrantup.com
• Initialize virtual machine
mkdir datascience
cd datascience
vagrant init data-science-toolbox/data-science-at-the-command-line
vagrant up && vagrant ssh
3
About me
• UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu)
• MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu)
• Senior Software Engineer, Google AdWords modeling
• Senior Software Engineer, Microsoft Machine learning
• Principal data scientist, Nanigans Inc
4
About the workshop
• Based on a book by Jeroen Janssens 1
• Vowpal wabbit 2
1
http://datascienceatthecommandline.com/
2
https://github.com/JohnLangford/vowpal_wabbit
5
POSIX command line
• Exposes operating system functionalities through a shell
• Example shells include bash, zsh, tcsh, fish etc.
• Comprises of a large number of utility programs
• Examples: grep, awk, sed, etc.
• GNU user space programs
• Common API: Input through stdin and output to stdout
• stdin and stdout can be redirected during execution
• Pipes (|) allows composition through pipes (chaining).
• Allows redirection of output of one command as input to
another
ls -lh | more
cat file | sort | uniq -c
6
Why command line ?
• REPL allows rapid iteration (through immediate feedback)
• Allows composition of scripts and commands using pipes
• Automation and scaling
• Reproducibility
• Extensibility
• R, python, perl, ruby scripts can be invoked like command
line utilities
7
Data science workflow
A typical workflow OSEMN model
• Obtaining data
• Scrubbing data
• Exploring data
• Modeling data
• Interpreting data
8
Workshop outline
• Background
• Getting data: Curl, scrape
• Scrubbing: jq, csvkit
• Exploring: csvkit
• Modeling: vowpal wabbit
• Scaling: parallel
Hands on exercises
• Obtaining data walkthrough
• JQ walkthrough
• CSVKit walkthrough
• Vowpal wabbit
9
Workflow example: Boston housing dataset
https://github.com/sharatsc/cdse/blob/master/
boston-housing
10
Boston housing dataset
Python workflow 3
import urllib
import pandas as pd
# Obtain data
urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin
df = pd.read_csv(’boston.csv’)
# Scrub data
df = df.fillna(0)
# Model data
from statsmodels import regression
from statsmodels.formula import api as smf
formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’])
model = smf.ols(formula=formula, data=df)
res=model.fit()
res.summary()
Command line workflow
URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv"
curl $URL| Rio -e ’model=lm("medv~.", df);model’
3
https://github.com/sharatsc/cdse/blob/master/boston-housing
11
Obtaining data
Obtaining data from the web: curl
• CURL (curl.haxx.se)
• Cross platform command line tool that supports data
transfer using
HTTP, HTTPS, FTP, IMAP, SCP, SFTP
• Supports cookies, user+password authentication
• can be used to get data from RESTful APIs 4
• http GET
curl http://www.google.com
• ftp GET
curl ftp://catless.ncl.ac.uk
• scp COPY
curl -u username: --key key_file --pass password scp://example.com/~/file.txt
4
www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/
12
Scrubbing data
Scrubbing web data: Scrape
• Scrape is a python command line tool to parse html
documents
• Queries can be made in CSS selector or XPath syntax
htmldoc=$(cat << EOF
<div id=a>
<a href="x.pdf">x</a>
</div>
<div id=b>
<a href="png.png">y</a>
<a href="pdf.pdf">y</a>
</div>
EOF
)
# Select liks that end with pdf and are within div with id=b (Use CSS3 selector)
echo $htmldoc | scrape -e "$b a[href$=pdf]"
# Select all anchors (use Xpath)
echo $htmldoc | scrape -e "//a"
<a href="pdf.pdf">y</a>
13
CSS selectors
.class selects all elements with class=’class’
div p selects all <p> elements inside div elements
div > p selects <p> elements where parent is <div>
[target=blank] selects all elements with target="blank"
[href^=https] selects urls beginning with https
[href$=pdf] selects urls ending with pdf
More examples at
https://www.w3schools.com/cssref/css_selectors.asp
14
XPath Query
author
selects all <author> elements at
the current level
//author
selects all <author> elements at
any level
//author[@class=’x’]
selects <author> elements with
class=famous
//book//author
All <author> elements that are
below <book> element"
//author/* All children of <author> nodes
More examples at
https://msdn.microsoft.com/en-us/library/ms256086
15
Getting data walkthrough
https:
//github.com/sharatsc/cdse/tree/master/curl
16
Scrubbing JSON data: JQ 5
• JQ is a portable command line utility to manipulate and
filter JSON data
• Filters can be defined to access individual fields, transform
records or produce derived objets
• Filters can be composed and combined
• Provides builtin functions an operators
Example:
curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’
5
https://stedolan.github.io/jq/
17
Basic filters
# Identity ’.’
echo ’"Hello world"’ | jq ’.’
"Hello world"
# Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"]
echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’
42
echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’
10
#Arrays ’.[]’, ’.[0]’, ’.[1:3]’
echo ’["foo", "bar"]’ | jq ’.[0]’
"foo"
echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’
["bar", "baz"]
# Pipe ’.foo|.bar’, ’.[]|.bar’
echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f
10 20 30
18
Object construction
Object construction allows you to derive new objects out of
existing ones.
# Field selection
echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’
{"foo": "F"}
# Array expansion
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’
{"foo": "F", "bar": "X"}
{"foo": "F", "bar": "Y"}
# Expression evaluation, key and value can be substituted
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’
{"A": "X"}
{"A": "Y"}
19
Operators
Addition
• Numbers are added by normal arithmetic.
• Arrays are added by being concatenated into a larger array.
• Strings are added by being joined into a larger string.
• Objects are added by merging, that is, inserting all the
key-value pairs from both objects into a single combined
object.
# Adding fields
echo ’{"foo": 10}’ | jq ’.foo + 1’
11
# Adding arrays
echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’
[1,2,3,11,12,13]
20
JQ walkthrough
https:
//github.com/sharatsc/cdse/tree/master/jq
21
Exploring data
CSV
• CSV (Comma separated value) is the common demoniator
for data exchange
• Tabular data with ’,’ as a separator
• Can be ingested by R, python, excel etc.
• No explicit specification of data types (ARFF supports type
annotation)
Example
state,county,quantity
NE,ADAMS,1
NE,BUFFALO,1
NE,THURSTON,1
22
CSVKit (Groskopf and contributors [2016])
csvkit 6 is a suite of command line tools for converting and
working with CSV, the defacto standard for tabular file formats.
Example use cases
• Importing data from excel, sql
• Select subset of columns
• Reorder columns
• Mergeing multiple files (row and column wise)
• Summary statistics
6
https://csvkit.readthedocs.io/
23
Importing data
# Fetch data in XLS format
# (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed
# This data was widely cited in the aftermath of the Ferguson, Missouri protests. T
curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10
# Convert to csv
in2csv ne_1033_data.xlxs > data.csv
# Inspect the columns
csvcut -n data.csv
# Inspect the data in specific columns
csvcut -c county, quantity data.csv | csvlook
24
CSVKit: Examining data
csvstat provides a summary view of the data similar to
summary() function in R.
# Get summary for county, and cost
csvcut -c county,acquisition_cost,ship_date data.csv | csvstat
1. county
Text
Nulls: False
Unique values: 35
Max length: 10
5 most frequent values:
DOUGLAS: 760
DAKOTA: 42
CASS: 37
HALL: 23
LANCASTER: 18
2. acquisition_cost
Number
Nulls: False
Min: 0.0
Max: 412000.0
Sum: 5430787.55
Mean: 5242.072924710424710424710425
Median: 6000.0
Standard Deviation: 13368.07836799839045093904423
Unique values: 75
5 most frequent values:
6800.0: 304
25
CSVKit: searching data
csvgrep can be used to search the content of the CSV file.
Options include
• Exact match -m
• Regex match -r
• Invert match -i
• Search specific columns -c columns
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook
| county | item_name | total_cost |
| --------- | ------------------------------ | ---------- |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
26
CSVKit: Power tools
• csvjoin can be used to combine columns from multiple files
csvjoin -c join_column data.csv other_data.csv
• csvsort can be used to sort the file based on specific
columns
csvsort -c total_population | csvlook | head
• csvstack allows you to merge mutiple files together
(row-wise)
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re
in2csv ne_1033_data.xls > ne_1033_data.csv
csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv
27
CSVKit: SQL
• csvsql allows queryring against one or more CSV files.
• The results of the query can be inserted back into a db
Examples
• Import from csv into a table
# Inserts into a specific table
csvsql --db postgresql:///test --table data --insert data.csv
# Inserts each file into a separate table
csvsql --db postgresql:///test --insert examples/*_tables.csv
• Regular SQL query
csvsql --query "select count(*) from data" data.csv
28
CSVKit Walkthrough
https:
//github.com/sharatsc/cdse/tree/master/csvkit
29
Managing large workflows
GNU parallel
• GNU parallel (Tange [2011]) is a tool for executing jobs in
parallel on one or more machines.
• It can be used to parallelize across arguments, lines and
files.
Examples
# Parallelize across lines
seq 1000 | parallel "echo {}"
# Parallelize across file content
cat input.csv | parallel -C, "mv {1} {2}"
cat input.csv | parallel -C --header "mv {source} {dest}"
30
Parallel (cont.)
• By default, parallel runs one job per cpu core
• Concurrency can be controlled by --jobs or -j option
seq 100 | parallel -j2 "echo number: {}"
seq 100 | parallel -j200% "echo number: {}"
Logging
• Output from each parallel job can be captured separately
using --results
seq 10 | parallel --results data/outdir "echo number: {}"
find data/outdir
31
Parallel (cont.)
• Remote execution
parallel --nonall --slf instances hostname
# nonall - no argument command to follow
# slf - uses ~/.parallel/sshloginfile as the list of sshlogins
• Distributing data
# Split 1-1000 into sections of 100 and pipe it to remote instances
seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)"
#transmit, retrieve and cleanup
# sends jq to all instances
# transmits the input file, retrives the results into {.}csv and cleanup
ls *.gz | parallel -v --basefile jq --trc {.} csv
32
Modeling using vowpal wabbit
Overview
• Fast, online , scalable learning system
• Supports out of core execution with in memory model
• Scalable (Terascale)
• 1000 nodes (??)
• Billions of examples
• Trillions of unique features.
• Actively developed
https://github.com/JohnLangford/vowpal_wabbit
33
Swiss army knife of online algorithms
• Binary classification
• Multiclass classification
• Linear regression
• Quantile regression
• Topic modeling (online LDA)
• Structured prediction
• Active learning
• Recommendation (Matrix factorization)
• Contextual bandit learning (explore/exploit algorithms)
• Reductions
34
Features
• Flexible input format: Allows free form text, identifying tags,
multiple labels
• Speed: Supports online learning
• Scalable
• Streaming removes row limit.
• Hashing removes dimension limit.
• Distributed training: models merged using AllReduce
operation.
• Cross product features: Allows multi-task learning
35
Optimization
VW solves optimization of the form
i
l(wT
xi; yi) + λR(w)
Here, l() is convex, R(w) = λ1|w| + λ2||w||2.
VW support a variety of loss function
Linear regression (y − wT x)2
Logistic regression log(1 + exp(−ywT x))
SVM regression max(0, 1 − ywT x)
Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I
36
Detour: Feature hashing
• Feature hashing can be used to reduce dimensions of
sparse features.
• Unlike random projections ? , retains sparsity
• Preserves dot products (random projection preserves
distances).
• Model can fit in memory.
• Unsigned ?
Consider a hash function
h(x) : [0 . . . N] → [0 . . . m], m << N.
φi(x) =
j:h(j)=i
xj
• Signed ?
Consider additionaly a hash function
ξ(x) : [0 . . . N] → {1, −1}.
φi(x) = ξ(j)xj
37
Detour: Generalized linear models
A generalized linear predictor specifies
• A linear predictor of the form η(x) = wT x
• A mean estimate µ
• A link function g(µ) such that g(µ) = η(x) that relates the
mean estimate to the linear predictor.
This framework supports a variety of regression problems
Linear regression µ = wT x
Logistic regression log( µ
1−µ) = wT x
Poisson regression log(µ) = wT x
38
fragileInput format
Label Importance [Tag]|namespace Feature . . . | namespace
Feature . . .
namespace = String[:Float]
feature = String[:Float]
Examples:
• 1 | 1:0.01 32:-0.1
• example|namespace normal text features
• 1 3 tag|ad-features ad description |user-features name
address age
39
Input options
• data file -d datafile
• network --daemon --port <port=26542>
• compressed data --compressed
• stdin cat <data> | vw
40
Manipulation options
• ngrams --ngram
• skips --skips
• quadratic interaction -q args. e.g -q ab
• cubic interaction --cubic args. e.g. --cubic ccc
41
Output options
• Examining feature construction --audit
• Generating prediction --predictions or -p
• Unnormalized predictions --raw_predictions
• Testing only --testonly or -t
42
Model options
• Model size --bit_precision or -b . Number of
coefficients limited to 2b
• Update existing model --initial_regressor or -i.
• Final model destination --final_regressor or -f
• Readable model definition --readable_model
• Readable feature values --invert_hash
• Snapshot model every pass --save_per_pass
• Weight initialization
--initial_weight or --random_weights
43
Regression (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/linear-regression
• Linear regression --loss_function square
• Quantile regression
--loss_function quantile --quantile_tau <=0.5>
44
Binary classification (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/classification
• Note: a linear regressor can be used as a classifier as well
• Logistic loss
--loss_function logistic, --link logistic
• Hinge loss (SVM loss function)
--loss_function hinge
• Report binary loss instead of logistic loss --binary
45
Multiclass classification (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/multiclass
• One against all --oaa <k>
• Error correcting tournament --ect <k>
• Online decision trees ---log_multi <k>
• Cost sensitive one-against-all --csoaa <k>
46
LDA options (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/lda
• Number of topics --lda
• Prior on per-document topic weights --lda_alpha
• Prior on topic distributions --lda_rho
• Estimated number of documents --lda_D
• Convergence parameter for topic estimation
--lda_epsilon
• Mini batch size --minibatch
47
Daemon mode (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/daemon-respond
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/daemon-request
• Loads model and answers any prediction request coming
over the network
• Preferred way to deploy a VW model
• Options
• --daemon. Enables demon mode
• --testonly or -t. Does not update the model in
response to requests
• --initial_model or -i. Model to load
• --port <arg>. Port to listen to the request
• --num_children <arg>. Number of threads listening to
request 48
References
Christopher Groskopf and contributors. csvkit, 2016. URL
https://csvkit.readthedocs.org/.
O. Tange. Gnu parallel - the command-line power tool. ;login:
The USENIX Magazine, 36(1):42–47, Feb 2011. doi:
http://dx.doi.org/10.5281/zenodo.16303. URL
http://www.gnu.org/s/parallel.
48

More Related Content

PDF
Barcelona MUG MongoDB + Hadoop Presentation
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
ODP
Aggregation Framework in MongoDB Overview Part-1
PDF
R statistics with mongo db
PPTX
The Aggregation Framework
PPTX
The Aggregation Framework
PDF
MongoDB Aggregation Framework
PDF
Using MongoDB and Python
Barcelona MUG MongoDB + Hadoop Presentation
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Aggregation Framework in MongoDB Overview Part-1
R statistics with mongo db
The Aggregation Framework
The Aggregation Framework
MongoDB Aggregation Framework
Using MongoDB and Python

What's hot (20)

PPTX
MongoDB - Aggregation Pipeline
PDF
Data Processing and Aggregation with MongoDB
PDF
Aggregation Framework MongoDB Days Munich
PDF
Working With a Real-World Dataset in Neo4j: Import and Modeling
PDF
MongoDB and Python
PPTX
MongoDB World 2016 : Advanced Aggregation
KEY
Python Development (MongoSF)
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
PDF
Analytics with MongoDB Aggregation Framework and Hadoop Connector
KEY
MongoDB Aggregation Framework
PDF
On Beyond (PostgreSQL) Data Types
PDF
PPTX
Introduction to MongoDB and Hadoop
PPTX
PostgreSQL 9.4 JSON Types and Operators
PDF
Webinar: Data Processing and Aggregation Options
PPTX
MongoDB Aggregation
PPTX
Webinar: Exploring the Aggregation Framework
PPTX
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
PDF
Fast track to getting started with DSE Max @ ING
PDF
Spark cassandra integration 2016
MongoDB - Aggregation Pipeline
Data Processing and Aggregation with MongoDB
Aggregation Framework MongoDB Days Munich
Working With a Real-World Dataset in Neo4j: Import and Modeling
MongoDB and Python
MongoDB World 2016 : Advanced Aggregation
Python Development (MongoSF)
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
Analytics with MongoDB Aggregation Framework and Hadoop Connector
MongoDB Aggregation Framework
On Beyond (PostgreSQL) Data Types
Introduction to MongoDB and Hadoop
PostgreSQL 9.4 JSON Types and Operators
Webinar: Data Processing and Aggregation Options
MongoDB Aggregation
Webinar: Exploring the Aggregation Framework
Conceptos básicos. Seminario web 5: Introducción a Aggregation Framework
Fast track to getting started with DSE Max @ ING
Spark cassandra integration 2016
Ad

Viewers also liked (20)

PPTX
Make Sense Out of Data with Feature Engineering
PDF
Work - LIGHT Ministry
PDF
6 rules of enterprise innovation
PPTX
Leverage Social Media for Employer Brand and Recruiting
PDF
Data Science Competition
PDF
Ethics in Data Science and Machine Learning
PDF
Doing your first Kaggle (Python for Big Data sets)
PPTX
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
PDF
Menstrual Health Reader - mEo
PDF
Open Innovation - A Case Study
PPTX
Managing Data Science | Lessons from the Field
PPTX
State of women in technical workforce
ODP
Wapid and wobust active online machine leawning with Vowpal Wabbit
PPTX
How to recruit excellent tech talent
PDF
How hackathons can drive top line revenue growth
PDF
No-Bullshit Data Science
PPTX
Kill the wabbit
PDF
USC LIGHT Ministry Introduction
PDF
Intra company hackathons using HackerEarth
PPTX
A Panorama of Natural Language Processing
Make Sense Out of Data with Feature Engineering
Work - LIGHT Ministry
6 rules of enterprise innovation
Leverage Social Media for Employer Brand and Recruiting
Data Science Competition
Ethics in Data Science and Machine Learning
Doing your first Kaggle (Python for Big Data sets)
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Menstrual Health Reader - mEo
Open Innovation - A Case Study
Managing Data Science | Lessons from the Field
State of women in technical workforce
Wapid and wobust active online machine leawning with Vowpal Wabbit
How to recruit excellent tech talent
How hackathons can drive top line revenue growth
No-Bullshit Data Science
Kill the wabbit
USC LIGHT Ministry Introduction
Intra company hackathons using HackerEarth
A Panorama of Natural Language Processing
Ad

Similar to Data science at the command line (20)

PDF
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
PPTX
Your data isn't that big @ Big Things Meetup 2016-05-16
PPTX
Web Scraping Basics
PPTX
Nix for etl using scripting to automate data cleaning & transformation
PDF
Data handling in r
PDF
How researchers and developers can benefit from the command line
DOCX
Ravi software faculty
PDF
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edi...
PDF
Computational Social Science, Lecture 09: Data Wrangling
PDF
Linux intro 3 grep + Unix piping
ODP
Love Your Command Line
PDF
Introduction to Linux with Focus on Raspberry Pi
PDF
Command line Data Tools
PDF
Linux Command Line - By Ranjan Raja
PDF
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
PDF
Data analystics with R module 3 cseds vtu
PPTX
Linux tech talk
PDF
Adaptive Query Processing on RAW Data
PDF
20180324 leveraging unix tools
PDF
One-Liners to Rule Them All
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Your data isn't that big @ Big Things Meetup 2016-05-16
Web Scraping Basics
Nix for etl using scripting to automate data cleaning & transformation
Data handling in r
How researchers and developers can benefit from the command line
Ravi software faculty
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython 1st Edi...
Computational Social Science, Lecture 09: Data Wrangling
Linux intro 3 grep + Unix piping
Love Your Command Line
Introduction to Linux with Focus on Raspberry Pi
Command line Data Tools
Linux Command Line - By Ranjan Raja
Wes McKinney - Python for Data Analysis-O'Reilly Media (2012).pdf
Data analystics with R module 3 cseds vtu
Linux tech talk
Adaptive Query Processing on RAW Data
20180324 leveraging unix tools
One-Liners to Rule Them All

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Approach and Philosophy of On baking technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Approach and Philosophy of On baking technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Per capita expenditure prediction using model stacking based on satellite ima...

Data science at the command line

  • 1. Data science at the command line Rapid prototyping and reproducible science Sharat Chikkerur sharat@alum.mit.edu Principal Data Scientist Nanigans Inc. 1
  • 2. Outline Introduction Motivation Data science workflow Obtaining data Scrubbing data Exploring data Managing large workflows Modeling using vowpal wabbit 2
  • 4. Setup • Follow instructions at https://github.com/sharatsc/cdse • Install virtualbox virtualbox.org • Install vagrant vagrantup.com • Initialize virtual machine mkdir datascience cd datascience vagrant init data-science-toolbox/data-science-at-the-command-line vagrant up && vagrant ssh 3
  • 5. About me • UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu) • MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu) • Senior Software Engineer, Google AdWords modeling • Senior Software Engineer, Microsoft Machine learning • Principal data scientist, Nanigans Inc 4
  • 6. About the workshop • Based on a book by Jeroen Janssens 1 • Vowpal wabbit 2 1 http://datascienceatthecommandline.com/ 2 https://github.com/JohnLangford/vowpal_wabbit 5
  • 7. POSIX command line • Exposes operating system functionalities through a shell • Example shells include bash, zsh, tcsh, fish etc. • Comprises of a large number of utility programs • Examples: grep, awk, sed, etc. • GNU user space programs • Common API: Input through stdin and output to stdout • stdin and stdout can be redirected during execution • Pipes (|) allows composition through pipes (chaining). • Allows redirection of output of one command as input to another ls -lh | more cat file | sort | uniq -c 6
  • 8. Why command line ? • REPL allows rapid iteration (through immediate feedback) • Allows composition of scripts and commands using pipes • Automation and scaling • Reproducibility • Extensibility • R, python, perl, ruby scripts can be invoked like command line utilities 7
  • 9. Data science workflow A typical workflow OSEMN model • Obtaining data • Scrubbing data • Exploring data • Modeling data • Interpreting data 8
  • 10. Workshop outline • Background • Getting data: Curl, scrape • Scrubbing: jq, csvkit • Exploring: csvkit • Modeling: vowpal wabbit • Scaling: parallel Hands on exercises • Obtaining data walkthrough • JQ walkthrough • CSVKit walkthrough • Vowpal wabbit 9
  • 11. Workflow example: Boston housing dataset https://github.com/sharatsc/cdse/blob/master/ boston-housing 10
  • 12. Boston housing dataset Python workflow 3 import urllib import pandas as pd # Obtain data urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin df = pd.read_csv(’boston.csv’) # Scrub data df = df.fillna(0) # Model data from statsmodels import regression from statsmodels.formula import api as smf formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’]) model = smf.ols(formula=formula, data=df) res=model.fit() res.summary() Command line workflow URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv" curl $URL| Rio -e ’model=lm("medv~.", df);model’ 3 https://github.com/sharatsc/cdse/blob/master/boston-housing 11
  • 14. Obtaining data from the web: curl • CURL (curl.haxx.se) • Cross platform command line tool that supports data transfer using HTTP, HTTPS, FTP, IMAP, SCP, SFTP • Supports cookies, user+password authentication • can be used to get data from RESTful APIs 4 • http GET curl http://www.google.com • ftp GET curl ftp://catless.ncl.ac.uk • scp COPY curl -u username: --key key_file --pass password scp://example.com/~/file.txt 4 www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/ 12
  • 16. Scrubbing web data: Scrape • Scrape is a python command line tool to parse html documents • Queries can be made in CSS selector or XPath syntax htmldoc=$(cat << EOF <div id=a> <a href="x.pdf">x</a> </div> <div id=b> <a href="png.png">y</a> <a href="pdf.pdf">y</a> </div> EOF ) # Select liks that end with pdf and are within div with id=b (Use CSS3 selector) echo $htmldoc | scrape -e "$b a[href$=pdf]" # Select all anchors (use Xpath) echo $htmldoc | scrape -e "//a" <a href="pdf.pdf">y</a> 13
  • 17. CSS selectors .class selects all elements with class=’class’ div p selects all <p> elements inside div elements div > p selects <p> elements where parent is <div> [target=blank] selects all elements with target="blank" [href^=https] selects urls beginning with https [href$=pdf] selects urls ending with pdf More examples at https://www.w3schools.com/cssref/css_selectors.asp 14
  • 18. XPath Query author selects all <author> elements at the current level //author selects all <author> elements at any level //author[@class=’x’] selects <author> elements with class=famous //book//author All <author> elements that are below <book> element" //author/* All children of <author> nodes More examples at https://msdn.microsoft.com/en-us/library/ms256086 15
  • 20. Scrubbing JSON data: JQ 5 • JQ is a portable command line utility to manipulate and filter JSON data • Filters can be defined to access individual fields, transform records or produce derived objets • Filters can be composed and combined • Provides builtin functions an operators Example: curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’ 5 https://stedolan.github.io/jq/ 17
  • 21. Basic filters # Identity ’.’ echo ’"Hello world"’ | jq ’.’ "Hello world" # Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"] echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’ 42 echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’ 10 #Arrays ’.[]’, ’.[0]’, ’.[1:3]’ echo ’["foo", "bar"]’ | jq ’.[0]’ "foo" echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’ ["bar", "baz"] # Pipe ’.foo|.bar’, ’.[]|.bar’ echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f 10 20 30 18
  • 22. Object construction Object construction allows you to derive new objects out of existing ones. # Field selection echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’ {"foo": "F"} # Array expansion echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’ {"foo": "F", "bar": "X"} {"foo": "F", "bar": "Y"} # Expression evaluation, key and value can be substituted echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’ {"A": "X"} {"A": "Y"} 19
  • 23. Operators Addition • Numbers are added by normal arithmetic. • Arrays are added by being concatenated into a larger array. • Strings are added by being joined into a larger string. • Objects are added by merging, that is, inserting all the key-value pairs from both objects into a single combined object. # Adding fields echo ’{"foo": 10}’ | jq ’.foo + 1’ 11 # Adding arrays echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’ [1,2,3,11,12,13] 20
  • 26. CSV • CSV (Comma separated value) is the common demoniator for data exchange • Tabular data with ’,’ as a separator • Can be ingested by R, python, excel etc. • No explicit specification of data types (ARFF supports type annotation) Example state,county,quantity NE,ADAMS,1 NE,BUFFALO,1 NE,THURSTON,1 22
  • 27. CSVKit (Groskopf and contributors [2016]) csvkit 6 is a suite of command line tools for converting and working with CSV, the defacto standard for tabular file formats. Example use cases • Importing data from excel, sql • Select subset of columns • Reorder columns • Mergeing multiple files (row and column wise) • Summary statistics 6 https://csvkit.readthedocs.io/ 23
  • 28. Importing data # Fetch data in XLS format # (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed # This data was widely cited in the aftermath of the Ferguson, Missouri protests. T curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10 # Convert to csv in2csv ne_1033_data.xlxs > data.csv # Inspect the columns csvcut -n data.csv # Inspect the data in specific columns csvcut -c county, quantity data.csv | csvlook 24
  • 29. CSVKit: Examining data csvstat provides a summary view of the data similar to summary() function in R. # Get summary for county, and cost csvcut -c county,acquisition_cost,ship_date data.csv | csvstat 1. county Text Nulls: False Unique values: 35 Max length: 10 5 most frequent values: DOUGLAS: 760 DAKOTA: 42 CASS: 37 HALL: 23 LANCASTER: 18 2. acquisition_cost Number Nulls: False Min: 0.0 Max: 412000.0 Sum: 5430787.55 Mean: 5242.072924710424710424710425 Median: 6000.0 Standard Deviation: 13368.07836799839045093904423 Unique values: 75 5 most frequent values: 6800.0: 304 25
  • 30. CSVKit: searching data csvgrep can be used to search the content of the CSV file. Options include • Exact match -m • Regex match -r • Invert match -i • Search specific columns -c columns csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook | county | item_name | total_cost | | --------- | ------------------------------ | ---------- | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | | LANCASTER | RIFLE,5.56 MILLIMETER | 120 | 26
  • 31. CSVKit: Power tools • csvjoin can be used to combine columns from multiple files csvjoin -c join_column data.csv other_data.csv • csvsort can be used to sort the file based on specific columns csvsort -c total_population | csvlook | head • csvstack allows you to merge mutiple files together (row-wise) curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re in2csv ne_1033_data.xls > ne_1033_data.csv csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv 27
  • 32. CSVKit: SQL • csvsql allows queryring against one or more CSV files. • The results of the query can be inserted back into a db Examples • Import from csv into a table # Inserts into a specific table csvsql --db postgresql:///test --table data --insert data.csv # Inserts each file into a separate table csvsql --db postgresql:///test --insert examples/*_tables.csv • Regular SQL query csvsql --query "select count(*) from data" data.csv 28
  • 35. GNU parallel • GNU parallel (Tange [2011]) is a tool for executing jobs in parallel on one or more machines. • It can be used to parallelize across arguments, lines and files. Examples # Parallelize across lines seq 1000 | parallel "echo {}" # Parallelize across file content cat input.csv | parallel -C, "mv {1} {2}" cat input.csv | parallel -C --header "mv {source} {dest}" 30
  • 36. Parallel (cont.) • By default, parallel runs one job per cpu core • Concurrency can be controlled by --jobs or -j option seq 100 | parallel -j2 "echo number: {}" seq 100 | parallel -j200% "echo number: {}" Logging • Output from each parallel job can be captured separately using --results seq 10 | parallel --results data/outdir "echo number: {}" find data/outdir 31
  • 37. Parallel (cont.) • Remote execution parallel --nonall --slf instances hostname # nonall - no argument command to follow # slf - uses ~/.parallel/sshloginfile as the list of sshlogins • Distributing data # Split 1-1000 into sections of 100 and pipe it to remote instances seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)" #transmit, retrieve and cleanup # sends jq to all instances # transmits the input file, retrives the results into {.}csv and cleanup ls *.gz | parallel -v --basefile jq --trc {.} csv 32
  • 39. Overview • Fast, online , scalable learning system • Supports out of core execution with in memory model • Scalable (Terascale) • 1000 nodes (??) • Billions of examples • Trillions of unique features. • Actively developed https://github.com/JohnLangford/vowpal_wabbit 33
  • 40. Swiss army knife of online algorithms • Binary classification • Multiclass classification • Linear regression • Quantile regression • Topic modeling (online LDA) • Structured prediction • Active learning • Recommendation (Matrix factorization) • Contextual bandit learning (explore/exploit algorithms) • Reductions 34
  • 41. Features • Flexible input format: Allows free form text, identifying tags, multiple labels • Speed: Supports online learning • Scalable • Streaming removes row limit. • Hashing removes dimension limit. • Distributed training: models merged using AllReduce operation. • Cross product features: Allows multi-task learning 35
  • 42. Optimization VW solves optimization of the form i l(wT xi; yi) + λR(w) Here, l() is convex, R(w) = λ1|w| + λ2||w||2. VW support a variety of loss function Linear regression (y − wT x)2 Logistic regression log(1 + exp(−ywT x)) SVM regression max(0, 1 − ywT x) Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I 36
  • 43. Detour: Feature hashing • Feature hashing can be used to reduce dimensions of sparse features. • Unlike random projections ? , retains sparsity • Preserves dot products (random projection preserves distances). • Model can fit in memory. • Unsigned ? Consider a hash function h(x) : [0 . . . N] → [0 . . . m], m << N. φi(x) = j:h(j)=i xj • Signed ? Consider additionaly a hash function ξ(x) : [0 . . . N] → {1, −1}. φi(x) = ξ(j)xj 37
  • 44. Detour: Generalized linear models A generalized linear predictor specifies • A linear predictor of the form η(x) = wT x • A mean estimate µ • A link function g(µ) such that g(µ) = η(x) that relates the mean estimate to the linear predictor. This framework supports a variety of regression problems Linear regression µ = wT x Logistic regression log( µ 1−µ) = wT x Poisson regression log(µ) = wT x 38
  • 45. fragileInput format Label Importance [Tag]|namespace Feature . . . | namespace Feature . . . namespace = String[:Float] feature = String[:Float] Examples: • 1 | 1:0.01 32:-0.1 • example|namespace normal text features • 1 3 tag|ad-features ad description |user-features name address age 39
  • 46. Input options • data file -d datafile • network --daemon --port <port=26542> • compressed data --compressed • stdin cat <data> | vw 40
  • 47. Manipulation options • ngrams --ngram • skips --skips • quadratic interaction -q args. e.g -q ab • cubic interaction --cubic args. e.g. --cubic ccc 41
  • 48. Output options • Examining feature construction --audit • Generating prediction --predictions or -p • Unnormalized predictions --raw_predictions • Testing only --testonly or -t 42
  • 49. Model options • Model size --bit_precision or -b . Number of coefficients limited to 2b • Update existing model --initial_regressor or -i. • Final model destination --final_regressor or -f • Readable model definition --readable_model • Readable feature values --invert_hash • Snapshot model every pass --save_per_pass • Weight initialization --initial_weight or --random_weights 43
  • 50. Regression (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/linear-regression • Linear regression --loss_function square • Quantile regression --loss_function quantile --quantile_tau <=0.5> 44
  • 51. Binary classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/classification • Note: a linear regressor can be used as a classifier as well • Logistic loss --loss_function logistic, --link logistic • Hinge loss (SVM loss function) --loss_function hinge • Report binary loss instead of logistic loss --binary 45
  • 52. Multiclass classification (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/multiclass • One against all --oaa <k> • Error correcting tournament --ect <k> • Online decision trees ---log_multi <k> • Cost sensitive one-against-all --csoaa <k> 46
  • 53. LDA options (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/lda • Number of topics --lda • Prior on per-document topic weights --lda_alpha • Prior on topic distributions --lda_rho • Estimated number of documents --lda_D • Convergence parameter for topic estimation --lda_epsilon • Mini batch size --minibatch 47
  • 54. Daemon mode (Demo) https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-respond https://github.com/sharatsc/cdse/tree/master/ vowpal-wabbit/daemon-request • Loads model and answers any prediction request coming over the network • Preferred way to deploy a VW model • Options • --daemon. Enables demon mode • --testonly or -t. Does not update the model in response to requests • --initial_model or -i. Model to load • --port <arg>. Port to listen to the request • --num_children <arg>. Number of threads listening to request 48
  • 56. Christopher Groskopf and contributors. csvkit, 2016. URL https://csvkit.readthedocs.org/. O. Tange. Gnu parallel - the command-line power tool. ;login: The USENIX Magazine, 36(1):42–47, Feb 2011. doi: http://dx.doi.org/10.5281/zenodo.16303. URL http://www.gnu.org/s/parallel. 48