Data science at the command line

Data science at the command line
Rapid prototyping and reproducible science
Sharat Chikkerur
sharat@alum.mit.edu
Principal Data Scientist
Nanigans Inc.
1

Outline
Introduction
Motivation
Data science workﬂow
Obtaining data
Scrubbing data
Exploring data
Managing large workﬂows
Modeling using vowpal wabbit
2

Setup
• Follow instructions at
https://github.com/sharatsc/cdse
• Install virtualbox virtualbox.org
• Install vagrant vagrantup.com
• Initialize virtual machine
mkdir datascience
cd datascience
vagrant init data-science-toolbox/data-science-at-the-command-line
vagrant up && vagrant ssh
3

About me
• UB Alumni, 2005 (M.S., EE Dept, cubs.buffalo.edu)
• MIT Alumni, 2010 (PhD., EECS Dept, cbcl.mit.edu)
• Senior Software Engineer, Google AdWords modeling
• Senior Software Engineer, Microsoft Machine learning
• Principal data scientist, Nanigans Inc
4

About the workshop
• Based on a book by Jeroen Janssens 1
• Vowpal wabbit 2
1
http://datascienceatthecommandline.com/
2
https://github.com/JohnLangford/vowpal_wabbit
5

POSIX command line
• Exposes operating system functionalities through a shell
• Example shells include bash, zsh, tcsh, ﬁsh etc.
• Comprises of a large number of utility programs
• Examples: grep, awk, sed, etc.
• GNU user space programs
• Common API: Input through stdin and output to stdout
• stdin and stdout can be redirected during execution
• Pipes (|) allows composition through pipes (chaining).
• Allows redirection of output of one command as input to
another
ls -lh | more
cat file | sort | uniq -c
6

Why command line ?
• REPL allows rapid iteration (through immediate feedback)
• Allows composition of scripts and commands using pipes
• Automation and scaling
• Reproducibility
• Extensibility
• R, python, perl, ruby scripts can be invoked like command
line utilities
7

Data science workﬂow
A typical workﬂow OSEMN model
• Obtaining data
• Scrubbing data
• Exploring data
• Modeling data
• Interpreting data
8

Workshop outline
• Background
• Getting data: Curl, scrape
• Scrubbing: jq, csvkit
• Exploring: csvkit
• Modeling: vowpal wabbit
• Scaling: parallel
Hands on exercises
• Obtaining data walkthrough
• JQ walkthrough
• CSVKit walkthrough
• Vowpal wabbit
9

Workﬂow example: Boston housing dataset
https://github.com/sharatsc/cdse/blob/master/
boston-housing
10

Boston housing dataset
Python workﬂow 3
import urllib
import pandas as pd
# Obtain data
urllib.urlretrieve(’https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housin
df = pd.read_csv(’boston.csv’)
# Scrub data
df = df.fillna(0)
# Model data
from statsmodels import regression
from statsmodels.formula import api as smf
formula = ’medv~’ + ’ + ’.join(df.columns - [’medv’])
model = smf.ols(formula=formula, data=df)
res=model.fit()
res.summary()
Command line workﬂow
URL="https://raw.githubusercontent.com/sharatsc/cdse/master/boston-housing/boston.csv"
curl $URL| Rio -e ’model=lm("medv~.", df);model’
3
https://github.com/sharatsc/cdse/blob/master/boston-housing
11

Obtaining data from the web: curl
• CURL (curl.haxx.se)
• Cross platform command line tool that supports data
transfer using
HTTP, HTTPS, FTP, IMAP, SCP, SFTP
• Supports cookies, user+password authentication
• can be used to get data from RESTful APIs 4
• http GET
curl http://www.google.com
• ftp GET
curl ftp://catless.ncl.ac.uk
• scp COPY
curl -u username: --key key_file --pass password scp://example.com/~/file.txt
4
www.codingpedia.org/ama/how-to-test-a-rest-api-from-command-line-with-curl/
12

Scrubbing web data: Scrape
• Scrape is a python command line tool to parse html
documents
• Queries can be made in CSS selector or XPath syntax
htmldoc=$(cat << EOF
<div id=a>
<a href="x.pdf">x</a>
</div>
<div id=b>
<a href="png.png">y</a>
<a href="pdf.pdf">y</a>
</div>
EOF
)
# Select liks that end with pdf and are within div with id=b (Use CSS3 selector)
echo $htmldoc | scrape -e "$b a[href$=pdf]"
# Select all anchors (use Xpath)
echo $htmldoc | scrape -e "//a"
<a href="pdf.pdf">y</a>
13

CSS selectors
.class selects all elements with class=’class’
div p selects all <p> elements inside div elements
div > p selects <p> elements where parent is <div>
[target=blank] selects all elements with target="blank"
[href^=https] selects urls beginning with https
[href$=pdf] selects urls ending with pdf
More examples at
https://www.w3schools.com/cssref/css_selectors.asp
14

XPath Query
author
selects all <author> elements at
the current level
//author
selects all <author> elements at
any level
//author[@class=’x’]
selects <author> elements with
class=famous
//book//author
All <author> elements that are
below <book> element"
//author/* All children of <author> nodes
More examples at
https://msdn.microsoft.com/en-us/library/ms256086
15

Getting data walkthrough
https:
//github.com/sharatsc/cdse/tree/master/curl
16

Scrubbing JSON data: JQ 5
• JQ is a portable command line utility to manipulate and
filter JSON data
• Filters can be defined to access individual fields, transform
records or produce derived objets
• Filters can be composed and combined
• Provides builtin functions an operators
Example:
curl ’https://api.github.com/repos/stedolan/jq/commits’ | jq ’.[0]’
5
https://stedolan.github.io/jq/
17

Basic ﬁlters
# Identity ’.’
echo ’"Hello world"’ | jq ’.’
"Hello world"
# Examples ’.foo’, ’.foo.bar’, ’.foo|.bar’, ’.["foo"]
echo ’{"foo": 42, "bar": 10}’ | jq ’.foo’
42
echo ’{"foo": {"bar": 10, "baz": 20}} | jq ’.foo.bar’
10
#Arrays ’.[]’, ’.[0]’, ’.[1:3]’
echo ’["foo", "bar"]’ | jq ’.[0]’
"foo"
echo ’["foo", "bar", "baz"]’ | jq ’.[1:3]’
["bar", "baz"]
# Pipe ’.foo|.bar’, ’.[]|.bar’
echo ’[{"f": 10}, {"f": 20}, {"f": 30}]’ | jq ’.[] | .f
10 20 30
18

Object construction
Object construction allows you to derive new objects out of
existing ones.
# Field selection
echo {"foo": "F", "bar": "B", "baz": "Z"} | jq ’{"foo": .foo}’
{"foo": "F"}
# Array expansion
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{"foo": .foo, "bar": .bar[]}’
{"foo": "F", "bar": "X"}
{"foo": "F", "bar": "Y"}
# Expression evaluation, key and value can be substituted
echo ’{"foo": "A", "bar": ["X", "Y"]}’ | jq ’{(.foo): .bar[]}’
{"A": "X"}
{"A": "Y"}
19

Operators
Addition
• Numbers are added by normal arithmetic.
• Arrays are added by being concatenated into a larger array.
• Strings are added by being joined into a larger string.
• Objects are added by merging, that is, inserting all the
key-value pairs from both objects into a single combined
object.
# Adding fields
echo ’{"foo": 10}’ | jq ’.foo + 1’
11
# Adding arrays
echo ’{"foo": [1,2,3], "bar": [11,12,13]}’ | jq ’.foo + .bar’
[1,2,3,11,12,13]
20

JQ walkthrough
https:
//github.com/sharatsc/cdse/tree/master/jq
21

CSV
• CSV (Comma separated value) is the common demoniator
for data exchange
• Tabular data with ’,’ as a separator
• Can be ingested by R, python, excel etc.
• No explicit speciﬁcation of data types (ARFF supports type
annotation)
Example
state,county,quantity
NE,ADAMS,1
NE,BUFFALO,1
NE,THURSTON,1
22

CSVKit (Groskopf and contributors [2016])
csvkit 6 is a suite of command line tools for converting and
working with CSV, the defacto standard for tabular ﬁle formats.
Example use cases
• Importing data from excel, sql
• Select subset of columns
• Reorder columns
• Mergeing multiple ﬁles (row and column wise)
• Summary statistics
6
https://csvkit.readthedocs.io/
23

Importing data
# Fetch data in XLS format
# (LESO) 1033 Program dataset, which describes how surplus military arms have been distributed
# This data was widely cited in the aftermath of the Ferguson, Missouri protests. T
curl -L https://github.com/sharatsc/cdse/blob/master/csvkit/ne_1033_data.xls?raw=true -o ne_10
# Convert to csv
in2csv ne_1033_data.xlxs > data.csv
# Inspect the columns
csvcut -n data.csv
# Inspect the data in specific columns
csvcut -c county, quantity data.csv | csvlook
24

CSVKit: Examining data
csvstat provides a summary view of the data similar to
summary() function in R.
# Get summary for county, and cost
csvcut -c county,acquisition_cost,ship_date data.csv | csvstat
1. county
Text
Nulls: False
Unique values: 35
Max length: 10
5 most frequent values:
DOUGLAS: 760
DAKOTA: 42
CASS: 37
HALL: 23
LANCASTER: 18
2. acquisition_cost
Number
Nulls: False
Min: 0.0
Max: 412000.0
Sum: 5430787.55
Mean: 5242.072924710424710424710425
Median: 6000.0
Standard Deviation: 13368.07836799839045093904423
Unique values: 75
5 most frequent values:
6800.0: 304
25

CSVKit: searching data
csvgrep can be used to search the content of the CSV ﬁle.
Options include
• Exact match -m
• Regex match -r
• Invert match -i
• Search speciﬁc columns -c columns
csvcut -c county,item_name,total_cost data.csv | csvgrep -c county -m LANCASTER | csvlook
| county | item_name | total_cost |
| --------- | ------------------------------ | ---------- |
| LANCASTER | RIFLE,5.56 MILLIMETER | 120 |
26

CSVKit: Power tools
• csvjoin can be used to combine columns from multiple files
csvjoin -c join_column data.csv other_data.csv
• csvsort can be used to sort the file based on specific
columns
csvsort -c total_population | csvlook | head
• csvstack allows you to merge mutiple files together
(row-wise)
curl -L -O https://raw.githubusercontent.com/wireservice/csvkit/master/examples/re
in2csv ne_1033_data.xls > ne_1033_data.csv
csvstack -g region ne_1033_data.csv ks_1033_data.csv > region.csv
27

CSVKit: SQL
• csvsql allows queryring against one or more CSV ﬁles.
• The results of the query can be inserted back into a db
Examples
• Import from csv into a table
# Inserts into a specific table
csvsql --db postgresql:///test --table data --insert data.csv
# Inserts each file into a separate table
csvsql --db postgresql:///test --insert examples/*_tables.csv
• Regular SQL query
csvsql --query "select count(*) from data" data.csv
28

CSVKit Walkthrough
https:
//github.com/sharatsc/cdse/tree/master/csvkit
29

GNU parallel
• GNU parallel (Tange [2011]) is a tool for executing jobs in
parallel on one or more machines.
• It can be used to parallelize across arguments, lines and
ﬁles.
Examples
# Parallelize across lines
seq 1000 | parallel "echo {}"
# Parallelize across file content
cat input.csv | parallel -C, "mv {1} {2}"
cat input.csv | parallel -C --header "mv {source} {dest}"
30

Parallel (cont.)
• By default, parallel runs one job per cpu core
• Concurrency can be controlled by --jobs or -j option
seq 100 | parallel -j2 "echo number: {}"
seq 100 | parallel -j200% "echo number: {}"
Logging
• Output from each parallel job can be captured separately
using --results
seq 10 | parallel --results data/outdir "echo number: {}"
find data/outdir
31

Parallel (cont.)
• Remote execution
parallel --nonall --slf instances hostname
# nonall - no argument command to follow
# slf - uses ~/.parallel/sshloginfile as the list of sshlogins
• Distributing data
# Split 1-1000 into sections of 100 and pipe it to remote instances
seq 1000 | parallel -N100 --pipe --slf instances "(wc -l)"
#transmit, retrieve and cleanup
# sends jq to all instances
# transmits the input file, retrives the results into {.}csv and cleanup
ls *.gz | parallel -v --basefile jq --trc {.} csv
32

Overview
• Fast, online , scalable learning system
• Supports out of core execution with in memory model
• Scalable (Terascale)
• 1000 nodes (??)
• Billions of examples
• Trillions of unique features.
• Actively developed
https://github.com/JohnLangford/vowpal_wabbit
33

Swiss army knife of online algorithms
• Binary classiﬁcation
• Multiclass classiﬁcation
• Linear regression
• Quantile regression
• Topic modeling (online LDA)
• Structured prediction
• Active learning
• Recommendation (Matrix factorization)
• Contextual bandit learning (explore/exploit algorithms)
• Reductions
34

Features
• Flexible input format: Allows free form text, identifying tags,
multiple labels
• Speed: Supports online learning
• Scalable
• Streaming removes row limit.
• Hashing removes dimension limit.
• Distributed training: models merged using AllReduce
operation.
• Cross product features: Allows multi-task learning
35

Optimization
VW solves optimization of the form
i
l(wT
xi; yi) + λR(w)
Here, l() is convex, R(w) = λ1|w| + λ2||w||2.
VW support a variety of loss function
Linear regression (y − wT x)2
Logistic regression log(1 + exp(−ywT x))
SVM regression max(0, 1 − ywT x)
Quantile regression τ(wT x − y) ∗ I(y < wT x) + (1 − τ)(y − wT x)I
36

Detour: Feature hashing
• Feature hashing can be used to reduce dimensions of
sparse features.
• Unlike random projections ? , retains sparsity
• Preserves dot products (random projection preserves
distances).
• Model can ﬁt in memory.
• Unsigned ?
Consider a hash function
h(x) : [0 . . . N] → [0 . . . m], m << N.
φi(x) =
j:h(j)=i
xj
• Signed ?
Consider additionaly a hash function
ξ(x) : [0 . . . N] → {1, −1}.
φi(x) = ξ(j)xj
37

Detour: Generalized linear models
A generalized linear predictor speciﬁes
• A linear predictor of the form η(x) = wT x
• A mean estimate µ
• A link function g(µ) such that g(µ) = η(x) that relates the
mean estimate to the linear predictor.
This framework supports a variety of regression problems
Linear regression µ = wT x
Logistic regression log( µ
1−µ) = wT x
Poisson regression log(µ) = wT x
38

Input options
• data ﬁle -d datafile
• network --daemon --port <port=26542>
• compressed data --compressed
• stdin cat <data> | vw
40

Manipulation options
• ngrams --ngram
• skips --skips
• quadratic interaction -q args. e.g -q ab
• cubic interaction --cubic args. e.g. --cubic ccc
41

Output options
• Examining feature construction --audit
• Generating prediction --predictions or -p
• Unnormalized predictions --raw_predictions
• Testing only --testonly or -t
42

Model options
• Model size --bit_precision or -b . Number of
coefﬁcients limited to 2b
• Update existing model --initial_regressor or -i.
• Final model destination --final_regressor or -f
• Readable model deﬁnition --readable_model
• Readable feature values --invert_hash
• Snapshot model every pass --save_per_pass
• Weight initialization
--initial_weight or --random_weights
43

Regression (Demo)
https://github.com/sharatsc/cdse/tree/master/
vowpal-wabbit/linear-regression
• Linear regression --loss_function square
• Quantile regression
--loss_function quantile --quantile_tau <=0.5>
44

Binary classiﬁcation (Demo)
vowpal-wabbit/classification
• Note: a linear regressor can be used as a classiﬁer as well
• Logistic loss
--loss_function logistic, --link logistic
• Hinge loss (SVM loss function)
--loss_function hinge
• Report binary loss instead of logistic loss --binary
45

Multiclass classiﬁcation (Demo)
vowpal-wabbit/multiclass
• One against all --oaa <k>
• Error correcting tournament --ect <k>
• Online decision trees ---log_multi <k>
• Cost sensitive one-against-all --csoaa <k>
46

LDA options (Demo)
vowpal-wabbit/lda
• Number of topics --lda
• Prior on per-document topic weights --lda_alpha
• Prior on topic distributions --lda_rho
• Estimated number of documents --lda_D
• Convergence parameter for topic estimation
--lda_epsilon
• Mini batch size --minibatch
47

Daemon mode (Demo)
vowpal-wabbit/daemon-respond
vowpal-wabbit/daemon-request
• Loads model and answers any prediction request coming
over the network
• Preferred way to deploy a VW model
• Options
• --daemon. Enables demon mode
• --testonly or -t. Does not update the model in
response to requests
• --initial_model or -i. Model to load
• --port <arg>. Port to listen to the request
• --num_children <arg>. Number of threads listening to
request 48

Christopher Groskopf and contributors. csvkit, 2016. URL
https://csvkit.readthedocs.org/.
O. Tange. Gnu parallel - the command-line power tool. ;login:
The USENIX Magazine, 36(1):42–47, Feb 2011. doi:
http://dx.doi.org/10.5281/zenodo.16303. URL
http://www.gnu.org/s/parallel.
48

Data science at the command line

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Data science at the command line (20)

Recently uploaded (20)

Data science at the command line