Boosting command line experience
Python meets AWK
Kirill Pavlov
Technical Recruiter, Terminal 1
October 22, 2017
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 1 / 20
README.md
A lot of Python/AWK examples here.
Source code and slides available online.
At the end: build a stock trading system and check NYSE:CS.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 2 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 3 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 4 / 20
Background
Yandex, year 2010. Hadoop was not widly adopted.
10Gb of archived ads data daily: time, ad_id, site_id, clicks.
Task: daily data aggregation (simple functions: group by, sum, join) and feature
generation for further machine learning classification.
Solution: released a set of command line scripts.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 5 / 20
Example
This presentation uses UCI machine learning Higgs boson data: 11M objects, 28
attributes, 7.5Gb unarchived.
Questions:
1. What is the maximum value of lepton_eta?
2. What is the average value of lepton_phi by class 0 and 1?
3. Filter objects with m_jj > 0.75 (8.9M objects) and sort them by m_wbb.
Solutions:
1. In-memory Python with Pandas.
2. Databse SQL queries (PostgreSQL and Docker).
3. Command line with AWK.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
Demo Time
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
Reality Check
It’s not as agile as it seems. You work inside the company network.
1. You don’t have sudo rights and your admin does not want to install anything for you.
Like no database or user privileges, etc.
2. The server does not have GitHub/Internet access and the only deployment
possible is Java JARs or C/C++/etc. So, no NodeJS/Python packages. And of course
no R/Matlab/Excel.
3. Get better at command line tools ;)
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 7 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 8 / 20
Basic concepts
1. AWK1
— language for streaming columnar data processing. Standard in unix-like OS.
2. Actual AWK is outdated, use mawk (fast) or gawk (flexible).
3. Limited data structures: strings, associative arrays (hash maps) and regexps.
4. Built-in variables:
$1, $2, . . . ($0 is entire record)
NR - number of processe lines (records)
NF - number of columns (fields)
5. Use vars without declaration. Default values are 0s. One liners. Hipster friendly.
1
Tutorial by Bruce Barnett. Careful, he writes his blog in txt
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 9 / 20
AWK Examples 1 & 2
1. Count number of words and lines at codeconf.hk:
cat codeconf.md | awk ’{w += NF}END{print NR, w}’
370 1445
2. Most popular words on codeconf.hk website:
cat codeconf.md 
| awk ’{for(i=1; i<=NF; i++) words[tolower($i)]++}
END{for(w in words) print w, words[w]}’ 
| sort -k2 -nr
Most popular non stop-words: "Serverles" and "Android".
SEO winners: Davide Benvegnù and Richard Cohen.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 10 / 20
Boosting command line experience with python and awk
Boosting command line experience with python and awk
AWK Examples 3
3. Find the longest line in the text (if-then-else example):
cat codeconf.md | awk ’{
l = length > length(l) ? $0 : l
}END{
print length(l), l
}’
146 * We believe that the Hong Kong developer community is skilled and
diverse, but that often these skills end up hidden away in big organisations.
Demo Time
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 11 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 12 / 20
Basic concepts
1. Special files format: tsv + header (meta information). Easy to convert and
autogenerate headers.
# Date Open High Low Close Volume
2014-02-21 84.35 84.45 83.9 83.45 17275.0
2. Python script manages file descriptors headers, convert column names to column
numbers and executes command line command, e.g. cat/tail/sort.
3. Heavy lifting goes to awk: tawk (map) and tgrp (map-reduce).
4. Based on command line expressions, it generates awk command and executes it
with incoming stream.
5. Visual sugar: tpretty and tplot.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 13 / 20
Features
1. Streaming expressions: parametrized running/total sum/average/maximum2
.
2. Aggregators: first, last, min, max, count.
3. Modules: deque.
4. Build to self-contained 2k LOC portable python (2.7, 3.3+) scripts.
5. All together: zero-configuration extensible sql in command line. It is readable and
faster than a generic python/cython code (even after shedskin) and perl.
2
moving maximum in linear time with deque implemented on top of awk associative arrays.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 14 / 20
Solutions comparison
Dell xps 15, 16Gb RAM, 8 CPUs:
Python PostgreSQL gawk mawk Tabtools
Read time 104.4 180.3 0 0 0
Q1: "max" time 0 15.2 22.8 12.2 12.8
Q2: "group + avg" time 0 5.8 30.5 12.6 26.63
Q3: "filter + sort" time 21.3 33.6 174.2 36.3 33.5
Total, sec. 125.7 243.9 227.5 61.1 72.9
3
Uses Ω(n log(n)) complexity instead of Ω(n). Could be improved.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 15 / 20
Table of content
1. Problem Background
2. AWK Bootcamp in 5 min
3. Tabtools architecture and features
4. Stock example with NYSE:CS
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 16 / 20
Data description
Credit Suisse (NYSE:CS) daily stock data from Yahoo Finance: ’CS.csv’ + ’cs.tsv’.
cat cs.tsv | tgrp 
-k "Week=strftime("%U", DateEpoch(Date))" 
-g "Date=FIRST(Date)" 
-g "Open=FIRST(Open)" 
-g "High=MAX(High)" 
-g "Low=MIN(Low)" 
-g "Close=LAST(Close)" 
-g "Volume=SUM(Volume)" 
| ttail 
| tsrt -k Date:desc 
| tpretty
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
Demo Time
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
Demo: metrics
1. Moving Average for windown size 200 and 50.
2. Exponential moving average for window size 26 and 13.
3. MACD(26, 12, 9) histogram.
4. Moving maximum and minimum for window size 14.
5. Fast and Slow Stochastics.
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 18 / 20
Demo: plot (expected and actual)
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 19 / 20
Thank you!
Kirill Pavlov <k@p99.io>, Recruiter, Terminal 1.
GitHub: @pavlov99 | Presentation: 2017-10-22-codeconf | tabtools
Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 20 / 20

More Related Content

PDF
Austin Python Meetup 2017: What's New in Pythons 3.5 and 3.6?
PDF
Managing data workflows with Luigi
PDF
Luigi Presentation at OSCON 2013
PDF
Luigi presentation NYC Data Science
PDF
Data correlation using PySpark and HDFS
PDF
PyHEP 2018: Tools to bind to Python
PPTX
Jonathan Coveney: Why Pig?
PDF
H2O World - PySparkling Water - Nidhi Mehta
Austin Python Meetup 2017: What's New in Pythons 3.5 and 3.6?
Managing data workflows with Luigi
Luigi Presentation at OSCON 2013
Luigi presentation NYC Data Science
Data correlation using PySpark and HDFS
PyHEP 2018: Tools to bind to Python
Jonathan Coveney: Why Pig?
H2O World - PySparkling Water - Nidhi Mehta

What's hot (20)

PDF
R and C++
PDF
R and cpp
PDF
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
PPTX
Beyond Lists - Functional Kats Conf Dublin 2015
PDF
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
PDF
2019 IRIS-HEP AS workshop: Boost-histogram and hist
PDF
Influxdb and time series data
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Pybind11 - SciPy 2021
PDF
What is the best full text search engine for Python?
PDF
DPF 2017: GPUs in LHCb for Analysis
PDF
Building social network with Neo4j and Python
PDF
Digital RSE: automated code quality checks - RSE group meeting
PDF
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
PDF
Querying 1.8 billion reddit comments with python
PDF
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
PDF
PyCon Russian 2015 - Dive into full text search with python.
PPTX
Weather of the Century: Visualization
PDF
The Weather of the Century Part 3: Visualization
PDF
The Weather of the Century
R and C++
R and cpp
CHEP 2018: A Python upgrade to the GooFit package for parallel fitting
Beyond Lists - Functional Kats Conf Dublin 2015
PyCon Poland 2016: Maintaining a high load Python project: typical mistakes
2019 IRIS-HEP AS workshop: Boost-histogram and hist
Influxdb and time series data
A Beginner's Guide to Building Data Pipelines with Luigi
Pybind11 - SciPy 2021
What is the best full text search engine for Python?
DPF 2017: GPUs in LHCb for Analysis
Building social network with Neo4j and Python
Digital RSE: automated code quality checks - RSE group meeting
TPC-DS performance evaluation for JAQL and PIG queries - Andrii Vozniuk, Serg...
Querying 1.8 billion reddit comments with python
Austin Python Meetup 2017: How to Stop Worrying and Start a Project with Pyth...
PyCon Russian 2015 - Dive into full text search with python.
Weather of the Century: Visualization
The Weather of the Century Part 3: Visualization
The Weather of the Century
Ad

Similar to Boosting command line experience with python and awk (20)

PPTX
Your data isn't that big @ Big Things Meetup 2016-05-16
PDF
Awk-An Advanced Filter
PPTX
Nix for etl using scripting to automate data cleaning & transformation
PDF
The Bash Dashboard (Or: How to Use Bash for Data Analysis)
PDF
Crunching Gigabytes Locally
PDF
One-Liners to Rule Them All
PDF
Workshop on command line tools - day 2
DOCX
Introduction to Unix - POS420Unix  Lab Exercise Week 3 BTo.docx
PPS
Unix - Class7 - awk
DOCX
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
PDF
Data science at the command line
PDF
[POSS 2019] Learn AWK in 15 minutes
PDF
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
PDF
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
PDF
Awk Introduction
PPT
Unix day4 v1.3
PDF
Linux intro 5 extra: awk
PDF
Chapter 1: Introduction to Command Line
PDF
How researchers and developers can benefit from the command line
PPTX
Chapter 1: Introduction to Command Line
Your data isn't that big @ Big Things Meetup 2016-05-16
Awk-An Advanced Filter
Nix for etl using scripting to automate data cleaning & transformation
The Bash Dashboard (Or: How to Use Bash for Data Analysis)
Crunching Gigabytes Locally
One-Liners to Rule Them All
Workshop on command line tools - day 2
Introduction to Unix - POS420Unix  Lab Exercise Week 3 BTo.docx
Unix - Class7 - awk
Scanned by CamScannerModule 03 Lab WorksheetWeb Developmen.docx
Data science at the command line
[POSS 2019] Learn AWK in 15 minutes
#OSSPARIS19 - Learn AWK in 15 minutes - MAXIME BESSON, Worteks
Obtaining, Scrubbing, and Exploring Data at the Command Line by Jeroen Janssens
Awk Introduction
Unix day4 v1.3
Linux intro 5 extra: awk
Chapter 1: Introduction to Command Line
How researchers and developers can benefit from the command line
Chapter 1: Introduction to Command Line
Ad

Recently uploaded (20)

PPTX
Cybersecurity: Protecting the Digital World
PDF
DNT Brochure 2025 – ISV Solutions @ D365
PDF
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
PPTX
Weekly report ppt - harsh dattuprasad patel.pptx
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PPTX
Advanced SystemCare Ultimate Crack + Portable (2025)
PPTX
Airline CRS | Airline CRS Systems | CRS System
PDF
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
PDF
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
PDF
Microsoft Office 365 Crack Download Free
PPTX
Computer Software - Technology and Livelihood Education
PDF
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
Download Adobe Photoshop Crack 2025 Free
PDF
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
PDF
Guide to Food Delivery App Development.pdf
PDF
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
PDF
iTop VPN Crack Latest Version Full Key 2025
Cybersecurity: Protecting the Digital World
DNT Brochure 2025 – ISV Solutions @ D365
Multiverse AI Review 2025: Access All TOP AI Model-Versions!
Weekly report ppt - harsh dattuprasad patel.pptx
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Advanced SystemCare Ultimate Crack + Portable (2025)
Airline CRS | Airline CRS Systems | CRS System
AI-Powered Threat Modeling: The Future of Cybersecurity by Arun Kumar Elengov...
novaPDF Pro 11.9.482 Crack + License Key [Latest 2025]
Microsoft Office 365 Crack Download Free
Computer Software - Technology and Livelihood Education
DuckDuckGo Private Browser Premium APK for Android Crack Latest 2025
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
CNN LeNet5 Architecture: Neural Networks
CCleaner 6.39.11548 Crack 2025 License Key
Download Adobe Photoshop Crack 2025 Free
The Dynamic Duo Transforming Financial Accounting Systems Through Modern Expe...
Guide to Food Delivery App Development.pdf
EaseUS PDF Editor Pro 6.2.0.2 Crack with License Key 2025
iTop VPN Crack Latest Version Full Key 2025

Boosting command line experience with python and awk

  • 1. Boosting command line experience Python meets AWK Kirill Pavlov Technical Recruiter, Terminal 1 October 22, 2017 Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 1 / 20
  • 2. README.md A lot of Python/AWK examples here. Source code and slides available online. At the end: build a stock trading system and check NYSE:CS. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 2 / 20
  • 3. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 3 / 20
  • 4. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 4 / 20
  • 5. Background Yandex, year 2010. Hadoop was not widly adopted. 10Gb of archived ads data daily: time, ad_id, site_id, clicks. Task: daily data aggregation (simple functions: group by, sum, join) and feature generation for further machine learning classification. Solution: released a set of command line scripts. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 5 / 20
  • 6. Example This presentation uses UCI machine learning Higgs boson data: 11M objects, 28 attributes, 7.5Gb unarchived. Questions: 1. What is the maximum value of lepton_eta? 2. What is the average value of lepton_phi by class 0 and 1? 3. Filter objects with m_jj > 0.75 (8.9M objects) and sort them by m_wbb. Solutions: 1. In-memory Python with Pandas. 2. Databse SQL queries (PostgreSQL and Docker). 3. Command line with AWK. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
  • 7. Demo Time Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 6 / 20
  • 8. Reality Check It’s not as agile as it seems. You work inside the company network. 1. You don’t have sudo rights and your admin does not want to install anything for you. Like no database or user privileges, etc. 2. The server does not have GitHub/Internet access and the only deployment possible is Java JARs or C/C++/etc. So, no NodeJS/Python packages. And of course no R/Matlab/Excel. 3. Get better at command line tools ;) Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 7 / 20
  • 9. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 8 / 20
  • 10. Basic concepts 1. AWK1 — language for streaming columnar data processing. Standard in unix-like OS. 2. Actual AWK is outdated, use mawk (fast) or gawk (flexible). 3. Limited data structures: strings, associative arrays (hash maps) and regexps. 4. Built-in variables: $1, $2, . . . ($0 is entire record) NR - number of processe lines (records) NF - number of columns (fields) 5. Use vars without declaration. Default values are 0s. One liners. Hipster friendly. 1 Tutorial by Bruce Barnett. Careful, he writes his blog in txt Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 9 / 20
  • 11. AWK Examples 1 & 2 1. Count number of words and lines at codeconf.hk: cat codeconf.md | awk ’{w += NF}END{print NR, w}’ 370 1445 2. Most popular words on codeconf.hk website: cat codeconf.md | awk ’{for(i=1; i<=NF; i++) words[tolower($i)]++} END{for(w in words) print w, words[w]}’ | sort -k2 -nr Most popular non stop-words: "Serverles" and "Android". SEO winners: Davide Benvegnù and Richard Cohen. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 10 / 20
  • 14. AWK Examples 3 3. Find the longest line in the text (if-then-else example): cat codeconf.md | awk ’{ l = length > length(l) ? $0 : l }END{ print length(l), l }’ 146 * We believe that the Hong Kong developer community is skilled and diverse, but that often these skills end up hidden away in big organisations. Demo Time Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 11 / 20
  • 15. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 12 / 20
  • 16. Basic concepts 1. Special files format: tsv + header (meta information). Easy to convert and autogenerate headers. # Date Open High Low Close Volume 2014-02-21 84.35 84.45 83.9 83.45 17275.0 2. Python script manages file descriptors headers, convert column names to column numbers and executes command line command, e.g. cat/tail/sort. 3. Heavy lifting goes to awk: tawk (map) and tgrp (map-reduce). 4. Based on command line expressions, it generates awk command and executes it with incoming stream. 5. Visual sugar: tpretty and tplot. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 13 / 20
  • 17. Features 1. Streaming expressions: parametrized running/total sum/average/maximum2 . 2. Aggregators: first, last, min, max, count. 3. Modules: deque. 4. Build to self-contained 2k LOC portable python (2.7, 3.3+) scripts. 5. All together: zero-configuration extensible sql in command line. It is readable and faster than a generic python/cython code (even after shedskin) and perl. 2 moving maximum in linear time with deque implemented on top of awk associative arrays. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 14 / 20
  • 18. Solutions comparison Dell xps 15, 16Gb RAM, 8 CPUs: Python PostgreSQL gawk mawk Tabtools Read time 104.4 180.3 0 0 0 Q1: "max" time 0 15.2 22.8 12.2 12.8 Q2: "group + avg" time 0 5.8 30.5 12.6 26.63 Q3: "filter + sort" time 21.3 33.6 174.2 36.3 33.5 Total, sec. 125.7 243.9 227.5 61.1 72.9 3 Uses Ω(n log(n)) complexity instead of Ω(n). Could be improved. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 15 / 20
  • 19. Table of content 1. Problem Background 2. AWK Bootcamp in 5 min 3. Tabtools architecture and features 4. Stock example with NYSE:CS Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 16 / 20
  • 20. Data description Credit Suisse (NYSE:CS) daily stock data from Yahoo Finance: ’CS.csv’ + ’cs.tsv’. cat cs.tsv | tgrp -k "Week=strftime("%U", DateEpoch(Date))" -g "Date=FIRST(Date)" -g "Open=FIRST(Open)" -g "High=MAX(High)" -g "Low=MIN(Low)" -g "Close=LAST(Close)" -g "Volume=SUM(Volume)" | ttail | tsrt -k Date:desc | tpretty Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
  • 21. Demo Time Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 17 / 20
  • 22. Demo: metrics 1. Moving Average for windown size 200 and 50. 2. Exponential moving average for window size 26 and 13. 3. MACD(26, 12, 9) histogram. 4. Moving maximum and minimum for window size 14. 5. Fast and Slow Stochastics. Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 18 / 20
  • 23. Demo: plot (expected and actual) Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 19 / 20
  • 24. Thank you! Kirill Pavlov <k@p99.io>, Recruiter, Terminal 1. GitHub: @pavlov99 | Presentation: 2017-10-22-codeconf | tabtools Kirill Pavlov <k@p99.io> Boosting command line: python + awk October 22, 2017 20 / 20