SlideShare a Scribd company logo
Daniel Molnar @ Oberlo/Shopify / Data Natives @ Berlin @ 2018-11-22
Where I'm coming from
• senior data analy.cs engineer,
• head of data and analy.cs,
• senior applied and data scien.st,
• data analyst,
• or just data janitor.
Perspec've
• rounded, not complete,
• slow, old, stupid and lazy and
tl;dr (new)
• KISS is the philosophy,
• take the long view, invest in durable knowledge,
• strive for fast and good enough,
• just because you can doesn't mean you should,
• figure what to worry about,
• you are not Google.
it used to be a hype
now this is a war
nobody's your friend
they want your money and data (preferably both locked in)
Things you worry about:
• machine learning,
• deep learning,
• GDPR.
Things you should really worry about:
• machine learning adblockers,
• deep learning ELT,
• GDPR, CRM (yes, CRM).
The Data Janitor Returns | Daniel Molnar | DN18
AGGREGATE
& LABEL
Don't skip
leg day.
Do
make
programma'c KPI defini'ons.
Look at the *** data
Toolset
Python,
(P)SQL,
Metabase.
Usual suspect: NPS
• one, simple number you can squint at,
• sampling is skewed,
• answer is unsure,
• easy to hack step func:on1
,
MONKEYPATCH: look at the change of the distro.
1
Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
Google
Analy&cs?
Hero of the day
Mar$n Loetzsch
@mar$n_loetzsch
-=-
KPIs for e-commerce startups
Data Science in Early Stage
Startups: the Struggle to Create
Value
https://github.com/mara
The Data Janitor Returns | Daniel Molnar | DN18
LEARN &
OPTIMIZE
Half of the *me when companies
say they need "AI" what they really
need is a SELECT clause with
GROUP BY. You're welcome.
— Mat Velloso @matvelloso (Technical Advisor to CTO at Microso9)
Don't do A/B tests
99% it will not worth doing it
... conversion rate is 2% ... detec0ng
a rela0ve change of 1% requires an
experiment with 12 million users ...
— Simon Jackson (Booking.com)
R?Shiny.
Usual suspects
• Non-reproducable experiments and tests.
• R hodpepodge in produc9on.
• Beliefs hidden as implicits in models.
The Data Janitor Returns | Daniel Molnar | DN18
ML~AI~DEEP*
You don't have (enough) data.
Make your own data points!
Deploy good enough fast?
Deep learn my ***
Do you really need it?
Tensorflow! ...
... so distributed deep learning
can compress porn on the end
device.
Hero of the day
Szilard [Deeper than Deep
Learning] @DataScienceLA
-=-
Be#er than Deep Learning:
Gradient Boos4ng Machines
(GBMs)
https://github.com/
szilard/benchm-ml
Spark MLlibs GBM implementa3on is 10x
slower, uses 10x more memory and is buggy/
lower accuracy. Total fucking garbage!
— Szilard [Deeper than Deep Learning] @DataScienceLA
The Data Janitor Returns | Daniel Molnar | DN18
The Data Janitor Returns | Daniel Molnar | DN18
MOVE
STORE
EXPLORE
TRANSFORM
Q: Why are there so many
programmers from Eastern Europe?
A: Slavic pessimism. Everything that
can go wrong will go wrong. With
such a mindset programming comes
naturally.
— Mar&n Sustrik @sustrik (Creator of ZeroMQ, nanomsg, libdill.)
The Data Janitor Returns | Daniel Molnar | DN18
over
engineering
you get an other machine
if you can use
one
Do embrace
dirty reality.
Get cloud agnos.c!
• AWS s'll leads the pack by far
• Azure will sell anyway, and all will cry,
• Google competes with the cheap and uncooked
ETL is #solved OMG
• Airflow is an overengineered underperforming nightmare,
• metl for source mappings in magnitude,
• Mara for generic e-commerce,
• night-shift for explicit minimalism.
Showdown
Hero of the day
Mark Litwintschik @marklit82
Summary of the 1.1 Billion Taxi
Rides Benchmarks (500 GB
uncompressed CSV)
https://
tech.marksblogg.com
Spark
Setup Query Median QM per vCPU Cost/hour
11 x m3.xlarge + HDFS 14,91 0,34 27,5
1 x i3.8xlarge + HDFS 26,00 0,81 2,5
21 x m3.xlarge + HDFS 32,00 0,38 5,67
5 x m3.xlarge + S3 466,50 23,33 1,35
3 x Raspberry Pi 1738,00 144,83
HDFS. RPi = 1/6 VCPU ~100 EUR. Linear scaling.
Presto
Setup Query Median QM per vCPU Cost/hour
50 x n1-standard-4 7,00 0,04 9.50
21 x m3.xlarge 11,50 0,14 5.67
10 x n1-standard-4 16,00 0,36 2.09
1 x i3.8xlarge + HDFS 15,00 0,47 2.50
5 x m3.xlarge + HDFS 51,50 0,26 1.35
50 x m3.xlarge + S3 43,50 0,22 13.50
Workhorse in favour. HDFS. 1 machine. Non-linear scaling.
Lazy Evalua*on
Setup Query Median Cost/hour
Redshi', 6 x ds2.8xlarge 1,91 40.80
BigQuery 2,00
Amazon Athena 6,30
Presto, 50 x n1-standard-4 7,00 9.50
Spark, 11 x m3.xlarge + HDFS 14,91 27.50
The human cost -- in both terms.
One Machine
Setup Query Median QM per vCPU Cost/hour
ClickHouse 4,21 1,05
Elas3csearch tuned 13,14 3,29
Presto, 1 x i3.8xlarge + HDFS 15,00 0.47 2.50
Spark, 1 x i3.8xlarge + HDFS 26,00 0,81 2.50
Ver3ca 32,80 8,20
Elas3csearch 48,89 12,22
PSQL 9.5 + cstore_fdw 205,00 51,25
Intel Core i5 4670K VS i3.8xlarge (32 VCPUs). Desktop example costs <600 EUR.
Do you
use adblocking?
Do you use
Google Analy+cs?
9%of the events are lost to ~all third party trackers
due to adblocking.
Sink > Sieve > Sort
ELT aka SQL on flat files with the minimum amount of code wri:en.
The Data Janitor Returns | Daniel Molnar | DN18
BIRD
OF
PREY
Who are you?
• Lip service provider.
• Fake news producer.
• Kingmaker.
Are you the fool
or the grey eminent?
Don't believe the hype.
HR: good people leave.
Marke&ng
Will this ever get be-er?
• adblocking,
• CPA silver bullets are gone,
• conversion & a8ribu9on are hard nuts,
• FB and GO are not your friends (the 900% on videos),
• but CRM is.
GDPR
• road to hell is paved with good inten2ons,
• it's about the process, matey,
• mostly fair,
• yes, you have to clean up your mess,
• dunno, wouldn't buy programma2c shares1
.
1
Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
Thank you!
@soobrosa
We're hiring!
visuals: @mroga., @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk,
jonolist, Kalexanderson, Shopify Burst

More Related Content

PDF
Finding and Using Big Data in your business
PPTX
All your types are belong to us!
PDF
Gadfly
PPTX
Deep learning with Tensorflow in R
PPTX
Microsoft on Big Data
PDF
2951085 dzone-2016guidetobigdata
PDF
LUISS - Deep Learning and data analyses - 09/01/19
PPTX
BDI- The Beginning (Big data training in Coimbatore)
Finding and Using Big Data in your business
All your types are belong to us!
Gadfly
Deep learning with Tensorflow in R
Microsoft on Big Data
2951085 dzone-2016guidetobigdata
LUISS - Deep Learning and data analyses - 09/01/19
BDI- The Beginning (Big data training in Coimbatore)

Similar to The Data Janitor Returns | Daniel Molnar | DN18 (20)

PDF
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
PDF
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
PDF
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
PPTX
Check Point Big Data Forum m3
PPT
Big Data Ecosystem for Data-Driven Decision Making
PPTX
Datasciencein E-commerce industry
PDF
Rakuten - Recommendation Platform
PDF
Singapore Spark Meetup Dec 01 2015
PDF
Big data made easy with a Spark
PPTX
Atlanta MLConf
PPTX
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
PDF
Ncku csie talk about Spark
PPTX
Big Data - An Overview
PPTX
Lessons learned from designing a QA Automation for analytics databases (big d...
PDF
Budapest Big Data Meetup Nov 26 2015
PDF
Big data and AI in Socialbakers
PDF
Toronto Spark Meetup Dec 14 2015
PDF
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
PDF
How to build your own Delve: combining machine learning, big data and SharePoint
PPTX
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Google Analytics Konferenz 2019_Google Cloud Platform_Carl Fernandes & Ksenia...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Check Point Big Data Forum m3
Big Data Ecosystem for Data-Driven Decision Making
Datasciencein E-commerce industry
Rakuten - Recommendation Platform
Singapore Spark Meetup Dec 01 2015
Big data made easy with a Spark
Atlanta MLConf
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Ncku csie talk about Spark
Big Data - An Overview
Lessons learned from designing a QA Automation for analytics databases (big d...
Budapest Big Data Meetup Nov 26 2015
Big data and AI in Socialbakers
Toronto Spark Meetup Dec 14 2015
Spsbepoelmanssharepointbigdataclean 150421080105-conversion-gate02
How to build your own Delve: combining machine learning, big data and SharePoint
Big Data - part 5/7 of "7 modern trends that every IT Pro should know about"
Ad

More from DataconomyGmbH (15)

PDF
Technical debt in ML | Jaroslaw Szymczak | DN18
PPTX
Accessing Online Text-based conversation | Jay Krall | DN2018
PDF
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
PDF
Causal inference-for-profit | Dan McKinley | DN18
PDF
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
PDF
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
PDF
Building a Data Science Consultancy | Bart Smeets | DN18
PPTX
Building Sustainable Machine Learning Products for Communities, by Communit...
PPTX
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
PPTX
Undermining democracy | Alisa Kolesnikova | DN18
PPTX
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
PPTX
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
PPTX
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
PPTX
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
PPTX
Are You Ready for the Quickening!
Technical debt in ML | Jaroslaw Szymczak | DN18
Accessing Online Text-based conversation | Jay Krall | DN2018
Journey from Structured to Unstructured Data | Nischal HP | VP, Engineering a...
Causal inference-for-profit | Dan McKinley | DN18
How to Lie with Data and Statistics? | Iveta Lohovska, Principal Data Scienti...
Data Science in Clinical Care | Johannes Starlinger, Charité | DN18
Building a Data Science Consultancy | Bart Smeets | DN18
Building Sustainable Machine Learning Products for Communities, by Communit...
BIG DATA is DEAD | Marc Weimer-Hablitzel, Etventure | DN18
Undermining democracy | Alisa Kolesnikova | DN18
Support automation with chatbots | Erik Pfannmöller | Founder, Solvemate | DN18
Rent, Rain, and Regulations | Du Phan, Dataiku | DN18
Linked data in an era of data surrealism Hans Constandt | CEO & FOUNDER
Living in an Era of Data Surrealism | Hans Constandt | CEO & FOUNDER
Are You Ready for the Quickening!
Ad

Recently uploaded (20)

PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Introduction to the R Programming Language
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Business Analytics and business intelligence.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Qualitative Qantitative and Mixed Methods.pptx
SAP 2 completion done . PRESENTATION.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
annual-report-2024-2025 original latest.
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
A Complete Guide to Streamlining Business Processes
CYBER SECURITY the Next Warefare Tactics
Introduction to the R Programming Language
STERILIZATION AND DISINFECTION-1.ppthhhbx
Microsoft Core Cloud Services powerpoint
Optimise Shopper Experiences with a Strong Data Estate.pdf
Pilar Kemerdekaan dan Identi Bangsa.pptx
Business Analytics and business intelligence.pdf
ISS -ESG Data flows What is ESG and HowHow
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx

The Data Janitor Returns | Daniel Molnar | DN18

  • 1. Daniel Molnar @ Oberlo/Shopify / Data Natives @ Berlin @ 2018-11-22
  • 2. Where I'm coming from • senior data analy.cs engineer, • head of data and analy.cs, • senior applied and data scien.st, • data analyst, • or just data janitor.
  • 3. Perspec've • rounded, not complete, • slow, old, stupid and lazy and
  • 4. tl;dr (new) • KISS is the philosophy, • take the long view, invest in durable knowledge, • strive for fast and good enough, • just because you can doesn't mean you should, • figure what to worry about, • you are not Google.
  • 5. it used to be a hype now this is a war nobody's your friend they want your money and data (preferably both locked in)
  • 6. Things you worry about: • machine learning, • deep learning, • GDPR.
  • 7. Things you should really worry about: • machine learning adblockers, • deep learning ELT, • GDPR, CRM (yes, CRM).
  • 12. Look at the *** data
  • 14. Usual suspect: NPS • one, simple number you can squint at, • sampling is skewed, • answer is unsure, • easy to hack step func:on1 , MONKEYPATCH: look at the change of the distro. 1 Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
  • 16. Hero of the day Mar$n Loetzsch @mar$n_loetzsch -=- KPIs for e-commerce startups Data Science in Early Stage Startups: the Struggle to Create Value https://github.com/mara
  • 19. Half of the *me when companies say they need "AI" what they really need is a SELECT clause with GROUP BY. You're welcome. — Mat Velloso @matvelloso (Technical Advisor to CTO at Microso9)
  • 20. Don't do A/B tests 99% it will not worth doing it
  • 21. ... conversion rate is 2% ... detec0ng a rela0ve change of 1% requires an experiment with 12 million users ... — Simon Jackson (Booking.com)
  • 23. Usual suspects • Non-reproducable experiments and tests. • R hodpepodge in produc9on. • Beliefs hidden as implicits in models.
  • 26. You don't have (enough) data.
  • 27. Make your own data points!
  • 29. Deep learn my *** Do you really need it? Tensorflow! ... ... so distributed deep learning can compress porn on the end device.
  • 30. Hero of the day Szilard [Deeper than Deep Learning] @DataScienceLA -=- Be#er than Deep Learning: Gradient Boos4ng Machines (GBMs) https://github.com/ szilard/benchm-ml
  • 31. Spark MLlibs GBM implementa3on is 10x slower, uses 10x more memory and is buggy/ lower accuracy. Total fucking garbage! — Szilard [Deeper than Deep Learning] @DataScienceLA
  • 35. Q: Why are there so many programmers from Eastern Europe? A: Slavic pessimism. Everything that can go wrong will go wrong. With such a mindset programming comes naturally. — Mar&n Sustrik @sustrik (Creator of ZeroMQ, nanomsg, libdill.)
  • 38. you get an other machine if you can use one
  • 40. Get cloud agnos.c! • AWS s'll leads the pack by far • Azure will sell anyway, and all will cry, • Google competes with the cheap and uncooked
  • 41. ETL is #solved OMG • Airflow is an overengineered underperforming nightmare, • metl for source mappings in magnitude, • Mara for generic e-commerce, • night-shift for explicit minimalism.
  • 43. Hero of the day Mark Litwintschik @marklit82 Summary of the 1.1 Billion Taxi Rides Benchmarks (500 GB uncompressed CSV) https:// tech.marksblogg.com
  • 44. Spark Setup Query Median QM per vCPU Cost/hour 11 x m3.xlarge + HDFS 14,91 0,34 27,5 1 x i3.8xlarge + HDFS 26,00 0,81 2,5 21 x m3.xlarge + HDFS 32,00 0,38 5,67 5 x m3.xlarge + S3 466,50 23,33 1,35 3 x Raspberry Pi 1738,00 144,83 HDFS. RPi = 1/6 VCPU ~100 EUR. Linear scaling.
  • 45. Presto Setup Query Median QM per vCPU Cost/hour 50 x n1-standard-4 7,00 0,04 9.50 21 x m3.xlarge 11,50 0,14 5.67 10 x n1-standard-4 16,00 0,36 2.09 1 x i3.8xlarge + HDFS 15,00 0,47 2.50 5 x m3.xlarge + HDFS 51,50 0,26 1.35 50 x m3.xlarge + S3 43,50 0,22 13.50 Workhorse in favour. HDFS. 1 machine. Non-linear scaling.
  • 46. Lazy Evalua*on Setup Query Median Cost/hour Redshi', 6 x ds2.8xlarge 1,91 40.80 BigQuery 2,00 Amazon Athena 6,30 Presto, 50 x n1-standard-4 7,00 9.50 Spark, 11 x m3.xlarge + HDFS 14,91 27.50 The human cost -- in both terms.
  • 47. One Machine Setup Query Median QM per vCPU Cost/hour ClickHouse 4,21 1,05 Elas3csearch tuned 13,14 3,29 Presto, 1 x i3.8xlarge + HDFS 15,00 0.47 2.50 Spark, 1 x i3.8xlarge + HDFS 26,00 0,81 2.50 Ver3ca 32,80 8,20 Elas3csearch 48,89 12,22 PSQL 9.5 + cstore_fdw 205,00 51,25 Intel Core i5 4670K VS i3.8xlarge (32 VCPUs). Desktop example costs <600 EUR.
  • 49. Do you use Google Analy+cs?
  • 50. 9%of the events are lost to ~all third party trackers due to adblocking.
  • 51. Sink > Sieve > Sort ELT aka SQL on flat files with the minimum amount of code wri:en.
  • 54. Who are you? • Lip service provider. • Fake news producer. • Kingmaker. Are you the fool or the grey eminent?
  • 55. Don't believe the hype. HR: good people leave.
  • 57. Will this ever get be-er? • adblocking, • CPA silver bullets are gone, • conversion & a8ribu9on are hard nuts, • FB and GO are not your friends (the 900% on videos), • but CRM is.
  • 58. GDPR • road to hell is paved with good inten2ons, • it's about the process, matey, • mostly fair, • yes, you have to clean up your mess, • dunno, wouldn't buy programma2c shares1 . 1 Eve Rajca aka @EveTheAnalyst and Jacques Ma9heij aka @jma9heij
  • 59. Thank you! @soobrosa We're hiring! visuals: @mroga., @xkcd, @DorsaAmir, ˙Cаvin 〄, thelearningcurvedotca, JD Hancock, Thomas Hawk, jonolist, Kalexanderson, Shopify Burst