SlideShare a Scribd company logo
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  [Incomplete]	
  Data	
  Tools	
  
Landscape	
  [for	
  Hackers]	
  in	
  
2015	
  
Wes	
  McKinney	
  @wesmckinn	
  
Data^3	
  MeeMng	
  —	
  Minneapolis,	
  MN	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  A	
  parMal	
  look	
  at	
  different	
  languages	
  and	
  tools	
  
•  LimiMng	
  scope	
  to	
  either:	
  
• Permissively	
  licensed	
  open	
  source	
  soSware,	
  e.g.	
  Apache-­‐licensed	
  (OSS)	
  
• Non-­‐dual-­‐licensed	
  copyleS	
  OSS	
  (e.g.	
  GPL)	
  
• i.e.	
  “do	
  you	
  [the	
  community]	
  have	
  any	
  incenMve	
  to	
  create	
  patches?”	
  
•  Some	
  trends	
  (that	
  I	
  see,	
  anyway)	
  
•  Challenges	
  and	
  opportuniMes	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
•  Python	
  data	
  firestarter	
  
•  Financial	
  analyMcs	
  in	
  R	
  /	
  Python	
  starMng	
  2007	
  
•  pandas	
  project	
  born	
  of	
  frustraMon	
  in	
  2008	
  
•  2010-­‐2012	
  
• Hiatus	
  from	
  gainful	
  employment	
  
• Make	
  pandas	
  ready	
  for	
  primeMme	
  
• Write	
  "Python	
  for	
  Data	
  Analysis"	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  (cont’d)	
  
•  2013-­‐2014:	
  Co-­‐founder/CEO	
  of	
  DataPad	
  (analyMcs	
  startup,	
  with	
  early	
  pandas	
  
collaborator	
  Chang	
  She)	
  
•  Late	
  2014:	
  DataPad	
  team	
  joins	
  Cloudera	
  
•  Now:	
  backend	
  systems	
  and	
  all-­‐things-­‐Python	
  @	
  Cloudera	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SQL:	
  SMll	
  a	
  lingua	
  franca	
  
•  “SQL:	
  the	
  Fortran	
  of	
  AnalyMcs”	
  
•  OSen	
  a	
  concise,	
  declaraMve	
  way	
  to	
  express	
  data	
  transforms,	
  analyMcs,	
  etc.	
  
•  RelaMvely	
  easy	
  to	
  parse,	
  analyze	
  
•  SQL	
  recently	
  has	
  seen	
  resurgence	
  with	
  focus	
  on	
  interacMve-­‐speed	
  SQL	
  engines,	
  
especially	
  on	
  top	
  of	
  HDFS/Hadoop	
  
•  Relevant	
  and	
  impaclul	
  features	
  (e.g.	
  JSON	
  support)	
  sMll	
  arriving	
  in	
  established	
  
RDBMS	
  like	
  PostgreSQL	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Historical	
  Python	
  Context	
  
•  ScienMfic	
  /	
  HPC	
  compuMng	
  focus	
  in	
  1990s,	
  2000s	
  
• Python	
  web	
  community	
  developed	
  in	
  parallel,	
  matured	
  faster!	
  
•  NumPy	
  became	
  community	
  standard	
  in	
  2005,	
  born	
  from	
  Numeric	
  +	
  Numarray	
  
•  Pyrex,	
  later	
  Cython,	
  easier	
  C	
  /	
  C++	
  wrapping	
  
•  f2py:	
  easy	
  Fortran	
  wrapping	
  
•  Anaconda	
  distribuMon	
  
• Finally	
  solving	
  Python	
  deployment	
  for	
  all	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
EssenMal	
  Python	
  stack	
  
•  NumPy:	
  low-­‐level	
  array	
  processing	
  
•  SciPy:	
  essenMal	
  computaMonal	
  algos	
  
•  pandas:	
  data	
  wrangling	
  
•  scikit-­‐learn:	
  machine	
  learning	
  
•  matplotlib	
  (+	
  add-­‐ons,	
  like	
  seaborn):	
  visualizaMon	
  
•  numba:	
  numeric	
  hotspot	
  LLVM	
  compiler	
  
•  Domain-­‐specific	
  toolkits:	
  nltk,	
  scikit-­‐image,	
  statsmodels,	
  Theano,	
  PyCUDA/
PyOpenCL	
  and	
  many	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  A	
  Pythonic	
  take	
  on	
  the	
  classic	
  R	
  “data	
  frame”	
  data	
  structure	
  
•  CriMcal	
  piece	
  to	
  make	
  the	
  Python	
  stack	
  useful	
  in	
  everyday	
  work	
  
•  Added	
  axis	
  metadata	
  /	
  labeling	
  for	
  represenMng	
  mulMdimensional	
  data	
  
•  Focus	
  on	
  easy	
  data	
  wrangling,	
  IO,	
  ploung,	
  and	
  basic	
  analyMcs	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Jeff	
  Reback’s	
  “pandas	
  as	
  PyData	
  middleware”	
  diagram	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Newer	
  /	
  Up-­‐and-­‐coming	
  Python	
  projects	
  
•  Bokeh:	
  interacMve	
  /	
  reacMve	
  visualizaMon	
  for	
  the	
  web	
  
•  Blaze:	
  uniform	
  data	
  expression	
  API	
  
•  Odo:	
  easy	
  data	
  migraMon	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
R	
  Project	
  
•  Trusted	
  base	
  of	
  staMsMcs	
  libraries	
  
• Latest	
  and	
  greatest	
  stats	
  research	
  oSen	
  hits	
  R	
  first	
  
•  RStudio	
  
•  The	
  "Hadley	
  stack”	
  
• VisualizaMon:	
  ggplot2	
  (staMc)	
  and	
  ggvis	
  (interacMve)	
  
• Data	
  Wrangling:	
  dplyr	
  
• legacy:	
  plyr	
  /	
  reshape2	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
dplyr	
  
•  Started	
  late	
  2012	
  by	
  Hadley	
  Wickham,	
  supported	
  by	
  RStudio	
  
•  Composable	
  /	
  chainable	
  analyMcs	
  and	
  data	
  wrangling	
  expressions	
  
•  In-­‐memory	
  and	
  SQL	
  backends	
  
•  Has	
  avracted	
  folks	
  back	
  to	
  R	
  from	
  Python	
  in	
  a	
  lot	
  of	
  cases	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  great	
  R	
  stuff	
  
•  shiny:	
  interacMve	
  web	
  apps	
  in	
  R	
  
•  Rcpp	
  
•  data.table	
  
•  xts	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
IPython	
  
•  IPython	
  started	
  out	
  as	
  a	
  bever	
  interacMve	
  Python	
  
•  Grew	
  to	
  include	
  web-­‐based	
  computaMonal	
  notebook,	
  GUI	
  console,	
  and	
  other	
  
components	
  
• (Google	
  even	
  integrated	
  into	
  Google	
  Drive!)	
  
•  IPython	
  Notebook	
  architecture	
  enabled	
  “kernel”	
  processes	
  to	
  be	
  wriven	
  in	
  nearly	
  
any	
  language	
  (even	
  bash!)	
  	
  
•  How	
  to	
  build	
  community	
  beyond	
  Python?	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enter	
  Jupyter	
  
•  hvp://jupyter.org	
  
•  Breaking	
  out	
  notebook	
  machinery	
  into	
  a	
  standalone	
  non-­‐Python-­‐specific	
  project	
  	
  
•  Enable	
  project	
  components	
  to	
  evolve	
  at	
  own	
  pace,	
  without	
  large	
  monolithic	
  
releases	
  
•  JupyterHub:	
  upcoming	
  mulM-­‐user	
  notebook	
  server	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  few	
  words	
  about	
  Hadoop	
  +	
  Big	
  Data	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Spark	
  
•  Originated	
  from	
  Berkeley	
  AMPLab	
  
•  General	
  purpose	
  distributed	
  memory-­‐centric	
  data	
  processing	
  framework	
  
•  Official	
  APIs:	
  Scala,	
  Java,	
  Python	
  
Source:	
  databricks.com	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  1.3:	
  DataFrames!	
  
•  R/pandas-­‐inspired	
  API	
  for	
  tabular	
  data	
  manipulaMon	
  in	
  Scala,	
  Python,	
  etc.	
  
•  Logical	
  operaMon	
  graphs	
  rewriven	
  internally	
  in	
  more	
  efficient	
  form	
  
•  Good	
  interop	
  with	
  Spark	
  SQL	
  
•  Some	
  interoperability	
  with	
  pandas	
  
•  Will	
  help	
  close	
  the	
  semanMc	
  gap	
  between	
  Spark	
  and	
  R/Python	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  problems	
  in	
  need	
  of	
  solving	
  
•  A	
  Shiny-­‐like	
  quick-­‐and-­‐dirty	
  data	
  app	
  development	
  framework	
  for	
  Python	
  
•  IPython/Jupyter	
  notebook	
  collaboraMon	
  
•  A	
  community-­‐standard,	
  Apache-­‐licensed	
  C/C++	
  data	
  frame	
  library	
  with	
  best-­‐in-­‐
class	
  performance	
  
•  Ubiquitous	
  support	
  for	
  emerging	
  analyMcal	
  on-­‐disk	
  storage	
  standards	
  like	
  Parquet	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  interesMng	
  stuff	
  to	
  look	
  at	
  	
  
•  Torch7	
  /	
  LuaJIT:	
  high	
  performance	
  ML	
  /	
  deep	
  learning	
  on	
  GPUs	
  
• Facebook	
  AI	
  group	
  open	
  sourced	
  several	
  ML	
  modules	
  
•  Apache	
  Flink	
  
• Up-­‐and-­‐coming	
  Scala-­‐based	
  data	
  processing	
  framework	
  
• Some	
  overlap	
  with	
  Spark	
  use	
  cases	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  other	
  interesMng	
  industry	
  trends	
  
•  MicrosoS	
  
• Acquired	
  RevoluMon	
  AnalyMcs,	
  leading	
  commercial	
  R	
  vendor	
  
• Launched	
  Azure	
  ML:	
  R,	
  Python,	
  and	
  more	
  on	
  Azure	
  cloud	
  
•  Dato	
  (ya	
  GraphLab)	
  
• faster,	
  more	
  scalable	
  machine	
  learning,	
  with	
  Python	
  interface	
  (Paid	
  commercial	
  
product,	
  free	
  for	
  non-­‐commercial/academic	
  use)	
  
• Largest-­‐ever	
  VC	
  investment	
  in	
  a	
  data	
  tools	
  company	
  beung	
  big	
  on	
  Python	
  
•  Databricks	
  
• Offering	
  cloud	
  Spark-­‐notebook-­‐as-­‐a-­‐service	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  

More Related Content

PDF
Ibis: Scaling the Python Data Experience
PDF
My Data Journey with Python (SciPy 2015 Keynote)
PDF
Python Data Ecosystem: Thoughts on Building for the Future
PDF
PyData: The Next Generation
PDF
Improving data interoperability in Python and R
PDF
DataFrames: The Extended Cut
PDF
Apache Arrow (Strata-Hadoop World San Jose 2016)
PDF
Data Science Languages and Industry Analytics
Ibis: Scaling the Python Data Experience
My Data Journey with Python (SciPy 2015 Keynote)
Python Data Ecosystem: Thoughts on Building for the Future
PyData: The Next Generation
Improving data interoperability in Python and R
DataFrames: The Extended Cut
Apache Arrow (Strata-Hadoop World San Jose 2016)
Data Science Languages and Industry Analytics

What's hot (20)

PDF
Apache Arrow and Python: The latest
PDF
Next-generation Python Big Data Tools, powered by Apache Arrow
PDF
High Performance Python on Apache Spark
PDF
DataFrames: The Good, Bad, and Ugly
PDF
How Apache Arrow and Parquet boost cross-language interoperability
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Memory Interoperability in Analytics and Machine Learning
PDF
Enabling Python to be a Better Big Data Citizen
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Python Data Wrangling: Preparing for the Future
PPTX
Apache Arrow - An Overview
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
PDF
Ibis: Scaling Python Analytics on Hadoop and Impala
PDF
Apache Spark Briefing
PPTX
Large Scale Graph Analytics with JanusGraph
PPTX
PyData: The Next Generation | Data Day Texas 2015
PPTX
Improving Python and Spark Performance and Interoperability with Apache Arrow
PPTX
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Apache Arrow and Python: The latest
Next-generation Python Big Data Tools, powered by Apache Arrow
High Performance Python on Apache Spark
DataFrames: The Good, Bad, and Ugly
How Apache Arrow and Parquet boost cross-language interoperability
Apache Arrow -- Cross-language development platform for in-memory data
Memory Interoperability in Analytics and Machine Learning
Enabling Python to be a Better Big Data Citizen
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Python Data Wrangling: Preparing for the Future
Apache Arrow - An Overview
Improving Python and Spark (PySpark) Performance and Interoperability
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Ibis: Scaling Python Analytics on Hadoop and Impala
Apache Spark Briefing
Large Scale Graph Analytics with JanusGraph
PyData: The Next Generation | Data Day Texas 2015
Improving Python and Spark Performance and Interoperability with Apache Arrow
Using SparkR to Scale Data Science Applications in Production. Lessons from t...

Viewers also liked (16)

PPT
PPTX
User Experience for Business Analysts
PDF
Riding the Enterprise Integration train
PDF
Salesforce DX Pilot Product Overview
PDF
Productive Data Tools for Quants
PPTX
How To Be A Hacker
PDF
Hacking For Innovation Delhi
PDF
Success Community Wizard Overview
PPTX
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
PDF
Parquet Strata/Hadoop World, New York 2013
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PDF
pandas: Powerful data analysis tools for Python
PDF
Python for Financial Data Analysis with pandas
PPTX
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
PDF
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
PDF
Startup Pitch Decks
User Experience for Business Analysts
Riding the Enterprise Integration train
Salesforce DX Pilot Product Overview
Productive Data Tools for Quants
How To Be A Hacker
Hacking For Innovation Delhi
Success Community Wizard Overview
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
Parquet Strata/Hadoop World, New York 2013
Efficient Data Storage for Analytics with Apache Parquet 2.0
pandas: Powerful data analysis tools for Python
Python for Financial Data Analysis with pandas
AI for IA's: Machine Learning Demystified at IA Summit 2017 - IAS17
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
Startup Pitch Decks

Similar to An Incomplete Data Tools Landscape for Hackers in 2015 (20)

PDF
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
PDF
London level39
PDF
Python as the Zen of Data Science
PDF
Big data berlin
PDF
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
PDF
What's new in pandas and the SciPy stack for financial users
PDF
RDM 2020: Python, Numpy, and Pandas
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
PyData Boston 2013
PDF
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
PDF
Making the big data ecosystem work together with python apache arrow, spark,...
PDF
Continuum Analytics and Python
PPTX
Intro to Python Data Analysis in Wakari
PDF
Accelerating Big Data beyond the JVM - Fosdem 2018
PDF
Exploring and Using the Python Ecosystem
PPTX
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
PDF
Sa introduction to big data pipelining with cassandra & spark west mins...
PDF
Big data beyond the JVM - DDTX 2018
PDF
Essential Python Libraries Every Developer Should Know - CETPA Infotech
PDF
Liferay & Big Data Dev Con 2014
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
London level39
Python as the Zen of Data Science
Big data berlin
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
What's new in pandas and the SciPy stack for financial users
RDM 2020: Python, Numpy, and Pandas
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyData Boston 2013
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with python apache arrow, spark,...
Continuum Analytics and Python
Intro to Python Data Analysis in Wakari
Accelerating Big Data beyond the JVM - Fosdem 2018
Exploring and Using the Python Ecosystem
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Sa introduction to big data pipelining with cassandra & spark west mins...
Big data beyond the JVM - DDTX 2018
Essential Python Libraries Every Developer Should Know - CETPA Infotech
Liferay & Big Data Dev Con 2014

More from Wes McKinney (17)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PPTX
Shared Infrastructure for Data Science
PDF
Data Science Without Borders (JupyterCon 2017)
PPTX
Raising the Tides: Open Source Analytics for Data Science
PDF
PyCon APAC 2016 Keynote
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Shared Infrastructure for Data Science
Data Science Without Borders (JupyterCon 2017)
Raising the Tides: Open Source Analytics for Data Science
PyCon APAC 2016 Keynote

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Spectral efficient network and resource selection model in 5G networks
Chapter 3 Spatial Domain Image Processing.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectral efficient network and resource selection model in 5G networks

An Incomplete Data Tools Landscape for Hackers in 2015

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   A  [Incomplete]  Data  Tools   Landscape  [for  Hackers]  in   2015   Wes  McKinney  @wesmckinn   Data^3  MeeMng  —  Minneapolis,  MN  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  A  parMal  look  at  different  languages  and  tools   •  LimiMng  scope  to  either:   • Permissively  licensed  open  source  soSware,  e.g.  Apache-­‐licensed  (OSS)   • Non-­‐dual-­‐licensed  copyleS  OSS  (e.g.  GPL)   • i.e.  “do  you  [the  community]  have  any  incenMve  to  create  patches?”   •  Some  trends  (that  I  see,  anyway)   •  Challenges  and  opportuniMes  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Python  data  firestarter   •  Financial  analyMcs  in  R  /  Python  starMng  2007   •  pandas  project  born  of  frustraMon  in  2008   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeMme   • Write  "Python  for  Data  Analysis"  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?  (cont’d)   •  2013-­‐2014:  Co-­‐founder/CEO  of  DataPad  (analyMcs  startup,  with  early  pandas   collaborator  Chang  She)   •  Late  2014:  DataPad  team  joins  Cloudera   •  Now:  backend  systems  and  all-­‐things-­‐Python  @  Cloudera  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   SQL:  SMll  a  lingua  franca   •  “SQL:  the  Fortran  of  AnalyMcs”   •  OSen  a  concise,  declaraMve  way  to  express  data  transforms,  analyMcs,  etc.   •  RelaMvely  easy  to  parse,  analyze   •  SQL  recently  has  seen  resurgence  with  focus  on  interacMve-­‐speed  SQL  engines,   especially  on  top  of  HDFS/Hadoop   •  Relevant  and  impaclul  features  (e.g.  JSON  support)  sMll  arriving  in  established   RDBMS  like  PostgreSQL  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Historical  Python  Context   •  ScienMfic  /  HPC  compuMng  focus  in  1990s,  2000s   • Python  web  community  developed  in  parallel,  matured  faster!   •  NumPy  became  community  standard  in  2005,  born  from  Numeric  +  Numarray   •  Pyrex,  later  Cython,  easier  C  /  C++  wrapping   •  f2py:  easy  Fortran  wrapping   •  Anaconda  distribuMon   • Finally  solving  Python  deployment  for  all  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   EssenMal  Python  stack   •  NumPy:  low-­‐level  array  processing   •  SciPy:  essenMal  computaMonal  algos   •  pandas:  data  wrangling   •  scikit-­‐learn:  machine  learning   •  matplotlib  (+  add-­‐ons,  like  seaborn):  visualizaMon   •  numba:  numeric  hotspot  LLVM  compiler   •  Domain-­‐specific  toolkits:  nltk,  scikit-­‐image,  statsmodels,  Theano,  PyCUDA/ PyOpenCL  and  many  others  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  A  Pythonic  take  on  the  classic  R  “data  frame”  data  structure   •  CriMcal  piece  to  make  the  Python  stack  useful  in  everyday  work   •  Added  axis  metadata  /  labeling  for  represenMng  mulMdimensional  data   •  Focus  on  easy  data  wrangling,  IO,  ploung,  and  basic  analyMcs  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Jeff  Reback’s  “pandas  as  PyData  middleware”  diagram  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Newer  /  Up-­‐and-­‐coming  Python  projects   •  Bokeh:  interacMve  /  reacMve  visualizaMon  for  the  web   •  Blaze:  uniform  data  expression  API   •  Odo:  easy  data  migraMon  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   R  Project   •  Trusted  base  of  staMsMcs  libraries   • Latest  and  greatest  stats  research  oSen  hits  R  first   •  RStudio   •  The  "Hadley  stack”   • VisualizaMon:  ggplot2  (staMc)  and  ggvis  (interacMve)   • Data  Wrangling:  dplyr   • legacy:  plyr  /  reshape2  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Started  late  2012  by  Hadley  Wickham,  supported  by  RStudio   •  Composable  /  chainable  analyMcs  and  data  wrangling  expressions   •  In-­‐memory  and  SQL  backends   •  Has  avracted  folks  back  to  R  from  Python  in  a  lot  of  cases  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  great  R  stuff   •  shiny:  interacMve  web  apps  in  R   •  Rcpp   •  data.table   •  xts  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   IPython   •  IPython  started  out  as  a  bever  interacMve  Python   •  Grew  to  include  web-­‐based  computaMonal  notebook,  GUI  console,  and  other   components   • (Google  even  integrated  into  Google  Drive!)   •  IPython  Notebook  architecture  enabled  “kernel”  processes  to  be  wriven  in  nearly   any  language  (even  bash!)     •  How  to  build  community  beyond  Python?  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Enter  Jupyter   •  hvp://jupyter.org   •  Breaking  out  notebook  machinery  into  a  standalone  non-­‐Python-­‐specific  project     •  Enable  project  components  to  evolve  at  own  pace,  without  large  monolithic   releases   •  JupyterHub:  upcoming  mulM-­‐user  notebook  server  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   A  few  words  about  Hadoop  +  Big  Data  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Spark   •  Originated  from  Berkeley  AMPLab   •  General  purpose  distributed  memory-­‐centric  data  processing  framework   •  Official  APIs:  Scala,  Java,  Python   Source:  databricks.com  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  1.3:  DataFrames!   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaMon  in  Scala,  Python,  etc.   •  Logical  operaMon  graphs  rewriven  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  Will  help  close  the  semanMc  gap  between  Spark  and  R/Python  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Some  problems  in  need  of  solving   •  A  Shiny-­‐like  quick-­‐and-­‐dirty  data  app  development  framework  for  Python   •  IPython/Jupyter  notebook  collaboraMon   •  A  community-­‐standard,  Apache-­‐licensed  C/C++  data  frame  library  with  best-­‐in-­‐ class  performance   •  Ubiquitous  support  for  emerging  analyMcal  on-­‐disk  storage  standards  like  Parquet  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Other  interesMng  stuff  to  look  at     •  Torch7  /  LuaJIT:  high  performance  ML  /  deep  learning  on  GPUs   • Facebook  AI  group  open  sourced  several  ML  modules   •  Apache  Flink   • Up-­‐and-­‐coming  Scala-­‐based  data  processing  framework   • Some  overlap  with  Spark  use  cases  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Some  other  interesMng  industry  trends   •  MicrosoS   • Acquired  RevoluMon  AnalyMcs,  leading  commercial  R  vendor   • Launched  Azure  ML:  R,  Python,  and  more  on  Azure  cloud   •  Dato  (ya  GraphLab)   • faster,  more  scalable  machine  learning,  with  Python  interface  (Paid  commercial   product,  free  for  non-­‐commercial/academic  use)   • Largest-­‐ever  VC  investment  in  a  data  tools  company  beung  big  on  Python   •  Databricks   • Offering  cloud  Spark-­‐notebook-­‐as-­‐a-­‐service  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn