SlideShare a Scribd company logo
Wes McKinney @wesmckinn
DATA SCIENCE WITHOUT
BORDERS
WES MCKINNEY @WESMCKINN
JupyterCon | August 2017
ME
2
I M P O R TA N T L E G A L I N F O R M AT I O N
• The information presented here is offered for informational purposes only and should not be used for any other purpose
(including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes
only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any
offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of
Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at
any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such
copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright
or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma,
nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn 3
THINKING ON THE LAST 10 YEARS
4
2007 2017
CLOSED SOURCE OPEN SOURCE
5
A shared front-end
for data science
THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …
THE AI ARMS RACE
Wes McKinney @wesmckinn 8
CHANGING HARDWARE LANDSCAPE
DISK
PROCESSIN
G
MEMORY
9
T
DATA SCIENCE “LANGUAGE “SILOS”
FRONT-END
PYTHON R JVM JULIA …
10
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
pandas NumPy
pandas
NumPy
pandas
scikit-learn
12
RENOVATING PANDAS
Wes McKinney @wesmckinn 13
T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14
PROGRAMMING LANGUAGES
AS USER INTERFACES
15
GRAPHIC: Iceberg under sea (only top
part visible to naked eye)
T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DIFFERENT
IMPLEMENTATION
17
T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18
FROM IDEA TO ACTION
19
T
PART 1: STANDARD IN-MEMORY FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…
T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21
T
PART 3: HIGH PERFORMANCE DATA
ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22
T
PART 4: FLEXIBLE COMPUTATION ENGINE
• Zero-overhead User-defined Functions
• Portable Operator “Graphs”
• “Embeddable” in Larger Systems
23
APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24
T
BUILDING THE ARROW FORMAT
• “Superset” of representations supported by
R, pandas, SQL engines
• Optimized for CPU cache affinity
• ASF Governance: Open + Transparent
Community Project
25
FEATHER: MINIMALIST ARROW ON DISK
Some Arrow OSS Users
Feather Format
Ray Project
27
BUILDING THE FUTURE
28
Data Science Without Borders (JupyterCon 2017)
Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org

More Related Content

PPTX
Raising the Tides: Open Source Analytics for Data Science
PPTX
Shared Infrastructure for Data Science
PDF
BI Past Present and Future - 2016 Persepective
PDF
Data science hypes and reality
PDF
Ingesting click events for analytics
PPTX
Candor - open analytics nyc
PPT
Hans Henseler - Intelligent data analysis for improving public security - Da...
PDF
SIAM Annual Meeting - CTeixeira MITRE V4
Raising the Tides: Open Source Analytics for Data Science
Shared Infrastructure for Data Science
BI Past Present and Future - 2016 Persepective
Data science hypes and reality
Ingesting click events for analytics
Candor - open analytics nyc
Hans Henseler - Intelligent data analysis for improving public security - Da...
SIAM Annual Meeting - CTeixeira MITRE V4

Similar to Data Science Without Borders (JupyterCon 2017) (20)

PDF
Nielsen investor-day-consolidated-deck full
PPTX
Enhancing BI with Predictive Analytics with Case Study
PDF
Denodo DataFest 2017: Company Leadership from Data Leadership
PPTX
Data Analytics in Cyber Security
PPTX
Data Analytics in Cyber Security
PPTX
Financial Services Forum_New York, May 17, 2017
PPTX
Colossal Data for Dramatic Effect
PDF
Big Data Scotland 2017
PPTX
Big Data can be fun!
PPTX
Big Data = MISSION IMPOSSIBLE?
PPTX
Department of State IT Sales Opportunities: FY17, 18, and Beyond
PDF
Getting Started with Data Governance? Use Process Models!
PDF
1115 track3 bertero
PDF
Smart Data Webinar: Organizing Data and Knowledge - The Role of Taxonomies an...
PDF
Workshop on Data Science at Best Practices Meet 2017, Data Security Council o...
PDF
Data Governance Strategies - With Great Power Comes Great Accountability
PPTX
Webinar: Powering Personalized Search with Knowledge Graphs
PDF
Big Data for the Next Big Idea in Financial Services (Whitepaper)
PDF
Securing executive support for data governance - John Morton
PDF
Creating an EDGE - Enterprise Data Governance Experience
Nielsen investor-day-consolidated-deck full
Enhancing BI with Predictive Analytics with Case Study
Denodo DataFest 2017: Company Leadership from Data Leadership
Data Analytics in Cyber Security
Data Analytics in Cyber Security
Financial Services Forum_New York, May 17, 2017
Colossal Data for Dramatic Effect
Big Data Scotland 2017
Big Data can be fun!
Big Data = MISSION IMPOSSIBLE?
Department of State IT Sales Opportunities: FY17, 18, and Beyond
Getting Started with Data Governance? Use Process Models!
1115 track3 bertero
Smart Data Webinar: Organizing Data and Knowledge - The Role of Taxonomies an...
Workshop on Data Science at Best Practices Meet 2017, Data Security Council o...
Data Governance Strategies - With Great Power Comes Great Accountability
Webinar: Powering Personalized Search with Knowledge Graphs
Big Data for the Next Big Idea in Financial Services (Whitepaper)
Securing executive support for data governance - John Morton
Creating an EDGE - Enterprise Data Governance Experience
Ad

More from Wes McKinney (20)

PDF
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
PDF
Solving Enterprise Data Challenges with Apache Arrow
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
PDF
Apache Arrow: High Performance Columnar Data Framework
PDF
New Directions for Apache Arrow
PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
PDF
ACM TechTalks : Apache Arrow and the Future of Data Frames
PDF
Apache Arrow: Present and Future @ ScaledML 2020
PDF
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PDF
Apache Arrow: Leveling Up the Analytics Stack
PDF
Apache Arrow Workshop at VLDB 2019 / BOSS Session
PDF
Apache Arrow: Leveling Up the Data Science Stack
PDF
Ursa Labs and Apache Arrow in 2019
PDF
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PDF
Apache Arrow at DataEngConf Barcelona 2018
PDF
Apache Arrow: Cross-language Development Platform for In-memory Data
PDF
Apache Arrow -- Cross-language development platform for in-memory data
PPTX
Memory Interoperability in Analytics and Machine Learning
PDF
Improving Python and Spark (PySpark) Performance and Interoperability
PDF
Python Data Wrangling: Preparing for the Future
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Solving Enterprise Data Challenges with Apache Arrow
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: High Performance Columnar Data Framework
New Directions for Apache Arrow
Apache Arrow Flight: A New Gold Standard for Data Transport
ACM TechTalks : Apache Arrow and the Future of Data Frames
Apache Arrow: Present and Future @ ScaledML 2020
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow: Leveling Up the Data Science Stack
Ursa Labs and Apache Arrow in 2019
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow -- Cross-language development platform for in-memory data
Memory Interoperability in Analytics and Machine Learning
Improving Python and Spark (PySpark) Performance and Interoperability
Python Data Wrangling: Preparing for the Future
Ad

Recently uploaded (20)

PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Electronic commerce courselecture one. Pdf
PPT
Teaching material agriculture food technology
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
sap open course for s4hana steps from ECC to s4
Electronic commerce courselecture one. Pdf
Teaching material agriculture food technology
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
MIND Revenue Release Quarter 2 2025 Press Release
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf

Data Science Without Borders (JupyterCon 2017)

  • 1. Wes McKinney @wesmckinn DATA SCIENCE WITHOUT BORDERS WES MCKINNEY @WESMCKINN JupyterCon | August 2017
  • 3. I M P O R TA N T L E G A L I N F O R M AT I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  • 4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  • 6. A shared front-end for data science
  • 7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  • 8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  • 10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  • 11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  • 12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  • 14. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  • 16. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  • 17. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  • 18. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  • 19. FROM IDEA TO ACTION 19
  • 20. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  • 21. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  • 22. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  • 23. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  • 24. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  • 25. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  • 27. Some Arrow OSS Users Feather Format Ray Project 27
  • 30. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org