PySpark for Time Series Analysis
David Palaitis
Two Sigma Investments
About Me
Important Legal Information
The information presented here is offered for recruiting purposes only and
should not be used for any other purpose (including, without limitation, the
making of investment decisions). Examples provided herein are for illustrative
purposes only and are not necessarily based on actual data. Nothing herein
constitutes an offer to sell or the solicitation of any offer to buy any security or
other interest. We consider this information to be confidential and not for
redistribution or dissemination. Some of the images, logos or other material
used herein may be protected by copyright and/or trademark. If so, such
copyrights and/or trademarks are most likely owned by the entity that created
the material and are used purely for identification and comment as fair use
under international copyright and/or trademark laws. Use of such image,
copyright or trademark does not imply any association with such organization (or
endorsement of such organization) by Two Sigma, nor vice versa.
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Time Series
IOT feeds
sensor data
economic data
An ordered sequence of values of a variable
Time Series Analysis
Time Series Analysis
Time Series Analysis
Time Series at Two Sigma
Millions of
Time Series
Big and
Small
(1GB – 1PB)
Narrow (10
columns) and
Wide (1MM
Columns)
Evenly and
Unevenly
Spaced
Observations
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Let’s start from the beginning …
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Examples!
What’s Missing?
You can’t even do “Word Count”
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
“Word Count” !
What’s missing? Time.
Windowed Aggregations
Temporal Joins
} window
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
w is a window specification e.g. 500ms, 5s, 3 business days
RDD[(K,V)] -> RDD[(K,Seq[V])]
reduceByWindow(f: (V, V) => V, w): RDD[(K, W)] => RDD[(K, V)]
reduceByWindow(f: (V, V) => V, w): RDD[(K, V)] => RDD[(K, V)]
https://github.com/twosigma/flint
Getting
Started …
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis
Looking ahead.
Thank You.
Find me after the talk to see Flint in action.

More Related Content

PDF
Infographic: Guide to the General Data Protection Regulation
PDF
Frauds in India
PDF
Hadoop 2.0 - Solving the Data Quality Challenge
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PDF
Time Series Processing with Apache Spark
PPTX
Data Governance, Compliance and Security in Hadoop with Cloudera
PDF
5 Steps to Securing Your Company's Crown Jewels
PPTX
SplunkLive! Paris 2018: Splunk Overview
Infographic: Guide to the General Data Protection Regulation
Frauds in India
Hadoop 2.0 - Solving the Data Quality Challenge
Analyzing Time Series Data with Apache Spark and Cassandra
Time Series Processing with Apache Spark
Data Governance, Compliance and Security in Hadoop with Cloudera
5 Steps to Securing Your Company's Crown Jewels
SplunkLive! Paris 2018: Splunk Overview

Similar to New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis (20)

PDF
Streaming analytics overview for R
PDF
Unveiling the Threat: The Impact of Spooling Attacks in Cybersecurity
PDF
SVI Pitchbook
PPT
Securing Your "Crown Jewels": Do You Have What it Takes?
PDF
Data Marketplace: Speed to Value with MicroStrategy & Flexible Architectures
PPTX
Trust optix 4 1-14
PPTX
MicroStrategy Indianapolis - Speed to Value
PDF
SaaS Security Programs: Build What You Seek
PDF
Comex Intraday pack
PPTX
TUCON 2013
PDF
Making Data Science accessible to a wider audience
PDF
Guide: Security and Compliance
PDF
How To Build A Winning MarTech Stack In 2023
PDF
Tibco streaming analytics overview and roadmap
PDF
Aujas_Gartner_Dubai_v1_Nov15
PDF
1.1 Data Security Presentation.pdf
PDF
Robo advisory-online-fow dw-asia-gs_khooalgo2008
PDF
Reusing and Managing R models in an Enterprise
PDF
Draganfly Deck March 2022
PDF
Draganfly Deck January 2022
Streaming analytics overview for R
Unveiling the Threat: The Impact of Spooling Attacks in Cybersecurity
SVI Pitchbook
Securing Your "Crown Jewels": Do You Have What it Takes?
Data Marketplace: Speed to Value with MicroStrategy & Flexible Architectures
Trust optix 4 1-14
MicroStrategy Indianapolis - Speed to Value
SaaS Security Programs: Build What You Seek
Comex Intraday pack
TUCON 2013
Making Data Science accessible to a wider audience
Guide: Security and Compliance
How To Build A Winning MarTech Stack In 2023
Tibco streaming analytics overview and roadmap
Aujas_Gartner_Dubai_v1_Nov15
1.1 Data Security Presentation.pdf
Robo advisory-online-fow dw-asia-gs_khooalgo2008
Reusing and Managing R models in an Enterprise
Draganfly Deck March 2022
Draganfly Deck January 2022
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Ad

Recently uploaded (20)

PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPT
statistics analysis - topic 3 - describing data visually
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PPTX
New ISO 27001_2022 standard and the changes
PPTX
1 hour to get there before the game is done so you don’t need a car seat for ...
PPT
statistic analysis for study - data collection
PPTX
Business_Capability_Map_Collection__pptx
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
DOCX
Factor Analysis Word Document Presentation
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PDF
Global Data and Analytics Market Outlook Report
PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPTX
SET 1 Compulsory MNH machine learning intro
PPTX
recommendation Project PPT with details attached
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPT
Image processing and pattern recognition 2.ppt
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
statistics analysis - topic 3 - describing data visually
expt-design-lecture-12 hghhgfggjhjd (1).ppt
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
New ISO 27001_2022 standard and the changes
1 hour to get there before the game is done so you don’t need a car seat for ...
statistic analysis for study - data collection
Business_Capability_Map_Collection__pptx
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Factor Analysis Word Document Presentation
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
Global Data and Analytics Market Outlook Report
Tapan_20220802057_Researchinternship_final_stage.pptx
SET 1 Compulsory MNH machine learning intro
recommendation Project PPT with details attached
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Image processing and pattern recognition 2.ppt

New Directions in pySpark for Time Series Analysis: Spark Summit East talk by David Palaitis