SlideShare a Scribd company logo
Sparksheet -
Transforming Spreadsheets
into Spark Data Frames
Oscar Castañeda-Villagrán
Universidad del Valle de Guatemala
About
• Researcher at Universidad del Valle de Guatemala.
• Research Interests:
• Program Transformation,
• Programming Education Research,
• Online Learning to Rank.
Prototyping …
http://bit.ly/2e5GmyY
Prototyping Spark programs with …
http://bit.ly/2e5GmyY http://bit.ly/2edYfMs
http://bit.ly/2e5GmyY http://bit.ly/2edYfMshttp://bit.ly/2e0TZA8
Prototyping Spark programs with Excel
Agenda
• Problem Statement and Motivation
• Architecture
• Program Transformation
• Pipeline
• Code-to-Code Transformation
• Parsing Excel Formulas
• Grammar
• Parse Tree
• XLParser
• Excel as a DSL
• Generating Code
• Demo
• Q&A
Disclaimer(s)
• Ongoing research …
• We will focus on how to create a Program
Transformation Pipeline.
Problem Statement
Spark programs can be prototyped in Excel but manually
translating Excel formulas to Spark programs is tedious
and error-prone.
Motivation
• “Straight path” between column-oriented Excel “programs”
and Spark programs that make use of the DataFrame API.
• But, manually translating Excel formulas to Spark is tedious
and error-prone.
• What if: Excel compiler?
Problem Statement
Given that column-oriented Excel applications can be
manually translated to Spark programs, …
… find a way to automate translation of Excel formulas so
that data pipelines can be prototyped in Excel …
… and Scala/Python code generated to run in Spark.
Motivation
Automatically translate
Excel Formulas to …
Spark.
Data!
http://bit.ly/2expmoF
Architecture
Excel Formulas
Columnar Data
Take ES
snapshot
Restore ES
snapshot
Black Box
Data!
Refine
Code
http://bit.ly/2em6RUK
http://bit.ly/2e5H1jL
Excel Formulas
Columnar Data
Take ES
snapshot
Restore ES
snapshot
Black Box
Data!
Refine
Code
Architecture
http://bit.ly/2em6RUK
http://bit.ly/2e5H1jL
Excel Formulas
Columnar Data
Take ES
snapshot
Restore ES
snapshot
Black Box
Data!
Refine
Code
Tomorrow: Spark Cluster with Elasticsearch InsideArchitecture
http://bit.ly/2em6RUK
http://bit.ly/2e5H1jL
Program Transformation
Program Transformation
“A program transformation is any
operation that takes a computer program
and generates another program.”
https://en.wikipedia.org/wiki/Program_transformation
Architecture
Excel Formulas
Columnar Data
Take ES
snapshot
Restore ES
snapshot
Black Box
Data!
Refine
Code
Program Transformation Pipeline
http://bit.ly/2e0TZA8
http://bit.ly/2efaib4
http://bit.ly/2di0cFq
Code-to-Code Transformation
“The input to the code generator typically
consists of a parse tree or an
abstract syntax tree.”
https://en.wikipedia.org/wiki/Code_generation_(compiler)
http://bit.ly/2dH0ybF
We need a Grammar!
We need a Parse Tree!
Parse Excel Formulas
We need a Parse Tree!
Parse Excel Formulas
http://bit.ly/2e0TZA8
We need a Parse Tree!
Generate Scala CodeParse Excel Formulas
http://bit.ly/2e0TZA8
We need a Grammar!
http://bit.ly/2dH0ybF
XLParser
"If I have seen further, it is by standing upon the shoulders of giants"
— Sir Isaac Newton
A Grammar for Spreadsheet Formulas Evaluated on Two Large Datasets – Efthimia Aivaloglou, David Hoepelman &
Felienne Hermans, Proceedings of SCAM ’15
XLParser
"If I have seen further, it is by standing upon the shoulders of giants"
— Sir Isaac Newton
A Grammar for Spreadsheet Formulas Evaluated on Two Large Datasets – Efthimia Aivaloglou, David Hoepelman &
Felienne Hermans, Proceedings of SCAM ’15
XLParser
"If I have seen further, it is by standing upon the shoulders of giants"
— Sir Isaac Newton
Excel Formula
SUM(A,C)
http://xlparser.perfectxl.nl/demo
XLParser
"If I have seen further, it is by standing upon the shoulders of giants"
— Sir Isaac Newton
Parse
Tree!
SUM(A,C)
Excel Formula
http://xlparser.perfectxl.nl/demo
Excel as a DSL
• External DSL: parsed independently.
• XLParser gives us a Parse Tree from an
Excel Formula.
• Given the Parse Tree, generate code!
How do you generate code from
parsed Excel Formulas?
?
Generating Code
“An elegant way to generate code from an AST
is to write a class for each non-terminal node in
the tree, and then each node in the tree simply
generates the piece of code that it is
responsible for.”
http://www.codeproject.com/Articles/26975/Writing-Your-First-Domain-Specific-Language-Part
Generating Code
A practical way to generate code
is to take a Parse Tree and write
a pretty printer for the target
language.
http://bit.ly/2em73DM
Generating Code from an AST
SUM(A,C)
Generating Code from an AST
Generating Code from an AST
Demo!
What have we seen?
• Column-Oriented Excel Applications as Prototypes for Spark programs
• Program Transformation.
• How to model as a Pipeline.
• Why considered a Code-to-Code Transformation.
• How to Parse Excel Formulas.
• Grammar
• Parse Tree
• XLParser
• Excel as a DSL.
• How can we Generate Code?
• Demo.
Next Steps
• Translate ~500 Excel Formulas.
• Modeling Machine Learning in Excel.
• Prototype D|’s and ML|’s in Excel.
Q&A
Q&A
Tomorrow: Spark Cluster with Elasticsearch Inside
THANK YOU.
Email: ofcastaneda@uvg.edu.gt
Twitter: @oscar_castaneda

More Related Content

PDF
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
PDF
Spark Summit EU talk by Oscar Castaneda
PDF
Spark Summit EU talk by Berni Schiefer
PDF
Spark Summit EU talk by Jakub Hava
PDF
Spark Summit EU talk by Yiannis Gkoufas
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Resource-Efficient Deep Learning Model Selection on Apache Spark
Spark Summit EU talk by Debasish Das and Pramod Narasimha

What's hot (20)

PDF
Spark Summit EU talk by Tim Hunter
PDF
Spark Summit EU talk by Heiko Korndorf
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PDF
Spark Summit EU talk by Luca Canali
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Spark Summit EU talk by Bas Geerdink
PDF
Scaling Machine Learning To Billions Of Parameters
PDF
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
PDF
Spark Summit EU talk by Emlyn Whittick
PDF
Apache con big data 2015 - Data Science from the trenches
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PDF
Spark Summit EU talk by Stephan Kessler
PDF
Spark Summit EU talk by Rolf Jagerman
PDF
Spark Summit EU talk by Josef Habdank
PDF
Spark Summit EU talk by Elena Lazovik
PDF
Spark Summit EU talk by Reza Karimi
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Huawei Advanced Data Science With Spark Streaming
PDF
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
PDF
Productionizing Machine Learning with a Microservices Architecture
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Heiko Korndorf
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Spark Summit EU talk by Luca Canali
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Spark Summit EU talk by Bas Geerdink
Scaling Machine Learning To Billions Of Parameters
Spark Summit EU talk by Patrick Baier and Stanimir Dragiev
Spark Summit EU talk by Emlyn Whittick
Apache con big data 2015 - Data Science from the trenches
Apache Spark MLlib 2.0 Preview: Data Science and Production
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Rolf Jagerman
Spark Summit EU talk by Josef Habdank
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Reza Karimi
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Huawei Advanced Data Science With Spark Streaming
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Productionizing Machine Learning with a Microservices Architecture
Ad

Viewers also liked (17)

PDF
Spark Summit EU talk by Javier Aguedes
PPTX
The Spark (R)evolution in The Netherlands
PDF
Spark Summit EU talk by Jorg Schad
PDF
Spark Summit EU talk by Erwin Datema and Roeland van Ham
PPTX
Democratizing AI with Apache Spark
PDF
Spark Summit EU talk by Sudeep Das and Aish Faenton
PDF
Spark Summit EU talk by Luc Bourlier
PDF
Spark Summit EU talk by Dean Wampler
PDF
Spark Summit EU talk by Sital Kedia
PDF
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
PDF
MmmooOgle: From Big Data to Decisions for Dairy Cows
PDF
Spark Summit EU talk by Ted Malaska
PPTX
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
PPTX
Simplifying Big Data Applications with Apache Spark 2.0
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Spark Summit EU talk by Qifan Pu
PPTX
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit EU talk by Javier Aguedes
The Spark (R)evolution in The Netherlands
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Democratizing AI with Apache Spark
Spark Summit EU talk by Sudeep Das and Aish Faenton
Spark Summit EU talk by Luc Bourlier
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
MmmooOgle: From Big Data to Decisions for Dairy Cows
Spark Summit EU talk by Ted Malaska
The Next AMPLab: Real-Time, Intelligent, and Secure Computing
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Qifan Pu
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Ad

Similar to Spark Summit EU talk by Oscar Castaneda (20)

PPTX
A machine learning and data science pipeline for real companies
PDF
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
PPTX
Tech Spark Presentation
PDF
From Pipelines to Refineries: Scaling Big Data Applications
PDF
Rapid prototyping with solr - By Erik Hatcher
PDF
Rapid Prototyping with Solr
PDF
Intro to Machine Learning with H2O and AWS
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Our path to apache spark
PPTX
Data Science at Scale with Apache Spark and Zeppelin Notebook
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
PDF
Big Data for Data Scientists - WeCloudData
PPT
Agile Data: Building Hadoop Analytics Applications
PDF
Rapid Prototyping with Solr
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PDF
Data Science in Future Tense
PPTX
Big Data Introduction - Solix empower
PDF
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
PDF
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
A machine learning and data science pipeline for real companies
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Tech Spark Presentation
From Pipelines to Refineries: Scaling Big Data Applications
Rapid prototyping with solr - By Erik Hatcher
Rapid Prototyping with Solr
Intro to Machine Learning with H2O and AWS
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Our path to apache spark
Data Science at Scale with Apache Spark and Zeppelin Notebook
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data for Data Scientists - WeCloudData
Agile Data: Building Hadoop Analytics Applications
Rapid Prototyping with Solr
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Data Science in Future Tense
Big Data Introduction - Solix empower
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Computer network topology notes for revision
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Introduction to Business Data Analytics.
Lecture1 pattern recognition............
Launch Your Data Science Career in Kochi – 2025
Computer network topology notes for revision
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Data_Analytics_and_PowerBI_Presentation.pptx
1_Introduction to advance data techniques.pptx
climate analysis of Dhaka ,Banglades.pptx
.pdf is not working space design for the following data for the following dat...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
IB Computer Science - Internal Assessment.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Knowledge Engineering Part 1
Business Acumen Training GuidePresentation.pptx
Introduction to Business Data Analytics.

Spark Summit EU talk by Oscar Castaneda