Big Data Science
in Scala
Anastasia Lieva
Data Scientist
@lievAnastazia
1. R
2. Python
3. SQL
2014
KDnuggets Polls: most popular tools in data-science
2015
2016
Context: Real Time Bidding
Raw requests: 100 000 requests per second
4 terabytes per day
R
Python
SQL
Scala
R
Python
SQL
Scala
Spark
ML/DATAFRAME/SQL
SMILE
Saddle
Spark Saddle Smile
Preprocessing
Machine Learning
Evaluation
Preprocessing
Machine
Learning
Evaluation
Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request configuration
● proposed creative
● user history
● third-party information
depending on:
Algorithm:
Random Forest
Averaging the decisions
from all the trees
os
Categorie City
Oui Non OuiNon
adType
adSize weekDay
Oui Non OuiNon
Raw data
{
"time":"2016-06-09T0:25:28Z",
"bidfloor":2.88,
"appOrSite":"app",
"adType":"banner",
"categories":"games,news,football",
"carrier":"208-10",
"os":"iOS",
"connectionType":3,
"coords":[48.929256439208984, 2.4255824089050293],
"adSize":[320, 50],
"exchange":"xxxxx",
[...],
"clicked":true
}
Sampling of 13 Gb
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Click
False
True
False
Os MaxPrice Time
3.0 6.0 1.0
5.0 3.0 5.0
1.0 2.0 3.0
Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
Transformation: Scaling, converting, or modifying features
Selection: Selecting a subset from a larger set of features
Preprocessing: Spark ml
Extraction: Extracting features from “raw” data
TF-IDF, SparkSQL
Transformation: Scaling, converting, or modifying features
Bucketizer, String Indexer, Index to String, Vector Assembler
Selection: Selecting a subset from a larger set of features
ChiSqSelector
Preprocessing: Saddle
array-backed, specialized data structures:
Pandas-like operations:
dealing with missing values
index transformation tools
extracting,slicing,mapping
row/column wise
groupBy/join/concat
sorting/pivoting
Learning: Spark ml
Dataframe-based API
Classification
Regression
Linear Methods
Decision Trees
Tree ensembles
Learning: Spark ml
Dataframe-based API
Pipeline interface
Classification
Regression
Linear Methods
Decision Trees
Tree ensembles
TF-IDF String Indexer Assembler Random Forest Evaluation
Compare performance : Spark
Learning: Smile
Classification
Regression
Linear Methods
Decision Trees
Tree ensembles
Array-backed API
Learning: Smile
Classification
Regression
Linear Methods
Decision Trees
Tree ensembles
★ Visualisation
★ Missing Values Imputation
★ Association Rule Mining
★ Manifold learning
★ Multi-dimensional scaling
★ Feature selection and dimensionality reduction
Preprocessing: Saddle
Create dataframe and balance the data
Preprocessing: Spark ml
Create dataframe and balance the data
Preprocessing: Spark ml
Index categorical data
timestamp os osIdx
1465037789 iOS 1
1464983457 Windows Phone 2
1465019529 Android 0
1464974567 iOS 1
1465018552 Android 0
Preprocessing: Saddle
Index categorical data
Preprocessing: Saddle
Split randomly to test and train sets
and convert to input type needed in Smile RF implementation
Preprocessing: Spark ml
Conversion and sampling
Learning:
Smile
Construct Classifier and set
hyperparameters
Spark ml
Learning: Train model
and predict on test dataframe
Spark ml
Smile
Learning: Evaluate model
Spark ml
Smile
Compare Spark and Smile Random Forest
The higher the better The lower the better
Classification metrics
Compare Spark and Smile Random Forest
Running time on 13 GB
minutes
Compare preprocessing:
Spark vs Saddle
My List[tools] for THIS project:
Preprocessing
Spark
Machine Learning
(Random Forest)
Smile
Your Option[tools] for YOUR project:
Spark
SMILE
Saddle

More Related Content

PDF
Big Data Science in Scala V2
PDF
Pinterest - Big Data Machine Learning Platform at Pinterest
PPTX
GraphLab Conference 2014 Keynote - Carlos Guestrin
PDF
Machine Learning at Scale with MLflow and Apache Spark
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
PPTX
Role of Analytics in Digital Business
PPTX
Conference 2014: Rajat Arya - Deployment with GraphLab Create
PDF
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Big Data Science in Scala V2
Pinterest - Big Data Machine Learning Platform at Pinterest
GraphLab Conference 2014 Keynote - Carlos Guestrin
Machine Learning at Scale with MLflow and Apache Spark
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Role of Analytics in Digital Business
Conference 2014: Rajat Arya - Deployment with GraphLab Create
Mastering Your Customer Data on Apache Spark by Elliott Cordo

What's hot (20)

PDF
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
PPTX
END-TO-END MACHINE LEARNING STACK
PPTX
Big data bi-mature-oanyc summit
PDF
Applied Machine Learning for Ranking Products in an Ecommerce Setting
PPTX
Big data-science-oanyc
PDF
Use of standards and related issues in predictive analytics
PPTX
Predictive modelling with azure ml
PDF
Machine Learning with Big Data using Apache Spark
PDF
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
PPTX
What you need to know to start an AI company?
PDF
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
PPTX
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
PDF
Rakuten - Recommendation Platform
PPTX
Machine Learning in the Real World
PDF
Pandas UDF: Scalable Analysis with Python and PySpark
PPTX
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
PDF
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
PDF
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Production ready big ml workflows from zero to hero daniel marcous @ waze
Operationalizing Edge Machine Learning with Apache Spark with Nisha Talagala ...
END-TO-END MACHINE LEARNING STACK
Big data bi-mature-oanyc summit
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Big data-science-oanyc
Use of standards and related issues in predictive analytics
Predictive modelling with azure ml
Machine Learning with Big Data using Apache Spark
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
What you need to know to start an AI company?
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
Helping data scientists escape the seduction of the sandbox - Krish Swamy, We...
Rakuten - Recommendation Platform
Machine Learning in the Real World
Pandas UDF: Scalable Analysis with Python and PySpark
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Accelerating Production Machine Learning with MLflow with Matei Zaharia
Ad

Viewers also liked (20)

PPTX
Audit your reactive applications
PPTX
Les secrets de la JVM pour les algos à haute fréquence
PDF
Spark / Mesos Cluster Optimization
PPTX
The Other 99% of a Data Science Project
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
PDF
Advanced Spark and TensorFlow Meetup May 26, 2016
PPTX
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
PDF
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
PDF
Parquet Strata/Hadoop World, New York 2013
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Developing Real-Time Data Pipelines with Apache Kafka
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
PPTX
Menduni 24112016 compiled
PPTX
нет коррупции
PDF
4Virtus Reference projects
DOCX
DOCX
Mf0013 internal audit & control
PPTX
Қор туралы тұсаукесер
Audit your reactive applications
Les secrets de la JVM pour les algos à haute fréquence
Spark / Mesos Cluster Optimization
The Other 99% of a Data Science Project
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Advanced Spark and TensorFlow Meetup May 26, 2016
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)
How to Apply Machine Learning with R, H20, Apache Spark MLlib or PMML to Real...
Parquet Strata/Hadoop World, New York 2013
Efficient Data Storage for Analytics with Apache Parquet 2.0
Real time Analytics with Apache Kafka and Apache Spark
Developing Real-Time Data Pipelines with Apache Kafka
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Menduni 24112016 compiled
нет коррупции
4Virtus Reference projects
Mf0013 internal audit & control
Қор туралы тұсаукесер
Ad

Similar to Big Data Science in Scala (20)

PDF
AI meets Big Data
PPTX
Mastering MapReduce: MapReduce for Big Data Management and Analysis
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PPTX
big-data-anallytics.pptx
PPTX
Deploying Data Science Engines to Production
PDF
Sanmitra Ijeri Resume
PDF
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
PDF
Yu's resume
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PDF
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
PDF
It takes a village (to raise a ML model)
PPTX
Low Code Platform To Build Data & AI Products
PDF
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
PPTX
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
PDF
Data Infrastructure for a World of Music
PDF
Artificial Intelligence (ML - DL)
PPTX
June 2014 HUG: Interactive analytics over hadoop
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
PPTX
Big Data Trend and Open Data
PDF
Big data in marketing at harvard business club nick1 june 15 2013
AI meets Big Data
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
big-data-anallytics.pptx
Deploying Data Science Engines to Production
Sanmitra Ijeri Resume
WSO2Con EU 2015: An Introduction to the WSO2 Data Analytics Platform
Yu's resume
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Building an AI-Powered Retail Experience with Delta Lake, Spark, and Databricks
It takes a village (to raise a ML model)
Low Code Platform To Build Data & AI Products
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Data Infrastructure for a World of Music
Artificial Intelligence (ML - DL)
June 2014 HUG: Interactive analytics over hadoop
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Big Data Trend and Open Data
Big data in marketing at harvard business club nick1 june 15 2013

More from Anastasia Bobyreva (9)

PDF
Extreme data Science (English version)
PDF
Extreme Data Science
PDF
Make Data Science Great Again. Pourquoi et comment crafter la Data Science su...
PDF
NUPIC : new concept of AI
PDF
LearnLink project for Startup Week-End Montpellier
PDF
Google voice transcriptions demystified: Introduction to recurrent neural ne...
PDF
Big Data Science in Scala ( Joker 2017, slides in Russian)
PDF
Deep Learning with Spark
PDF
Which library should you choose for data-science? That's the question!
Extreme data Science (English version)
Extreme Data Science
Make Data Science Great Again. Pourquoi et comment crafter la Data Science su...
NUPIC : new concept of AI
LearnLink project for Startup Week-End Montpellier
Google voice transcriptions demystified: Introduction to recurrent neural ne...
Big Data Science in Scala ( Joker 2017, slides in Russian)
Deep Learning with Spark
Which library should you choose for data-science? That's the question!

Recently uploaded (20)

PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Introduction to Data Science and Data Analysis
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Navigating the Thai Supplements Landscape.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPTX
Business_Capability_Map_Collection__pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
CYBER SECURITY the Next Warefare Tactics
DOCX
Factor Analysis Word Document Presentation
PPTX
New ISO 27001_2022 standard and the changes
PDF
Global Data and Analytics Market Outlook Report
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPT
statistic analysis for study - data collection
PPTX
modul_python (1).pptx for professional and student
PDF
Transcultural that can help you someday.
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Introduction to Data Science and Data Analysis
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Optimise Shopper Experiences with a Strong Data Estate.pdf
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Navigating the Thai Supplements Landscape.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Business_Capability_Map_Collection__pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
CYBER SECURITY the Next Warefare Tactics
Factor Analysis Word Document Presentation
New ISO 27001_2022 standard and the changes
Global Data and Analytics Market Outlook Report
DU, AIS, Big Data and Data Analytics.ppt
statistic analysis for study - data collection
modul_python (1).pptx for professional and student
Transcultural that can help you someday.
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx

Big Data Science in Scala