SlideShare a Scribd company logo
MACHINE LEARNING IN SPARK
Константин Макарычев
secon 2017
Big Data: Volume, Velocity, Variety
Apache Spark
http://spark.apache.org/
val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor executor executor executor executor
SQL, Streaming, GraphX, MLlib
Machine Learning: training + serving
preprocess preprocess train model
pipeline
apache spark 1
hadoop mapreduce 0
spark machine learning 1
tokenizer
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
hashing tf
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
logistic regression
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
0 72 -2.7138781446090308
0 94 0.9042505436914775
0 105 3.0835670890496645
…
0 495 3.2071722417080766
0 722 0.9042505436914775
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
model.write.save("/tmp/spark-model")
preprocess preprocess model
pipeline
val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()
./bin/spark-submit …
Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ
Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ
data spark
data
scientist
cluster
model
web
app
data spark
data
scientist
cluster
model
web
appDB
data spark
data
scientist
cluster
model
web
applibs
deps
model
docker
data spark
data
scientist
cluster model
web
app
API
data spark
data
scientist
cluster model
web
app
API
serving
API
Hydrosphere Mist
https://github.com/hydrospheredata/mist

More Related Content

PDF
Introduction to the Hadoop Ecosystem (codemotion Edition)
PPTX
RasterFrames + STAC
PDF
Big data ecosystem
PDF
Scientific Computing With Amazon Web Services
PDF
2018512 AWS上での機械学習システムの構築とSageMaker
PDF
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
PPT
A hybrid evolutionary algorithm for multi objective optimization of synthesis...
DOCX
Hadoop online training course
Introduction to the Hadoop Ecosystem (codemotion Edition)
RasterFrames + STAC
Big data ecosystem
Scientific Computing With Amazon Web Services
2018512 AWS上での機械学習システムの構築とSageMaker
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
A hybrid evolutionary algorithm for multi objective optimization of synthesis...
Hadoop online training course

What's hot (18)

PPTX
Big data solution capacity planning
PPTX
Qubole Overview at the Fifth Elephant Conference
PDF
Beginner Apache Spark Presentation
PDF
Collecting metrics with Graphite and StatsD
PPTX
Bizosys at fifth elephant
PPTX
The Meta of Hadoop - COMAD 2012
PPTX
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
PDF
Hadoop 101 - Big Data Technology
PPTX
2014 moore-ddd
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
PDF
Graphite
KEY
Getting Started on Hadoop
PDF
Engineering fast indexes
PDF
BigFoot: Big Data For Every Organization
PDF
Barcelona MUG MongoDB + Hadoop Presentation
PDF
Using Alluxio as a Fault Tolerant Pluggable Optimization Component to Compute...
PDF
Introduction to Apache Hivemall v0.5.2 and v0.6
PDF
PAC 2019 virtual Stefano Doni
Big data solution capacity planning
Qubole Overview at the Fifth Elephant Conference
Beginner Apache Spark Presentation
Collecting metrics with Graphite and StatsD
Bizosys at fifth elephant
The Meta of Hadoop - COMAD 2012
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Hadoop 101 - Big Data Technology
2014 moore-ddd
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Graphite
Getting Started on Hadoop
Engineering fast indexes
BigFoot: Big Data For Every Organization
Barcelona MUG MongoDB + Hadoop Presentation
Using Alluxio as a Fault Tolerant Pluggable Optimization Component to Compute...
Introduction to Apache Hivemall v0.5.2 and v0.6
PAC 2019 virtual Stefano Doni
Ad

Similar to Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ (20)

PDF
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
PPTX
Spark ML Pipeline serving
PPTX
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
PDF
PySaprk
PDF
OCF.tw's talk about "Introduction to spark"
PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PDF
Intro to Spark and Spark SQL
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PDF
Osd ctw spark
PDF
Overview of stinger interactive query for hive
PPTX
Big Data Processing with .NET and Spark (SQLBits 2020)
PDF
20170126 big data processing
PDF
Apache Spark & Hadoop
PDF
End-to-end Data Pipeline with Apache Spark
PDF
Spark Programming Basic Training Handout
PDF
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
PPTX
20130912 YTC_Reynold Xin_Spark and Shark
PPTX
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
DataScienceLab2017_Cервинг моделей, построенных на больших данных с помощью A...
Spark ML Pipeline serving
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
PySaprk
OCF.tw's talk about "Introduction to spark"
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Intro to Spark and Spark SQL
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Osd ctw spark
Overview of stinger interactive query for hive
Big Data Processing with .NET and Spark (SQLBits 2020)
20170126 big data processing
Apache Spark & Hadoop
End-to-end Data Pipeline with Apache Spark
Spark Programming Basic Training Handout
xPatterns on Spark, Tachyon and Mesos - Bucharest meetup
20130912 YTC_Reynold Xin_Spark and Shark
Chapel-on-X: Exploring Tasking Runtimes for PGAS Languages
SparkR: Enabling Interactive Data Science at Scale on Hadoop
Ad

More from Provectus (20)

PPTX
Choosing the right IDP Solution
PPTX
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
PPTX
Choosing the Right Document Processing Solution for Healthcare Organizations
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
PPTX
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
PDF
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
PDF
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
PDF
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
PDF
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
PDF
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
PPTX
How to implement authorization in your backend with AWS IAM
Choosing the right IDP Solution
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Choosing the Right Document Processing Solution for Healthcare Organizations
MLOps and Data Quality: Deploying Reliable ML Models in Production
AI Stack on AWS: Amazon SageMaker and Beyond
Feature Store as a Data Foundation for Machine Learning
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
How to implement authorization in your backend with AWS IAM

Recently uploaded (20)

PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
A Presentation on Touch Screen Technology
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Chapter 5: Probability Theory and Statistics
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Encapsulation theory and applications.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Mushroom cultivation and it's methods.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Web App vs Mobile App What Should You Build First.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
A Presentation on Touch Screen Technology
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
WOOl fibre morphology and structure.pdf for textiles
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Chapter 5: Probability Theory and Statistics
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
A comparative study of natural language inference in Swahili using monolingua...
TLE Review Electricity (Electricity).pptx
Encapsulation theory and applications.pdf
Enhancing emotion recognition model for a student engagement use case through...
cloud_computing_Infrastucture_as_cloud_p
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Mushroom cultivation and it's methods.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...

Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ