SlideShare a Scribd company logo
1
Леонид Блохин
Mist
Apache Spark analytics as a service
2
В ДЕТСТВЕ БЫЛ БЛАГОВОСПИТАННЫМ
МАЛЬЧИКОМ, НО НАЧАЛ ИГРАТЬ В
ОРЛЯНКУ, СВЯЗАЛСЯ С ПИРАТАМИ И
ПОКАТИЛОСЬ... BIG DATA ENGINEER,
ХАРАКТЕР НЕИСТОВЫЙ. ЖЕНАТ.
ЛЕОНИД БЛОХИН
www.provectus.com
2
Содержание доклада
• Apache Spark
• Hydrosphere Mist
• FrontEnd and BigData
www.provectus.com
3
www.provectus.com
4
5
www.provectus.com
5
Mist
www.provectus.com
6
Resilient Distributed Dataset
Распределенный набор данных, который знает как себя вычислить.
RDD может быть лениво вычисляемой при запросе
может быть и материализована — распределенно, в памяти или на
диске (или в памяти с вытеснением на диск)
разбита на партиции — это минимальный объем RDD, который будет
обработан каждым рабочим узлом.
www.provectus.com
7
www.provectus.com
8
www.provectus.com
9
www.provectus.com
10
Spark MLlib
● Classification: logistic regression, naive Bayes,...
● Regression: generalized linear regression, isotonic regression,...
● Decision trees, random forests, and gradient-boosted trees
● Recommendation: alternating least squares (ALS)
● Clustering: K-means, Gaussian mixtures (GMMs),...
● Topic modeling: latent Dirichlet allocation (LDA)
● Feature transformations: standardization, normalization, hashing,...
● Model evaluation and hyper-parameter tuning
● ML Pipeline construction
● ML persistence: saving and loading models and Pipelines
● Survival analysis: accelerated failure time model
● Frequent itemset and sequential pattern mining: FP-growth, association rules,
PrefixSpan
● Distributed linear algebra: singular value decomposition (SVD), principal
component analysis (PCA),...
● Statistics: summary statistics, hypothesis testing,...
www.provectus.com
Spark GraphX
● PageRank
● Connected components
● Label propagation
● SVD++
● Strongly connected components
● Triangle count
12
www.provectus.com
12
13
Mist
www.provectus.com
14
http://hydrosphere.io/
Hydrosphere – Opensource Big Data and Analytics platform
with DevOps culture in mind.
Mist
www.provectus.com
15
ANALYTICS AS A SERVICE
Make your applications smarter. Routine CRUD applications should be
deprecated. Add predictive and interactive features into your product to
improve the engagement with your users.
Tap into big data analytics as a service. No more shared databases and low-
level APIs. Hydrosphere.io allows you to work with predictive services on a
right degree of abstraction and focus on customer happiness and usability.
Get educated from data scientists. Working in a squad with data scientists and
big data engineers facilitates knowledge sharing and exposure to new
technologies.
Mist
www.provectus.com
16
Mist
Exposing Apache Spark jobs through REST API
Spark 2.1.0 support!
HTTP & Messaging (MQTT) API
Scala and Python Spark jobs support
Support for Spark SQL and Hive
High Availability and Fault Tolerance
Self Healing after driver program failure
Powerful logging
Clear end-user API
www.provectus.com
17
Why We Needed a Mist
Mist
www.provectus.com
18
Spark Contexts orchestration - Cluster of Sark Clusters: manages multiple
Spark contexts in separate JVMs or Dockers
Mist
www.provectus.com
19
Realtime low latency models serving/scoring
Mist
docker run
-p 2003:2003
-v /var/run/docker.sock:/var/run/docker.sock
-d hydrosphere/mist:master-2.1.0 mist
www.provectus.com
20
Running
Mist
www.provectus.com
Configuration
21
mist.context-defaults.spark-conf = {
spark.master = local[*]
spark.default.parallelism = 128
spark.driver.memory = "10g"
spark.scheduler.mode = "FAIR"
}
mist.http.on = true
mist.http.host = "192.168.10.13"
mist.http.port = 2003
Mist
www.provectus.com
Configuration
22
mist.mqtt.on = true
mist.mqtt.host = "192.168.10.33"
mist.mqtt.port = 1883
# mist listens this topic for incoming requests
mist.mqtt.subscribeTopic = "foo"
# mist answers in this topic with the results
mist.mqtt.publishTopic = "foo"
Mist
www.provectus.com
Configuration
23
# recovery job (off by default)
mist.recovery.on = true
mist.recovery.multilimit = 10
mist.recovery.typedb = "MapDb"
mist.recovery.dbfilename = "file.db"
Mist
www.provectus.com
Configuration
24
# default settings for all contexts
mist.context-defaults
# uniq settings
mist.context.<namespace>
# timeout for each job in context
mist.contextDefaults.timeout = 100 days
# mist can kill context after job finished (off by default)
mist.contextDefaults.disposable = false
mist.context.<namespace>.worker-downtime = Inf
Mist
www.provectus.com
Configuration
25
mist.context.<namespace>.streaming-duration = 1 sec
mist.context-settings.onstart = [“foo”,”bar”]
mist.http.router-config-path = ./config/router.conf
Mist
www.provectus.com
Mist RESTificated Routes
26
forecast = {
path = 'hdfs://hdfs-host/jobs/forecast-job.jar',
className = 'FastForecast$',
namespace = 'production-namespace'
}
www.provectus.com
27
Mist
www.provectus.com
Mist RESTificated Routes
28
HTTP
POST /api/forecast
{
"param": 1
}
MQTT
{
"route": "forecast",
"parameters": {
"param": 1
}
}
Mist
Spark Job at Mist
libraryDependencies += "io.hydrosphere" %% "mist-lib-spark2" % "0.11.0"
import io.hydrosphere.mist.lib.spark2._
object MyCoolMistJob extends MistJob {
def execute(): Map[String, Any] = {
val rdd = context.parallelize()
...
return result.asInstance[Map[String, Any]]
}
}
www.provectus.com
29
Mist
Spark Job at Mist
libraryDependencies += "io.hydrosphere" %% "mist-lib-spark2" % "0.11.0"
import io.hydrosphere.mist.lib.spark2._
import io.hydrosphere.mist.lib.spark2.ml._
object MyCoolMistJob extends MistJob {
def train(savePath: String): Map[String, Any] = {...}
def serve(modelPath: String, features: List[Array[Double]]): Map[String, Any] = {...}
}
www.provectus.com
30
Mist
www.provectus.com
31
Mist
www.provectus.com
32
HTTP или MQTT
POST http://localhost:2003/api/forecast
{
“points”:{"lat": 56.1241, “lng”: 42.3457,...}
“distance”: 254
}
Mist
www.provectus.com
33
Mist
www.provectus.com
34
{"success":true,"payload":
{
“point”:{“lat”:56.1241, “lng”:42.3457}
“sun”:30,
“cloud”:75,
“temperature”:-25} ...
}, …
}
www.provectus.com
35
Mist
www.provectus.com
36
Apache Kafka support
Bi-directional streaming API
AMQP support
Your contributions are very welcome on Github!
https://github.com/Hydrospheredata/mist
Road Map
Леонид Блохин
телефон: 8 (917) 295 - 40 - 49
Email: lblokhin@provectus.com
www.provectus.com
Благодарю за внимание!

More Related Content

PPTX
OpenStack Day 2 Operations
PPTX
OpenStack Day 2 Operations (Toronto)
PDF
Nsq.io on Node.js and Shell
PDF
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
PDF
GPUs in Big Data - StampedeCon 2014
PPTX
Ceph Day Berlin: CEPH@DeutscheTelekom - a 2+ years production liaison
PDF
Cassandra Community Webinar | In Case of Emergency Break Glass
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
OpenStack Day 2 Operations
OpenStack Day 2 Operations (Toronto)
Nsq.io on Node.js and Shell
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
GPUs in Big Data - StampedeCon 2014
Ceph Day Berlin: CEPH@DeutscheTelekom - a 2+ years production liaison
Cassandra Community Webinar | In Case of Emergency Break Glass
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...

What's hot (17)

PDF
Bare metal Hadoop provisioning
PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
PDF
Taking devops to the Next Level - Max Martin
PDF
Juliet Hougland, Data Scientist, Cloudera at MLconf NYC
PDF
Brisk hadoop june2011
PDF
Brisk hadoop june2011_sfjava
PDF
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
ODP
Blades for HPTC
PDF
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
PDF
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
PDF
Bdc from bare metal to k8s
PDF
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
PPTX
DataStax: An Introduction to DataStax Enterprise Search
PDF
Globus Integrations (CHPC 2019 - South Africa)
PPTX
Data relay introduction to big data clusters
PDF
Tech Talk: Best Practices for Data Modeling
PPTX
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Bare metal Hadoop provisioning
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Taking devops to the Next Level - Max Martin
Juliet Hougland, Data Scientist, Cloudera at MLconf NYC
Brisk hadoop june2011
Brisk hadoop june2011_sfjava
C* Summit 2013: Suicide Risk Prediction Using Social Media and Cassandra by K...
Blades for HPTC
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
Bdc from bare metal to k8s
15th Athens Big Data Meetup - 1st Talk - Running Spark On Mesos
DataStax: An Introduction to DataStax Enterprise Search
Globus Integrations (CHPC 2019 - South Africa)
Data relay introduction to big data clusters
Tech Talk: Best Practices for Data Modeling
Java one2011 brisk-and_high_order_bits_from_cassandra_and_hadoop
Ad

Similar to [FrontDays'2017] Леонид Блохин (Big Data Engineer): Мист. Сервис для работы с Apache Spark (20)

PDF
Блохин Леонид - "Mist, как часть Hydrosphere"
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
Incrementalism: An Industrial Strategy For Adopting Modern Automation
ODP
Drupalcon2007 Sun
PPTX
Spark to DocumentDB connector
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
Spark Summit EU 2015: Lessons from 300+ production users
PDF
Jump Start on Apache Spark 2.2 with Databricks
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
PDF
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
PDF
Fast Data Analytics with Spark and Python
ODP
Why Sun for Drupal?
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
PPTX
Speed up R with parallel programming in the Cloud
PPTX
TechEvent Apache Cassandra
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
PPT
Mysql Latency
PDF
Advertising Fraud Detection at Scale at T-Mobile
PPTX
Typesafe spark- Zalando meetup
Блохин Леонид - "Mist, как часть Hydrosphere"
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Drupalcon2007 Sun
Spark to DocumentDB connector
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit EU 2015: Lessons from 300+ production users
Jump Start on Apache Spark 2.2 with Databricks
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Fast Data Analytics with Spark and Python
Why Sun for Drupal?
Apache Spark 2.0: Faster, Easier, and Smarter
Speed up R with parallel programming in the Cloud
TechEvent Apache Cassandra
Jump Start with Apache Spark 2.0 on Databricks
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
Mysql Latency
Advertising Fraud Detection at Scale at T-Mobile
Typesafe spark- Zalando meetup
Ad

More from Provectus (20)

PPTX
Choosing the right IDP Solution
PPTX
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
PPTX
Choosing the Right Document Processing Solution for Healthcare Organizations
PPTX
MLOps and Data Quality: Deploying Reliable ML Models in Production
PPTX
AI Stack on AWS: Amazon SageMaker and Beyond
PPTX
Feature Store as a Data Foundation for Machine Learning
PPTX
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
PPTX
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
PPTX
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
PDF
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
PDF
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
PDF
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
PDF
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
PDF
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
PDF
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
PDF
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
PDF
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
PDF
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
PPTX
How to implement authorization in your backend with AWS IAM
Choosing the right IDP Solution
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Choosing the Right Document Processing Solution for Healthcare Organizations
MLOps and Data Quality: Deploying Reliable ML Models in Production
AI Stack on AWS: Amazon SageMaker and Beyond
Feature Store as a Data Foundation for Machine Learning
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
How to implement authorization in your backend with AWS IAM

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
Electronic commerce courselecture one. Pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
20250228 LYD VKU AI Blended-Learning.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25-Week II
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Review of recent advances in non-invasive hemoglobin estimation
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf

[FrontDays'2017] Леонид Блохин (Big Data Engineer): Мист. Сервис для работы с Apache Spark