Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ

MACHINE LEARNING IN SPARK
Константин Макарычев
secon 2017

Big Data: Volume, Velocity, Variety

Apache Spark
http://spark.apache.org/

val wordCounts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey((a, b) => a + b)
executor executor executor executor executor

Machine Learning: training + serving

preprocess preprocess train model
pipeline

apache spark 1
hadoop mapreduce 0
spark machine learning 1
tokenizer
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1

hashing tf
[apache, spark] 1
[hadoop, mapreduce] 0
[spark, machine, learning] 1
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1

logistic regression
[105, 495], [1.0, 1.0] 1
[6, 638, 655], [1.0, 1.0, 1.0] 0
[105, 72, 852], [1.0, 1.0, 1.0] 1
0 72 -2.7138781446090308
0 94 0.9042505436914775
0 105 3.0835670890496645
…
0 495 3.2071722417080766
0 722 0.9042505436914775

val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
model.write.save("/tmp/spark-model")

preprocess preprocess model
pipeline

val test = spark.createDataFrame(Seq(
("spark hadoop"),
("hadoop learning")
)).toDF("text")
val model = PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()

Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ

data spark
data
scientist
cluster
model
web
app

data spark
data
scientist
cluster
model
web
appDB

data spark
data
scientist
cluster
model
web
applibs
deps
model
docker

data spark
data
scientist
cluster model
web
app
API

data spark
data
scientist
cluster model
web
app
API
serving
API

Hydrosphere Mist
https://github.com/hydrospheredata/mist

Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ

More Related Content

What's hot (18)

Similar to Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ (20)

More from Provectus (20)

Recently uploaded (20)

Константин Макарычев (Sofware Engineer): ИСПОЛЬЗОВАНИЕ SPARK ДЛЯ МАШИННОГО ОБУЧЕНИЯ