Using Apache Spark for generating
ElasticSearch indices offline
Andrej Babolčai
ESET Database systems engineer
Apache: Big Data Europe 2016
Who am I
• Software engineer in database systems team
• Responsible for collecting, moving and providing access to data
Context
Apache
Kafka
Apache
Hive/Impala
ElasticSearch
Agenda
• Approaches we tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
Agenda
• Approaches we tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
Indexing data to live cluster
• Failed because of
• Slowed search and near real-time (NRT) import
• Reduce ingestion speed - too slow
Spark job with Lucene library
• Approach
• Generate indices with Lucene and “import” them to ES
• Indexing with Lucene is fast, hundreds of GB/hour
• Failed because of
• ES types in Lucene
• ES translog and checksum
Agenda
• Approaches tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
Goal
• Offload ES cluster and generate indices on Spark cluster
• We want indices to be “ready to use”
• When appropriate copy them to ES
Spark + local ES
• Based on https://github.com/MyPureCloud/elasticsearch-
lambda
• Similar approach to Cloudera Solr MapReduceIndexerTool
How do we generate indices offline
S
p
a
r
k
0 partition
1 partition
n partition
…
Start ES
n shards/
1 with data
Start ES
Start ES
…
…
HDFS repository
Index/0
Index/1
Index/n
Input
Data
1 partition 1. shard (data)
0. shard (empty)
n. shard (empty)
… SnapshotConstant shard routing
0. sh. snapshot(empty)
1. sh. snapshot
n. sh. snapshot(empty)
HDFS snapshot repository layout
Dest dir
indices
idxname-2015
idxname-2016
0
__r
__z
1
meta-idxname-
2016.dat
snap-idxname-
2016.dat
Creating local ES node
val nodeSettings: Settings = Settings.builder
.put("http.enabled", false)
.put("processors", 1)
.put("index.merge.scheduler.max_thread_count", 1)
…
.build
val node: Node =
nodeBuilder().settings(nodeSettings).local(true).node()
node.start
val client: Client = node.client
client.admin.indices. … .setSource(mapping).get
HTTP unnecessary, use transport interface
Only JVM local node discovery
Same json mapping as Index API (http)
RDD export like saveAsTextFile
rddToIndex
.repartition(config.numShards)
.saveToESSnapshot(
config,…
We use implicit conversions
package object spark {
implicit class
DBSysSparkRDDFunctions
[T <: Map[String,Object]] //Row
(val rdd: RDD[T]) extends AnyVal {
def saveToESSnapshot(config:String,…):Unit = {
…
rdd.sparkContext.runJob(rdd,esWriter.processPartition _)
Input RDD type bound
Indexing method
Infiltrate spark namespace
Useful ES commands
• Create HDFS snapshot repository:
curl -s -XPUT 'localhost:9200/_snapshot/<Repo name>' -d '{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "/user/<username>/<Snapshot repo hdfs path>",
"load_defaults": "false"
}
}’
Useful ES commands
• Start restore process:
curl -XPOST 'localhost:9200/_snapshot/<Repo name>/<snapshot name>/_restore’
• Monitor restore progress:
curl -s –XGET 'localhost:9200/_cat/recovery?v' |
awk '{print $1 " " $11}' |
fgrep -v " 0.0%" |
fgrep -v "100.0%"
Agenda
• Approaches tried and why they failed
• Solution used, Spark + ES
• Benchmark, summary and possible improvements
ES cluster configuration
Property Value
Number of nodes 24
ES heap size 29GB
CPUs 8 (/proc/cpuinfo)
HDD 2x3.5TB / node
No. of indices 130
No. of shards ~3900
Data size 16 TB
No. of docs > 60 billion
Indexing speed (what we can handle…) ~1000 docs/s
Offline indexing environment
Property Value
Input size 135GB compr. parquet
Number of docs 470M
CPUs (indexing) 15 spark workers
Memory 4GB /worker
Output index layout 20 string fields, 15 part.
Job duration ~3.5h
Restore duration ~20m
Duration total ~4h
Indexing speed >30k docs/s
Future work
• Shard routing
• Indexing on local FS, use directly HDFS
• Speed up indexing
• Use for stream indexing
Summary
• Hard to directly compare RT with our offline approach
• What we wanted was to make historical data available for
users, without influencing production systems
https://github.com/andybab/OfflineESIndexGenerator
Thank you
Questions?
babolcai@eset.sk

More Related Content

PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PPTX
Log analysis using Logstash,ElasticSearch and Kibana
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
PDF
Hadoop spark online demo
KEY
Counters with Riak on Amazon EC2 at Hackover
PDF
Logging with Elasticsearch, Logstash & Kibana
PDF
Wisely Chen Spark Talk At Spark Gathering in Taiwan
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Log analysis using Logstash,ElasticSearch and Kibana
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Hadoop spark online demo
Counters with Riak on Amazon EC2 at Hackover
Logging with Elasticsearch, Logstash & Kibana
Wisely Chen Spark Talk At Spark Gathering in Taiwan

What's hot (20)

PDF
Spark Summit EU talk by Jakub Hava
PDF
Shipping & Visualize Your Data With ELK
PDF
Logging logs with Logstash - Devops MK 10-02-2016
PDF
Tale of ISUCON and Its Bench Tools
PDF
Osd ctw spark
PDF
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
PDF
Elasitcsearch + Logstash + Kibana 日誌監控
PDF
Extending Spark With Java Agent (handout)
PDF
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PDF
Automated Spark Deployment With Declarative Infrastructure
ODP
Introduction to Spark with Scala
PPT
spark-kafka_mod
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
PDF
Rupy2012 ArangoDB Workshop Part2
PPTX
Spark real world use cases and optimizations
PDF
Operational Tips for Deploying Spark by Miklos Christine
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
PDF
Training Slides: 351 - Tungsten Replicator for Data Warehouses
PPTX
Espresso advanced
Spark Summit EU talk by Jakub Hava
Shipping & Visualize Your Data With ELK
Logging logs with Logstash - Devops MK 10-02-2016
Tale of ISUCON and Its Bench Tools
Osd ctw spark
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Elasitcsearch + Logstash + Kibana 日誌監控
Extending Spark With Java Agent (handout)
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Automated Spark Deployment With Declarative Infrastructure
Introduction to Spark with Scala
spark-kafka_mod
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Rupy2012 ArangoDB Workshop Part2
Spark real world use cases and optimizations
Operational Tips for Deploying Spark by Miklos Christine
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Training Slides: 351 - Tungsten Replicator for Data Warehouses
Espresso advanced
Ad

Viewers also liked (20)

PDF
October 2015 Old Irving Park Neighborhood Real Estate Market Update
PPT
Saxo Grammaticus
PDF
November 2015 Near North Side Neighborhood Real Estate Update
PDF
October 2015 West Loop Neighborhood Real Estate Market Update
PPT
Gesta Danorum
PPTX
Absalon Ungdom
PDF
October 2015 Old Town Neighborhood Real Estate Market Update
PDF
November 2015 Noble Square Neighborhood Real Estate Update
PPTX
PPT
Fremtidensbibliotek
PPTX
Aplicas funciones periodicas
PDF
Kti siti maysaroh
PPTX
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
PPTX
Calidad de software
PPTX
Coping Responses
PPSX
Calculo de predicados
PPT
อัลบั้มรูป
PPT
Tipos de-software II
PPT
Act 4.3 pruebas de software
PDF
Estrategias y técnicas de pruebas de software
October 2015 Old Irving Park Neighborhood Real Estate Market Update
Saxo Grammaticus
November 2015 Near North Side Neighborhood Real Estate Update
October 2015 West Loop Neighborhood Real Estate Market Update
Gesta Danorum
Absalon Ungdom
October 2015 Old Town Neighborhood Real Estate Market Update
November 2015 Noble Square Neighborhood Real Estate Update
Fremtidensbibliotek
Aplicas funciones periodicas
Kti siti maysaroh
Yrittäjien lakisääteinen eläketurva – työurat, työtulot ja rahoitus
Calidad de software
Coping Responses
Calculo de predicados
อัลบั้มรูป
Tipos de-software II
Act 4.3 pruebas de software
Estrategias y técnicas de pruebas de software
Ad

Similar to using-apache-spark-for-generating-elasticsearch-indices-offline (20)

PPTX
Introduction to Apache Spark
PPTX
Apache Spark on HDinsight Training
PPTX
Programming in Spark using PySpark
PDF
20170126 big data processing
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Paris Data Geek - Spark Streaming
PPTX
Building highly scalable data pipelines with Apache Spark
PPTX
Dive into spark2
PDF
Icinga 2009 at OSMC
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
A really really fast introduction to PySpark - lightning fast cluster computi...
PPTX
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
PPTX
Dec6 meetup spark presentation
PDF
Apache Spark Tutorial
PPT
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
PPTX
Spark from the Surface
PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Spark Programming
PPTX
Spark core
Introduction to Apache Spark
Apache Spark on HDinsight Training
Programming in Spark using PySpark
20170126 big data processing
Real time Analytics with Apache Kafka and Apache Spark
Paris Data Geek - Spark Streaming
Building highly scalable data pipelines with Apache Spark
Dive into spark2
Icinga 2009 at OSMC
Intro to Apache Spark
Intro to Apache Spark
A really really fast introduction to PySpark - lightning fast cluster computi...
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Dec6 meetup spark presentation
Apache Spark Tutorial
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Spark from the Surface
Alpine academy apache spark series #1 introduction to cluster computing wit...
Spark Programming
Spark core

using-apache-spark-for-generating-elasticsearch-indices-offline

  • 1. Using Apache Spark for generating ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe 2016
  • 2. Who am I • Software engineer in database systems team • Responsible for collecting, moving and providing access to data
  • 4. Agenda • Approaches we tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 5. Agenda • Approaches we tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 6. Indexing data to live cluster • Failed because of • Slowed search and near real-time (NRT) import • Reduce ingestion speed - too slow
  • 7. Spark job with Lucene library • Approach • Generate indices with Lucene and “import” them to ES • Indexing with Lucene is fast, hundreds of GB/hour • Failed because of • ES types in Lucene • ES translog and checksum
  • 8. Agenda • Approaches tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 9. Goal • Offload ES cluster and generate indices on Spark cluster • We want indices to be “ready to use” • When appropriate copy them to ES
  • 10. Spark + local ES • Based on https://github.com/MyPureCloud/elasticsearch- lambda • Similar approach to Cloudera Solr MapReduceIndexerTool
  • 11. How do we generate indices offline S p a r k 0 partition 1 partition n partition … Start ES n shards/ 1 with data Start ES Start ES … … HDFS repository Index/0 Index/1 Index/n Input Data 1 partition 1. shard (data) 0. shard (empty) n. shard (empty) … SnapshotConstant shard routing 0. sh. snapshot(empty) 1. sh. snapshot n. sh. snapshot(empty)
  • 12. HDFS snapshot repository layout Dest dir indices idxname-2015 idxname-2016 0 __r __z 1 meta-idxname- 2016.dat snap-idxname- 2016.dat
  • 13. Creating local ES node val nodeSettings: Settings = Settings.builder .put("http.enabled", false) .put("processors", 1) .put("index.merge.scheduler.max_thread_count", 1) … .build val node: Node = nodeBuilder().settings(nodeSettings).local(true).node() node.start val client: Client = node.client client.admin.indices. … .setSource(mapping).get HTTP unnecessary, use transport interface Only JVM local node discovery Same json mapping as Index API (http)
  • 14. RDD export like saveAsTextFile rddToIndex .repartition(config.numShards) .saveToESSnapshot( config,…
  • 15. We use implicit conversions package object spark { implicit class DBSysSparkRDDFunctions [T <: Map[String,Object]] //Row (val rdd: RDD[T]) extends AnyVal { def saveToESSnapshot(config:String,…):Unit = { … rdd.sparkContext.runJob(rdd,esWriter.processPartition _) Input RDD type bound Indexing method Infiltrate spark namespace
  • 16. Useful ES commands • Create HDFS snapshot repository: curl -s -XPUT 'localhost:9200/_snapshot/<Repo name>' -d '{ "type": "hdfs", "settings": { "uri": "hdfs://namenode:8020/", "path": "/user/<username>/<Snapshot repo hdfs path>", "load_defaults": "false" } }’
  • 17. Useful ES commands • Start restore process: curl -XPOST 'localhost:9200/_snapshot/<Repo name>/<snapshot name>/_restore’ • Monitor restore progress: curl -s –XGET 'localhost:9200/_cat/recovery?v' | awk '{print $1 " " $11}' | fgrep -v " 0.0%" | fgrep -v "100.0%"
  • 18. Agenda • Approaches tried and why they failed • Solution used, Spark + ES • Benchmark, summary and possible improvements
  • 19. ES cluster configuration Property Value Number of nodes 24 ES heap size 29GB CPUs 8 (/proc/cpuinfo) HDD 2x3.5TB / node No. of indices 130 No. of shards ~3900 Data size 16 TB No. of docs > 60 billion Indexing speed (what we can handle…) ~1000 docs/s
  • 20. Offline indexing environment Property Value Input size 135GB compr. parquet Number of docs 470M CPUs (indexing) 15 spark workers Memory 4GB /worker Output index layout 20 string fields, 15 part. Job duration ~3.5h Restore duration ~20m Duration total ~4h Indexing speed >30k docs/s
  • 21. Future work • Shard routing • Indexing on local FS, use directly HDFS • Speed up indexing • Use for stream indexing
  • 22. Summary • Hard to directly compare RT with our offline approach • What we wanted was to make historical data available for users, without influencing production systems https://github.com/andybab/OfflineESIndexGenerator