Roma – 7 Febbraio 2017
presenta Alberto Paro, Seacom
Elastic & Spark.
Building A Search Geo Locator
Alberto Paro
 Laureato in Ingegneria Informatica (POLIMI)
 Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech
review
 Lavoro principalmente in Scala e su tecnologie BD
(Akka, Spray.io, Playframework, Apache Spark) e NoSQL
(Accumulo, Cassandra, ElasticSearch e MongoDB)
 Evangelist linguaggio Scala e Scala.JS
Elasticseach 5.x - Cookbook
 Choose the best ElasticSearch cloud topology to deploy and power it up
with external plugins
 Develop tailored mapping to take full control of index steps
 Build complex queries through managing indices and documents
 Optimize search results through executing analytics aggregations
 Monitor the performance of the cluster and nodes
 Install Kibana to monitor cluster and extend Kibana for plugins.
 Integrate ElasticSearch in Java, Scala, Python and Big Data applications
Discount code for Ebook: ALPOEB50
Discount code for Print Book: ALPOPR15
Expiration Date: 21st Feb 2017
Obiettivi
 Architetture Big Data con ES
 Apache Spark
 GeoIngester
 Data Collection
 Ottimizzazione Indici
 Ingestion via Apache Spark
 Ricerca per un luogo
 Cenni di Big Data Tools
Architettura
Hadoop / Spark
Input
Iter 1
HDFS
Iter 2
HDFS
HDFS
Read
HDFS
Read
HDFS
Write
HDFS
Write
Input
Iter 1 Iter 2
Hadoop MapReduce
Apache Spark
Evoluzione del modello Map Reduce
Apache Spark
 Scritto in Scala con API in Java, Python e R
 Evoluzione del modello Map/Reduce
 Potenti moduli a corredo:
 Spark SQL
 Spark Streaming
 MLLib (Machine Learning)
 GraphX (graph)
Geoname
GeoNames è un database geografico, scaricabile gratuitamente sotto
licenza creative commons.
Contiene circa 10 millioni di nomi geografici e consiste di circa 9
milioni di feature uniqche di cui 2.8 milioni di posti popolati e 5.5
millioni di nomi alternativi.
Può essere facilmente scaricato da
http://download.geonames.org/export/dump come file CSV.
Il codice è disponibile all’indirizzo:
https://github.com/aparo/elasticsearch-geonames-locator
Geoname - Struttura
No. Attribute name Explanation
1 geonameid Unique ID for this geoname
2 name The name of the geoname
3 asciiname ASCII representation of the name
4 alternatenames Other forms of this name. Generally in several languages
5 latitude Latitude in decimal degrees of the Geoname
6 longitude Longitude in decimal degrees of the Geoname
7 fclass Feature class see http://www.geonames.org/export/codes.html
8 fcode Feature code see http://www.geonames.org/export/codes.html
9 country ISO-3166 2-letter country code
10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code
11 admin1 Fipscode (subject to change to iso code
12 admin2 Code for the second administrative division, a county in the US
13 admin3 Code for third level administrative division
14 admin4 Code for fourth level administrative division
14 population The Population of Geoname
14 elevation The elevation in meters of Geoname
14 gtopo30 Digital elevation model
14 timezone The timezone of Geoname
14 moddate The date of last change of this Geoname
Ottimizzazione indici – 1/2
Necessario per:
 Rimuove campi non richiesti.
 Gestire campi Geo Point.
 Ottimizzare i campi stringa (text, keyword)
 Numeri shard corretto (11M records => 2 shards)
Vantaggi => performances/spazio/CPU
Ottimizzazione indici – 2/2
{
"mappings": {
"geoname": {
"properties": {
"admin1": {
"type": "keyword",
"ignore_above": 256
},
…
"alternatenames": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
…
…
"location": {
"type": "geo_point"
},
…
"longitude": {
"type": "float"
},
"moddate": {
"type": "date"
},
Ingestion via Spark – GeonameIngester – 1/7
Il nostro ingester eseguirà i seguenti steps:
 Inizializzazione Job Spark
 Parse del CSV
 Definizione della struttura di indicizzazione
 Popolamento delle classi
 Scrittura dati in Elasticsearch
 Esecuzione del Job Spark
Ingestion via Spark – GeonameIngester – 2/7
Inizializzazione di un Job Spark
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.elasticsearch.spark.rdd.EsSpark
import scala.util.Try
object GeonameIngester {
def main(args: Array[String]) {
val sparkSession = SparkSession.builder
.master("local")
.appName("GeonameIngester")
.getOrCreate()
Ingestion via Spark – GeonameIngester – 3/7
Parse del CSV
val geonameSchema = StructType(Array(
StructField("geonameid", IntegerType, false),
StructField("name", StringType, false),
StructField("asciiname", StringType, true),
StructField("alternatenames", StringType, true),
StructField("latitude", FloatType, true), ….
val GEONAME_PATH = "downloads/allCountries.txt"
val geonames = sparkSession.sqlContext.read
.option("header", false)
.option("quote", "")
.option("delimiter", "t").option("maxColumns", 22)
.schema(geonameSchema)
.csv(GEONAME_PATH)
.cache()
Ingestion via Spark – GeonameIngester – 4/7
Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String],
latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String,
cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4:
Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate:
String)
implicit def emptyToOption(value: String): Option[String] = {
if (value == null) return None
val clean = value.trim
if (clean.isEmpty) { None } else { Some(clean)}
}
Ingestion via Spark – GeonameIngester – 5/7
Definizione delle nostre classi per l’Inidicizzazione
case class GeoPoint(lat: Double, lon: Double)
case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String],
latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String,
cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4:
Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate:
String)
implicit def emptyToOption(value: String): Option[String] = {
if (value == null) return None
val clean = value.trim
if (clean.isEmpty) { None } else { Some(clean)}
}
Ingestion via Spark – GeonameIngester – 6/7
Popolazione delle nostre classi
val records = geonames.map {
row =>
val id = row.getInt(0)
val lat = row.getFloat(4)
val lon = row.getFloat(5)
Geoname(id, row.getString(1), row.getString(2),
Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil),
lat, lon, GeoPoint(lat, lon),
row.getString(6), row.getString(7), row.getString(8), row.getString(9),
row.getString(10), row.getString(11), row.getString(12), row.getString(13),
row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17),
row.getDate(18).toString
)
}
Ingestion via Spark – GeonameIngester – 7/7
Scrittura in Elasticsearch
EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" ->
"geonameid"))
Esecuzione di uno Spark Job
spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator-
assembly-1.0.jar
(~20 minuti su singola macchina)
Ricerca di un luogo
curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{
"query": {
"bool": {
"minimum_should_match": 1,
"should": [
{ "term": { "name": "moscow"}},
{ "term": { "alternatenames": "moscow"}},
{ "term": { "asciiname": "moscow" }}
],
"filter": [
{ "term": { "fclass": "P" }},
{ "range": { "population": {"gt": 0}}}
]
}
},
"sort": [ { "population": { "order": "desc"}}]
}'
NoSQL
Key-Value
 Redis
 Voldemort
 Dynomite
 Tokio*
BigTable Clones
 Accumulo
 Hbase
 Cassandra
Document
 CouchDB
 MongoDB
 ElasticSearch
GraphDB
 Neo4j
 OrientDB
 …Graph
Message Queue
 Kafka
 RabbitMQ
 ...MQ
NoSQL - Evolution
MicroServices
Linguaggio – Scala vs Java
public class User {
private String firstName;
private String lastName;
private String email;
private Password password;
public User(String firstName, String lastName,
String email, Password password) {
this.firstName = firstName;
this.lastName = lastName;
this.email = email;
this.password = password;
}
public String getFirstName() {return firstName; }
public void setFirstName(String firstName) { this.firstName = firstName; }
public String getLastName() { return lastName; }
public void setLastName(String lastName) { this.lastName = lastName; }
public String getEmail() { return email; }
public void setEmail(String email) { this.email = email; }
public Password getPassword() { return password; }
public void setPassword(Password password) { this.password = password; }
@Override public String toString() {
return "User [email=" + email + ", firstName=" + firstName + ", lastName=" + lastName + "]"; }
@Override public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((email == null) ? 0 : email.hashCode());
result = prime * result + ((firstName == null) ? 0 : firstName.hashCode());
result = prime * result + ((lastName == null) ? 0 : firstName.hashCode());
result = prime * result + ((password == null) ? 0 : password.hashCode());
return result; }
@Override public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
return false;
User other = (User) obj;
if (email == null) {
if (other.email != null)
return false;
} else if (!email.equals(other.email))
return false;
if (password == null) {
if (other.password != null)
return false;
} else if (!password.equals(other.password))
return false;
if (firstName == null) {
if (other.firstName != null)
return false;
} else if (!firstName.equals(other.firstName))
return false;
if (lastName == null) {
case class User(
var firstName:String,
var lastName:String,
var email:String,
var password:Password)
JAVASCALA
Grazie per
l’attenzione
Alberto Paro
Q&A

More Related Content

PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
PDF
Spark what's new what's coming
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
SparkR: Enabling Interactive Data Science at Scale on Hadoop
PDF
SparkR-Advance Analytic for Big Data
PDF
Real-Time Spark: From Interactive Queries to Streaming
PPTX
Intro to Apache Spark
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark what's new what's coming
Spark streaming State of the Union - Strata San Jose 2015
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR-Advance Analytic for Big Data
Real-Time Spark: From Interactive Queries to Streaming
Intro to Apache Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark

What's hot (19)

PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PPTX
Use r tutorial part1, introduction to sparkr
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
PDF
Spark Under the Hood - Meetup @ Data Science London
PDF
Airstream: Spark Streaming At Airbnb
PPTX
Presto Talk @ Hadoop Summit'15
PDF
Lessons from Running Large Scale Spark Workloads
PDF
The BDAS Open Source Community
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
Operational Tips for Deploying Spark
PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PDF
A look ahead at spark 2.0
PDF
Cost-Based Optimizer in Apache Spark 2.2
Simplifying Big Data Analytics with Apache Spark
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Use r tutorial part1, introduction to sparkr
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Under the Hood - Meetup @ Data Science London
Airstream: Spark Streaming At Airbnb
Presto Talk @ Hadoop Summit'15
Lessons from Running Large Scale Spark Workloads
The BDAS Open Source Community
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Introduction to Spark (Intern Event Presentation)
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Operational Tips for Deploying Spark
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
What's new in pandas and the SciPy stack for financial users
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
A look ahead at spark 2.0
Cost-Based Optimizer in Apache Spark 2.2
Ad

Viewers also liked (20)

PDF
ElasticSearch on AWS
PDF
Elasticsearch And Apache Lucene For Apache Spark And MLlib
PDF
Reactive app using actor model & apache spark
PDF
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
PPTX
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
PPTX
Apache Accumulo 1.8.0 Overview
PPTX
SQRRL threat hunting platform
PPTX
Near Real-Time Outlier Detection and Interpretation
PDF
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
PPTX
[Strata] Sparkta
PDF
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
PDF
Webinar: MongoDB Connector for Spark
PPTX
PDF
Introduction to Accumulo
PPTX
Big data advance topics - part 2.pptx
PPTX
The modern analytics architecture
PDF
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
PPTX
Large scale near real-time log indexing with Flume and SolrCloud
PPTX
Hadoop and Spark Analytics over Better Storage
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
ElasticSearch on AWS
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Reactive app using actor model & apache spark
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Imperative Induced Innovation - Patrick W. Dowd, Ph. D
Apache Accumulo 1.8.0 Overview
SQRRL threat hunting platform
Near Real-Time Outlier Detection and Interpretation
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
[Strata] Sparkta
Integrating Elastic and Apache Spark - Elastic London Meetup (2015-09-24)
Webinar: MongoDB Connector for Spark
Introduction to Accumulo
Big data advance topics - part 2.pptx
The modern analytics architecture
SnapLogic's Latest Elastic iPaaS Release Adds Hybrid Links for Spark, Cortana...
Large scale near real-time log indexing with Flume and SolrCloud
Hadoop and Spark Analytics over Better Storage
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Ad

Similar to 2017 02-07 - elastic & spark. building a search geo locator (20)

PDF
Scala - en bedre og mere effektiv Java?
PDF
Scala - en bedre Java?
PDF
Pune Clojure Course Outline
PPTX
Best of build 2021 - C# 10 & .NET 6
PDF
Kotlin @ Coupang Backend 2017
PPT
3 database-jdbc(1)
PDF
Functional programming using underscorejs
PDF
Refactoring to Macros with Clojure
PDF
Scala in Places API
PDF
Kotlin, smarter development for the jvm
PDF
Scala @ TechMeetup Edinburgh
PDF
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
PDF
Kotlin Developer Starter in Android projects
PPT
Scala presentation by Aleksandar Prokopec
PDF
Introduction to Scalding and Monoids
PDF
Wprowadzenie do technologi Big Data i Apache Hadoop
PPTX
PPT
Groovy Introduction - JAX Germany - 2008
PDF
HelsinkiJS meet-up. Dmitry Soshnikov - ECMAScript 6
PPT
TechTalk - Dotnet
Scala - en bedre og mere effektiv Java?
Scala - en bedre Java?
Pune Clojure Course Outline
Best of build 2021 - C# 10 & .NET 6
Kotlin @ Coupang Backend 2017
3 database-jdbc(1)
Functional programming using underscorejs
Refactoring to Macros with Clojure
Scala in Places API
Kotlin, smarter development for the jvm
Scala @ TechMeetup Edinburgh
Kotlin Developer Starter in Android - STX Next Lightning Talks - Feb 12, 2016
Kotlin Developer Starter in Android projects
Scala presentation by Aleksandar Prokopec
Introduction to Scalding and Monoids
Wprowadzenie do technologi Big Data i Apache Hadoop
Groovy Introduction - JAX Germany - 2008
HelsinkiJS meet-up. Dmitry Soshnikov - ECMAScript 6
TechTalk - Dotnet

More from Alberto Paro (10)

PPTX
Data streaming
PDF
LUISS - Deep Learning and data analyses - 09/01/19
PPTX
2018 07-11 - kafka integration patterns
PPTX
Elasticsearch in architetture Big Data - EsInADay-2017
PPTX
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
PPTX
2017 02-07 - elastic & spark. building a search geo locator
PPTX
2016 02-24 - Piattaforme per i Big Data
PPTX
What's Big Data? - Big Data Tech - 2015 - Firenze
PPTX
ElasticSearch Meetup 30 - 10 - 2014
PPTX
Scala Italy 2015 - Hands On ScalaJS
Data streaming
LUISS - Deep Learning and data analyses - 09/01/19
2018 07-11 - kafka integration patterns
Elasticsearch in architetture Big Data - EsInADay-2017
ElasticSearch 5.x - New Tricks - 2017-02-08 - Elasticsearch Meetup
2017 02-07 - elastic & spark. building a search geo locator
2016 02-24 - Piattaforme per i Big Data
What's Big Data? - Big Data Tech - 2015 - Firenze
ElasticSearch Meetup 30 - 10 - 2014
Scala Italy 2015 - Hands On ScalaJS

Recently uploaded (20)

DOCX
Factor Analysis Word Document Presentation
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPT
Image processing and pattern recognition 2.ppt
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPTX
modul_python (1).pptx for professional and student
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
PPTX
Business_Capability_Map_Collection__pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Transcultural that can help you someday.
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Global Data and Analytics Market Outlook Report
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
Introduction to Inferential Statistics.pptx
Factor Analysis Word Document Presentation
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Image processing and pattern recognition 2.ppt
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
modul_python (1).pptx for professional and student
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Data Engineering Interview Questions & Answers Data Modeling (3NF, Star, Vaul...
Business_Capability_Map_Collection__pptx
IMPACT OF LANDSLIDE.....................
Transcultural that can help you someday.
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Global Data and Analytics Market Outlook Report
SET 1 Compulsory MNH machine learning intro
Navigating the Thai Supplements Landscape.pdf
Introduction to Inferential Statistics.pptx

2017 02-07 - elastic & spark. building a search geo locator

  • 1. Roma – 7 Febbraio 2017 presenta Alberto Paro, Seacom Elastic & Spark. Building A Search Geo Locator
  • 2. Alberto Paro  Laureato in Ingegneria Informatica (POLIMI)  Autore di 3 libri su ElasticSearch da 1 a 5.x + 6 Tech review  Lavoro principalmente in Scala e su tecnologie BD (Akka, Spray.io, Playframework, Apache Spark) e NoSQL (Accumulo, Cassandra, ElasticSearch e MongoDB)  Evangelist linguaggio Scala e Scala.JS
  • 3. Elasticseach 5.x - Cookbook  Choose the best ElasticSearch cloud topology to deploy and power it up with external plugins  Develop tailored mapping to take full control of index steps  Build complex queries through managing indices and documents  Optimize search results through executing analytics aggregations  Monitor the performance of the cluster and nodes  Install Kibana to monitor cluster and extend Kibana for plugins.  Integrate ElasticSearch in Java, Scala, Python and Big Data applications Discount code for Ebook: ALPOEB50 Discount code for Print Book: ALPOPR15 Expiration Date: 21st Feb 2017
  • 4. Obiettivi  Architetture Big Data con ES  Apache Spark  GeoIngester  Data Collection  Ottimizzazione Indici  Ingestion via Apache Spark  Ricerca per un luogo  Cenni di Big Data Tools
  • 6. Hadoop / Spark Input Iter 1 HDFS Iter 2 HDFS HDFS Read HDFS Read HDFS Write HDFS Write Input Iter 1 Iter 2 Hadoop MapReduce Apache Spark Evoluzione del modello Map Reduce
  • 7. Apache Spark  Scritto in Scala con API in Java, Python e R  Evoluzione del modello Map/Reduce  Potenti moduli a corredo:  Spark SQL  Spark Streaming  MLLib (Machine Learning)  GraphX (graph)
  • 8. Geoname GeoNames è un database geografico, scaricabile gratuitamente sotto licenza creative commons. Contiene circa 10 millioni di nomi geografici e consiste di circa 9 milioni di feature uniqche di cui 2.8 milioni di posti popolati e 5.5 millioni di nomi alternativi. Può essere facilmente scaricato da http://download.geonames.org/export/dump come file CSV. Il codice è disponibile all’indirizzo: https://github.com/aparo/elasticsearch-geonames-locator
  • 9. Geoname - Struttura No. Attribute name Explanation 1 geonameid Unique ID for this geoname 2 name The name of the geoname 3 asciiname ASCII representation of the name 4 alternatenames Other forms of this name. Generally in several languages 5 latitude Latitude in decimal degrees of the Geoname 6 longitude Longitude in decimal degrees of the Geoname 7 fclass Feature class see http://www.geonames.org/export/codes.html 8 fcode Feature code see http://www.geonames.org/export/codes.html 9 country ISO-3166 2-letter country code 10 cc2 Alternate country codes, comma separated, ISO-3166 2-letter country code 11 admin1 Fipscode (subject to change to iso code 12 admin2 Code for the second administrative division, a county in the US 13 admin3 Code for third level administrative division 14 admin4 Code for fourth level administrative division 14 population The Population of Geoname 14 elevation The elevation in meters of Geoname 14 gtopo30 Digital elevation model 14 timezone The timezone of Geoname 14 moddate The date of last change of this Geoname
  • 10. Ottimizzazione indici – 1/2 Necessario per:  Rimuove campi non richiesti.  Gestire campi Geo Point.  Ottimizzare i campi stringa (text, keyword)  Numeri shard corretto (11M records => 2 shards) Vantaggi => performances/spazio/CPU
  • 11. Ottimizzazione indici – 2/2 { "mappings": { "geoname": { "properties": { "admin1": { "type": "keyword", "ignore_above": 256 }, … "alternatenames": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, … … "location": { "type": "geo_point" }, … "longitude": { "type": "float" }, "moddate": { "type": "date" },
  • 12. Ingestion via Spark – GeonameIngester – 1/7 Il nostro ingester eseguirà i seguenti steps:  Inizializzazione Job Spark  Parse del CSV  Definizione della struttura di indicizzazione  Popolamento delle classi  Scrittura dati in Elasticsearch  Esecuzione del Job Spark
  • 13. Ingestion via Spark – GeonameIngester – 2/7 Inizializzazione di un Job Spark import org.apache.spark.sql.SparkSession import org.apache.spark.sql.types._ import org.elasticsearch.spark.rdd.EsSpark import scala.util.Try object GeonameIngester { def main(args: Array[String]) { val sparkSession = SparkSession.builder .master("local") .appName("GeonameIngester") .getOrCreate()
  • 14. Ingestion via Spark – GeonameIngester – 3/7 Parse del CSV val geonameSchema = StructType(Array( StructField("geonameid", IntegerType, false), StructField("name", StringType, false), StructField("asciiname", StringType, true), StructField("alternatenames", StringType, true), StructField("latitude", FloatType, true), …. val GEONAME_PATH = "downloads/allCountries.txt" val geonames = sparkSession.sqlContext.read .option("header", false) .option("quote", "") .option("delimiter", "t").option("maxColumns", 22) .schema(geonameSchema) .csv(GEONAME_PATH) .cache()
  • 15. Ingestion via Spark – GeonameIngester – 4/7 Definizione delle nostre classi per l’Inidicizzazione case class GeoPoint(lat: Double, lon: Double) case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String) implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean)} }
  • 16. Ingestion via Spark – GeonameIngester – 5/7 Definizione delle nostre classi per l’Inidicizzazione case class GeoPoint(lat: Double, lon: Double) case class Geoname(geonameid: Int, name: String, asciiname: String, alternatenames: List[String], latitude: Float, longitude: Float, location: GeoPoint, fclass: String, fcode: String, country: String, cc2: String, admin1: Option[String], admin2: Option[String], admin3: Option[String], admin4: Option[String], population: Double, elevation: Int, gtopo30: Int, timezone: String, moddate: String) implicit def emptyToOption(value: String): Option[String] = { if (value == null) return None val clean = value.trim if (clean.isEmpty) { None } else { Some(clean)} }
  • 17. Ingestion via Spark – GeonameIngester – 6/7 Popolazione delle nostre classi val records = geonames.map { row => val id = row.getInt(0) val lat = row.getFloat(4) val lon = row.getFloat(5) Geoname(id, row.getString(1), row.getString(2), Option(row.getString(3)).map(_.split(",").map(_.trim).filterNot(_.isEmpty).toList).getOrElse(Nil), lat, lon, GeoPoint(lat, lon), row.getString(6), row.getString(7), row.getString(8), row.getString(9), row.getString(10), row.getString(11), row.getString(12), row.getString(13), row.getDouble(14), fixNullInt(row.get(15)), row.getInt(16), row.getString(17), row.getDate(18).toString ) }
  • 18. Ingestion via Spark – GeonameIngester – 7/7 Scrittura in Elasticsearch EsSpark.saveToEs(records.toJavaRDD, "geonames/geoname", Map("es.mapping.id" -> "geonameid")) Esecuzione di uno Spark Job spark-submit --class GeonameIngester target/scala-2.11/elasticsearch-geonames-locator- assembly-1.0.jar (~20 minuti su singola macchina)
  • 19. Ricerca di un luogo curl -XPOST 'http://localhost:9200/geonames/geoname/_search' -d '{ "query": { "bool": { "minimum_should_match": 1, "should": [ { "term": { "name": "moscow"}}, { "term": { "alternatenames": "moscow"}}, { "term": { "asciiname": "moscow" }} ], "filter": [ { "term": { "fclass": "P" }}, { "range": { "population": {"gt": 0}}} ] } }, "sort": [ { "population": { "order": "desc"}}] }'
  • 20. NoSQL Key-Value  Redis  Voldemort  Dynomite  Tokio* BigTable Clones  Accumulo  Hbase  Cassandra Document  CouchDB  MongoDB  ElasticSearch GraphDB  Neo4j  OrientDB  …Graph Message Queue  Kafka  RabbitMQ  ...MQ
  • 23. Linguaggio – Scala vs Java public class User { private String firstName; private String lastName; private String email; private Password password; public User(String firstName, String lastName, String email, Password password) { this.firstName = firstName; this.lastName = lastName; this.email = email; this.password = password; } public String getFirstName() {return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } public String getEmail() { return email; } public void setEmail(String email) { this.email = email; } public Password getPassword() { return password; } public void setPassword(Password password) { this.password = password; } @Override public String toString() { return "User [email=" + email + ", firstName=" + firstName + ", lastName=" + lastName + "]"; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((email == null) ? 0 : email.hashCode()); result = prime * result + ((firstName == null) ? 0 : firstName.hashCode()); result = prime * result + ((lastName == null) ? 0 : firstName.hashCode()); result = prime * result + ((password == null) ? 0 : password.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; User other = (User) obj; if (email == null) { if (other.email != null) return false; } else if (!email.equals(other.email)) return false; if (password == null) { if (other.password != null) return false; } else if (!password.equals(other.password)) return false; if (firstName == null) { if (other.firstName != null) return false; } else if (!firstName.equals(other.firstName)) return false; if (lastName == null) { case class User( var firstName:String, var lastName:String, var email:String, var password:Password) JAVASCALA
  • 25. Q&A

Editor's Notes

  • #5: Spark SQL (DB, Json, case class)
  • #6: Vantaggi: 100% opensource Svantaggi: Breaking modulli
  • #8: Spark SQL (DB, Json, case class)
  • #9: Spark SQL (DB, Json, case class)
  • #10: Spark SQL (DB, Json, case class)
  • #11: Spark SQL (DB, Json, case class)
  • #12: Spark SQL (DB, Json, case class)
  • #13: Spark SQL (DB, Json, case class)
  • #14: Spark SQL (DB, Json, case class)
  • #15: Spark SQL (DB, Json, case class)
  • #16: Spark SQL (DB, Json, case class)
  • #17: Spark SQL (DB, Json, case class)
  • #18: Spark SQL (DB, Json, case class)
  • #19: Spark SQL (DB, Json, case class)
  • #20: Spark SQL (DB, Json, case class)
  • #21: Key Value: Focus on scaling to huge amounts of data Designed to handle massive load Based on Amazon’s Dynamo paper Data model: (global) collection of Key-Value pairs Dynamo ring partitioning and replication Big Table Clones Like column oriented Relational Databases, but with a twist Tables similarly to RDBMS, but handles semi-structured ๏Based on Google’s BigTable paper Data model: ‣Columns → column families → ACL ‣Datums keyed by: row, column, time, index ‣Row-range → tablet → distribution Document Similar to Key-Value stores,but the DB knows what theValue is Inspired by Lotus Notes Data model: Collections of Key-Value collections Documents are often versioned GraphDB Focus on modeling the structure of data – interconnectivity Scales to the complexity of the data Inspired by mathematical Graph Theory ( G=(E,V) ) Data model: “Property Graph” ‣Nodes ‣Relationships/Edges between Nodes (first class) ‣Key-Value pairs on both ‣Possibly Edge Labels and/or Node/Edge Types