SlideShare a Scribd company logo
@alexanderDeja @maxospiquante#TiaSparkSQL
SparkSQL pour analyser vos données Cassandra
@alexanderDeja @maxospiquante#TiaSparkSQL
Qui sommes-nous ?
Alexander DEJANOVSKI
@alexanderDeja
Développeur
Maxence LECOINTE
@maxospiquante
Développeur
@alexanderDeja @maxospiquante#TiaSparkSQL
Cassandra
• Base NoSQL distribuée
• Langage de requête : CQL~=SQL
• SELECT * FROM ze_table WHERE ze_key=1
• Pas de jointure, pas de group by, pas
d’insert/select
@alexanderDeja @maxospiquante#TiaSparkSQL
Spark
• Map/Reduce en mémoire
• 10x-100x plus rapide que Hadoop
• Scala, Java ou Python
• Modules : Spark Streaming, MLlib, GraphX, SparkSQL
@alexanderDeja @maxospiquante#TiaSparkSQL
Objectif
Cassandra << >> SparkSQL
Création de tables d’index
Calcul de statistiques (simples…)
sur les confs Devoxx FR de 2012 à 2015
@alexanderDeja @maxospiquante#TiaSparkSQL
Datastax Spark Cassandra Connector
@alexanderDeja @maxospiquante#TiaSparkSQL
Setup
• Spark 1.1 ou 1.2 pour Scala et Java
• Connecteur Datastax :
http://github.com/datastax/spark-cassandra-connector
• Spark 1.1 pour Python
• Connecteur Calliope de TupleJump :
http://tuplejump.github.io/calliope/start-with-sql.html
@alexanderDeja @maxospiquante#TiaSparkSQL
Pour vous éviter (certaines) galères…
• Sources de ce TIA :
https://github.com/adejanovski/devoxx2015
• Lisez le README
@alexanderDeja @maxospiquante#TiaSparkSQL
C’est quoi un RDD ?
• Resilient Distributed Dataset
• Collection d’objets distribuée et résiliente
• Permet le stockage de n’importe quel format de donnée
@alexanderDeja @maxospiquante#TiaSparkSQL
Schéma
@alexanderDeja @maxospiquante#TiaSparkSQL
Schéma
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
Etape 1
@alexanderDeja @maxospiquante#TiaSparkSQL
Scala-Fu
@alexanderDeja @maxospiquante#TiaSparkSQL
Scala-Fu : split par speaker
@alexanderDeja @maxospiquante#TiaSparkSQL
val rddTalk = cc.sql("select annee, titre, speakers, type_talk
from devoxx.talk")
// On sort de SparkSQL pour retravailler les données
val splitBySpeakersRdd =
rddTalk.flatMap(r => (r(2).asInstanceOf[scala.collection.immutable.Set[String]])
.map(m => (m,r) ))
case class Talk(titre: String, speaker: String, annee: Int, type_talk: String)
val talksSchemaRdd = splitBySpeakersRdd.map(
t =>Talk(t._2.getString(1),t._1,t._2.getInt(0),t._2.getString(1),t._2.getString(3)))
talksSchemaRdd.registerTempTable("talks_par_speaker")
Code Scala
@alexanderDeja @maxospiquante#TiaSparkSQL
val rddTalk = cc.sql("select annee, titre, speakers, type_talk
from devoxx.talk")
// On sort de SparkSQL pour retravailler les données
val splitBySpeakersRdd =
rddTalk.flatMap(r => (r(2).asInstanceOf[scala.collection.immutable.Set[String]])
.map(m => (m,r) ))
case class Talk(titre: String, speaker: String, annee: Int, type_talk: String)
val talksSchemaRdd = splitBySpeakersRdd.map(
t =>Talk(t._2.getString(1),t._1,t._2.getInt(0),t._2.getString(1),t._2.getString(3)))
talksSchemaRdd.registerTempTable("talks_par_speaker")
Code Scala
@alexanderDeja @maxospiquante#TiaSparkSQL
val rddTalk = cc.sql("select annee, titre, speakers, type_talk
from devoxx.talk")
// On sort de SparkSQL pour retravailler les données
val splitBySpeakersRdd =
rddTalk.flatMap(r => (r(2).asInstanceOf[scala.collection.immutable.Set[String]])
.map(m => (m,r) ))
case class Talk(titre: String, speaker: String, annee: Int, type_talk: String)
val talksSchemaRdd = splitBySpeakersRdd.map(
t =>Talk(t._2.getString(1),t._1,t._2.getInt(0),t._2.getString(1),t._2.getString(3)))
talksSchemaRdd.registerTempTable("talks_par_speaker")
Code Scala
@alexanderDeja @maxospiquante#TiaSparkSQL
val rddTalk = cc.sql("select annee, titre, speakers, type_talk
from devoxx.talk")
// On sort de SparkSQL pour retravailler les données
val splitBySpeakersRdd =
rddTalk.flatMap(r => (r(2).asInstanceOf[scala.collection.immutable.Set[String]])
.map(m => (m,r) ))
case class Talk(titre: String, speaker: String, annee: Int, type_talk: String)
val talksSchemaRdd = splitBySpeakersRdd.map(
t =>Talk(t._2.getString(1),t._1,t._2.getInt(0),t._2.getString(1),t._2.getString(3)))
talksSchemaRdd.registerTempTable("talks_par_speaker")
Code Scala
@alexanderDeja @maxospiquante#TiaSparkSQL
cc.sql("insert into devoxx.talk_par_speaker
select speaker, type_talk, titre, annee
from talks_par_speaker").collect()
Code Scala : insertion Cassandra
@alexanderDeja @maxospiquante#TiaSparkSQL
Code Scala : insertion Cassandra
val connector = CassandraConnector(sc.getConf)
talksSchemaRdd.foreachPartition(partition => {
connector.withSessionDo{ session =>
partition.foreach(r => session.execute(
"UDPATE devoxx.talk_par_speaker USING TTL ? " +
set type_talk=?, titre=?, annee=? " +
WHERE id_speaker = ?"),
86400, r.type_talk, r.titre,
r.annee.asInstanceOf[java.lang.Integer],r.speaker)
)}
})
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
“Demo time”
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
Etape 2
@alexanderDeja @maxospiquante#TiaSparkSQL
Java-Fu
@alexanderDeja @maxospiquante#TiaSparkSQL
SchemaRDD nbTalkParSpeaker = cassandraSQLContext.sql(
“SELECT B.nom_speaker as nom_speaker, A.annee as annee,
“A.id_speaker as id_speaker “ +
“FROM devoxx.talk_par_speaker A JOIN devoxx.speakers B “+
“ON A.id_speaker = B.id_speaker ");
nbTalkParSpeaker.registerTempTable(“tmp_talk_par_speaker");
cassandraSQLContext.sql(
“INSERT INTO devoxx.speaker_par_annee “ +
“SELECT nom_speaker, annee, count(*) as nb “+
“FROM tmp_talk_par_speaker group by nom_speaker, annee").collect();
Code Java
@alexanderDeja @maxospiquante#TiaSparkSQL
SchemaRDD nbTalkParSpeaker = cassandraSQLContext.sql(
“SELECT B.nom_speaker as nom_speaker, A.annee as annee,
“A.id_speaker as id_speaker “ +
“FROM devoxx.talk_par_speaker A JOIN devoxx.speakers B “+
“ON A.id_speaker = B.id_speaker ");
nbTalkParSpeaker.registerTempTable(“tmp_talk_par_speaker");
cassandraSQLContext.sql(
“INSERT INTO devoxx.speaker_par_annee “ +
“SELECT nom_speaker, annee, count(*) as nb “+
“FROM tmp_talk_par_speaker group by nom_speaker, annee").collect();
Code Java
@alexanderDeja @maxospiquante#TiaSparkSQL
SchemaRDD nbTalkParSpeaker = cassandraSQLContext.sql(
“SELECT B.nom_speaker as nom_speaker, A.annee as annee,
“A.id_speaker as id_speaker “ +
“FROM devoxx.talk_par_speaker A JOIN devoxx.speakers B “+
“ON A.id_speaker = B.id_speaker ");
nbTalkParSpeaker.registerTempTable(“tmp_talk_par_speaker");
cassandraSQLContext.sql(
“INSERT INTO devoxx.speaker_par_annee “ +
“SELECT nom_speaker, annee, count(*) as nb, id_speaker “+
“FROM tmp_talk_par_speaker “+
“GROUP BY nom_speaker, annee, id_speaker").collect();
Code Java
@alexanderDeja @maxospiquante#TiaSparkSQL
./spark-submit 
--class devoxx.Devoxx….. 
--master spark://127.0.0.1:7077 
devoxxSparkSql.jar
Submit Java
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
“Demo time”
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
Etape 3
@alexanderDeja @maxospiquante#TiaSparkSQL
Python-Fu
@alexanderDeja @maxospiquante#TiaSparkSQL
def split_keywords(row):
## fonction splittant les titres par mot
rddTalk = sqlContext.sql("SELECT titre, speakers, annee, categorie, type_talk FROM devoxx.talk")
splitByKeywordRdd = rddTalk.flatMap(lambda r:split_keywords(r))
splitByKeywordRdd_schema = sqlContext.inferSchema(
splitByKeywordRdd.filter(lambda word:len(word[0])>1)
.map(lambda x:Row(keyword=x[0],annee=x[1])))
splitByKeywordRdd_schema.registerTempTable("tmp_keywords")
keyword_count = sqlContext.sql("""SELECT keyword, annee, count(*) as nb
FROM tmp_keywords
GROUP BY keyword, annee""")
keyword_count_schema = sqlContext.inferSchema(keyword_count.map(lambda x:Row(...)))
keyword_count_schema.registerTempTable("tmp_keywords_count")
sqlContext.sql("""INSERT INTO devoxx.keyword_par_annee SELECT keyword, annee, nb
FROM tmp_keywords_count""")
Code Python
@alexanderDeja @maxospiquante#TiaSparkSQL
def split_keywords(row):
## fonction splittant les titres par mot
rddTalk = sqlContext.sql("select titre, speakers, annee, categorie, type_talk from devoxx.talk")
splitByKeywordRdd = rddTalk.flatMap(lambda r:split_keywords(r))
splitByKeywordRdd_schema = sqlContext.inferSchema(
splitByKeywordRdd.filter(lambda word:len(word[0])>1)
.map(lambda x:Row(keyword=x[0],annee=x[1])))
splitByKeywordRdd_schema.registerTempTable("tmp_keywords")
keyword_count = sqlContext.sql("""SELECT keyword, annee, count(*) as nb
FROM tmp_keywords
GROUP BY keyword, annee""")
keyword_count_schema = sqlContext.inferSchema(keyword_count.map(lambda x:Row(...)))
keyword_count_schema.registerTempTable("tmp_keywords_count")
sqlContext.sql("""INSERT INTO devoxx.keyword_par_annee SELECT keyword, annee, nb
FROM tmp_keywords_count""")
Code Python
@alexanderDeja @maxospiquante#TiaSparkSQL
def split_keywords(row):
## fonction splittant les titres par mot
rddTalk = sqlContext.sql("select titre, speakers, annee, categorie, type_talk from devoxx.talk")
splitByKeywordRdd = rddTalk.flatMap(lambda r:split_keywords(r))
splitByKeywordRdd_schema = sqlContext.inferSchema(
splitByKeywordRdd.filter(lambda word:len(word[0])>1)
.map(lambda x:Row(keyword=x[0],annee=x[1])))
splitByKeywordRdd_schema.registerTempTable("tmp_keywords")
keyword_count = sqlContext.sql("""SELECT keyword, annee, count(*) as nb
FROM tmp_keywords
GROUP BY keyword, annee""")
keyword_count_schema = sqlContext.inferSchema(keyword_count.map(lambda x:Row(...)))
keyword_count_schema.registerTempTable("tmp_keywords_count")
sqlContext.sql("""INSERT INTO devoxx.keyword_par_annee SELECT keyword, annee, nb
FROM tmp_keywords_count""")
Code Python
@alexanderDeja @maxospiquante#TiaSparkSQL
def split_keywords(row):
## fonction splittant les titres par mot
rddTalk = sqlContext.sql("select titre, speakers, annee, categorie, type_talk from devoxx.talk")
splitByKeywordRdd = rddTalk.flatMap(lambda r:split_keywords(r))
splitByKeywordRdd_schema = sqlContext.inferSchema(
splitByKeywordRdd.filter(lambda word:len(word[0])>1)
.map(lambda x:Row(keyword=x[0],annee=x[1])))
splitByKeywordRdd_schema.registerTempTable("tmp_keywords")
keyword_count = sqlContext.sql("""SELECT keyword, annee, count(*) as nb
FROM tmp_keywords
GROUP BY keyword, annee""")
keyword_count_schema = sqlContext.inferSchema(keyword_count.map(lambda x:Row(...)))
keyword_count_schema.registerTempTable("tmp_keywords_count")
sqlContext.sql("""INSERT INTO devoxx.keyword_par_annee SELECT keyword, annee, nb
FROM tmp_keywords_count""")
Code Python
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
“Demo time”
@YourTwitterHandle@YourTwitterHandle@alexanderDeja @maxospiquante#TiaSparkSQL
Et voilà ! Des questions ?

More Related Content

PDF
Apache cassandra en production - devoxx 2017
PDF
Stampede con 2014 cassandra in the real world
ODP
Meetup cassandra for_java_cql
PDF
Software Development with Apache Cassandra
PDF
Hardening cassandra q2_2016
PDF
Advanced Apache Cassandra Operations with JMX
PDF
Sparkstreaming
PPTX
The tale of 100 cve's
Apache cassandra en production - devoxx 2017
Stampede con 2014 cassandra in the real world
Meetup cassandra for_java_cql
Software Development with Apache Cassandra
Hardening cassandra q2_2016
Advanced Apache Cassandra Operations with JMX
Sparkstreaming
The tale of 100 cve's

What's hot (20)

PDF
Move Over, Rsync
PPTX
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
PDF
Real world tales of repair - Apache BigData
PDF
Seattle C* Meetup: Hardening cassandra for compliance or paranoia
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PDF
C* Summit EU 2013: Cassandra Internals
PDF
Terracotta's OffHeap Explained
PDF
Compliance as Code with terraform-compliance
PDF
Elasticsearch for Logs & Metrics - a deep dive
PDF
Null Bachaav - May 07 Attack Monitoring workshop.
PDF
Regex Considered Harmful: Use Rosie Pattern Language Instead
PDF
How Prometheus Store the Data
PPTX
Cassandra Community Webinar: Back to Basics with CQL3
PDF
Hardening cassandra for compliance or paranoia
PDF
Dynamic Database Credentials: Security Contingency Planning
PPTX
Using ansible vault to protect your secrets
PDF
Stop Worrying & Love the SQL - A Case Study
PDF
Successful Architectures for Fast Data
PDF
Creating PostgreSQL-as-a-Service at Scale
PDF
distributed tracing in 5 minutes
Move Over, Rsync
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Real world tales of repair - Apache BigData
Seattle C* Meetup: Hardening cassandra for compliance or paranoia
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
C* Summit EU 2013: Cassandra Internals
Terracotta's OffHeap Explained
Compliance as Code with terraform-compliance
Elasticsearch for Logs & Metrics - a deep dive
Null Bachaav - May 07 Attack Monitoring workshop.
Regex Considered Harmful: Use Rosie Pattern Language Instead
How Prometheus Store the Data
Cassandra Community Webinar: Back to Basics with CQL3
Hardening cassandra for compliance or paranoia
Dynamic Database Credentials: Security Contingency Planning
Using ansible vault to protect your secrets
Stop Worrying & Love the SQL - A Case Study
Successful Architectures for Fast Data
Creating PostgreSQL-as-a-Service at Scale
distributed tracing in 5 minutes
Ad

Viewers also liked (20)

PDF
The SparkSQL things you maybe confuse
PPTX
Getting started with SparkSQL - Desert Code Camp 2016
PDF
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
PPTX
The DAP - Where YARN, HBase, Kafka and Spark go to Production
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
PPTX
Data Science at Scale: Using Apache Spark for Data Science at Bitly
PPTX
Spark meetup v2.0.5
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
PDF
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PPTX
Apache HBase Internals you hoped you Never Needed to Understand
PPTX
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
PPTX
HBaseCon 2015: HBase and Spark
PDF
Pivoting Data with SparkSQL by Andrew Ray
PPTX
Spark Sql for Training
PPTX
PPTX
Apache Spark II (SparkSQL)
PDF
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
The SparkSQL things you maybe confuse
Getting started with SparkSQL - Desert Code Camp 2016
Galvanise NYC - Scaling R with Hadoop & Spark. V1.0
HBaseConEast2016: HBase and Spark, State of the Art
The DAP - Where YARN, HBase, Kafka and Spark go to Production
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Spark meetup v2.0.5
SparkR - Play Spark Using R (20160909 HadoopCon)
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
DataEngConf SF16 - Spark SQL Workshop
Build a Time Series Application with Apache Spark and Apache HBase
Apache HBase Internals you hoped you Never Needed to Understand
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
HBaseCon 2015: HBase and Spark
Pivoting Data with SparkSQL by Andrew Ray
Spark Sql for Training
Apache Spark II (SparkSQL)
R, Spark, Tensorflow, H20.ai Applied to Streaming Analytics
Ad

Similar to SparkSQL et Cassandra - Tool In Action Devoxx 2015 (20)

PDF
SparkSQL and Dataframe
PDF
Lightning fast analytics with Spark and Cassandra
PDF
Reading Cassandra Meetup Feb 2015: Apache Spark
PPTX
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
PDF
Analytics with Cassandra & Spark
PDF
Manchester Hadoop Meetup: Spark Cassandra Integration
PDF
Spark手把手:[e2-spk-s02]
PDF
Lightning fast analytics with Spark and Cassandra
PDF
3 Dundee-Spark Overview for C* developers
PPTX
Spark sql
PPTX
Spark + Cassandra = Real Time Analytics on Operational Data
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PPTX
Unlocking Your Hadoop Data with Apache Spark and CDH5
PPTX
Using Spark to Load Oracle Data into Cassandra
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PDF
Fast track to getting started with DSE Max @ ING
PPTX
Lightning Fast Analytics with Cassandra and Spark
SparkSQL and Dataframe
Lightning fast analytics with Spark and Cassandra
Reading Cassandra Meetup Feb 2015: Apache Spark
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra & Spark
Manchester Hadoop Meetup: Spark Cassandra Integration
Spark手把手:[e2-spk-s02]
Lightning fast analytics with Spark and Cassandra
3 Dundee-Spark Overview for C* developers
Spark sql
Spark + Cassandra = Real Time Analytics on Operational Data
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Unlocking Your Hadoop Data with Apache Spark and CDH5
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Fast track to getting started with DSE Max @ ING
Lightning Fast Analytics with Cassandra and Spark

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Foundation of Data Science unit number two notes
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Computer network topology notes for revision
PPT
Quality review (1)_presentation of this 21
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
climate analysis of Dhaka ,Banglades.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Fluorescence-microscope_Botany_detailed content
Foundation of Data Science unit number two notes
Database Infoormation System (DBIS).pptx
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
.pdf is not working space design for the following data for the following dat...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Computer network topology notes for revision
Quality review (1)_presentation of this 21
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
IB Computer Science - Internal Assessment.pptx
Miokarditis (Inflamasi pada Otot Jantung)

SparkSQL et Cassandra - Tool In Action Devoxx 2015