SlideShare a Scribd company logo
Apache	Spark
AN ENGINE	FOR	LARGE-SCALE	DATA	PROCESSING
Introducing	myself…
• Mylène	Reiners
• Architect	@Atos
• Focus	innovation
Sketching	the	context
• Big	Data
• New	insights
• Analytics
• Data	discovery
Sketching	the	context
• Hadoop
• Storing	and	managing	data
Apache	Spark
• Speed
• General	purpose
Short	demo	in	Scala	(shell)
• Simple	data	analysis
• Read	“README.md”
• Count	the	number	of	lines
Role	of	SparkContext (sc)
RDD
• Resilient	Distributed	Dataset
• Creation
• Transformations
• Actions
RDD
• Lazy
• Recomputed
Java	example	(accumulator)
JavaRDD<String> rdd = sc.textFile(args[1]);
final Accumulator<Integer> blankLines = sc.accumulator(0);
JavaRDD<String> callSigns = rdd.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String line) {
if (line.equals("")) {
blankLines.add(1);
}
return Arrays.asList(line.split(" "));
}});
callSigns.saveAsTextFile("output.txt")
Apache	Spark	stack
Spark	SQL
• Interface	for	working	with	(semi)structured	data
Hive	example
// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext;
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext;
// Import the JavaSchemaRDD
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.Row;
(...)
JavaSparkContext ctx = new JavaSparkContext(...);
SQLContext hiveCtx = new HiveContext(ctx);
Hive	example	(cont’d)
SchemaRDD input = hiveCtx.jsonFile(inputFile);
// Register the input schema RDD
input.registerTempTable("tweets");
// Select tweets based on the retweetCount
SchemaRDD topTweets = hiveCtx.sql(
"SELECT text, retweetCount FROM tweets ORDER BY
retweetCount LIMIT 10");
Spark	Streaming
• Acting	on	data	as	soon	as	it	arrives	
• Dstreams
Example
// Create a StreamingContext with a 1-second batch size from a SparkConf
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
// Create a DStream from all the input on port 7777
JavaDStream<String> lines = jssc.socketTextStream("localhost", 7777);
// Filter our DStream for lines with "error"
JavaDStream<String> errorLines = lines.filter(new Function<String, Boolean>() {
public Boolean call(String line) {
return line.contains("error");
}});
// Print out the lines with errors
errorLines.print();
Example
// Start our streaming context and wait for it
// to "finish"
jssc.start();
// Wait for the job to finish
jssc.awaitTermination();
GraphX
• Graphdatabase
Example
// Load the edges as a graph
val graph = GraphLoader.edgeListFile(sc,
"followers.txt")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
// Join the ranks with the usernames
val users = sc.textFile("users.txt").map { line =>
val fields = line.split(",")
(fields(0).toLong, fields(1))
}
Example
val ranksByUsername =
users.join(ranks)
.map {case (id, (username, rank))
=> (username, rank)
}
// Print the result
println(ranksByUsername.collect().mkString("n"))
MLib
• Machine	learning
Thank	you

More Related Content

PDF
Small intro to Big Data - Old version
PPTX
Open source big data landscape and possible ITS applications
PDF
Introduction to apache spark
PPTX
Intro to Spark
PDF
Hugfr SPARK & RIAK -20160114_hug_france
PDF
Databases and how to choose them
PPTX
3 CityNetConf - sql+c#=u-sql
PPTX
Introduction to Apache Spark
Small intro to Big Data - Old version
Open source big data landscape and possible ITS applications
Introduction to apache spark
Intro to Spark
Hugfr SPARK & RIAK -20160114_hug_france
Databases and how to choose them
3 CityNetConf - sql+c#=u-sql
Introduction to Apache Spark

What's hot (20)

PPTX
Quark Virtualization Engine for Analytics
PDF
Spark in 15 min
PDF
Introduction to TitanDB
PPTX
Database Choices
PPTX
Spark - The Ultimate Scala Collections by Martin Odersky
PPTX
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
PPTX
Bleeding Edge Databases
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
PPTX
Spark sql meetup
PPTX
Spark - The beginnings
PDF
SFScon18 - Stefano Pampaloni - The SQL revenge
PDF
Meetup070416 Presentations
PDF
Spark and scala course content | Spark and scala course online training
PPTX
Spark and Spark Streaming
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
PDF
Munich March 2015 - Cassandra + Spark Overview
PPTX
Building a Lambda Architecture with Elasticsearch at Yieldbot
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PPTX
Spark Introduction
Quark Virtualization Engine for Analytics
Spark in 15 min
Introduction to TitanDB
Database Choices
Spark - The Ultimate Scala Collections by Martin Odersky
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Bleeding Edge Databases
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Spark sql meetup
Spark - The beginnings
SFScon18 - Stefano Pampaloni - The SQL revenge
Meetup070416 Presentations
Spark and scala course content | Spark and scala course online training
Spark and Spark Streaming
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Cassandra vs. ScyllaDB: Evolutionary Differences
Munich March 2015 - Cassandra + Spark Overview
Building a Lambda Architecture with Elasticsearch at Yieldbot
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Introduction
Ad

Viewers also liked (6)

DOCX
Budaya politik dan praktiknya
PDF
3.giao trinh sql_va_pl_sql
PDF
Java development with the dynamo framework
PPTX
Closing the Knowledge Gap
PPTX
PPT
Campsite project presentation
Budaya politik dan praktiknya
3.giao trinh sql_va_pl_sql
Java development with the dynamo framework
Closing the Knowledge Gap
Campsite project presentation
Ad

Similar to Apache Spark part of Eindhoven Java Meetup (20)

PDF
Apache Spark RDDs
PDF
Apache Spark Overview @ ferret
PPTX
5 Ways to Use Spark to Enrich your Cassandra Environment
PDF
3 Dundee-Spark Overview for C* developers
PDF
Big Data Analytics with Apache Spark
PDF
Apache Spark and DataStax Enablement
PPTX
An Introduct to Spark - Atlanta Spark Meetup
PPTX
An Introduction to Spark
PPTX
Introduction to Spark - DataFactZ
PPTX
Learning spark ch09 - Spark SQL
PDF
Apache spark - Architecture , Overview & libraries
PPTX
Apache spark core
PPTX
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
PDF
Introduction to apache spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Apache spark sneha challa- google pittsburgh-aug 25th
PPTX
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
PPTX
Spark: The State of the Art Engine for Big Data Processing
PPTX
Introduction to Apache Spark
Apache Spark RDDs
Apache Spark Overview @ ferret
5 Ways to Use Spark to Enrich your Cassandra Environment
3 Dundee-Spark Overview for C* developers
Big Data Analytics with Apache Spark
Apache Spark and DataStax Enablement
An Introduct to Spark - Atlanta Spark Meetup
An Introduction to Spark
Introduction to Spark - DataFactZ
Learning spark ch09 - Spark SQL
Apache spark - Architecture , Overview & libraries
Apache spark core
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
Introduction to apache spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Apache spark sneha challa- google pittsburgh-aug 25th
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Spark: The State of the Art Engine for Big Data Processing
Introduction to Apache Spark

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
1_Introduction to advance data techniques.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PDF
Lecture1 pattern recognition............
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
1_Introduction to advance data techniques.pptx
Clinical guidelines as a resource for EBP(1).pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ISS -ESG Data flows What is ESG and HowHow
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
.pdf is not working space design for the following data for the following dat...
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Lecture1 pattern recognition............
IBA_Chapter_11_Slides_Final_Accessible.pptx

Apache Spark part of Eindhoven Java Meetup