SlideShare a Scribd company logo
You don't need Functional Programming for Fun!
Cassandra and SparkSQL
Russell (left) and Cara (right)
• Software Engineer



• Spark-Cassandra
Integration since Spark
0.9
• Cassandra since
Cassandra1.2
• 2 Year Scala Convert
• Still not comfortable
talking about Monads
in public
@Evanfchan
A Story in 3 Parts
• Why SparkSQL?
• The Spark SQL Thrift Server
• Writing SQL for Spark
You have lots of options why Spark SQL?
• Scala?
• Java?
Spark is A Powerful Analytics Tool Built on Scala
Distributed Analytics Platform
with In Memory Capabilities
Lots of new concepts:

RDDs
DataSets
Streaming
Serialization
Functional Programming
Functional Programming Is Awesome
Side-effect Free Functions
Monads
Easy
Parallelization
AnonymousFunctions
Scala
Async Models
TypeMatching
rdd.map(y => y+1)
Endofunctors
Functional Programming can be Hard
blah-blah blah
Blah
Easy
blahilization
baaaaah
blahala
Asybc blah
Blahblahhing
rdd.map(y => y+1)
Aren't
Endofunctors from
ghostbusters?
Endofunctors
Practical considerations when devoting time to a
new Project.
Compile Time
Type Safety!
Catalyst! Tungsten! We
get to learn all sorts of
fun new things! SBT
is probably great!
Usually Me Less Excitable Dev
We ship next week
Spark SQL Provides A Familiar and Easy API
Use SQL to access the Power of Spark
Spark Sql Provides A Familiar and Easy API
Catalyst
Codegen!
Optimization!
Predicate Pushdowns
Distributed
Work
SQL
It still takes Scala/Java/Python/… Code.
import	org.apache.spark.sql.cassandra._

val	df	=	spark	
		.read	
		.cassandraFormat("tab",	"ks")	
		.load

	df.createTempView("tab")

spark.sql("SELECT	*	FROM	tab").show	
+---+---+---+	
|		k|		c|		v|	
+---+---+---+	
|		1|		1|		1|	
|		1|		2|		2|
Let me color
code that by parts I
like vs parts I don't
like.
It still takes Scala/Java/Python/… Code.
import	org.apache.spark.sql.cassandra._

val	df	=	spark	
		.read	
		.cassandraFormat("tab",	"ks")	
		.load

	df.createTempView("tab")

spark.sql("SELECT	*	FROM	tab").show	
+---+---+---+	
|		k|		c|		v|	
+---+---+---+	
|		1|		1|		1|	
|		1|		2|		2|
Also, your import
has an underscore
in it..
For exploration we have the Spark-SQL Shell
spark-sql>	SELECT	*	FROM	ks.tab;	
1	 2	 2	
1	 3	 3
For exploration we have the Spark-SQL Shell
spark-sql>	SELECT	*	FROM	ks.tab;	
1	 2	 2	
1	 3	 3
SparkSession
For exploration we have the Spark-SQL Shell
spark-sql>	SELECT	*	FROM	ks.tab;	
1	 2	 2	
1	 3	 3
SparkSession
Executor Executor Executor Executor Executor
Not really good for multiple-users
spark-sql>	SELECT	*	FROM	ks.tab;	
1	 2	 2	
1	 3	 3
SparkSession
Executor Executor Executor Executor Executor
Enter Spark Thrift Server
Spark Sql Thrift Server
Executor Executor Executor Executor Executor
JDBC Client JDBC ClientJDBC Client
The Spark Sql Thrift Server is a Spark Application
• Built on HiveServer2
• Single Spark Context
• Clients Communicate with it via JDBC
• Can use all SparkSQL
• Fair Scheduling
• Clients can share Cached Resources
• Security
The Spark Sql ThriftServer is a Spark Application
• Built on HiveServer2
• Single Spark Context
• Clients Communicate with it via JDBC
• Can use all SparkSQL
• Fair Scheduling
• Clients can share Cached Resources
• Security
Fair Scheduling is Sharing
FIFO
Time
Fair Scheduling is Sharing
FIFO
Time
Fair Scheduling is Sharing
FIFO
Time
Fair Scheduling is Sharing
FIFO
Time
Fair Scheduling is Sharing
FIFO
FAIR
Time
Fair Scheduling is Sharing
FIFO
FAIR
Time
Fair Scheduling is Sharing
FIFO
FAIR
Time
SingleContext can Share Cached Data
Spark Sql Thrift Server
Executor Executor Executor Executor Executor
cache TABLE today select * from ks.tab where date = today;
SingleContext can Share Cached Data
Spark Sql Thrift Server
Executor Executor Executor Executor Executor
CACHED CACHED CACHED CACHED CACHED
cache TABLE today select * from ks.tab where date = today;
SingleContext can Share Cached Data
Spark Sql Thrift Server
Executor Executor Executor Executor Executor
CACHED CACHED CACHED CACHED CACHED
cache TABLE today select * from ks.tab where date = today;
SELECT * from TODAY where age > 5
How to use it
Starts from the command line and can use all
Spark Submit Args
• ./sbin/start-thriftserver.sh
• dse spark-sql-thriftserver start
starting	org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
How to use it
Starts from the command line and can use all
Spark Submit Args
• ./sbin/start-thriftserver.sh
• dse spark-sql-thriftserver start
Use with all of your favorite Spark Packages
like the Spark Cassandra Connector!
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2
--conf spark.cassandra.connection.host=127.0.0.1
Hive? Wait I though we were Doing Spark
starting	org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
Why does it say Hive
everywhere?
• Built on HiveServer2
A Brief History of the Spark Thrift Server
• Thrift?
• Hive?
They are not the Same
Cassandra Thrift Hive Thrift
Have you heard of the

"Ship of Theseus?"
Time for a quick history
More Greek stuff ..
When you replace all the parts of a thing Does it
Remain the Same?
Greek Boat
When you replace all the parts of a thing Does it
Remain the Same?
SharkServer
Hive Parser
Hive Optimization
Map-Reduce
Spark Execution
JDBC Results
When you replace all the parts of a thing Does it
Remain the Same?
SharkServer

ThriftServer
Hive Parser
Map-Reduce
Spark Execution
JDBC Results
Hive Optimization
When you replace all the parts of a thing Does it
Remain the Same?
ThriftServer
Hive Parser
Catalyst
Schema RDDs
Spark Execution
JDBC Results
When you replace all the parts of a thing Does it
Remain the Same?
ThriftServer
Hive Parser
Catalyst
Dataframes
Spark Execution
JDBC Results
When you replace all the parts of a thing Does it
Remain the Same?
ThriftServer
Hive Parser
Catalyst
DataSets
Spark Execution
JDBC Results
When you replace all the parts of a thing Does it
Remain the Same?
ThriftServer
Spark Parser
Catalyst
DataSets
Spark Execution
JDBC Results
Almost all Spark now
ThriftServer
Spark Parser
Catalyst
DataSets
Spark Execution
JDBC Results
Connecting with Beeline (JDBC Client)
./bin/beeline	
dse	beeline	
!connect	jdbc:hive2://localhost:10000
Even More Hive!
Connect Tableau to Cassandra
The Full JDBC/ODBC Ecosystem Can Connect to
ThriftServer
Incremental Collect - Because some BI Tools are
Mean
SELECT * FROM
TABLE
Spark Sql Thrift Server
ALL THE DATA
Incremental Collect - Because some BI Tools are
Mean
SELECT * FROM
TABLE
Spark Sql Thrift Server
ALL THE
DATAOOM
Incremental Collect - Because some BI Tools are
Mean
SELECT * FROM
TABLE
Spark Sql Thrift Server

spark.sql.thriftServer.incrementalCollect=true
ALL THE DATASpark Partition 1 Spark Partition 2 Spark Partition 3
Incremental Collect - Because some BI Tools are
Mean
SELECT * FROM
TABLE
Spark Sql Thrift Server

spark.sql.thriftServer.incrementalCollect=true
ALL THE DATA
Spark Partition 1
Spark Partition 2 Spark Partition 3
Incremental Collect - Because some BI Tools are
Mean
SELECT * FROM
TABLE
Spark Sql Thrift Server

spark.sql.thriftServer.incrementalCollect=true
ALL THE DATA
Spark Partition 1 Spark Partition 2
Spark Partition 3
Incremental Collect - Because some BI Tools are
Mean
SELECT * FROM
TABLE
Spark Sql Thrift Server

spark.sql.thriftServer.incrementalCollect=true
ALL THE DATA
Spark Partition 1Spark Partition 2 Spark Partition 3
Getting things done with SQL
• Registering Sources
• Writing to Tables
• Examining Query Plans
• Debugging Predicate pushdowns
• Caching Views
Registering Sources using SQL
CREATE	TEMPORARY	VIEW	words	
					USING	format.goes.here	
					OPTIONS	(	
								key	"value"

)
Registering Sources using SQL
CREATE	TEMPORARY	VIEW	words	
					USING	org.apache.spark.sql.cassandra	
					OPTIONS	(	
					table	"tab",	
					keyspace	"ks")
Not a single monad…
CREATE	TEMPORARY	VIEW	words	
					USING	org.apache.spark.sql.cassandra	
					OPTIONS	(	
					table	"tab",	
					keyspace	"ks")
Registering Sources using SQL
CassandraSourceRelation
We Can Still Use a HiveMetaStore
DSE auto registers C* Tables in a C* based Metastore
MetaStore Thrift Server
Writing DataFrames using SQL
INSERT	INTO	arrow	SELECT	*	FROM	words;
CassandraSourceRelation
words
read
CassandraSourceRelation
arrow

write
Explain to Analyze Query Plans
EXPLAIN	SELECT	*	FROM	arrow	WHERE	C	>	2;



Scan	org.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a	
[k#18,c#19,v#20]	

		PushedFilters:	[IsNotNull(c),	GreaterThan(c,2)],	

		ReadSchema:	struct<k:int,c:int,v:int>
We can analyze the inside
of the Catalyst just
like with Scala/Java/…
Predicates get Pushed Down Automatically
EXPLAIN	SELECT	*	FROM	arrow	WHERE	C	>	2;



Scan	org.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a	
[k#18,c#19,v#20]	

		PushedFilters:	[IsNotNull(c),	GreaterThan(c,2)],	

		ReadSchema:	struct<k:int,c:int,v:int>
CassandraSourceRelation Filter [GreaterThan(c,2)]
EXPLAIN	SELECT	*	FROM	arrow	WHERE	C	>	2;



Scan	org.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a	
[k#18,c#19,v#20]	

		PushedFilters:	[IsNotNull(c),	GreaterThan(c,2)],	

		ReadSchema:	struct<k:int,c:int,v:int>
CassandraSourceRelation
Filter [GreaterThan(c,2)]
Internal Request to Cassandra: CQL
SELECT * FROM ks.bat WHERE C > 2
Automatic
Pushdowns!
Predicates get Pushed Down Automatically
Common Cases where Predicates Don't Push
SELECT	*	from	troubles	WHERE	c	<	'2017-05-27'

*Filter	(cast(c#76	as	string)	<	2017-05-27)	
+-	*Scan	CassandraSourceRelation@53e82b30	[k#75,c#76,v#77]					
						PushedFilters:	[IsNotNull(c)],		
						ReadSchema:	struct<k:int,c:date,v:int>		
Why is my date
clustering column not being
pushed down.
Common Cases where Predicates Don't Push
CassandraSourceRelation Filter [LessThan(c,'2017-05-27')]
SELECT	*	from	troubles	WHERE	c	<	'2017-05-27'

*Filter	(cast(c#76	as	string)	<	2017-05-27)	
+-	*Scan	CassandraSourceRelation@53e82b30	[k#75,c#76,v#77]					
						PushedFilters:	[IsNotNull(c)],		
						ReadSchema:	struct<k:int,c:date,v:int>
Common Cases where Predicates Don't Push
CassandraSourceRelation

ReadSchema:
struct<k:int,c:date,v:int>
Filter [LessThan(c,'2017-05-27')]
Date != String
SELECT	*	from	troubles	WHERE	c	<	'2017-05-27'

*Filter	(cast(c#76	as	string)	<	2017-05-27)	
+-	*Scan	CassandraSourceRelation@53e82b30	[k#75,c#76,v#77]					
						PushedFilters:	[IsNotNull(c)],		
						ReadSchema:	struct<k:int,c:date,v:int>
Make Sure we Cast Correctly
EXPLAIN	SELECT	*	from	troubles	WHERE	c	<	
cast('2017-05-27'	as	date);

	*Scan	C*Relation	PushedFilters:	[IsNotNull(c),	LessThan(c,	2017-05-27)]
CassandraSourceRelation

ReadSchema:
struct<k:int,c:date,v:int>
Filter [LessThan(c,Date('2017-05-27'))]
Date == Date
Make Sure we Cast Correctly
EXPLAIN	SELECT	*	from	troubles	WHERE	c	<	
cast('2017-05-27'	as	date);

	*Scan	C*Relation	PushedFilters:	[IsNotNull(c),	LessThan(c,	2017-05-27)]
CassandraSourceRelation

ReadSchema:
struct<k:int,c:date,v:int>
Filter [LessThan(c,Date('2017-05-27'))]
Automatic
Pushdowns!
DSE Search Automatic Pushdowns!
EXPLAIN	SELECT	*	from	troubles	WHERE	v	<	6;

	*Scan	C*Relation	PushedFilters:	[IsNotNull(v),	LessThan(v,	6)]
CassandraSourceRelation

ReadSchema:
struct<k:int,c:date,v:int>
Solr_Query
DSE Search Automatic Pushdowns!
DSE Search Automatic Pushdowns!
Count Happens in the Index
DSE

Continuous Paging
Cache a whole table
CassandraSourceRelation InMemoryRelation
CACHE	TABLE	ks.tab;

explain	SELECT	*	FROM	ks.tab;	
==	Physical	Plan	==	
InMemoryTableScan	[k#0,	c#1,	v#2]	
:		+-	InMemoryRelation	StorageLevel(disk,	memory,	deserialized,	1	replicas),	`ks`.`tab`	
:					:		+-	*Scan	CassandraSourceRelation
Uncache
CassandraSourceRelation
UNCACHE	TABLE	ks.tab;	
explain	SELECT	*	FROM	ks.tab;	
==	Physical	Plan	==	
*Scan	CassandraSourceRelation
Cache a fraction of Data
CassandraSourceRelation
CACHE	TABLE	somedata	SELECT	*	FROM	ks.tab	WHERE	c	>	2;

explain	SELECT	*	from	somedata;	
==	Physical	Plan	==	
InMemoryTableScan	
:		+-	InMemoryRelation	`somedata`	
:					:		+-	*Scan	CassandraSourceRelation	PushedFilters:	[IsNotNull(c),	GreaterThan(c,2)]
Filter [GreaterThan(c,2)]
InMemoryRelation
somedata
Let this be a starting point
• https://github.com/datastax/spark-cassandra-connector
• https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
• https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-thrift-server.html
• http://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkSqlThriftServer.html
• https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
• https://www.datastax.com/dev/blog/dse-5-1-automatic-optimization-of-spark-sql-queries-using-dse-search
• https://www.datastax.com/dev/blog/dse-continuous-paging-tuning-and-support-guide
Thank You.
http://www.russellspitzer.com/

@RussSpitzer
Come chat with us at DataStax Academy: 

https://academy.datastax.com/slack

More Related Content

PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Analytics with Cassandra & Spark
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
PDF
OLAP with Cassandra and Spark
PDF
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Spark And Cassandra: 2 Fast, 2 Furious
Analytics with Cassandra & Spark
Analyzing Time Series Data with Apache Spark and Cassandra
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
OLAP with Cassandra and Spark
Homologous Apache Spark Clusters Using Nomad with Alex Dadgar
Real time data pipeline with spark streaming and cassandra with mesos

What's hot (20)

PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Lambda architecture
PDF
Spark Streaming with Cassandra
PDF
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
PDF
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Spark streaming , Spark SQL
PPTX
Processing Large Data with Apache Spark -- HasGeek
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Apache cassandra & apache spark for time series data
PDF
Cassandra spark connector
POTX
Apache Spark Streaming: Architecture and Fault Tolerance
PDF
Dive into Spark Streaming
PDF
Data Analytics with Apache Spark and Cassandra
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PDF
Scaling Data Analytics Workloads on Databricks
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Real time Analytics with Apache Kafka and Apache Spark
Lambda architecture
Spark Streaming with Cassandra
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
Real time data viz with Spark Streaming, Kafka and D3.js
Spark streaming , Spark SQL
Processing Large Data with Apache Spark -- HasGeek
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Apache cassandra & apache spark for time series data
Cassandra spark connector
Apache Spark Streaming: Architecture and Fault Tolerance
Dive into Spark Streaming
Data Analytics with Apache Spark and Cassandra
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Scaling Data Analytics Workloads on Databricks
Ad

Similar to Cassandra and Spark SQL (20)

PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Jumpstart on Apache Spark 2.2 on Databricks
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
PDF
실시간 Streaming using Spark and Kafka 강의교재
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PPTX
Spark from the Surface
PDF
Jump Start with Apache Spark 2.0 on Databricks
PPTX
Intro to Spark
PPTX
Incorta spark integration
PDF
Harnessing Spark and Cassandra with Groovy
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PPTX
Productionizing Spark and the REST Job Server- Evan Chan
PPTX
Spark core
PPT
Apache spark-melbourne-april-2015-meetup
PPTX
Intro to Apache Spark
PPTX
Intro to Apache Spark
PDF
Productionizing Spark and the Spark Job Server
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Jumpstart on Apache Spark 2.2 on Databricks
Jump Start on Apache® Spark™ 2.x with Databricks
실시간 Streaming using Spark and Kafka 강의교재
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark from the Surface
Jump Start with Apache Spark 2.0 on Databricks
Intro to Spark
Incorta spark integration
Harnessing Spark and Cassandra with Groovy
Headaches and Breakthroughs in Building Continuous Applications
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Productionizing Spark and the REST Job Server- Evan Chan
Spark core
Apache spark-melbourne-april-2015-meetup
Intro to Apache Spark
Intro to Apache Spark
Productionizing Spark and the Spark Job Server
Ad

More from Russell Spitzer (10)

PDF
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
PPTX
Maximum Overdrive: Tuning the Spark Cassandra Connector
PDF
Spark and Cassandra 2 Fast 2 Furious
PDF
Spark Cassandra Connector: Past, Present, and Future
PDF
Spark Cassandra Connector Dataframes
PDF
Cassandra and Spark: Optimizing for Data Locality
PDF
Cassandra and IoT
PDF
Zero to Streaming: Spark and Cassandra
PDF
Cassandra Fundamentals - C* 2.0
PDF
Escape From Hadoop: Spark One Liners for C* Ops
Tale of Two Graph Frameworks: Graph Frames and Tinkerpop
Maximum Overdrive: Tuning the Spark Cassandra Connector
Spark and Cassandra 2 Fast 2 Furious
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector Dataframes
Cassandra and Spark: Optimizing for Data Locality
Cassandra and IoT
Zero to Streaming: Spark and Cassandra
Cassandra Fundamentals - C* 2.0
Escape From Hadoop: Spark One Liners for C* Ops

Recently uploaded (20)

PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Nekopoi APK 2025 free lastest update
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Transform Your Business with a Software ERP System
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
System and Network Administraation Chapter 3
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
ai tools demonstartion for schools and inter college
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
history of c programming in notes for students .pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Nekopoi APK 2025 free lastest update
Design an Analysis of Algorithms II-SECS-1021-03
Wondershare Filmora 15 Crack With Activation Key [2025
Transform Your Business with a Software ERP System
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
ISO 45001 Occupational Health and Safety Management System
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Odoo POS Development Services by CandidRoot Solutions
System and Network Administraation Chapter 3
How to Migrate SBCGlobal Email to Yahoo Easily
Understanding Forklifts - TECH EHS Solution
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PTS Company Brochure 2025 (1).pdf.......
ai tools demonstartion for schools and inter college
Softaken Excel to vCard Converter Software.pdf
L1 - Introduction to python Backend.pptx
Which alternative to Crystal Reports is best for small or large businesses.pdf
history of c programming in notes for students .pptx

Cassandra and Spark SQL

  • 1. You don't need Functional Programming for Fun! Cassandra and SparkSQL
  • 2. Russell (left) and Cara (right) • Software Engineer
 
 • Spark-Cassandra Integration since Spark 0.9 • Cassandra since Cassandra1.2 • 2 Year Scala Convert • Still not comfortable talking about Monads in public @Evanfchan
  • 3. A Story in 3 Parts • Why SparkSQL? • The Spark SQL Thrift Server • Writing SQL for Spark
  • 4. You have lots of options why Spark SQL? • Scala? • Java?
  • 5. Spark is A Powerful Analytics Tool Built on Scala Distributed Analytics Platform with In Memory Capabilities Lots of new concepts:
 RDDs DataSets Streaming Serialization Functional Programming
  • 6. Functional Programming Is Awesome Side-effect Free Functions Monads Easy Parallelization AnonymousFunctions Scala Async Models TypeMatching rdd.map(y => y+1) Endofunctors
  • 7. Functional Programming can be Hard blah-blah blah Blah Easy blahilization baaaaah blahala Asybc blah Blahblahhing rdd.map(y => y+1) Aren't Endofunctors from ghostbusters? Endofunctors
  • 8. Practical considerations when devoting time to a new Project. Compile Time Type Safety! Catalyst! Tungsten! We get to learn all sorts of fun new things! SBT is probably great! Usually Me Less Excitable Dev We ship next week
  • 9. Spark SQL Provides A Familiar and Easy API Use SQL to access the Power of Spark
  • 10. Spark Sql Provides A Familiar and Easy API Catalyst Codegen! Optimization! Predicate Pushdowns Distributed Work SQL
  • 11. It still takes Scala/Java/Python/… Code. import org.apache.spark.sql.cassandra._
 val df = spark .read .cassandraFormat("tab", "ks") .load
 df.createTempView("tab")
 spark.sql("SELECT * FROM tab").show +---+---+---+ | k| c| v| +---+---+---+ | 1| 1| 1| | 1| 2| 2| Let me color code that by parts I like vs parts I don't like.
  • 12. It still takes Scala/Java/Python/… Code. import org.apache.spark.sql.cassandra._
 val df = spark .read .cassandraFormat("tab", "ks") .load
 df.createTempView("tab")
 spark.sql("SELECT * FROM tab").show +---+---+---+ | k| c| v| +---+---+---+ | 1| 1| 1| | 1| 2| 2| Also, your import has an underscore in it..
  • 13. For exploration we have the Spark-SQL Shell spark-sql> SELECT * FROM ks.tab; 1 2 2 1 3 3
  • 14. For exploration we have the Spark-SQL Shell spark-sql> SELECT * FROM ks.tab; 1 2 2 1 3 3 SparkSession
  • 15. For exploration we have the Spark-SQL Shell spark-sql> SELECT * FROM ks.tab; 1 2 2 1 3 3 SparkSession Executor Executor Executor Executor Executor
  • 16. Not really good for multiple-users spark-sql> SELECT * FROM ks.tab; 1 2 2 1 3 3 SparkSession Executor Executor Executor Executor Executor
  • 17. Enter Spark Thrift Server Spark Sql Thrift Server Executor Executor Executor Executor Executor JDBC Client JDBC ClientJDBC Client
  • 18. The Spark Sql Thrift Server is a Spark Application • Built on HiveServer2 • Single Spark Context • Clients Communicate with it via JDBC • Can use all SparkSQL • Fair Scheduling • Clients can share Cached Resources • Security
  • 19. The Spark Sql ThriftServer is a Spark Application • Built on HiveServer2 • Single Spark Context • Clients Communicate with it via JDBC • Can use all SparkSQL • Fair Scheduling • Clients can share Cached Resources • Security
  • 20. Fair Scheduling is Sharing FIFO Time
  • 21. Fair Scheduling is Sharing FIFO Time
  • 22. Fair Scheduling is Sharing FIFO Time
  • 23. Fair Scheduling is Sharing FIFO Time
  • 24. Fair Scheduling is Sharing FIFO FAIR Time
  • 25. Fair Scheduling is Sharing FIFO FAIR Time
  • 26. Fair Scheduling is Sharing FIFO FAIR Time
  • 27. SingleContext can Share Cached Data Spark Sql Thrift Server Executor Executor Executor Executor Executor cache TABLE today select * from ks.tab where date = today;
  • 28. SingleContext can Share Cached Data Spark Sql Thrift Server Executor Executor Executor Executor Executor CACHED CACHED CACHED CACHED CACHED cache TABLE today select * from ks.tab where date = today;
  • 29. SingleContext can Share Cached Data Spark Sql Thrift Server Executor Executor Executor Executor Executor CACHED CACHED CACHED CACHED CACHED cache TABLE today select * from ks.tab where date = today; SELECT * from TODAY where age > 5
  • 30. How to use it Starts from the command line and can use all Spark Submit Args • ./sbin/start-thriftserver.sh • dse spark-sql-thriftserver start starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
  • 31. How to use it Starts from the command line and can use all Spark Submit Args • ./sbin/start-thriftserver.sh • dse spark-sql-thriftserver start Use with all of your favorite Spark Packages like the Spark Cassandra Connector! --packages com.datastax.spark:spark-cassandra-connector_2.11:2.0.2 --conf spark.cassandra.connection.host=127.0.0.1
  • 32. Hive? Wait I though we were Doing Spark starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 Why does it say Hive everywhere? • Built on HiveServer2
  • 33. A Brief History of the Spark Thrift Server • Thrift? • Hive?
  • 34. They are not the Same Cassandra Thrift Hive Thrift
  • 35. Have you heard of the
 "Ship of Theseus?" Time for a quick history More Greek stuff ..
  • 36. When you replace all the parts of a thing Does it Remain the Same? Greek Boat
  • 37. When you replace all the parts of a thing Does it Remain the Same? SharkServer Hive Parser Hive Optimization Map-Reduce Spark Execution JDBC Results
  • 38. When you replace all the parts of a thing Does it Remain the Same? SharkServer
 ThriftServer Hive Parser Map-Reduce Spark Execution JDBC Results Hive Optimization
  • 39. When you replace all the parts of a thing Does it Remain the Same? ThriftServer Hive Parser Catalyst Schema RDDs Spark Execution JDBC Results
  • 40. When you replace all the parts of a thing Does it Remain the Same? ThriftServer Hive Parser Catalyst Dataframes Spark Execution JDBC Results
  • 41. When you replace all the parts of a thing Does it Remain the Same? ThriftServer Hive Parser Catalyst DataSets Spark Execution JDBC Results
  • 42. When you replace all the parts of a thing Does it Remain the Same? ThriftServer Spark Parser Catalyst DataSets Spark Execution JDBC Results
  • 43. Almost all Spark now ThriftServer Spark Parser Catalyst DataSets Spark Execution JDBC Results
  • 44. Connecting with Beeline (JDBC Client) ./bin/beeline dse beeline !connect jdbc:hive2://localhost:10000 Even More Hive!
  • 45. Connect Tableau to Cassandra
  • 46. The Full JDBC/ODBC Ecosystem Can Connect to ThriftServer
  • 47. Incremental Collect - Because some BI Tools are Mean SELECT * FROM TABLE Spark Sql Thrift Server ALL THE DATA
  • 48. Incremental Collect - Because some BI Tools are Mean SELECT * FROM TABLE Spark Sql Thrift Server ALL THE DATAOOM
  • 49. Incremental Collect - Because some BI Tools are Mean SELECT * FROM TABLE Spark Sql Thrift Server
 spark.sql.thriftServer.incrementalCollect=true ALL THE DATASpark Partition 1 Spark Partition 2 Spark Partition 3
  • 50. Incremental Collect - Because some BI Tools are Mean SELECT * FROM TABLE Spark Sql Thrift Server
 spark.sql.thriftServer.incrementalCollect=true ALL THE DATA Spark Partition 1 Spark Partition 2 Spark Partition 3
  • 51. Incremental Collect - Because some BI Tools are Mean SELECT * FROM TABLE Spark Sql Thrift Server
 spark.sql.thriftServer.incrementalCollect=true ALL THE DATA Spark Partition 1 Spark Partition 2 Spark Partition 3
  • 52. Incremental Collect - Because some BI Tools are Mean SELECT * FROM TABLE Spark Sql Thrift Server
 spark.sql.thriftServer.incrementalCollect=true ALL THE DATA Spark Partition 1Spark Partition 2 Spark Partition 3
  • 53. Getting things done with SQL • Registering Sources • Writing to Tables • Examining Query Plans • Debugging Predicate pushdowns • Caching Views
  • 54. Registering Sources using SQL CREATE TEMPORARY VIEW words USING format.goes.here OPTIONS ( key "value"
 )
  • 55. Registering Sources using SQL CREATE TEMPORARY VIEW words USING org.apache.spark.sql.cassandra OPTIONS ( table "tab", keyspace "ks") Not a single monad…
  • 57. We Can Still Use a HiveMetaStore DSE auto registers C* Tables in a C* based Metastore MetaStore Thrift Server
  • 58. Writing DataFrames using SQL INSERT INTO arrow SELECT * FROM words; CassandraSourceRelation words read CassandraSourceRelation arrow
 write
  • 59. Explain to Analyze Query Plans EXPLAIN SELECT * FROM arrow WHERE C > 2;
 
 Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a [k#18,c#19,v#20] 
 PushedFilters: [IsNotNull(c), GreaterThan(c,2)], 
 ReadSchema: struct<k:int,c:int,v:int> We can analyze the inside of the Catalyst just like with Scala/Java/…
  • 60. Predicates get Pushed Down Automatically EXPLAIN SELECT * FROM arrow WHERE C > 2;
 
 Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@6069193a [k#18,c#19,v#20] 
 PushedFilters: [IsNotNull(c), GreaterThan(c,2)], 
 ReadSchema: struct<k:int,c:int,v:int> CassandraSourceRelation Filter [GreaterThan(c,2)]
  • 62. Common Cases where Predicates Don't Push SELECT * from troubles WHERE c < '2017-05-27'
 *Filter (cast(c#76 as string) < 2017-05-27) +- *Scan CassandraSourceRelation@53e82b30 [k#75,c#76,v#77] PushedFilters: [IsNotNull(c)], ReadSchema: struct<k:int,c:date,v:int> Why is my date clustering column not being pushed down.
  • 63. Common Cases where Predicates Don't Push CassandraSourceRelation Filter [LessThan(c,'2017-05-27')] SELECT * from troubles WHERE c < '2017-05-27'
 *Filter (cast(c#76 as string) < 2017-05-27) +- *Scan CassandraSourceRelation@53e82b30 [k#75,c#76,v#77] PushedFilters: [IsNotNull(c)], ReadSchema: struct<k:int,c:date,v:int>
  • 64. Common Cases where Predicates Don't Push CassandraSourceRelation
 ReadSchema: struct<k:int,c:date,v:int> Filter [LessThan(c,'2017-05-27')] Date != String SELECT * from troubles WHERE c < '2017-05-27'
 *Filter (cast(c#76 as string) < 2017-05-27) +- *Scan CassandraSourceRelation@53e82b30 [k#75,c#76,v#77] PushedFilters: [IsNotNull(c)], ReadSchema: struct<k:int,c:date,v:int>
  • 65. Make Sure we Cast Correctly EXPLAIN SELECT * from troubles WHERE c < cast('2017-05-27' as date);
 *Scan C*Relation PushedFilters: [IsNotNull(c), LessThan(c, 2017-05-27)] CassandraSourceRelation
 ReadSchema: struct<k:int,c:date,v:int> Filter [LessThan(c,Date('2017-05-27'))] Date == Date
  • 66. Make Sure we Cast Correctly EXPLAIN SELECT * from troubles WHERE c < cast('2017-05-27' as date);
 *Scan C*Relation PushedFilters: [IsNotNull(c), LessThan(c, 2017-05-27)] CassandraSourceRelation
 ReadSchema: struct<k:int,c:date,v:int> Filter [LessThan(c,Date('2017-05-27'))] Automatic Pushdowns!
  • 67. DSE Search Automatic Pushdowns! EXPLAIN SELECT * from troubles WHERE v < 6;
 *Scan C*Relation PushedFilters: [IsNotNull(v), LessThan(v, 6)] CassandraSourceRelation
 ReadSchema: struct<k:int,c:date,v:int> Solr_Query
  • 68. DSE Search Automatic Pushdowns!
  • 69. DSE Search Automatic Pushdowns! Count Happens in the Index DSE
 Continuous Paging
  • 70. Cache a whole table CassandraSourceRelation InMemoryRelation CACHE TABLE ks.tab;
 explain SELECT * FROM ks.tab; == Physical Plan == InMemoryTableScan [k#0, c#1, v#2] : +- InMemoryRelation StorageLevel(disk, memory, deserialized, 1 replicas), `ks`.`tab` : : +- *Scan CassandraSourceRelation
  • 72. Cache a fraction of Data CassandraSourceRelation CACHE TABLE somedata SELECT * FROM ks.tab WHERE c > 2;
 explain SELECT * from somedata; == Physical Plan == InMemoryTableScan : +- InMemoryRelation `somedata` : : +- *Scan CassandraSourceRelation PushedFilters: [IsNotNull(c), GreaterThan(c,2)] Filter [GreaterThan(c,2)] InMemoryRelation somedata
  • 73. Let this be a starting point • https://github.com/datastax/spark-cassandra-connector • https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md • https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-thrift-server.html • http://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkSqlThriftServer.html • https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine • https://www.datastax.com/dev/blog/dse-5-1-automatic-optimization-of-spark-sql-queries-using-dse-search • https://www.datastax.com/dev/blog/dse-continuous-paging-tuning-and-support-guide
  • 74. Thank You. http://www.russellspitzer.com/
 @RussSpitzer Come chat with us at DataStax Academy: 
 https://academy.datastax.com/slack