Spark SQL

Big Data Warehousing - January 7, 2015
Hosted by:

7:00 Networking
Grab some food and drink... Make some friends.
7:15 Leslie Linsner
Talent Manager
Caserta Concepts
Welcome + Intro + Swag
About the Meetup
About Caserta Concepts
7:30 Elliott Cordo
Chief Architect
Caserta Concepts
Introduction and Overview of Spark
Deep dive into SparkSQL
Demo of SparkSQL!
8:15 Q&A Ask Questions, Share your experience with
SparkSQL
8:45 More Networking
Don’t leave until you make at least one new Data Nerd friend!
Agenda

• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Founded by Caserta Concepts
• November 10, 2012
• Next BDW Meetup:
• January 27th
• Topic: Graph Databases for MDM
• Location: TBD – Can you host us?
About the BDW Meetup Twitter: #BDWmeetup
#maximizeDataValue
@CasertaConcepts

About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization

Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Help Wanted
Storm
Big Data Architect Hbase
Cassandra

Free T-Shirts
Courtesy of Caserta Concepts
RAFFLE!!!

About SPARK!
• General Cluster Computing
• Born in UC Berkeley AMPLab around 2009
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014

More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• API’s for Java, Scala, Python

Spark processing
• Lots of processing options:
• API’s in Java, Scala, Python
• GraphX
• Streaming
• MLlib
• SQL goodness

Current State of Spark
• Now in version 1.2
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark

Caserta Active Spark Project
• Interactive SQL on large datasets  financial services
• Big ETL - Json Crunching ETL pipelines  Ad-tech
• Several others in R&D

Deploying Spark
• On-premise
• Databricks
• AWS EMR and EC2

SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?) 
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]
• Latest version is turn-key
• Just need to copy hive-site.xml if accessing Hive tables
• Minor issues with metastore when Impala and Parquet installed.

Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE
• Install on your Mac with one command
brew install apache-spark

IPython Notebook
• Great interactive environment for performing analysis in
Python..
• Cloudera has good documentation on configuring for
Yarn.
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
• Hint - if your are installing local: Homebrew install
“SPARK_HOME” will be:
/usr/local/Cellar/apache-spark/1.2.0/libexec/

So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc
• Hadoop Map Reduce is fairly mature

..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety  intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive

But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing  like
machine learning
• Challenging to conduct interactive analysis  run job –
go get coffee

..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce  split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!

So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon

About SparkSQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations

What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy  more community
participation

We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates

But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program

How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools

SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext

The Basics of the Spark SQLAPI
• SPARK Context – a connection to the Spark Execution Engine
• SCHEMA RDD – contains row of data with named columns
(think spreadsheet)!
• HiveContext (superset of SQLContext) – SQL on Spark, access
to Hive Metastore
• inferSchema – apply schema to a RDD of dictionary type
• jsonSchema/jsonFile – load a json file as a schema RDD
• registerTempTable – register an RDD as a temp table for
SQL fun

Inferring Schema and Querying JSON

Another method – load directly in SQL

And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brings a data sources API:
• Much easier to develop new integrations
• New integrations underway  Cassandra, CSV, Avro

Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements  Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency

Awesome collection of AWS developed bootstrap actions:
https://github.com/awslabs/emr-bootstrap-actions
Will provide notebook and helpful scripts soon!
Remember  Jan 27– Graph Databases
Resources

Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Thank You

Spark SQL

More Related Content

What's hot (20)

Similar to Spark SQL (20)

More from Caserta (20)

Recently uploaded (20)

Spark SQL