SlideShare a Scribd company logo
Big Data Warehousing - January 7, 2015
Hosted by:
7:00 Networking
Grab some food and drink... Make some friends.
7:15 Leslie Linsner
Talent Manager
Caserta Concepts
Welcome + Intro + Swag
About the Meetup
About Caserta Concepts
7:30 Elliott Cordo
Chief Architect
Caserta Concepts
Introduction and Overview of Spark
Deep dive into SparkSQL
Demo of SparkSQL!
8:15 Q&A Ask Questions, Share your experience with
SparkSQL
8:45 More Networking
Don’t leave until you make at least one new Data Nerd friend!
Agenda
• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Founded by Caserta Concepts
• November 10, 2012
• Next BDW Meetup:
• January 27th
• Topic: Graph Databases for MDM
• Location: TBD – Can you host us?
About the BDW Meetup Twitter: #BDWmeetup
#maximizeDataValue
@CasertaConcepts
About Caserta Concepts
• Award-winning technology innovation consulting with
expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy, Implementation
• Writing, Education, Mentoring
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Help Wanted
Storm
Big Data Architect Hbase
Cassandra
Free T-Shirts
Courtesy of Caserta Concepts
RAFFLE!!!
About SPARK!
• General Cluster Computing
• Born in UC Berkeley AMPLab around 2009
• Open sourced in 2010,
• Apache Software foundation in 2013
• Became top level project early in 2014
More about Spark
• A Swiss army knife!
• Streaming, batch, and interactive
• RDD – Redundant Distributed Datastore
• API’s for Java, Scala, Python
Spark processing
• Lots of processing options:
• API’s in Java, Scala, Python
• GraphX
• Streaming
• MLlib
• SQL goodness
Current State of Spark
• Now in version 1.2
• ~175 active contributors
• Most Hadoop distros now support, or are in progress of
integrating Spark
• Databricks is offering commercial support and fully
managed Spark Clusters
• Large number of Organizations using Spark
Recent Achievement: Gray Sort
Caserta Active Spark Project
• Interactive SQL on large datasets  financial services
• Big ETL - Json Crunching ETL pipelines  Ad-tech
• Several others in R&D
Deploying Spark
• On-premise
• Databricks
• AWS EMR and EC2
SPARK on Elastic Map Reduce
• Not currently a packaged application (coming soon?) 
Maybe AWS has other plans for Spark?
• Easily bootstrapped:
• https://github.com/awslabs/emr-bootstrap-actions
aws emr create-cluster --name SparkCluster --ami-version 3.2.1
--instance-type m3.xlarge --instance-count 3
--ec2-attributes KeyName=caserta-1 --applications Name=Hive
--bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["-
v1.2.0.a"]
• Latest version is turn-key
• Just need to copy hive-site.xml if accessing Hive tables
• Minor issues with metastore when Impala and Parquet installed.
Spark can be run locally too!
• Easy for development
• Local development is exactly the same as submitting work on a
cluster!
• IPython Notebook, or your favorite IDE
• Install on your Mac with one command
brew install apache-spark
IPython Notebook
• Great interactive environment for performing analysis in
Python..
• Cloudera has good documentation on configuring for
Yarn.
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
• Hint - if your are installing local: Homebrew install
“SPARK_HOME” will be:
/usr/local/Cellar/apache-spark/1.2.0/libexec/
So Why talk about Spark
• Many competing big data processing platforms, query
engines, etc
• Hadoop Map Reduce is fairly mature
..about Hadoop Map Reduce
• We can process very large datasets
• split processes across a large number of machines
• High recoverability/high safety  intermediate data is
written to disk..
• Efficient and generally fast – move processing to data
• SQL on Hadoop via Hive
But map reduce has it’s downsides
• SLOW – disk based intermediate steps (local disk and
HDFS)
• Especially inefficient for iterative processing  like
machine learning
• Challenging to conduct interactive analysis  run job –
go get coffee
..about Spark
• In-memory – eliminates intermediate disk based storage
• Performs generalized form of map-reduce  split
processes across a large number of machines
• Fast enough for interactive analysis
• Fault tolerant via lineage tracking
• SparkSQL!
So do we still need Hadoop
No, but yes
• Why Hadoop?
• YARN
• HDFS
• Hadoop Map Reduce is mature and will still be appropriate for certain
workloads!
• Other services!
• But you can use other resource managers too:
• Mesos
• Spark Standalone
• And can work with other distributed file systems including:
• S3
• Gluster
• Tachyon
About SparkSQL
• Sparks SQL Engine
• Brand new - emerged as alpha in 1.0.1 ~ 1 year old
• Converts SQL into RDD operations
What happened to Shark
• Replaces the for Shark Query engine
• All new Catalyst optimizer – Shark leveraged the Hive
optimizer
• Hadoop Map Reduce optimization rules were not
applicable
• Writing optimization rules made easy  more community
participation
We love SQL!
• Huge population of highly skilled developers and analysts
• Compatible with Tooling
• Many operations can easily and efficiently be expressed
in SQL
• Filters
• Joins
• Group by’s
• Aggregates
But sometimes SQL is not the best tool!
• Some operations do not fit SQL well
• Iteration
• Row-by row processing
• Other operations that are not set-based/SQL oriented
Spark can help!
• Spark API
• MLLIB – machine learning
Blend Spark SQL with other code in the same program
How can you leverage SPARK SQL
• Batch ETL development
• Interactive
• Spark Shell (PySpark)
• Spark SQL CLI
• Thrift Server (JDBC)
• Beeline
• Query Platforms
• BI Tools
SPARK SQL can leverage the Hive
metastore
• Hive Metastore can also be leveraged by a wide array of
applications
• Spark
• Hive
• Impala
• Pig
• Available from HiveContext
The Basics of the Spark SQLAPI
• SPARK Context – a connection to the Spark Execution Engine
• SCHEMA RDD – contains row of data with named columns
(think spreadsheet)!
• HiveContext (superset of SQLContext) – SQL on Spark, access
to Hive Metastore
• inferSchema – apply schema to a RDD of dictionary type
• jsonSchema/jsonFile – load a json file as a schema RDD
• registerTempTable – register an RDD as a temp table for
SQL fun
Spark SQL from Flat File
Spark SQL Loves JSON
Inferring Schema and Querying JSON
Another method – load directly in SQL
Spark SQL + Hive
And what about other data sources
Out of the box:
• Parquet
• JDBC
Spark 1.2 brings a data sources API:
• Much easier to develop new integrations
• New integrations underway  Cassandra, CSV, Avro
Where do we think SparkSQL is headed
• Spark in general will continue to gain momentum
• Increasing number of integrated data stores, file types etc
• Optimizer improvements  Catalyst should allow it to
evolve very quickly!
• Subsequent - Improvements for interactive SQL – better
performance, concurrency
DEMO Some Spark!
…
Awesome collection of AWS developed bootstrap actions:
https://github.com/awslabs/emr-bootstrap-actions
Will provide notebook and helpful scripts soon!
Remember  Jan 27– Graph Databases
Resources
Elliott Cordo
Principal Consultant, Caserta
Concepts
P: (855) 755-2246 x267
E: elliott@casertaconcepts.com
info@casertaconcepts.com
1(855) 755-2246
www.casertaconcepts.com
Thank You

More Related Content

PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
PDF
Spark sql
PPTX
Apache Spark sql
PDF
Introduction to Spark SQL training workshop
PDF
Spark SQL
PDF
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark SQL Deep Dive @ Melbourne Spark Meetup
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark sql
Apache Spark sql
Introduction to Spark SQL training workshop
Spark SQL
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster

What's hot (20)

PPTX
Spark Sql for Training
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
PDF
20140908 spark sql & catalyst
PDF
Using Apache Spark as ETL engine. Pros and Cons
PDF
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
PDF
Spark SQL - 10 Things You Need to Know
PDF
New Developments in Spark
PDF
20170126 big data processing
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PPTX
Spark etl
PDF
Spark SQL with Scala Code Examples
PPTX
Building a modern Application with DataFrames
PPTX
Optimizing Apache Spark SQL Joins
PDF
Simplifying Big Data Analytics with Apache Spark
PPTX
Introduce to Spark sql 1.3.0
Spark Sql for Training
DataEngConf SF16 - Spark SQL Workshop
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
20140908 spark sql & catalyst
Using Apache Spark as ETL engine. Pros and Cons
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Spark SQL - 10 Things You Need to Know
New Developments in Spark
20170126 big data processing
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Spark etl
Spark SQL with Scala Code Examples
Building a modern Application with DataFrames
Optimizing Apache Spark SQL Joins
Simplifying Big Data Analytics with Apache Spark
Introduce to Spark sql 1.3.0
Ad

Similar to Spark SQL (20)

PDF
Started with-apache-spark
PDF
Apache Spark and Python: unified Big Data analytics
PPTX
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
PDF
Apache Spark for Everyone - Women Who Code Workshop
PPTX
Apache spark
PPTX
Apachespark 160612140708
PDF
spark_v1_2
PDF
A look under the hood at Apache Spark's API and engine evolutions
PDF
Data processing with spark in r & python
PDF
Jump Start on Apache Spark 2.2 with Databricks
PPTX
Intro to Apache Spark by CTO of Twingo
PDF
New Analytics Toolbox DevNexus 2015
PDF
20160512 apache-spark-for-everyone
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
PDF
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
PDF
Spark sql under the hood - Data KRK meetup
PDF
Apache spark its place within a big data stack
PPTX
Introduction to spark
PPTX
Apache Spark in Industry
PPTX
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Started with-apache-spark
Apache Spark and Python: unified Big Data analytics
JavaOne 2016: Getting Started with Apache Spark: Use Scala, Java, Python, or ...
Apache Spark for Everyone - Women Who Code Workshop
Apache spark
Apachespark 160612140708
spark_v1_2
A look under the hood at Apache Spark's API and engine evolutions
Data processing with spark in r & python
Jump Start on Apache Spark 2.2 with Databricks
Intro to Apache Spark by CTO of Twingo
New Analytics Toolbox DevNexus 2015
20160512 apache-spark-for-everyone
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
Spark sql under the hood - Data KRK meetup
Apache spark its place within a big data stack
Introduction to spark
Apache Spark in Industry
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Ad

More from Caserta (20)

PPTX
Using Machine Learning & Spark to Power Data-Driven Marketing
PPTX
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PDF
General Data Protection Regulation - BDW Meetup, October 11th, 2017
PDF
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
PPTX
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
PDF
Introduction to Data Science (Data Summit, 2017)
PDF
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
PDF
The Rise of the CDO in Today's Enterprise
PDF
Building a New Platform for Customer Analytics
PDF
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
PDF
You're the New CDO, Now What?
PDF
The Data Lake - Balancing Data Governance and Innovation
PDF
Making Big Data Easy for Everyone
PDF
Benefits of the Azure Cloud
PDF
Big Data Analytics on the Cloud
PDF
Intro to Data Science on Hadoop
PDF
The Emerging Role of the Data Lake
PDF
Not Your Father's Database by Databricks
PDF
Mastering Customer Data on Apache Spark
Using Machine Learning & Spark to Power Data-Driven Marketing
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Introduction to Data Science (Data Summit, 2017)
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
The Rise of the CDO in Today's Enterprise
Building a New Platform for Customer Analytics
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
You're the New CDO, Now What?
The Data Lake - Balancing Data Governance and Innovation
Making Big Data Easy for Everyone
Benefits of the Azure Cloud
Big Data Analytics on the Cloud
Intro to Data Science on Hadoop
The Emerging Role of the Data Lake
Not Your Father's Database by Databricks
Mastering Customer Data on Apache Spark

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PPT
Teaching material agriculture food technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Understanding_Digital_Forensics_Presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Teaching material agriculture food technology
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Building Integrated photovoltaic BIPV_UPV.pdf

Spark SQL

  • 1. Big Data Warehousing - January 7, 2015 Hosted by:
  • 2. 7:00 Networking Grab some food and drink... Make some friends. 7:15 Leslie Linsner Talent Manager Caserta Concepts Welcome + Intro + Swag About the Meetup About Caserta Concepts 7:30 Elliott Cordo Chief Architect Caserta Concepts Introduction and Overview of Spark Deep dive into SparkSQL Demo of SparkSQL! 8:15 Q&A Ask Questions, Share your experience with SparkSQL 8:45 More Networking Don’t leave until you make at least one new Data Nerd friend! Agenda
  • 3. • Big Data is a complex, rapidly changing landscape • We want to share our stories and hear about yours • Great networking opportunity for like minded data nerds • Founded by Caserta Concepts • November 10, 2012 • Next BDW Meetup: • January 27th • Topic: Graph Databases for MDM • Location: TBD – Can you host us? About the BDW Meetup Twitter: #BDWmeetup #maximizeDataValue @CasertaConcepts
  • 4. About Caserta Concepts • Award-winning technology innovation consulting with expertise in: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy, Implementation • Writing, Education, Mentoring • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 5. Does this word cloud excite you? Speak with us about our open positions: leslie@casertaconcepts.com Help Wanted Storm Big Data Architect Hbase Cassandra
  • 6. Free T-Shirts Courtesy of Caserta Concepts RAFFLE!!!
  • 7. About SPARK! • General Cluster Computing • Born in UC Berkeley AMPLab around 2009 • Open sourced in 2010, • Apache Software foundation in 2013 • Became top level project early in 2014
  • 8. More about Spark • A Swiss army knife! • Streaming, batch, and interactive • RDD – Redundant Distributed Datastore • API’s for Java, Scala, Python
  • 9. Spark processing • Lots of processing options: • API’s in Java, Scala, Python • GraphX • Streaming • MLlib • SQL goodness
  • 10. Current State of Spark • Now in version 1.2 • ~175 active contributors • Most Hadoop distros now support, or are in progress of integrating Spark • Databricks is offering commercial support and fully managed Spark Clusters • Large number of Organizations using Spark
  • 12. Caserta Active Spark Project • Interactive SQL on large datasets  financial services • Big ETL - Json Crunching ETL pipelines  Ad-tech • Several others in R&D
  • 13. Deploying Spark • On-premise • Databricks • AWS EMR and EC2
  • 14. SPARK on Elastic Map Reduce • Not currently a packaged application (coming soon?)  Maybe AWS has other plans for Spark? • Easily bootstrapped: • https://github.com/awslabs/emr-bootstrap-actions aws emr create-cluster --name SparkCluster --ami-version 3.2.1 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=caserta-1 --applications Name=Hive --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark,Args=["- v1.2.0.a"] • Latest version is turn-key • Just need to copy hive-site.xml if accessing Hive tables • Minor issues with metastore when Impala and Parquet installed.
  • 15. Spark can be run locally too! • Easy for development • Local development is exactly the same as submitting work on a cluster! • IPython Notebook, or your favorite IDE • Install on your Mac with one command brew install apache-spark
  • 16. IPython Notebook • Great interactive environment for performing analysis in Python.. • Cloudera has good documentation on configuring for Yarn. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ • Hint - if your are installing local: Homebrew install “SPARK_HOME” will be: /usr/local/Cellar/apache-spark/1.2.0/libexec/
  • 17. So Why talk about Spark • Many competing big data processing platforms, query engines, etc • Hadoop Map Reduce is fairly mature
  • 18. ..about Hadoop Map Reduce • We can process very large datasets • split processes across a large number of machines • High recoverability/high safety  intermediate data is written to disk.. • Efficient and generally fast – move processing to data • SQL on Hadoop via Hive
  • 19. But map reduce has it’s downsides • SLOW – disk based intermediate steps (local disk and HDFS) • Especially inefficient for iterative processing  like machine learning • Challenging to conduct interactive analysis  run job – go get coffee
  • 20. ..about Spark • In-memory – eliminates intermediate disk based storage • Performs generalized form of map-reduce  split processes across a large number of machines • Fast enough for interactive analysis • Fault tolerant via lineage tracking • SparkSQL!
  • 21. So do we still need Hadoop No, but yes • Why Hadoop? • YARN • HDFS • Hadoop Map Reduce is mature and will still be appropriate for certain workloads! • Other services! • But you can use other resource managers too: • Mesos • Spark Standalone • And can work with other distributed file systems including: • S3 • Gluster • Tachyon
  • 22. About SparkSQL • Sparks SQL Engine • Brand new - emerged as alpha in 1.0.1 ~ 1 year old • Converts SQL into RDD operations
  • 23. What happened to Shark • Replaces the for Shark Query engine • All new Catalyst optimizer – Shark leveraged the Hive optimizer • Hadoop Map Reduce optimization rules were not applicable • Writing optimization rules made easy  more community participation
  • 24. We love SQL! • Huge population of highly skilled developers and analysts • Compatible with Tooling • Many operations can easily and efficiently be expressed in SQL • Filters • Joins • Group by’s • Aggregates
  • 25. But sometimes SQL is not the best tool! • Some operations do not fit SQL well • Iteration • Row-by row processing • Other operations that are not set-based/SQL oriented Spark can help! • Spark API • MLLIB – machine learning Blend Spark SQL with other code in the same program
  • 26. How can you leverage SPARK SQL • Batch ETL development • Interactive • Spark Shell (PySpark) • Spark SQL CLI • Thrift Server (JDBC) • Beeline • Query Platforms • BI Tools
  • 27. SPARK SQL can leverage the Hive metastore • Hive Metastore can also be leveraged by a wide array of applications • Spark • Hive • Impala • Pig • Available from HiveContext
  • 28. The Basics of the Spark SQLAPI • SPARK Context – a connection to the Spark Execution Engine • SCHEMA RDD – contains row of data with named columns (think spreadsheet)! • HiveContext (superset of SQLContext) – SQL on Spark, access to Hive Metastore • inferSchema – apply schema to a RDD of dictionary type • jsonSchema/jsonFile – load a json file as a schema RDD • registerTempTable – register an RDD as a temp table for SQL fun
  • 29. Spark SQL from Flat File
  • 31. Inferring Schema and Querying JSON
  • 32. Another method – load directly in SQL
  • 33. Spark SQL + Hive
  • 34. And what about other data sources Out of the box: • Parquet • JDBC Spark 1.2 brings a data sources API: • Much easier to develop new integrations • New integrations underway  Cassandra, CSV, Avro
  • 35. Where do we think SparkSQL is headed • Spark in general will continue to gain momentum • Increasing number of integrated data stores, file types etc • Optimizer improvements  Catalyst should allow it to evolve very quickly! • Subsequent - Improvements for interactive SQL – better performance, concurrency
  • 37. Awesome collection of AWS developed bootstrap actions: https://github.com/awslabs/emr-bootstrap-actions Will provide notebook and helpful scripts soon! Remember  Jan 27– Graph Databases Resources
  • 38. Elliott Cordo Principal Consultant, Caserta Concepts P: (855) 755-2246 x267 E: elliott@casertaconcepts.com info@casertaconcepts.com 1(855) 755-2246 www.casertaconcepts.com Thank You