SlideShare a Scribd company logo
Spark
Spark 
- Summit 
- News 
- Basics 
- Advanced 
- Subprojects 
- Use Cases 
- Resources
Summit 
- 1,164 participants from over 453 companies 
attended 
- Spark Training sold out at 300 participants 
- 31 organizations sponsored the event 
- 12 keynotes and 52 community presentations 
were given
News 
- Project 
- Databricks
Project 
- 1.0.0 release 
- Graduated incubator 
- Very active community
Very active community 
- Top three Apache projects 
- Most active Big Data project 
- > 50 companies 
- > 250 contributors 
- > 175,000 LOC
Databricks 
- Certification 
- Cloud
Certification 
- Every certified app will 
run on every certified 
distribution 
- Distribution Partners 
- App Partners
Distribution Partners 
- Cloudera 
- MapR 
- Hortonworks 
- Pivotal 
- IBM 
- Amazon Web Services 
- SAP
App Partners 
- Alteryx 
- Datastax 
- 0xdata 
- Typesafe 
- Zoomdata
Cloud 
- Vision: Make Big Data Easy! 
- Product: Badass 
- Hosted Platform 
- Cluster Management 
- Interactive Workspace
Interactive Workspace 
- Notebooks 
- Dashboards 
- Jobs
Dashboards 
- WYSIWYG Builder 
- Interactive plots 
- One-click publishing
Spark Basics 
- Execution 
- RDDs 
- Caching 
- Broadcast 
- Languages
Execution 
- Apply Functional Operators 
across Distributed Collections 
- Master / Worker 
- Lazy 
- Parallelize with Threads first
RDDs 
- Interface for dataset 
- Backed by anything 
- Any InputFormat class 
- HDFS default
Caching 
- Store intermediate 
results in memory 
- Partition-locality 
- Significant speed-up for 
iterative algorithms
Broadcast 
- Send immutable object 
to all workers 
- Similar to 
DistributedCache in 
mapreduce
Languages 
- Scala 
- Python 
- Java 7 
- Java 8 
- R 
- Clojure
Advanced 
- Partitioning 
- Persistence Options 
- Checkpointing 
- Accumulators 
- Optimizations
Subprojects 
- SparkSQL 
- Tachyon 
- Spark Streaming 
- MLLib 
- GraphX 
- BlinkDB 
- Spark Job Server
SparkSQL 
- Replaces Shark 
- Core 
- Catalyst 
- Libraries
Core 
- SchemaRDDs 
- Query Execution 
- Caching
Catalyst 
- Relational algebra 
- Expressions / UDFs 
- Query Planning 
- Optimizer
Libraries 
- POJOs 
- JDBC 
- JSON 
- Parquet 
- Hive
Hive 
- Catalog info from Metastore 
- Helps connect UI like 
Microstrategy / Tableau 
- Wrappers for UDF, UDAFs, 
UDTFs 
- Supports TRANSFORM 
- Supports SerDes
Tachyon 
- In Memory (Off-Heap) Distributed 
Datastore 
- Change URI from hdfs:// to tachyon:// 
- Share datasets between jobs without 
HDFS 
- Helps scaling by off-loading allocation 
responsibility and GC pauses from 
executor processes
Spark Streaming 
- Real-time streams 
- Micro-batching 
- Windowed 
Computations 
- Lambda Architecture
MLLib 
- Summary statistics 
- Regression 
- Classification 
- Clustering 
- Collaborative Filtering 
- Optimization 
- Dimensional Reduction
GraphX 
- Graph, VertexRDD, EdgeRDD 
objects and operations 
- Pregel API 
- mapReduceTriplets List<V,E,V> 
- Graph analytics libraries
Graph analytics libraries 
- ConnectedComponents 
- PageRank 
- TriangleCount 
- ShortestPaths 
- SVDPlusPlus
BlinkDB 
- Get estimated results 
- Time bound 
- Error bound
Spark Job Server 
- Runs multiple jobs / contexts 
in same process 
- Allows for RDD Caching / 
Sharing between jobs 
- Job Persistence
Use Cases 
- Spotify 
- Real-time Auctions - ShareThrough 
- Real-time Recommendations - Graphflow 
- Cancer Genomics - AMPLab 
- Malware Detection - F-Secure 
- Media Distribution Analytics - NBC Universal 
- Personal Fitness - Jawbone 
- Neuroscience - HHMI
Resources 
- Code 
- Event 
- Technology 
- Videos
Code 
- https://github.com/apache/spark
Event 
- spark-summit.org 
- http://arjon.es/2014/06/30/spark-summit-2014-day-1/ 
- https://www.crowdchat.net/chat/c3BvdF9vYmpfODc=. 
- https://nathanbrixius.wordpress.com/2014/07/02/spark-summit-keynote- 
notes/ 
- http://thomaswdinsmore.com/2014/07/03/spark-summit-2014- 
roundup/
Technology 
- Learning Spark (O'Reilly eBook) 
- www.spark-stack.org 
- ampcamp.berkeley.edu 
- https://amplab.cs.berkeley.edu/2013/10/23/got-a-minute-spin- 
up-a-spark-cluster-on-your-laptop-with-docker/
YouTube 
- AmpLab 
https://www.youtube.com/channel/UCWudC4d9i-2yxR5tuen- 
Nuw 
- Databricks 
https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q- 
_UUbA 
- Apache Spark 
https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82- 
w

More Related Content

PPTX
Using Visualization to Succeed with Big Data
PDF
Spark Streaming and MLlib - Hyderabad Spark Group
PPTX
Lambda-less Stream Processing @Scale in LinkedIn
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
PDF
IEEE International Conference on Data Engineering 2015
PDF
Spark (Structured) Streaming vs. Kafka Streams
PDF
Scalable And Incremental Data Profiling With Spark
PDF
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Using Visualization to Succeed with Big Data
Spark Streaming and MLlib - Hyderabad Spark Group
Lambda-less Stream Processing @Scale in LinkedIn
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
IEEE International Conference on Data Engineering 2015
Spark (Structured) Streaming vs. Kafka Streams
Scalable And Incremental Data Profiling With Spark
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

What's hot (20)

PDF
Big Telco - Yousun Jeong
PDF
End-to-End Data Pipelines with Apache Spark
PDF
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
PDF
Realtime Reporting using Spark Streaming
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Querying Druid in SQL with Superset
PPTX
Ai big dataconference_jeffrey ricker_kappa_architecture
PDF
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
PDF
Data governance in Hadoop (My Personal Notes)
PPTX
Hadoop data access layer v4.0
ODP
Kick-Start with SMACK Stack
PDF
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
PPTX
Intro to Apache Spark
PDF
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
PDF
Uber's data science workbench
PDF
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Big Telco - Yousun Jeong
End-to-End Data Pipelines with Apache Spark
Apache Spark Based Reliable Data Ingestion in Datalake with Gagan Agrawal
Realtime Reporting using Spark Streaming
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Querying Druid in SQL with Superset
Ai big dataconference_jeffrey ricker_kappa_architecture
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
Data governance in Hadoop (My Personal Notes)
Hadoop data access layer v4.0
Kick-Start with SMACK Stack
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Intro to Apache Spark
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Uber's data science workbench
End-to-End Spark/TensorFlow/PyTorch Pipelines with Databricks Delta
Kappa Architecture on Apache Kafka and Querona: datamass.io
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Data in Motion: Building Stream-Based Architectures with Qlik Replicate & Kaf...
Ad

Viewers also liked (20)

PPTX
Spark in the BigData dark
PDF
Apache streams 2015
PDF
Strata NYC 2015 - What's coming for the Spark community
PDF
Apache¼ Sparkℱ 1.5 presented by Databricks co-founder Patrick Wendell
PDF
London Spark Meetup Project Tungsten Oct 12 2015
PDF
Introduction to Spark SQL & Catalyst
PDF
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
PDF
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
PDF
Spark Summit EU talk by Herman van Hovell
PDF
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
PDF
Enhancements on Spark SQL optimizer by Min Qiu
PDF
20140908 spark sql & catalyst
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
PPTX
Spark sql meetup
PDF
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
PDF
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
PDF
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
PPTX
Processing Large Data with Apache Spark -- HasGeek
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Spark in the BigData dark
Apache streams 2015
Strata NYC 2015 - What's coming for the Spark community
Apache¼ Sparkℱ 1.5 presented by Databricks co-founder Patrick Wendell
London Spark Meetup Project Tungsten Oct 12 2015
Introduction to Spark SQL & Catalyst
Apache¼ Sparkℱ 1.6 presented by Databricks co-founder Patrick Wendell
Strata NYC 2015: Sketching Big Data with Spark: randomized algorithms for lar...
Spark Summit EU talk by Herman van Hovell
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Enhancements on Spark SQL optimizer by Min Qiu
20140908 spark sql & catalyst
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Spark sql meetup
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Processing Large Data with Apache Spark -- HasGeek
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Ad

Similar to Austin Data Meetup 092014 - Spark (20)

PPTX
Glint with Apache Spark
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
PPTX
In Memory Analytics with Apache Spark
PPTX
APACHE SPARK.pptx
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
PDF
Spark after Dark by Chris Fregly of Databricks
PDF
Dev Ops Training
PDF
Apache Spark - A High Level overview
PDF
Apache Spark: The Next Gen toolset for Big Data Processing
PDF
An introduction To Apache Spark
PDF
Bds session 13 14
PPTX
Apache Spark Fundamentals
PPTX
Apache Spark
PDF
Simplifying Big Data Analytics with Apache Spark
PDF
Spark Driven Big Data Analytics
 
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PPTX
Apache Spark in Industry
PDF
Spark Under the Hood - Meetup @ Data Science London
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
Glint with Apache Spark
Spark Concepts - Spark SQL, Graphx, Streaming
In Memory Analytics with Apache Spark
APACHE SPARK.pptx
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark after Dark by Chris Fregly of Databricks
Dev Ops Training
Apache Spark - A High Level overview
Apache Spark: The Next Gen toolset for Big Data Processing
An introduction To Apache Spark
Bds session 13 14
Apache Spark Fundamentals
Apache Spark
Simplifying Big Data Analytics with Apache Spark
Spark Driven Big Data Analytics
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Apache Spark in Industry
Spark Under the Hood - Meetup @ Data Science London
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
PPTX
ISO 45001 Occupational Health and Safety Management System
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Online Work Permit System for Fast Permit Processing
PPT
Introduction Database Management System for Course Database
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
System and Network Administraation Chapter 3
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
ManageIQ - Sprint 268 Review - Slide Deck
ISO 45001 Occupational Health and Safety Management System
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Understanding Forklifts - TECH EHS Solution
Odoo POS Development Services by CandidRoot Solutions
Online Work Permit System for Fast Permit Processing
Introduction Database Management System for Course Database
Design an Analysis of Algorithms I-SECS-1021-03
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms II-SECS-1021-03
System and Network Administraation Chapter 3
CHAPTER 2 - PM Management and IT Context
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
VVF-Customer-Presentation2025-Ver1.9.pptx
Softaken Excel to vCard Converter Software.pdf

Austin Data Meetup 092014 - Spark

  • 2. Spark - Summit - News - Basics - Advanced - Subprojects - Use Cases - Resources
  • 3. Summit - 1,164 participants from over 453 companies attended - Spark Training sold out at 300 participants - 31 organizations sponsored the event - 12 keynotes and 52 community presentations were given
  • 4. News - Project - Databricks
  • 5. Project - 1.0.0 release - Graduated incubator - Very active community
  • 6. Very active community - Top three Apache projects - Most active Big Data project - > 50 companies - > 250 contributors - > 175,000 LOC
  • 8. Certification - Every certified app will run on every certified distribution - Distribution Partners - App Partners
  • 9. Distribution Partners - Cloudera - MapR - Hortonworks - Pivotal - IBM - Amazon Web Services - SAP
  • 10. App Partners - Alteryx - Datastax - 0xdata - Typesafe - Zoomdata
  • 11. Cloud - Vision: Make Big Data Easy! - Product: Badass - Hosted Platform - Cluster Management - Interactive Workspace
  • 12. Interactive Workspace - Notebooks - Dashboards - Jobs
  • 13. Dashboards - WYSIWYG Builder - Interactive plots - One-click publishing
  • 14. Spark Basics - Execution - RDDs - Caching - Broadcast - Languages
  • 15. Execution - Apply Functional Operators across Distributed Collections - Master / Worker - Lazy - Parallelize with Threads first
  • 16. RDDs - Interface for dataset - Backed by anything - Any InputFormat class - HDFS default
  • 17. Caching - Store intermediate results in memory - Partition-locality - Significant speed-up for iterative algorithms
  • 18. Broadcast - Send immutable object to all workers - Similar to DistributedCache in mapreduce
  • 19. Languages - Scala - Python - Java 7 - Java 8 - R - Clojure
  • 20. Advanced - Partitioning - Persistence Options - Checkpointing - Accumulators - Optimizations
  • 21. Subprojects - SparkSQL - Tachyon - Spark Streaming - MLLib - GraphX - BlinkDB - Spark Job Server
  • 22. SparkSQL - Replaces Shark - Core - Catalyst - Libraries
  • 23. Core - SchemaRDDs - Query Execution - Caching
  • 24. Catalyst - Relational algebra - Expressions / UDFs - Query Planning - Optimizer
  • 25. Libraries - POJOs - JDBC - JSON - Parquet - Hive
  • 26. Hive - Catalog info from Metastore - Helps connect UI like Microstrategy / Tableau - Wrappers for UDF, UDAFs, UDTFs - Supports TRANSFORM - Supports SerDes
  • 27. Tachyon - In Memory (Off-Heap) Distributed Datastore - Change URI from hdfs:// to tachyon:// - Share datasets between jobs without HDFS - Helps scaling by off-loading allocation responsibility and GC pauses from executor processes
  • 28. Spark Streaming - Real-time streams - Micro-batching - Windowed Computations - Lambda Architecture
  • 29. MLLib - Summary statistics - Regression - Classification - Clustering - Collaborative Filtering - Optimization - Dimensional Reduction
  • 30. GraphX - Graph, VertexRDD, EdgeRDD objects and operations - Pregel API - mapReduceTriplets List<V,E,V> - Graph analytics libraries
  • 31. Graph analytics libraries - ConnectedComponents - PageRank - TriangleCount - ShortestPaths - SVDPlusPlus
  • 32. BlinkDB - Get estimated results - Time bound - Error bound
  • 33. Spark Job Server - Runs multiple jobs / contexts in same process - Allows for RDD Caching / Sharing between jobs - Job Persistence
  • 34. Use Cases - Spotify - Real-time Auctions - ShareThrough - Real-time Recommendations - Graphflow - Cancer Genomics - AMPLab - Malware Detection - F-Secure - Media Distribution Analytics - NBC Universal - Personal Fitness - Jawbone - Neuroscience - HHMI
  • 35. Resources - Code - Event - Technology - Videos
  • 37. Event - spark-summit.org - http://arjon.es/2014/06/30/spark-summit-2014-day-1/ - https://www.crowdchat.net/chat/c3BvdF9vYmpfODc=. - https://nathanbrixius.wordpress.com/2014/07/02/spark-summit-keynote- notes/ - http://thomaswdinsmore.com/2014/07/03/spark-summit-2014- roundup/
  • 38. Technology - Learning Spark (O'Reilly eBook) - www.spark-stack.org - ampcamp.berkeley.edu - https://amplab.cs.berkeley.edu/2013/10/23/got-a-minute-spin- up-a-spark-cluster-on-your-laptop-with-docker/
  • 39. YouTube - AmpLab https://www.youtube.com/channel/UCWudC4d9i-2yxR5tuen- Nuw - Databricks https://www.youtube.com/channel/UC3q8O3Bh2Le8Rj1-Q- _UUbA - Apache Spark https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82- w