INTRO TO APACHE
SPARK
BIG DATA FOR THE BUSINESS ANALYST
Created by /Gus Cavanaugh @GusCavanaugh
WHY ARE WE HERE?
Business analysts use data to inform business decisions.
Spark is one of many tools that can help you do that.
SO LET'S DIVE RIGHT IN
val input = sc.textfile("file:///test.csv")
input.collect().foreach(println)
This code just loads a file and prints it out to the screen
BIG CAVEAT
We will be coding
No, there is no other way
Yes, it will be hard
But you can do it
HERE'S HOW I KNOW...
Excel formulas are super hard
=VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE)
=SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10))
If you learned how to write VLOOKUPs, you can learn to
code
DISTINCTION: WE ARE NOT
ENGINEERS
We are not building production applications
We just want to answer questions with data rather than with
speculation
WE MAY SHARE TOOLS WITH
ENGINEERS, BUT OUR PROCESS IS
DIFFERENT
Principally, we emphasize interactive analysis
This means we want the flexibility to change the questions
we ask as we work
AND THE ABILITY TO STOP OUR
ANALYSIS AT ANY POINT
We are not doing analysis for the sake of doing analysis
Good may be the enemy of great, but better is the enemy of
done
IN BUSINESS LANGUAGE
We want the highest analytic return for our time investment
OUR ANALYTIC PROCESS
Don't measure, just cut
Google is your best friend
You don't have to know how to do anything
You just have to be able to find out
WHAT IS SPARK?
Spark is an open-source processing framework designed for
cluster computing
WHY IS IT POPULAR?
Super fast...
Plays well with Hadoop
Native APIs for analyst friendly languages like Python and
R
WAIT...I'VE HEARD THIS BEFORE
Sounds like the original promise of Hadoop...
How is Spark different?
FAST REVIEW OF HADOOP
Google was indexing the web every day
They wrote some custom software to store and process
those documents (web pages)
The open source version of that software is called Hadoop
HADOOP CONSISTS OF TWO MAIN
PIECES
The Hadoop Distributed File System: HDFS
And a processing framework called MapReduce
HDFS enabled fault-tolerant storage on commodity servers
at scale
And MapReduce allowed you to process what you stored in
parallel
THIS IS A BIG DEAL...
Companies storing ever increasing amounts of data could:
Do so much cheaper
With more flexibility
HADOOP CAME WITH A COST
Parallel processing, but not necessarily fast (batch
processing)
Difficult to program
package org.myorg;
 
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
   public static class Map extends MapReduceBase implements Mapper<longwritab
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();
NOT INTERACTIVE
Writing MapReduce jobs in Java is an inefficient way for
business analysts to process data in parallel
We get the parallel processing speed, but the development
time is long (or the time spent asking a dev to write it...)
BUT WHAT ABOUT PIG..?
Pig is a sort of scripting language for Hadoop with friendly
syntax that lets you read from any data source
A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into './wordcount';
While it works well, it's another language to learn and it is
only used in Hadoop
BUT WHAT ABOUT SQL-ON-
HADOOP?
A few options: Hive, Impala, Big SQL
If you have these options, use them
But they all involve substantial ETL and (maybe) additional
hardware
In D.C. we know what that means: you get it on next year's
contract
WHAT IS ETL? AND WHY WOULD WE
NEED IT?
Because unlike most Hadoop tutorials, the data analysts
access are not in flat files
For analytics, it is very likely you'll want data from your
Hadoop application's database
But what is your Hadoop application's database?
HBASE - THE HADOOP DATABASE
One big freakin' table
No joins - row keys are everything
Great for applications, terrible for analysts
WHY AM I TALKING ABOUT HBASE
DURING A SPARK PRESENTATION?
Because I want you to know that your data will not be in the
format you want
ETL - Extract, Transform, Load, is a real process that
engineers will have to spend time on to get your data into a
SQL friendly environment
This will not be an application feature, but an analytics one
(so don't be surprised if this gets skipped)
MY RAMBLING POINT IS THAT YOU
WILL HAVE MESSY DATA
Hadoop, Spark, Tableau, nor anything else will solve that
You still have to rely on the tools you use for data wrangling
Like Python and R
TOOL COMPARISON
Tool Powerful? Friendly?
Excel No Hell Yes
Python/R Meh... Yes
Hadoop Yes Hell no
Spark Hell yes Just right
IDEAL SCENARIO
I can write the same Python scripts that I use to process data
on my local machine
SPARK IS OUR BEST ANSWER
You can write Python and iterative computations are
processed in memory, so they are easier to write and much
faster than MapReduce
HOW YOU CAN GET STARTED
Big Data University
Spark on Bluemix
EXTRAS
My video on Docker install
Spark paper

More Related Content

PPTX
Introduction to Spark - Phoenix Meetup 08-19-2014
PDF
Spark, Python and Parquet
PDF
Fast Data Analytics with Spark and Python
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Big Data Processing with Spark and Scala
PPTX
Spark Application Development Made Easy
PDF
Hadoop and Spark
PDF
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Introduction to Spark - Phoenix Meetup 08-19-2014
Spark, Python and Parquet
Fast Data Analytics with Spark and Python
Spark Summit East 2015 Advanced Devops Student Slides
Big Data Processing with Spark and Scala
Spark Application Development Made Easy
Hadoop and Spark
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...

What's hot (20)

PPTX
Lightening Fast Big Data Analytics using Apache Spark
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PPTX
Spark architecture
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PPTX
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
PDF
Introduction to Apache Spark
PPTX
The Evolution of the Hadoop Ecosystem
PDF
Apache Spark 101
PPTX
Cost effective BigData Processing on Amazon EC2
PDF
Apache Spark Overview @ ferret
PPTX
Introduction to the Hadoop EcoSystem
PPTX
The Future of Hadoop: A deeper look at Apache Spark
PDF
Apache spark - Architecture , Overview & libraries
PPSX
Hadoop Ecosystem
PDF
Performant data processing with PySpark, SparkR and DataFrame API
PDF
Announcing Databricks Cloud (Spark Summit 2014)
PPTX
Hadoop Ecosystem
PDF
Hadoop ecosystem
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
PDF
Introduction to Spark Training
Lightening Fast Big Data Analytics using Apache Spark
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Spark architecture
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Introduction to Apache Spark
The Evolution of the Hadoop Ecosystem
Apache Spark 101
Cost effective BigData Processing on Amazon EC2
Apache Spark Overview @ ferret
Introduction to the Hadoop EcoSystem
The Future of Hadoop: A deeper look at Apache Spark
Apache spark - Architecture , Overview & libraries
Hadoop Ecosystem
Performant data processing with PySpark, SparkR and DataFrame API
Announcing Databricks Cloud (Spark Summit 2014)
Hadoop Ecosystem
Hadoop ecosystem
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Introduction to Spark Training
Ad

Viewers also liked (7)

PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
PPTX
Apache phoenix: Past, Present and Future of SQL over HBAse
PDF
Neural Networks, Spark MLlib, Deep Learning
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PDF
Hortonworks Technical Workshop: HBase and Apache Phoenix
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PPTX
How to Build a Recommendation Engine on Spark
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache phoenix: Past, Present and Future of SQL over HBAse
Neural Networks, Spark MLlib, Deep Learning
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
Hortonworks Technical Workshop: HBase and Apache Phoenix
TensorFrames: Google Tensorflow on Apache Spark
How to Build a Recommendation Engine on Spark
Ad

Similar to Spark For The Business Analyst (20)

PDF
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
PDF
RDBMS vs Hadoop vs Spark
PPTX
Intro to Apache Spark by CTO of Twingo
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
PDF
New Analytics Toolbox DevNexus 2015
PDF
Apache Spark PDF
PPT
Big_data_analytics_NoSql_Module-4_Session
PDF
[@NaukriEngineering] Apache Spark
PDF
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
PDF
New Analytics Toolbox
PDF
Started with-apache-spark
PDF
Spark Programming Basic Training Handout
PDF
20170126 big data processing
PDF
39.-Introduction-to-Sparkspark and all-1.pdf
PDF
Apache spark
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
PDF
spark_v1_2
PPTX
2016-07-21-Godil-presentation.pptx
PPTX
Introduction to spark
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
RDBMS vs Hadoop vs Spark
Intro to Apache Spark by CTO of Twingo
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
New Analytics Toolbox DevNexus 2015
Apache Spark PDF
Big_data_analytics_NoSql_Module-4_Session
[@NaukriEngineering] Apache Spark
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
New Analytics Toolbox
Started with-apache-spark
Spark Programming Basic Training Handout
20170126 big data processing
39.-Introduction-to-Sparkspark and all-1.pdf
Apache spark
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
spark_v1_2
2016-07-21-Godil-presentation.pptx
Introduction to spark
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data

Recently uploaded (20)

PPTX
IMPACT OF LANDSLIDE.....................
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
chrmotography.pptx food anaylysis techni
PPTX
Leprosy and NLEP programme community medicine
DOCX
Factor Analysis Word Document Presentation
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPT
statistic analysis for study - data collection
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
modul_python (1).pptx for professional and student
PPT
Predictive modeling basics in data cleaning process
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Business_Capability_Map_Collection__pptx
PDF
Microsoft 365 products and services descrption
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
IMPACT OF LANDSLIDE.....................
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
chrmotography.pptx food anaylysis techni
Leprosy and NLEP programme community medicine
Factor Analysis Word Document Presentation
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
statistic analysis for study - data collection
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
modul_python (1).pptx for professional and student
Predictive modeling basics in data cleaning process
Optimise Shopper Experiences with a Strong Data Estate.pdf
DU, AIS, Big Data and Data Analytics.ppt
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Microsoft Core Cloud Services powerpoint
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Business_Capability_Map_Collection__pptx
Microsoft 365 products and services descrption
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx

Spark For The Business Analyst

  • 1. INTRO TO APACHE SPARK BIG DATA FOR THE BUSINESS ANALYST Created by /Gus Cavanaugh @GusCavanaugh
  • 2. WHY ARE WE HERE? Business analysts use data to inform business decisions. Spark is one of many tools that can help you do that.
  • 3. SO LET'S DIVE RIGHT IN val input = sc.textfile("file:///test.csv") input.collect().foreach(println) This code just loads a file and prints it out to the screen
  • 4. BIG CAVEAT We will be coding No, there is no other way Yes, it will be hard But you can do it
  • 5. HERE'S HOW I KNOW... Excel formulas are super hard =VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE) =SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10)) If you learned how to write VLOOKUPs, you can learn to code
  • 6. DISTINCTION: WE ARE NOT ENGINEERS We are not building production applications We just want to answer questions with data rather than with speculation
  • 7. WE MAY SHARE TOOLS WITH ENGINEERS, BUT OUR PROCESS IS DIFFERENT Principally, we emphasize interactive analysis This means we want the flexibility to change the questions we ask as we work
  • 8. AND THE ABILITY TO STOP OUR ANALYSIS AT ANY POINT We are not doing analysis for the sake of doing analysis Good may be the enemy of great, but better is the enemy of done
  • 9. IN BUSINESS LANGUAGE We want the highest analytic return for our time investment
  • 10. OUR ANALYTIC PROCESS Don't measure, just cut Google is your best friend You don't have to know how to do anything You just have to be able to find out
  • 11. WHAT IS SPARK? Spark is an open-source processing framework designed for cluster computing
  • 12. WHY IS IT POPULAR? Super fast... Plays well with Hadoop Native APIs for analyst friendly languages like Python and R
  • 13. WAIT...I'VE HEARD THIS BEFORE Sounds like the original promise of Hadoop... How is Spark different?
  • 14. FAST REVIEW OF HADOOP Google was indexing the web every day They wrote some custom software to store and process those documents (web pages) The open source version of that software is called Hadoop
  • 15. HADOOP CONSISTS OF TWO MAIN PIECES The Hadoop Distributed File System: HDFS And a processing framework called MapReduce HDFS enabled fault-tolerant storage on commodity servers at scale And MapReduce allowed you to process what you stored in parallel
  • 16. THIS IS A BIG DEAL... Companies storing ever increasing amounts of data could: Do so much cheaper With more flexibility
  • 17. HADOOP CAME WITH A COST Parallel processing, but not necessarily fast (batch processing) Difficult to program package org.myorg;   import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount {    public static class Map extends MapReduceBase implements Mapper<longwritab    private final static IntWritable one = new IntWritable(1);    private Text word = new Text();
  • 18. NOT INTERACTIVE Writing MapReduce jobs in Java is an inefficient way for business analysts to process data in parallel We get the parallel processing speed, but the development time is long (or the time spent asking a dev to write it...)
  • 19. BUT WHAT ABOUT PIG..? Pig is a sort of scripting language for Hadoop with friendly syntax that lets you read from any data source A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './wordcount'; While it works well, it's another language to learn and it is only used in Hadoop
  • 20. BUT WHAT ABOUT SQL-ON- HADOOP? A few options: Hive, Impala, Big SQL If you have these options, use them But they all involve substantial ETL and (maybe) additional hardware In D.C. we know what that means: you get it on next year's contract
  • 21. WHAT IS ETL? AND WHY WOULD WE NEED IT? Because unlike most Hadoop tutorials, the data analysts access are not in flat files For analytics, it is very likely you'll want data from your Hadoop application's database But what is your Hadoop application's database?
  • 22. HBASE - THE HADOOP DATABASE One big freakin' table No joins - row keys are everything Great for applications, terrible for analysts
  • 23. WHY AM I TALKING ABOUT HBASE DURING A SPARK PRESENTATION? Because I want you to know that your data will not be in the format you want ETL - Extract, Transform, Load, is a real process that engineers will have to spend time on to get your data into a SQL friendly environment This will not be an application feature, but an analytics one (so don't be surprised if this gets skipped)
  • 24. MY RAMBLING POINT IS THAT YOU WILL HAVE MESSY DATA Hadoop, Spark, Tableau, nor anything else will solve that You still have to rely on the tools you use for data wrangling Like Python and R
  • 25. TOOL COMPARISON Tool Powerful? Friendly? Excel No Hell Yes Python/R Meh... Yes Hadoop Yes Hell no Spark Hell yes Just right
  • 26. IDEAL SCENARIO I can write the same Python scripts that I use to process data on my local machine
  • 27. SPARK IS OUR BEST ANSWER You can write Python and iterative computations are processed in memory, so they are easier to write and much faster than MapReduce
  • 28. HOW YOU CAN GET STARTED Big Data University Spark on Bluemix
  • 29. EXTRAS My video on Docker install Spark paper