Spark For The Business Analyst

INTRO TO APACHE
SPARK
BIG DATA FOR THE BUSINESS ANALYST
Created by /Gus Cavanaugh @GusCavanaugh

WHY ARE WE HERE?
Business analysts use data to inform business decisions.
Spark is one of many tools that can help you do that.

SO LET'S DIVE RIGHT IN
val input = sc.textfile("file:///test.csv")
input.collect().foreach(println)
This code just loads a file and prints it out to the screen

BIG CAVEAT
We will be coding
No, there is no other way
Yes, it will be hard
But you can do it

HERE'S HOW I KNOW...
Excel formulas are super hard
=VLOOKUP(B2,'Raw Data'!$B$1:$D$2,3,FALSE)
=SUMPRODUCT((A1:A10="Ford")*(B1:B10="June")*(C1:C10))
If you learned how to write VLOOKUPs, you can learn to
code

DISTINCTION: WE ARE NOT
ENGINEERS
We are not building production applications
We just want to answer questions with data rather than with
speculation

WE MAY SHARE TOOLS WITH
ENGINEERS, BUT OUR PROCESS IS
DIFFERENT
Principally, we emphasize interactive analysis
This means we want the flexibility to change the questions
we ask as we work

AND THE ABILITY TO STOP OUR
ANALYSIS AT ANY POINT
We are not doing analysis for the sake of doing analysis
Good may be the enemy of great, but better is the enemy of
done

IN BUSINESS LANGUAGE
We want the highest analytic return for our time investment

OUR ANALYTIC PROCESS
Don't measure, just cut
Google is your best friend
You don't have to know how to do anything
You just have to be able to find out

WHAT IS SPARK?
Spark is an open-source processing framework designed for
cluster computing

WHY IS IT POPULAR?
Super fast...
Plays well with Hadoop
Native APIs for analyst friendly languages like Python and
R

WAIT...I'VE HEARD THIS BEFORE
Sounds like the original promise of Hadoop...
How is Spark different?

FAST REVIEW OF HADOOP
Google was indexing the web every day
They wrote some custom software to store and process
those documents (web pages)
The open source version of that software is called Hadoop

HADOOP CONSISTS OF TWO MAIN
PIECES
The Hadoop Distributed File System: HDFS
And a processing framework called MapReduce
HDFS enabled fault-tolerant storage on commodity servers
at scale
And MapReduce allowed you to process what you stored in
parallel

THIS IS A BIG DEAL...
Companies storing ever increasing amounts of data could:
Do so much cheaper
With more flexibility

HADOOP CAME WITH A COST
Parallel processing, but not necessarily fast (batch
processing)
Difficult to program
package org.myorg;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<longwritab
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

NOT INTERACTIVE
Writing MapReduce jobs in Java is an inefficient way for
business analysts to process data in parallel
We get the parallel processing speed, but the development
time is long (or the time spent asking a dev to write it...)

BUT WHAT ABOUT PIG..?
Pig is a sort of scripting language for Hadoop with friendly
syntax that lets you read from any data source
A = load './input.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;
store D into './wordcount';
While it works well, it's another language to learn and it is
only used in Hadoop

BUT WHAT ABOUT SQL-ON-
HADOOP?
A few options: Hive, Impala, Big SQL
If you have these options, use them
But they all involve substantial ETL and (maybe) additional
hardware
In D.C. we know what that means: you get it on next year's
contract

WHAT IS ETL? AND WHY WOULD WE
NEED IT?
Because unlike most Hadoop tutorials, the data analysts
access are not in flat files
For analytics, it is very likely you'll want data from your
Hadoop application's database
But what is your Hadoop application's database?

HBASE - THE HADOOP DATABASE
One big freakin' table
No joins - row keys are everything
Great for applications, terrible for analysts

WHY AM I TALKING ABOUT HBASE
DURING A SPARK PRESENTATION?
Because I want you to know that your data will not be in the
format you want
ETL - Extract, Transform, Load, is a real process that
engineers will have to spend time on to get your data into a
SQL friendly environment
This will not be an application feature, but an analytics one
(so don't be surprised if this gets skipped)

MY RAMBLING POINT IS THAT YOU
WILL HAVE MESSY DATA
Hadoop, Spark, Tableau, nor anything else will solve that
You still have to rely on the tools you use for data wrangling
Like Python and R

TOOL COMPARISON
Tool Powerful? Friendly?
Excel No Hell Yes
Python/R Meh... Yes
Hadoop Yes Hell no
Spark Hell yes Just right

IDEAL SCENARIO
I can write the same Python scripts that I use to process data
on my local machine

SPARK IS OUR BEST ANSWER
You can write Python and iterative computations are
processed in memory, so they are easier to write and much
faster than MapReduce

HOW YOU CAN GET STARTED
Big Data University
Spark on Bluemix

EXTRAS
My video on Docker install
Spark paper

Spark For The Business Analyst

More Related Content

What's hot (20)

Viewers also liked (7)

Similar to Spark For The Business Analyst (20)

Recently uploaded (20)

Spark For The Business Analyst