tech 3camp presentation

© 2013 Acxiom Corporation. All Rights Reserved. © 2013 Acxiom Corporation. All Rights Reserved.
Hadoop – a distributed
analytical platform
Jakub Wszolek (jwszol@acxiom.com)
TECH 3camp - 2015

© 2013 Acxiom Corporation. All Rights Reserved.
BigData is not Hadoop only
2

Hadoop galactic
3

ETL processes
4

ETL processes
5
Hadoop Streaming
Hive
MRJOB

ETL processes
6
Hadoop Streaming
Hive
MRJOB
Data Loading

ETL processes
7
Hadoop Streaming
Hive
MRJOB
Data Loading
Hive Tables (internal/external)

ETL processes
8
Hadoop Streaming
Hive
MRJOB
Data Loading
Data Science

ETL processes
9
Hadoop Streaming
Hive
MRJOB
Data Loading
Data Science

Worth to check..
• MRJOB - https://pythonhosted.org/mrjob/
- Hadoop streaming
- Keep all MapReduce code for one job in a single class
- mrjob lets you run your code without Hadoop at all
- mrjob makes debugging much easier
• Snakebite - https://github.com/spotify/snakebite
- pure python HDFS client
- protobuf for communicating with the NameNode
- CLI for Hadoop
- Extreamlly fast!
10

Still under heavy loading
0
0,5
1
1,5
2
2,5
3
3,5
4
July August September October November
Data Loads [TB]
Data Loads [TB] Expon. (Data Loads [TB])
11

Complex analysis
12
• RevR + RStudio
• DataScience
• Trend analysis, advanced clustering
• Predictive models
• Classifiers

Apache Mahout
• Library of scalable machine-learning algorithms
• Implemented on top of Apache Hadoop
• Using the MapReduce paradigm
• Provides the data science tools to automatically
find meaningful patterns in those big data sets
• http://mahout.apache.org/
13

What Mahout Does
• Mahout supports four main data science use
cases:
- Collaborative filtering – mines user behavior and
makes product recommendations (e.g. Amazon
recommendations)
- Clustering – takes items in a particular class
- Classification – learns from existing categorizations
and then assigns
- Frequent itemset mining – analyzes items in a group
14

Clustering - business use case
• Helps marketers improve their customer base
and work on the target areas.
• Group people according to different criteria’s
(such as willingness, purchasing power etc.)
based on their similarity in many ways related
to the product under consideration.
• Helps in identification of groups of houses on
the basis of their value, type and geographical
locations.
15

K-means
16

K-means
17

Hadoop data preparation
18

Sequences and Vectors
• Hadoop Sequence file
- flat file consisting of binary key/value pairs
- It is extensively used in MapReduce as input/output
formats
- Each record is a <key,value> pair
- Key and Value needs to be a class of
org.apache.hadoop.io.Text
- KEY = record name/filename/uniqe ID
- VALUE = content as UTF-8 encoded String
• Vectors
- Typical vector representation ie. Weka, Matlab
19

HDFS data file to Vector
20
List<NamedVector> vector = new LinkedList<NamedVector>();
NamedVector v1;
v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
vector.add(v1);
Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
Path path = new Path("datasamples/data");
//write a SequenceFile form a Vector
SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for(NamedVector v:vector){
vec.set(v);
writer.append(new Text(v.getName()), v);
}
writer.close();

Kmeans clustering in action
• Place the file on HDFS
• Convert the file into sequence and vector
- mahout arff.vector
-d /home/cloudera/Mahout/input_data
-o /user/cloudera/mahout/arff/vec_data
-t /home/cloudera/Mahout/arff/dict
• Run mahout kmeans
- mahout kmeans --input <hdfs_ata_files> --output
<kmeans-output> --numClusters 3 --clusters
<clusters-0-final> --maxIter 20 --method mapreduce
21

Kmeans clustering in action
• See the cluster as text file
- mahout clusterdump
-i <hdfs_input>
- -o <output_file>
-p <clusteredPoints>
• See the cluster as graphml file
- -of GRAPH_ML
22

Results
23

Results
24

Acxiom DSSH
25
• Data Science Safe Haven (DSSH)
• Detailed measurements that show how digital
marketing is driving purchasing behaviors
• Actionable recommendations on how to adjust
your digital marketing to reach your goals
• Insights on how your key customer segments
are engaging in digital channels
• http://www.acxiom.com/data-science-safe-
haven/

tech 3camp presentation

More Related Content

What's hot (20)

Similar to tech 3camp presentation (20)

Recently uploaded (20)

tech 3camp presentation