SlideShare a Scribd company logo
©  2013  Acxiom  Corporation.  All  Rights  Reserved. ©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Hadoop – a  distributed
analytical platform
Jakub  Wszolek  (jwszol@acxiom.com)
TECH  3camp  -­ 2015
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
BigData is not  Hadoop only
2
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Hadoop galactic
3
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
ETL  processes
4
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
ETL  processes
5
Hadoop  Streaming
Hive
MRJOB
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
ETL  processes
6
Hadoop  Streaming
Hive
MRJOB
Data  Loading
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
ETL  processes
7
Hadoop  Streaming
Hive
MRJOB
Data  Loading
Hive  Tables  (internal/external)
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
ETL  processes
8
Hadoop  Streaming
Hive
MRJOB
Data  Loading
Hive  Tables  (internal/external)
Data  Science
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
ETL  processes
9
Hadoop  Streaming
Hive
MRJOB
Data  Loading
Hive  Tables  (internal/external)
Data  Science
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Worth  to  check..
• MRJOB -­ https://pythonhosted.org/mrjob/
- Hadoop  streaming  
- Keep  all  MapReduce code  for  one  job  in  a  single  class
- mrjob lets  you  run  your  code  without  Hadoop  at  all
- mrjob makes  debugging  much  easier
• Snakebite -­ https://github.com/spotify/snakebite
- pure  python  HDFS  client
- protobuf for  communicating  with  the  NameNode
- CLI  for  Hadoop
- Extreamlly fast!
10
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Still  under  heavy  loading
0
0,5
1
1,5
2
2,5
3
3,5
4
July August September October November
Data  Loads  [TB]
Data  Loads  [TB] Expon.    (Data  Loads  [TB])
11
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Complex analysis
12
• RevR  +  RStudio  
• DataScience
• Trend  analysis,  advanced clustering
• Predictive models
• Classifiers
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Apache  Mahout
• Library  of  scalable  machine-­learning algorithms
• Implemented  on  top  of  Apache  Hadoop
• Using  the  MapReduce paradigm
• Provides  the  data  science tools  to  automatically  
find  meaningful  patterns  in  those  big  data  sets
• http://mahout.apache.org/
13
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
What  Mahout  Does
• Mahout  supports four  main  data  science use  
cases:
- Collaborative  filtering – mines  user  behavior  and  
makes  product  recommendations  (e.g.  Amazon  
recommendations)
- Clustering – takes  items  in  a  particular  class
- Classification – learns  from  existing  categorizations  
and  then  assigns
- Frequent  itemset mining – analyzes  items  in  a  group
14
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Clustering  -­ business  use  case
• Helps  marketers  improve  their  customer  base  
and  work  on  the  target areas.  
• Group  people according  to  different  criteria’s  
(such  as  willingness,  purchasing  power  etc.)  
based  on  their  similarity in  many  ways  related  
to  the  product  under  consideration.
• Helps  in  identification of  groups  of  houses  on  
the  basis  of  their  value,  type  and  geographical  
locations.
15
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
K-­means
16
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
K-­means
17
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Hadoop  data  preparation
18
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Sequences  and  Vectors
• Hadoop  Sequence  file
- flat  file  consisting  of  binary  key/value  pairs
- It  is  extensively  used  in MapReduce as  input/output  
formats
- Each  record  is  a  <key,value>  pair
- Key  and  Value  needs  to  be  a  class  of  
org.apache.hadoop.io.Text
- KEY  =  record  name/filename/uniqe ID
- VALUE  =  content  as  UTF-­8  encoded  String
• Vectors
- Typical  vector  representation  ie.  Weka,  Matlab
19
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
HDFS  data  file  to  Vector
20
List<NamedVector> vector = new LinkedList<NamedVector>();
NamedVector v1;
v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
vector.add(v1);
Configuration config = new Configuration();
FileSystem fs = FileSystem.get(config);
Path path = new Path("datasamples/data");
//write a SequenceFile form a Vector
SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for(NamedVector v:vector){
vec.set(v);
writer.append(new Text(v.getName()), v);
}
writer.close();
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Kmeans clustering  in  action
• Place  the  file  on  HDFS
• Convert  the  file  into  sequence  and  vector
- mahout  arff.vector
-­d  /home/cloudera/Mahout/input_data
-­o  /user/cloudera/mahout/arff/vec_data
-­t  /home/cloudera/Mahout/arff/dict
• Run  mahout  kmeans
- mahout  kmeans -­-­input  <hdfs_ata_files>   -­-­output  
<kmeans-­output>   -­-­numClusters 3   -­-­clusters  
<clusters-­0-­final>  -­-­maxIter 20   -­-­method  mapreduce
21
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Kmeans clustering  in  action
• See  the  cluster  as  text  file
- mahout  clusterdump
-­i <hdfs_input>  
- -­o  <output_file>  
-­p  <clusteredPoints>
• See  the  cluster  as  graphml file
- -­of  GRAPH_ML
22
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Results
23
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Results
24
©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Acxiom  DSSH
25
• Data  Science  Safe  Haven  (DSSH)
• Detailed  measurements  that  show  how  digital  
marketing  is  driving  purchasing  behaviors
• Actionable  recommendations  on  how  to  adjust  
your  digital  marketing  to  reach  your  goals
• Insights  on  how  your  key  customer  segments  
are  engaging  in  digital  channels
• http://www.acxiom.com/data-­science-­safe-­
haven/
©  2013  Acxiom  Corporation.  All  Rights  Reserved. ©  2013  Acxiom  Corporation.  All  Rights  Reserved.
Questions?
Thank you!

More Related Content

PDF
3Camp2015_prod
PPTX
Introduction to Hive
PPT
Introduction to Hive for Hadoop
ODP
An introduction to Apache Hadoop Hive
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
PPTX
Big data and tools
PPTX
Analysis of historical movie data by BHADRA
PDF
Introduction to Hive and HCatalog
3Camp2015_prod
Introduction to Hive
Introduction to Hive for Hadoop
An introduction to Apache Hadoop Hive
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Big data and tools
Analysis of historical movie data by BHADRA
Introduction to Hive and HCatalog

What's hot (20)

PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
PPTX
An intriduction to hive
PPTX
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
PPTX
Data Discovery on Hadoop - Realizing the Full Potential of your Data
ODP
Introdution to Apache Hadoop
PPTX
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
PPTX
Apache Hive
DOCX
Big Data A La Carte Menu
PPTX
PPTX
Hadoop Architecture
PPTX
PPTX
Hive hcatalog
PPTX
מיכאל
PDF
Time series database by Harshil Ambagade
PPTX
Apache hive introduction
PPTX
Frequent itemset mining_on_hadoop
PDF
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
PPTX
PPTX
MATLAB and Scientific Data: New Features and Capabilities
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Introduction to Apache Hive(Big Data, Final Seminar)
An intriduction to hive
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Introdution to Apache Hadoop
Learning Apache HIVE - Data Warehouse and Query Language for Hadoop
Apache Hive
Big Data A La Carte Menu
Hadoop Architecture
Hive hcatalog
מיכאל
Time series database by Harshil Ambagade
Apache hive introduction
Frequent itemset mining_on_hadoop
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
MATLAB and Scientific Data: New Features and Capabilities
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Ad

Similar to tech 3camp presentation (20)

PPT
Hands on Mahout!
PPTX
BIg_Data_on_AWS_Simplified excelent.pptx
PPTX
Big Data Analytics Projects - Real World with Pentaho
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
PPTX
Next generation analytics with yarn, spark and graph lab
PPT
Hadoop Technology
PDF
Hadoop: The Default Machine Learning Platform ?
PPTX
Big Data Analytics with Storm, Spark and GraphLab
PPTX
Introduction to hadoop
PPTX
Hadoop & distributed cloud computing
PDF
Hadoop Overview kdd2011
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PDF
Hadoop Overview & Architecture
 
PPTX
Data Science with Hadoop: A Primer
PDF
Applying stratosphere for big data analytics
PDF
Semantic web meetup 14.november 2013
PDF
Survey Paper on Big Data and Hadoop
PDF
Hadoop 101 for bioinformaticians
PDF
Paper id 25201498
PPTX
05 k-means clustering
Hands on Mahout!
BIg_Data_on_AWS_Simplified excelent.pptx
Big Data Analytics Projects - Real World with Pentaho
Hadoop_EcoSystem slide by CIDAC India.pptx
Next generation analytics with yarn, spark and graph lab
Hadoop Technology
Hadoop: The Default Machine Learning Platform ?
Big Data Analytics with Storm, Spark and GraphLab
Introduction to hadoop
Hadoop & distributed cloud computing
Hadoop Overview kdd2011
Yarn spark next_gen_hadoop_8_jan_2014
Hadoop Overview & Architecture
 
Data Science with Hadoop: A Primer
Applying stratosphere for big data analytics
Semantic web meetup 14.november 2013
Survey Paper on Big Data and Hadoop
Hadoop 101 for bioinformaticians
Paper id 25201498
05 k-means clustering
Ad

Recently uploaded (20)

PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Introduction to machine learning and Linear Models
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Knowledge Engineering Part 1
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Supervised vs unsupervised machine learning algorithms
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Database Infoormation System (DBIS).pptx
Introduction-to-Cloud-ComputingFinal.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Introduction to machine learning and Linear Models
ISS -ESG Data flows What is ESG and HowHow
Data_Analytics_and_PowerBI_Presentation.pptx
Reliability_Chapter_ presentation 1221.5784
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Knowledge Engineering Part 1
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Supervised vs unsupervised machine learning algorithms
Quality review (1)_presentation of this 21
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx

tech 3camp presentation

  • 1. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Hadoop – a  distributed analytical platform Jakub  Wszolek  (jwszol@acxiom.com) TECH  3camp  -­ 2015
  • 2. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. BigData is not  Hadoop only 2
  • 3. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Hadoop galactic 3
  • 4. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ETL  processes 4
  • 5. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ETL  processes 5 Hadoop  Streaming Hive MRJOB
  • 6. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ETL  processes 6 Hadoop  Streaming Hive MRJOB Data  Loading
  • 7. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ETL  processes 7 Hadoop  Streaming Hive MRJOB Data  Loading Hive  Tables  (internal/external)
  • 8. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ETL  processes 8 Hadoop  Streaming Hive MRJOB Data  Loading Hive  Tables  (internal/external) Data  Science
  • 9. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ETL  processes 9 Hadoop  Streaming Hive MRJOB Data  Loading Hive  Tables  (internal/external) Data  Science
  • 10. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Worth  to  check.. • MRJOB -­ https://pythonhosted.org/mrjob/ - Hadoop  streaming   - Keep  all  MapReduce code  for  one  job  in  a  single  class - mrjob lets  you  run  your  code  without  Hadoop  at  all - mrjob makes  debugging  much  easier • Snakebite -­ https://github.com/spotify/snakebite - pure  python  HDFS  client - protobuf for  communicating  with  the  NameNode - CLI  for  Hadoop - Extreamlly fast! 10
  • 11. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Still  under  heavy  loading 0 0,5 1 1,5 2 2,5 3 3,5 4 July August September October November Data  Loads  [TB] Data  Loads  [TB] Expon.    (Data  Loads  [TB]) 11
  • 12. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Complex analysis 12 • RevR  +  RStudio   • DataScience • Trend  analysis,  advanced clustering • Predictive models • Classifiers
  • 13. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Apache  Mahout • Library  of  scalable  machine-­learning algorithms • Implemented  on  top  of  Apache  Hadoop • Using  the  MapReduce paradigm • Provides  the  data  science tools  to  automatically   find  meaningful  patterns  in  those  big  data  sets • http://mahout.apache.org/ 13
  • 14. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. What  Mahout  Does • Mahout  supports four  main  data  science use   cases: - Collaborative  filtering – mines  user  behavior  and   makes  product  recommendations  (e.g.  Amazon   recommendations) - Clustering – takes  items  in  a  particular  class - Classification – learns  from  existing  categorizations   and  then  assigns - Frequent  itemset mining – analyzes  items  in  a  group 14
  • 15. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Clustering  -­ business  use  case • Helps  marketers  improve  their  customer  base   and  work  on  the  target areas.   • Group  people according  to  different  criteria’s   (such  as  willingness,  purchasing  power  etc.)   based  on  their  similarity in  many  ways  related   to  the  product  under  consideration. • Helps  in  identification of  groups  of  houses  on   the  basis  of  their  value,  type  and  geographical   locations. 15
  • 16. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. K-­means 16
  • 17. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. K-­means 17
  • 18. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Hadoop  data  preparation 18
  • 19. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Sequences  and  Vectors • Hadoop  Sequence  file - flat  file  consisting  of  binary  key/value  pairs - It  is  extensively  used  in MapReduce as  input/output   formats - Each  record  is  a  <key,value>  pair - Key  and  Value  needs  to  be  a  class  of   org.apache.hadoop.io.Text - KEY  =  record  name/filename/uniqe ID - VALUE  =  content  as  UTF-­8  encoded  String • Vectors - Typical  vector  representation  ie.  Weka,  Matlab 19
  • 20. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. HDFS  data  file  to  Vector 20 List<NamedVector> vector = new LinkedList<NamedVector>(); NamedVector v1; v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one"); vector.add(v1); Configuration config = new Configuration(); FileSystem fs = FileSystem.get(config); Path path = new Path("datasamples/data"); //write a SequenceFile form a Vector SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class); VectorWritable vec = new VectorWritable(); for(NamedVector v:vector){ vec.set(v); writer.append(new Text(v.getName()), v); } writer.close();
  • 21. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Kmeans clustering  in  action • Place  the  file  on  HDFS • Convert  the  file  into  sequence  and  vector - mahout  arff.vector -­d  /home/cloudera/Mahout/input_data -­o  /user/cloudera/mahout/arff/vec_data -­t  /home/cloudera/Mahout/arff/dict • Run  mahout  kmeans - mahout  kmeans -­-­input  <hdfs_ata_files>   -­-­output   <kmeans-­output>   -­-­numClusters 3   -­-­clusters   <clusters-­0-­final>  -­-­maxIter 20   -­-­method  mapreduce 21
  • 22. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Kmeans clustering  in  action • See  the  cluster  as  text  file - mahout  clusterdump -­i <hdfs_input>   - -­o  <output_file>   -­p  <clusteredPoints> • See  the  cluster  as  graphml file - -­of  GRAPH_ML 22
  • 23. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Results 23
  • 24. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Results 24
  • 25. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Acxiom  DSSH 25 • Data  Science  Safe  Haven  (DSSH) • Detailed  measurements  that  show  how  digital   marketing  is  driving  purchasing  behaviors • Actionable  recommendations  on  how  to  adjust   your  digital  marketing  to  reach  your  goals • Insights  on  how  your  key  customer  segments   are  engaging  in  digital  channels • http://www.acxiom.com/data-­science-­safe-­ haven/
  • 26. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. ©  2013  Acxiom  Corporation.  All  Rights  Reserved. Questions? Thank you!