Amazon-style
shopping cart analysis
  using MapReduce
 on a Hadoop cluster


      Dan Şerban
Agenda
:: Introduction
   - Real-world uses of MapReduce
   - The origins of Hadoop
   - Hadoop facts and architecture
:: Part 1
   - Deploying Hadoop
:: Part 2
   - MapReduce is machine learning
:: Q&A
Why shopping cart analysis
 is useful to amazon.com
Linkedin and Google Reader
The origins of Hadoop
:: Hadoop got its start in Nutch
:: A few enthusiastic developers were attempting
   to build an open source web search engine
   and having trouble managing computations
   running on even a handful of computers
:: Once Google published their GoogleFS and MapReduce
   whitepapers, the way forward became clear
:: Google had devised systems to solve precisely
   the problems the Nutch project was facing
:: Thus, Hadoop was born
Hadoop facts
:: Hadoop is a distributed computing platform
   for processing extremely large amounts of data
:: Hadoop is divided into two main components:
   - the MapReduce runtime
   - the Hadoop Distributed File System (HDFS)
:: The MapReduce runtime allows the user to submit
   MapReduce jobs
:: The HDFS is a distributed file system that provides
   a logical interface for persistent and redundant
   storage of large data
:: Hadoop also provides the HadoopStreaming library
   that leverages STDIN and STDOUT so you can write
   mappers and reducers in your programming language
   of choice
Hadoop facts
:: Hadoop is based on the principle
   of moving computation to where the data is
:: Data stored on the Hadoop Distributed File System
   is broken up into chunks and replicated across
   the cluster providing fault tolerant parallel processing
   and redundancy for both the data and the jobs
:: Computation takes the form of a job which consists
   of a map phase and a reduce phase
:: Data is initially processed by map functions which run
   in parallel across the cluster
:: Map output is in the form of key-value pairs
:: The reduce phase then aggregates the map results
:: The reduce phase typically happens in multiple
   consecutive waves until the job is complete
Hadoop architecture
Part 1:
Configuring and deploying
   the Hadoop cluster
Hands-on with Hadoop
core-site.xml - before
core-site.xml - after
hdfs-site.xml - before
hdfs-site.xml - after
mapred-site.xml - before
mapred-site.xml - after
Setting up SSH
:: hadoop@hadoop-master needs to be able to ssh* into:
   - hadoop@hadoop-master
   - hadoop@chunkserver-a
   - hadoop@chunkserver-b
   - hadoop@chunkserver-c
:: hadoop@job-tracker needs to be able to ssh* into:
   - hadoop@job-tracker
   - hadoop@chunkserver-a
   - hadoop@chunkserver-b
   - hadoop@chunkserver-c




*Passwordless-ly and passphraseless-ly
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Hands-on with Hadoop
Part 2:
    MapReduce
is machine learning
Rolling your own
self-hosted alternative to ...
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
mapper.py
#!/usr/bin/python

import sys

for line in sys.stdin:
  line = line.strip()
  IDs = line.split()
  for firstID in IDs:
    for secondID in IDs:
      if secondID > firstID:
        print '%s_%st%s' % (firstID, secondID, 1)
reducer.py
#!/usr/bin/python

import sys

subTotals = {}
for line in sys.stdin:
  line = line.strip()
  word = line.split('t')[0]
  count = int(line.split('t')[1])
  subTotals[word] = subTotals.get(word, 0) + count
for k, v in subTotals.items():
  print "%st%s" % (k, v)
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Hands-on with MapReduce
Other MapReduce use cases
::   Google Suggest
::   Video recommendations (YouTube)
::   ClickStream Analysis (large web properties)
::   Spam filtering and contextual advertising (Yahoo)
::   Fraud detection (eBay, CC companies)
::   Firewall log analysis to discover exfiltration and
     other undesirable (possibly malware-related) activity
::   Finding patterns in social data, analyzing “likes” and
     building a search engine on top of them (FaceBook)
::   Discovering microblogging trends and opinion leaders,
     analyzing who follows who (Twitter)
::   Plain old supermarket shopping basket analysis
::   The semantic web
Questions / Feedback
Bonus slide: Making of SQLite DB

More Related Content

PPTX
Hadoop eco system-first class
PPTX
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
PDF
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
PPTX
Map Reduce
PPTX
Analysing of big data using map reduce
PDF
Map Reduce
PPTX
Introduction to Map Reduce
PDF
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Hadoop eco system-first class
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Hadoop - Introduction to map reduce programming - Reunião 12/04/2014
Map Reduce
Analysing of big data using map reduce
Map Reduce
Introduction to Map Reduce
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

What's hot (20)

PPT
Introduction to Apache Hadoop
PPT
Map Reduce
PPTX
MapReduce basic
PPTX
Map reduce presentation
PPTX
Cassandra + Hadoop @ApacheCon
PDF
Hadoop and Hive Development at Facebook
PPT
Map Reduce introduction
PPTX
Introduction to MapReduce
PPT
Hw09 Hadoop Development At Facebook Hive And Hdfs
PPTX
Hadoop_EcoSystem_Pradeep_MG
PDF
Introduction to map reduce
PPT
Hadoop institutes-in-bangalore
PDF
Hadoop-Introduction
PPT
Introduction To Map Reduce
PPTX
PPTX
Stratosphere with big_data_analytics
PPTX
Map Reduce basics
PPTX
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
PDF
Applying stratosphere for big data analytics
PPTX
writing Hadoop Map Reduce programs
Introduction to Apache Hadoop
Map Reduce
MapReduce basic
Map reduce presentation
Cassandra + Hadoop @ApacheCon
Hadoop and Hive Development at Facebook
Map Reduce introduction
Introduction to MapReduce
Hw09 Hadoop Development At Facebook Hive And Hdfs
Hadoop_EcoSystem_Pradeep_MG
Introduction to map reduce
Hadoop institutes-in-bangalore
Hadoop-Introduction
Introduction To Map Reduce
Stratosphere with big_data_analytics
Map Reduce basics
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Applying stratosphere for big data analytics
writing Hadoop Map Reduce programs
Ad

Similar to Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster (20)

PDF
Hadoop, MapReduce and R = RHadoop
PPT
Introduccion a Hadoop / Introduction to Hadoop
PPT
Hadoop MapReduce Fundamentals
PPTX
Learn what is Hadoop-and-BigData
PDF
Data Science
PDF
Report Hadoop Map Reduce
PPTX
Hadoop live online training
PPTX
Python in big data world
PPTX
Hadoop Big Data A big picture
PPTX
Hackathon bonn
PDF
Hadoop interview questions
PPTX
Unit 5
PPT
Hadoop ppt2
PDF
Hadoop 31-frequently-asked-interview-questions
PPTX
Data science and analytics, computer science
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
DOCX
Best Hadoop and Amazon Online Training
DOCX
Hadoop and aws map reducecourse
Hadoop, MapReduce and R = RHadoop
Introduccion a Hadoop / Introduction to Hadoop
Hadoop MapReduce Fundamentals
Learn what is Hadoop-and-BigData
Data Science
Report Hadoop Map Reduce
Hadoop live online training
Python in big data world
Hadoop Big Data A big picture
Hackathon bonn
Hadoop interview questions
Unit 5
Hadoop ppt2
Hadoop 31-frequently-asked-interview-questions
Data science and analytics, computer science
Hadoop a Natural Choice for Data Intensive Log Processing
Best Hadoop and Amazon Online Training
Hadoop and aws map reducecourse
Ad

More from Asociatia ProLinux (20)

PPT
Cristina Vintila - 4G - Technology Overview
PDF
Razvan Deaconescu - rss2email
PDF
Nicu Buculei - Progresul WLMRO
PDF
Razvan Deaconescu - Task Management for the Daily Workaholic
PDF
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
PDF
Ioan Eugen Stan - Introducere HBase
PDF
Ioan Eugen Stan - James
PDF
Dumitru Enache - Bacula
PPTX
Ciprian Badescu, Eugen Stoianovici - CUBRID
PDF
Ovidiu Constantin - ReactOS
PDF
Petru Ratiu - Linux bonding meets sysfs
PDF
Calin Burloiu - Prelucrarea fisierelor video in Linux
PDF
Alex Juncu - UDPCast
PDF
Razvan Deaconescu - Org-Mode
PDF
Ovidiu Constantin - Linux From Scratch 6.8
PDF
Cornel Florentin Dimitriu - Tune in... on Linux
PDF
Radu Zoran - Linux pe un Tablet PC
PDF
Ovidiu Constantin - Debian Live
PDF
Razvan Deaconescu - Redmine
PDF
Ovidiu constantin 1 airopl
Cristina Vintila - 4G - Technology Overview
Razvan Deaconescu - rss2email
Nicu Buculei - Progresul WLMRO
Razvan Deaconescu - Task Management for the Daily Workaholic
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Ioan Eugen Stan - Introducere HBase
Ioan Eugen Stan - James
Dumitru Enache - Bacula
Ciprian Badescu, Eugen Stoianovici - CUBRID
Ovidiu Constantin - ReactOS
Petru Ratiu - Linux bonding meets sysfs
Calin Burloiu - Prelucrarea fisierelor video in Linux
Alex Juncu - UDPCast
Razvan Deaconescu - Org-Mode
Ovidiu Constantin - Linux From Scratch 6.8
Cornel Florentin Dimitriu - Tune in... on Linux
Radu Zoran - Linux pe un Tablet PC
Ovidiu Constantin - Debian Live
Razvan Deaconescu - Redmine
Ovidiu constantin 1 airopl

Recently uploaded (20)

PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PPTX
The various Industrial Revolutions .pptx
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
DOCX
search engine optimization ppt fir known well about this
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Five Habits of High-Impact Board Members
PDF
CloudStack 4.21: First Look Webinar slides
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
DP Operators-handbook-extract for the Mautical Institute
PPTX
Tartificialntelligence_presentation.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
Final SEM Unit 1 for mit wpu at pune .pptx
Web Crawler for Trend Tracking Gen Z Insights.pptx
Hindi spoken digit analysis for native and non-native speakers
The various Industrial Revolutions .pptx
Univ-Connecticut-ChatGPT-Presentaion.pdf
search engine optimization ppt fir known well about this
Assigned Numbers - 2025 - Bluetooth® Document
Five Habits of High-Impact Board Members
CloudStack 4.21: First Look Webinar slides
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
O2C Customer Invoices to Receipt V15A.pptx
A review of recent deep learning applications in wood surface defect identifi...
DP Operators-handbook-extract for the Mautical Institute
Tartificialntelligence_presentation.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Getting started with AI Agents and Multi-Agent Systems
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Group 1 Presentation -Planning and Decision Making .pptx

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster