Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

Amazon-style
shopping cart analysis
using MapReduce
on a Hadoop cluster

Dan Şerban

Agenda
:: Introduction
- Real-world uses of MapReduce
- The origins of Hadoop
- Hadoop facts and architecture
:: Part 1
- Deploying Hadoop
:: Part 2
- MapReduce is machine learning
:: Q&A

Why shopping cart analysis
is useful to amazon.com

The origins of Hadoop
:: Hadoop got its start in Nutch
:: A few enthusiastic developers were attempting
to build an open source web search engine
and having trouble managing computations
running on even a handful of computers
:: Once Google published their GoogleFS and MapReduce
whitepapers, the way forward became clear
:: Google had devised systems to solve precisely
the problems the Nutch project was facing
:: Thus, Hadoop was born

Hadoop facts
:: Hadoop is a distributed computing platform
for processing extremely large amounts of data
:: Hadoop is divided into two main components:
- the MapReduce runtime
- the Hadoop Distributed File System (HDFS)
:: The MapReduce runtime allows the user to submit
MapReduce jobs
:: The HDFS is a distributed file system that provides
a logical interface for persistent and redundant
storage of large data
:: Hadoop also provides the HadoopStreaming library
that leverages STDIN and STDOUT so you can write
mappers and reducers in your programming language
of choice

Hadoop facts
:: Hadoop is based on the principle
of moving computation to where the data is
:: Data stored on the Hadoop Distributed File System
is broken up into chunks and replicated across
the cluster providing fault tolerant parallel processing
and redundancy for both the data and the jobs
:: Computation takes the form of a job which consists
of a map phase and a reduce phase
:: Data is initially processed by map functions which run
in parallel across the cluster
:: Map output is in the form of key-value pairs
:: The reduce phase then aggregates the map results
:: The reduce phase typically happens in multiple
consecutive waves until the job is complete

Part 1:
Configuring and deploying
the Hadoop cluster

Setting up SSH
:: hadoop@hadoop-master needs to be able to ssh* into:
- hadoop@hadoop-master
- hadoop@chunkserver-a
- hadoop@chunkserver-b
- hadoop@chunkserver-c
:: hadoop@job-tracker needs to be able to ssh* into:
- hadoop@job-tracker
- hadoop@chunkserver-a
- hadoop@chunkserver-b
- hadoop@chunkserver-c

*Passwordless-ly and passphraseless-ly

Part 2:
MapReduce
is machine learning

Rolling your own
self-hosted alternative to ...

mapper.py
#!/usr/bin/python

import sys

for line in sys.stdin:
line = line.strip()
IDs = line.split()
for firstID in IDs:
for secondID in IDs:
if secondID > firstID:
print '%s_%st%s' % (firstID, secondID, 1)

reducer.py
#!/usr/bin/python

import sys

subTotals = {}
for line in sys.stdin:
line = line.strip()
word = line.split('t')[0]
count = int(line.split('t')[1])
subTotals[word] = subTotals.get(word, 0) + count
for k, v in subTotals.items():
print "%st%s" % (k, v)

Other MapReduce use cases
:: Google Suggest
:: Video recommendations (YouTube)
:: ClickStream Analysis (large web properties)
:: Spam filtering and contextual advertising (Yahoo)
:: Fraud detection (eBay, CC companies)
:: Firewall log analysis to discover exfiltration and
other undesirable (possibly malware-related) activity
:: Finding patterns in social data, analyzing “likes” and
building a search engine on top of them (FaceBook)
:: Discovering microblogging trends and opinion leaders,
analyzing who follows who (Twitter)
:: Plain old supermarket shopping basket analysis
:: The semantic web

Bonus slide: Making of SQLite DB

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster

More Related Content

What's hot (20)

Similar to Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster (20)

More from Asociatia ProLinux (20)

Recently uploaded (20)

Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster