SlideShare a Scribd company logo
Social Network Mining
    Solutions using Google App Engine Map Reduce




     J Singh, DataThinks.org



                                        October 19, 2011
MapReduce: A Genealogical Perspective
• Roots
   – Lisp, Scheme
   – APL


• Google OS papers, 2004
   – Exploit extreme parallelism of data


• Apache Top Level Project (Hadoop)

• MapReduceGAE borrows from these




© J Singh, 2011                            2
                                   2
Social Network Mining
• Finding people based on data in social networks
   –   Love and Romance
   –   Common interests
   –   Similar buying habits
   –   Similar voting propensities
   –   Location


• It‟s not a new problem
   – We have additional solutions for the old problem
        • Examples based on proprietary data: eHarmony, etc.
        • Early examples based on social network data: ShoutFlow,
          WhoIsJustLikeMe.



© J Singh, 2011                                                     3
                                      3
Based on clustering algorithms
• On-line demo of clustering       • Resource intensive.
                                      – Best done in batch mode


                                   • Exploit data parallelism of the
                                     algorithm
                                      – App Engine Map Reduce,
                                        employing one map job for
                                        each cluster
                                      – App Engine Pipeline API,
                                        employing one stage of the
                                        pipeline for each „step‟


                                   • But first, a detour into Map
                                     Reduce…
© J Singh, 2011                                                      4
                               4
MapReduce Conceptual Underpinnings
• Based on Functional Programming model
   – From Lisp / Scheme
        • (map square '(1 2 3 4))   (1 4 9 16)
        • (reduce plus '(1 4 9 16))   30
   – From APL
        • +/ N    N  1 2 3 4


• Easy to distribute (based on each element of the vector)

• New for Map/Reduce: Nice failure/retry semantics
   – Hundreds and thousands of low-end servers are running at the
     same time



© J Singh, 2011                                                     5
                                  5
MapReduce Flow




© J Singh, 2011       6
                  6
MapReduce Components in GAE 2011
                  • Input Reader
                     – Several provided by GAE, can write your own


                  • Map function: Written by Programmer

                  • Shuffle function:
                     – Provided by GAE, can write your own


                  • Reduce function: Written by Programmer

                  • Output Writer
                     – Several provided by GAE, can write your own




© J Singh, 2011                                                      7
                               7
Invoking GAE Map Reduce
class MapreducePipeline (…):
    def run(self,
          job_name,             #   A string
          mapper_spec,          #   Mapper function
          reducer_spec,         #   Reducer function
          input_reader_spec,    #   Input reader fn
          output_writer_spec,   #   Output writer
          mapper_params,        #   A dictionary
          reducer_params,       #   A dictionary
          shards,               #   An int
            )


© J Singh, 2011                                        8
                          8
GAE Pipeline API
• Based on Python Generator functions

• The old Unix idea on steroids:
   – Perform complex operations by piping data between primitives
   – But the primitives are not so primitive
   – Data lives in permanent storage between pipeline stages


• MapreducePipeline (prev page) was just one type of pipeline




© J Singh, 2011                                                     9
                                   9
Pipeline API Example Code
Split and Merge example


  class aPipe(pipeline.Pipeline):
      def run(self, e_kind, prop_name, *value_list):
          all_bs = []
          for v in value_list:
              stage = yield bPipe(e_kind, prop_name, v)
              all_bs.append(stage)
          yield common.Append(*all_bs)




© J Singh, 2011                                           10
                            10
Pause and Assess
• Assertion:
   – GAE Map/Reduce is a complete solution for analysis of social
     network mining
   – We know it will scale, the question is how far.


• Working on one Proof of Concept for Social Network Mining
   – Recruiting a second test case


• Will report back in 3-4 months with data on
   – Performance
   – Cost
   – Limits of scalability


© J Singh, 2011                                                     11
                                     11
Adapting the algorithm to M/R
• Clustering Algorithm

   1. Create k randomly placed centroids       Map each
                                               data point

   2. Find the centroid (1-k) closest to each data point


   3. Move each centroid to the average of its members
                                              Reduce
                                           Each Centroid
   4. Repeat 2 and 3 until there is no more change

          Connect to next stage
           using Pipelining API

© J Singh, 2011                                             12
                                  12
About Us
• Involved with Map/Reduce and NoSQL technologies on several
  platforms
   – Google App Engine, MongoDB


• DataThinks.org is a new service of Early Stage IT
   – Building and operating “Big Data” analytics services




                           Thanks
© J Singh, 2011                                                13
                                   13

More Related Content

PPTX
Final ppt
PDF
presentation644v4
PDF
Graph Coloring Algorithms on Pregel Model using Hadoop
PDF
Practical implementation of pca on satellite images
PDF
[2020 CVPR Efficient DET paper review]
PDF
Hadoop combiner and partitioner
PDF
Benchmarking tool for graph algorithms
PDF
FME World Tour 2015 Belfast - Donegal County Council - Daragh McDonough
Final ppt
presentation644v4
Graph Coloring Algorithms on Pregel Model using Hadoop
Practical implementation of pca on satellite images
[2020 CVPR Efficient DET paper review]
Hadoop combiner and partitioner
Benchmarking tool for graph algorithms
FME World Tour 2015 Belfast - Donegal County Council - Daragh McDonough

Similar to Social Media Mining using GAE Map Reduce (20)

PPTX
Big Data Laboratory
PPT
Map reducecloudtech
PDF
Srinivas Muddana Resume
PDF
Resume
PDF
Srinivas Muddana Resume
PDF
Srinivas Muddana Resume
PPT
Download It
PPT
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
PPT
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
PPTX
The Hadoop Ecosystem
PDF
MapReduce Algorithm Design
KEY
Project Progress
PDF
Introduction to map reduce
PPTX
Scalable image recognition model with deep embedding
PDF
Deep Learning Applications to Satellite Imagery
PPTX
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
PPTX
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PPT
Hadoop Tutorial.ppt
PDF
GoFFish - A Sub-graph centric framework for large scale graph analytics
PPTX
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Big Data Laboratory
Map reducecloudtech
Srinivas Muddana Resume
Resume
Srinivas Muddana Resume
Srinivas Muddana Resume
Download It
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
DYNAMIC SLICING OF ASPECT-ORIENTED PROGRAMS
The Hadoop Ecosystem
MapReduce Algorithm Design
Project Progress
Introduction to map reduce
Scalable image recognition model with deep embedding
Deep Learning Applications to Satellite Imagery
Using R to Visualize Spatial Data: R as GIS - Guy Lansley
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
Hadoop Tutorial.ppt
GoFFish - A Sub-graph centric framework for large scale graph analytics
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Ad

More from J Singh (20)

PDF
OpenLSH - a framework for locality sensitive hashing
PPTX
Designing analytics for big data
PDF
Open LSH - september 2014 update
PPTX
PaaS - google app engine
PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
PPTX
Data Analytic Technology Platforms: Options and Tradeoffs
PPTX
Facebook Analytics with Elastic Map/Reduce
PPTX
High Throughput Data Analysis
PPTX
NoSQL and MapReduce
PPTX
CS 542 -- Concurrency Control, Distributed Commit
PPTX
CS 542 -- Failure Recovery, Concurrency Control
PPTX
CS 542 -- Query Optimization
PPTX
CS 542 -- Query Execution
PPTX
CS 542 Putting it all together -- Storage Management
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PPTX
CS 542 Database Index Structures
PPTX
CS 542 Controlling Database Integrity and Performance
PPTX
CS 542 Overview of query processing
PPTX
CS 542 Introduction
PDF
Cloud Computing from an Entrpreneur's Viewpoint
OpenLSH - a framework for locality sensitive hashing
Designing analytics for big data
Open LSH - september 2014 update
PaaS - google app engine
Mining of massive datasets using locality sensitive hashing (LSH)
Data Analytic Technology Platforms: Options and Tradeoffs
Facebook Analytics with Elastic Map/Reduce
High Throughput Data Analysis
NoSQL and MapReduce
CS 542 -- Concurrency Control, Distributed Commit
CS 542 -- Failure Recovery, Concurrency Control
CS 542 -- Query Optimization
CS 542 -- Query Execution
CS 542 Putting it all together -- Storage Management
CS 542 Parallel DBs, NoSQL, MapReduce
CS 542 Database Index Structures
CS 542 Controlling Database Integrity and Performance
CS 542 Overview of query processing
CS 542 Introduction
Cloud Computing from an Entrpreneur's Viewpoint
Ad

Social Media Mining using GAE Map Reduce

  • 1. Social Network Mining Solutions using Google App Engine Map Reduce J Singh, DataThinks.org October 19, 2011
  • 2. MapReduce: A Genealogical Perspective • Roots – Lisp, Scheme – APL • Google OS papers, 2004 – Exploit extreme parallelism of data • Apache Top Level Project (Hadoop) • MapReduceGAE borrows from these © J Singh, 2011 2 2
  • 3. Social Network Mining • Finding people based on data in social networks – Love and Romance – Common interests – Similar buying habits – Similar voting propensities – Location • It‟s not a new problem – We have additional solutions for the old problem • Examples based on proprietary data: eHarmony, etc. • Early examples based on social network data: ShoutFlow, WhoIsJustLikeMe. © J Singh, 2011 3 3
  • 4. Based on clustering algorithms • On-line demo of clustering • Resource intensive. – Best done in batch mode • Exploit data parallelism of the algorithm – App Engine Map Reduce, employing one map job for each cluster – App Engine Pipeline API, employing one stage of the pipeline for each „step‟ • But first, a detour into Map Reduce… © J Singh, 2011 4 4
  • 5. MapReduce Conceptual Underpinnings • Based on Functional Programming model – From Lisp / Scheme • (map square '(1 2 3 4)) (1 4 9 16) • (reduce plus '(1 4 9 16)) 30 – From APL • +/ N N  1 2 3 4 • Easy to distribute (based on each element of the vector) • New for Map/Reduce: Nice failure/retry semantics – Hundreds and thousands of low-end servers are running at the same time © J Singh, 2011 5 5
  • 6. MapReduce Flow © J Singh, 2011 6 6
  • 7. MapReduce Components in GAE 2011 • Input Reader – Several provided by GAE, can write your own • Map function: Written by Programmer • Shuffle function: – Provided by GAE, can write your own • Reduce function: Written by Programmer • Output Writer – Several provided by GAE, can write your own © J Singh, 2011 7 7
  • 8. Invoking GAE Map Reduce class MapreducePipeline (…): def run(self, job_name, # A string mapper_spec, # Mapper function reducer_spec, # Reducer function input_reader_spec, # Input reader fn output_writer_spec, # Output writer mapper_params, # A dictionary reducer_params, # A dictionary shards, # An int ) © J Singh, 2011 8 8
  • 9. GAE Pipeline API • Based on Python Generator functions • The old Unix idea on steroids: – Perform complex operations by piping data between primitives – But the primitives are not so primitive – Data lives in permanent storage between pipeline stages • MapreducePipeline (prev page) was just one type of pipeline © J Singh, 2011 9 9
  • 10. Pipeline API Example Code Split and Merge example class aPipe(pipeline.Pipeline): def run(self, e_kind, prop_name, *value_list): all_bs = [] for v in value_list: stage = yield bPipe(e_kind, prop_name, v) all_bs.append(stage) yield common.Append(*all_bs) © J Singh, 2011 10 10
  • 11. Pause and Assess • Assertion: – GAE Map/Reduce is a complete solution for analysis of social network mining – We know it will scale, the question is how far. • Working on one Proof of Concept for Social Network Mining – Recruiting a second test case • Will report back in 3-4 months with data on – Performance – Cost – Limits of scalability © J Singh, 2011 11 11
  • 12. Adapting the algorithm to M/R • Clustering Algorithm 1. Create k randomly placed centroids Map each data point 2. Find the centroid (1-k) closest to each data point 3. Move each centroid to the average of its members Reduce Each Centroid 4. Repeat 2 and 3 until there is no more change Connect to next stage using Pipelining API © J Singh, 2011 12 12
  • 13. About Us • Involved with Map/Reduce and NoSQL technologies on several platforms – Google App Engine, MongoDB • DataThinks.org is a new service of Early Stage IT – Building and operating “Big Data” analytics services Thanks © J Singh, 2011 13 13