SlideShare a Scribd company logo
  Mapreduce and (in) Search Amund Tveit (PhD) [email_address] http://atbrox.com
TOC Brief about my background Mapreduce Overview Mapreduce in Search Advanced Mapreduce Example 
Brief about my background PhD in Computer Science http://amundtveit.info/publications/ Googler for 4 years: Cluster Infrastructure Nordic Search (and Maps) Quality Google News for iPhone Details:  http://no.linkedin.com/in/amundtveit Interested in how mapreduce is used (algorithmic patterns) http://mapreducepatterns.org Working on projects using mapreduce in search and other large-scale problems Passionate about search and search technology Less known trivia: shared office with your professor back in 2000
Part 1  mapreduce
What is Mapreduce? Mapreduce is a concept and method for typically batch-based large-scale parallelization. It is inspired by functional programming's map() and reduce() functions Nice features of mapreduce systems include: reliably processing job even though machines die parallization en-large, e.g. thousands of machines for terasort Mapreduce was invented by the Google fellows below:
Mapper Processes one key and value pair at the time, e.g.  word count map(key: uri, value: text): for word in tokenize(value)      emit(word, 1) // found 1 occurence of word inverted index map(key: uri, value: text): for word in tokenize(value)      emit(word, key) 
Reducers Reducers processes one key and all values that belong to it, e.g. word count reduce(key: word type, value: list of 1s): emit(key, sum(value)) inverted index reduce(key: word type, value: list of URIs): emit(key, value)  // e.g. to a distr. hash table
Combiner Combiner functions is the subset of reducer functions where reducer functions can be written as recurrence equations, e.g. sum n+1  = sum n  + x n+1 This property happens (surprisingly) often and can be used to speed up mapreduce jobs (dramatically) by putting the combiner function as an " afterburner" on the map functions tai l. But sometimes, e.g. for advanced machine learning algorithms reducers are more advanced and can't be used as combiners
Shuffler - the silent leader When the mapper emits a key, value pair - the shuffler does the actual work in shipping it to the reducer, the addressing is a function of the key, e.g.       chosen reducer = hash(key)%num reducers but could basically be any routing function.  Q. When is changing the shuffle function useful? A. When data distributed is skewed, e.g. according to  Zip's law . Examples of this are stop words in text (overly frequent) and skewness in sales numbers (where bestsellers massively outnumbers items in the long tail)
Part 2  mapreduce in search
What is important in search? Precision results match query  but  primarily user intent Recall not missing important results  Freshness timely recall, .e.g for news Responsiveness time is money, but search latency is minus-money Ease-of-use few input widgets and intelligent query understanding Safety not providing malware or web spam
Recap: What is Search? Getting data Processing data Indexing data Searching data Search Quality (maintain and improve)  Ads Quality (which pays for the fun) But: not necessarily in that order and with feedback loops searching  might lead to  getting  usage data used in both  processing  and  indexing
Getting data - SizeOf(the Web) 2005  - Google and Yahoo competing with: ~ 20 Billion  (20*10 9 ) documents in their indices 2008  - Google seeing  1 trillion  (1*10 12 ) simultaneously existing unique urls   =>  PetaBytes  (i.e. thousands of harddisks) Bibliography: http://www.ysearchblog.com/?p=172 http://googleblog.blogspot.com/2005/09/we-wanted-something-special-for-our.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html http://www.chato.cl/research/crawling_thesis http://research.google.com/people/jeff/WSDM09-keynote.pdf http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006
MR+Search: Get Data - Dups I Detect: exact duplicates near-duplicates scraped/combined content with mapreduce: Naively   O(N 2 ) problem - compare all N documents with each other, but can be pruned by: comparing only document which share substrings compare only documents of similar size     
MR+Search:Get data - Dups II mapreduce job 1:      map (key: uri, value: text)          create hashes (shingles) of substrings of content          foreach hash in hashes: emit(key, hash)    reduce (key: uri, values: list of hashes)          size = number of hashes in values          outputvalue = concat(key, size)          foreach hash in hashes: emit(hash, outputvalue) mapreduce job 2:     map (key: hash, value: uri-numhashes)          emit(key, value)    reduce (key: hash, value: list of uri-numhashes)          emit all unique pairs of uri-numhash in value          // e.g. (uri0-4, uri1-7), (uri0-4, uri13-5), ..
MR+Search: Dups III mapreduce job 3:     map (key: uri-numhashes, value: uri-numhashes)          emit(key, value)    reduce (key: uri-numhashes, value: list of uri-numhashes)          // e.g. values for key uri0-4 could be          // uri1-7, uri1-7, uri13-5, uri33-7          for each unique uri-numhash h in values             alpha = count of h occurences in values             doc_similarity = alpha/(key's numhashes + h's numhashes                        - alpha             output_key = key's uri concatenated with h's uri             emit(output_key, doc_similarity) References: http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf http://www.cs.uwaterloo.ca/~kmsalem/courses/CS848W10/presentations/Hani-proj.pdf http://uwspace.uwaterloo.ca/bitstream/10012/5750/1/Khoshdel%20Nikkhoo_Hani.pdf http://glinden.blogspot.com/2008/04/detecting-near-duplicates-in-big-data.html http://infolab.stanford.edu/~manku/papers/07www-duplicates.ppt http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf
MR+Search: Processing data I E.g. a sequence of structurally similar map(reduce) steps where one updates a package with new fields: Content Normalization  (to text and remove boilerplate) map (key: URI, value: rawcontent) create new documentpackage documentpackage.setRawContent(rawcontent) documentpackage.setText(stripBoilerPlate(ToText(rawContent) emit(key, documentpackage) Entity Extraction  map (key: URI, value: documentpackage) names = findPersonNameEntities(documentpackage.getText()) documentpackage.setNames(names) emit(key, documentpackage) Sentiment Analysis  map (key: URI, value: documentpackage) sentimentScore = calculateSentiment(documentpackage.getText()) documentpackage.setSentimentScore(sentimentScore) emit(key, documentpackage)
MR+Search: Processing data II Creating a simple hypothetical ranking signal with mapreduce: => ratio of wovels per page map (key: URI, value: documentpackage)      numvowels, numnonvowels = 0      for each character in value:           if isWowel(character)               ++numvowels           else               ++numnonvowels      vowelratio = numvowels / (numvowels+numnonvowels)      documentpackage.setVowelRatio( vowelratio)      emit(key, documentpackage)          
MR+Search: Indexing data I Inverted index among the "classics" of mapreduce example, it resembles word count but outputs the URI instead of occurence count map (key: URI, value: text)     for word in tokenize(value)         emit(word, key) reduce (key: wordtype, value: list of URIs)     output(key, value) // e.g. to a distributed hash
MR+Search: Indexing data II How about positional information? map (key: URI, value: text)     wordposition = 0     foreach word in tokenize(value)         outputvalue = concat(key, wordposition)         ++wordposition         emit(word, outputvalue) reduce (key: wordtype, value: list of URI combined with word position)     create positional index for the given word (key) in the format      output(key, value) // e.g. to a distributed hash
MR+Search: Indexing data III How about bigram or n-gram indices? map (key: URI, value: text)     bigramposition = 0     foreach bigram in bigramtokenizer(value)         outputvalue = concat(key, wordposition)         ++bigramposition         emit(bigram, outputvalue) reduce (key: bigramtype, value: list of URI combined with bigram position)     create positional bigram index for the given bigram (key) in the format      output(key, value) // e.g. to a distributed hash
MR+Search: Indexing data IV How about indexing of stems or synonyms? map (key: URI, value: text)     foreach word in tokenizer(value)         synonyms = findSynonyms(word)         stem = createStem(word)         emit(word, key)         emit(stem, key)         foreach synonym in synonyms            emit(synonym, key) reduce (key: wordtype, value: list of URI)     output(key, value) // e.g. to a distributed hash
MR+Search: Searching data I A lot of things typically happens at search time, the query is "dissected", classified, interpreted and perhaps rewritten before querying the index. This can generate clickthrough log data where we can find most clicked-on uris: mapreducejob 1 map (key: query, value: URI clicked on)     emit(value, key) reduce (key: URI clicked on, value: list of queries)     emit(count(value), key)) // get     mapreduce job 2   map (key: uri count, value: uri)     emit(key, value) reduce (key: uri count, value: list of uris)      emit(key, value)   // generate list of uris per popularity count          
MR+Search: Searching data II Or we can find which the number of clicks per query  (note: queries with relatively few clicks probably might have issues with ranking) mapreducejob 1 map (key: query, value: URI clicked on)     emit(value, key) reduce (key: URI clicked on, value: list of queries)     foreach unique query q in value // get all click per query for this uri         outputvalue = count number of occurences (clicks) of q in value         emit( q, outputvalue)    mapreduce job 2   map (key: query, value: count)     outputvalue = sum(value)      emit(key, value) reduce (key: query, value: list of counts)      emit(sum(value), key)   // get all clickthroughs per query for all urls          
MR+Search: Quality Changing how processing and indexing mapreduce jobs work is likely to effect search quality (e.g. precision and recall), this can be evaluated with mapreduce, e.g. comparing the old and new set of indices by running querying both set of indices with the same (large) query log and compare differences in results. Task: how will you do that with mapreduce?      References: http://www.cs.northwestern.edu/~pardo/courses/mmml/papers/collaborative_filtering/crowdsourcing_for_relevance_evaluation_SIGIR08.pdf
MR+Search: Ads Quality Ads are commercial search results, they should have similar requirements to relevancy as "organic" results, but have less text "to rank with" themselves (~twitter tweet size or less), but fortunately a lot of metadata (about advertiser, landing pages, keywords bid for etc.) that can be used to measure, improve and predict their relevancy      References: http://www.wsdm-conference.org/2010/proceedings/docs/p361.pdf http://web2py.iiit.ac.in/publications/default/download/techreport.pdf.a373bbf4a5b76063.4164436c69636b5468726f7567685261746549494954485265706f72742e706466.pdf http://pages.stern.nyu.edu/~narchak/wfp0828-archak.pdf http://research.yahoo.com/files/cikm2008-search%20advertising.pdf   
MR+Search: Other Topics Plenty of mapreduce+search related topics haven't been covered, but here are some to consider looking at: machine translation clustering  graph algorithms spam/malware detection personalization References: http://www.google.com/research/pubs/och.html http://www.usenix.org/event/hotbots07/tech/full_papers/provos/provos.pdf  https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html   http://www.youtube.com/watch?v=1ZDybXl212Q  (clustering)   http://www.youtube.com/watch?v=BT-piFBP4fE  (graphs)
Part 3  advanced mapreduce example
Adv. Mapreduce Example I This gives an example of how to port a machine learning classifier to mapreduce with discussion about optimizations The Classifier:  Tikhonov Regularization with a square loss function ( this family includes: proximal support vector machines, ridge regression, shrinkage regression and regularized least-squares classification) (omega, gamma) = (I/mu + E T *E) -1 *(E T *D*e)     D  - a matrix of training classes, e.g. [[-1.0, 1.0, 1.0, .. ]] A  - a matrix with feature vectors, e.g. [[2.9, 3.3, 11.1, 2.4], .. ] e  - a vector filled with ones, e.g [1.0, 1.0, .., 1.0] E  = [A -e] mu  = scalar constant # used to tune classifier D  - a diagonal matrix with -1.0 or +1.0 values (depending on the class)  
Adv. Mapreduce Example II How to classify an incoming feature vector  x class = x T *omega - gamma The classifier expression in Python (numerical python): (omega, gamma) = (I/mu + E.T*E).I*(E.T*D*e)   The expression have a (nice) additive property such that: (omega, gamma) =  (I/mu + E_1.T*E_1 + E_2.T*E_2).I*(E_1.T*D_1*e + E_2.T*D_2*e) With induction this becomes: (omega, gamma) =  (I/mu + E_1.T*E_1 + .. + E_i.T*E_i).I* (E_1.T*D_1*e + .. + E_i.T*D_i*e)
Adv. Mapreduce Example III Brief Recap D  - a matrix of training classes, e.g.  [[-1.0, 1.0, 1.0, .. ]] A - a matrix with feature vectors, e.g.  [[2.9, 3.3, 11.1, 2.4], .. ] e - a vector filled with ones, e.g [1.0, 1.0, .., 1.0] E = [A -e] mu = scalar constant # used to tune classifier D - a diagonal matrix with -1.0 or +1.0 values  (depending on the class)   A  and  D  represent distributed training data, e.g. spread out on many machines or on distributed file system. Given the additive nature of the expression we can parallelize the  calculation of  E.T*E  and  E.T*De
Adv. Mapreduce Example IV
Adv. Mapreduce Example V  d ef  map (key, value):     # input key= class for one training example, e.g. "-1.0"     classes = [float(item) for item in key.split(",")]   # e.g. [-1.0]     D = numpy.diag(classes)       # input value = feature vector for one training example, e.g. "3.0, 7.0, 2.0"     featurematrix = [float(item) for item in value.split(",")]     A = numpy.matrix(featurematrix)       # create matrix E and vector e     e = numpy.matrix(numpy.ones(len(A)).reshape(len(A),1))     E = numpy.matrix(numpy.append(A,-e,axis=1))        # create a tuple with the values to be used by reducer     # and encode it with base64 to avoid potential trouble with '\t' and '\n' used     # as default separators in Hadoop Streaming     producedvalue = base64.b64encode(pickle.dumps( (E.T*E, E.T*D*e) )           # note: a single constant key "producedkey" sends to only one reducer     # somewhat "atypical" due to low degree of parallism on reducer side     print "producedkey\t%s" % (producedvalue)
Adv. Mapreduce Example VI  def  reduce (key, values, mu=0.1):    sumETE = None    sumETDe = None      # key isn't used, so ignoring it with _ (underscore).    for _, value in values:      # unpickle values      ETE, ETDe = pickle.loads(base64.b64decode(value))      if sumETE == None:        # create the I/mu with correct dimensions        sumETE = numpy.matrix(numpy.eye(ETE.shape[1])/mu)      sumETE += ETE        if sumETDe == None:        # create sumETDe with correct dimensions        sumETDe = ETDe      else:        sumETDe += ETDe        # note: omega = result[:-1] and gamma = result[-1]      # but printing entire vector as output      result = sumETE.I*sumETDe      print "%s\t%s" % (key, str(result.tolist()))
Adv. Mapreduce Example VII Mapper Increment Size really makes a difference E.T*E and E.T*D*e given to reducer are independent of number of training data(!) ==> put as much training data in each E.T*E and E.T*D*e package  Example: Assume 1000 mappers with 1 billion training examples each (web scale :) and 100 features per training example If putting all billion examples into one E.T*E and E.T*D*e package reducer needs to summarize 1000 101x101 matrices (not bad) Or if sending 1 example per E.T*E and E.T*D*e package reducer needs to summarize 1 trillion (10^12) 101x101 matrices (intractable)
Adv. Mapreduce Example VIII Avoid stressing the reducer
Mapreduce Patterns Map() and Reduce() methods typically follow patterns, a recommended way of grouping code with such patterns is: extracting and generalize fingerprints based on: loops : e.g "do-while", "while", "for", "repeat-until" => "loop" conditions : e.g. "if" and "switch" => "condition" emits emit data types  (if available) the map() method for both word count and index would be: map(key, value)    loop        emit(key:string, value:string)

More Related Content

PDF
정규표현식 Regular expression (regex)
PDF
Distributed batch processing with Hadoop
PDF
Mapreduce Algorithms
PDF
Collaborative Filtering in Map/Reduce
PPT
Hadoop MapReduce Fundamentals
PPTX
Mapreduce total order sorting technique
PPTX
Intro to Big Data using Hadoop
정규표현식 Regular expression (regex)
Distributed batch processing with Hadoop
Mapreduce Algorithms
Collaborative Filtering in Map/Reduce
Hadoop MapReduce Fundamentals
Mapreduce total order sorting technique
Intro to Big Data using Hadoop

Viewers also liked (20)

PDF
201411 memkitedeeplearning
PDF
Deep Learning
PPTX
YARN - Hadoop's Resource Manager
PDF
Gradient Boosted Regression Trees in scikit-learn
PPT
Item Based Collaborative Filtering Recommendation Algorithms
PPTX
Recommender System at Scale Using HBase and Hadoop
PDF
Guidelines for Swayam: India's MOOC Platform
PDF
The google MapReduce
PPTX
人工知能研究のための視覚情報処理
PDF
Map reduce: beyond word count
PDF
How Google Does Big Data - DevNexus 2014
PPTX
Surge: Rise of Scalable Machine Learning at Yahoo!
PDF
An Introduction to MapReduce
PDF
Introducing Apache Giraph for Large Scale Graph Processing
PPTX
Hadoop and Machine Learning
PPTX
Building a real time, solr-powered recommendation engine
PPTX
MapReduce in Simple Terms
PPTX
How to Build a Recommendation Engine on Spark
PDF
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
PDF
The Google File System (GFS)
201411 memkitedeeplearning
Deep Learning
YARN - Hadoop's Resource Manager
Gradient Boosted Regression Trees in scikit-learn
Item Based Collaborative Filtering Recommendation Algorithms
Recommender System at Scale Using HBase and Hadoop
Guidelines for Swayam: India's MOOC Platform
The google MapReduce
人工知能研究のための視覚情報処理
Map reduce: beyond word count
How Google Does Big Data - DevNexus 2014
Surge: Rise of Scalable Machine Learning at Yahoo!
An Introduction to MapReduce
Introducing Apache Giraph for Large Scale Graph Processing
Hadoop and Machine Learning
Building a real time, solr-powered recommendation engine
MapReduce in Simple Terms
How to Build a Recommendation Engine on Spark
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
The Google File System (GFS)
Ad

Similar to Mapreduce in Search (20)

KEY
Getting Started on Hadoop
KEY
PPT
Google Cluster Innards
PPTX
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
PDF
Lecture 2 part 3
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PPTX
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
PPT
Behm Shah Pagerank
PPT
Hadoop trainingin bangalore
PPTX
TheEdge10 : Big Data is Here - Hadoop to the Rescue
PPT
Easy R
PDF
Getting Started with Hadoop
PPT
Web Data Extraction Como2010
PPT
Hive Training -- Motivations and Real World Use Cases
PPTX
The Fundamentals Guide to HDP and HDInsight
PDF
Apache Hadoop: DFS and Map Reduce
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
PPTX
Map Reduce
PDF
Elasticsearch intro output
Getting Started on Hadoop
Google Cluster Innards
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Lecture 2 part 3
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequ...
Behm Shah Pagerank
Hadoop trainingin bangalore
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Easy R
Getting Started with Hadoop
Web Data Extraction Como2010
Hive Training -- Motivations and Real World Use Cases
The Fundamentals Guide to HDP and HDInsight
Apache Hadoop: DFS and Map Reduce
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Map Reduce
Elasticsearch intro output
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
project resource management chapter-09.pdf
PDF
Mushroom cultivation and it's methods.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
Chapter 5: Probability Theory and Statistics
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Hybrid model detection and classification of lung cancer
PDF
Approach and Philosophy of On baking technology
PPTX
Tartificialntelligence_presentation.pptx
PPTX
1. Introduction to Computer Programming.pptx
A Presentation on Artificial Intelligence
Digital-Transformation-Roadmap-for-Companies.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
project resource management chapter-09.pdf
Mushroom cultivation and it's methods.pdf
Web App vs Mobile App What Should You Build First.pdf
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Group 1 Presentation -Planning and Decision Making .pptx
1 - Historical Antecedents, Social Consideration.pdf
WOOl fibre morphology and structure.pdf for textiles
Chapter 5: Probability Theory and Statistics
Enhancing emotion recognition model for a student engagement use case through...
A comparative study of natural language inference in Swahili using monolingua...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Hybrid model detection and classification of lung cancer
Approach and Philosophy of On baking technology
Tartificialntelligence_presentation.pptx
1. Introduction to Computer Programming.pptx

Mapreduce in Search

  • 1.   Mapreduce and (in) Search Amund Tveit (PhD) [email_address] http://atbrox.com
  • 2. TOC Brief about my background Mapreduce Overview Mapreduce in Search Advanced Mapreduce Example 
  • 3. Brief about my background PhD in Computer Science http://amundtveit.info/publications/ Googler for 4 years: Cluster Infrastructure Nordic Search (and Maps) Quality Google News for iPhone Details:  http://no.linkedin.com/in/amundtveit Interested in how mapreduce is used (algorithmic patterns) http://mapreducepatterns.org Working on projects using mapreduce in search and other large-scale problems Passionate about search and search technology Less known trivia: shared office with your professor back in 2000
  • 5. What is Mapreduce? Mapreduce is a concept and method for typically batch-based large-scale parallelization. It is inspired by functional programming's map() and reduce() functions Nice features of mapreduce systems include: reliably processing job even though machines die parallization en-large, e.g. thousands of machines for terasort Mapreduce was invented by the Google fellows below:
  • 6. Mapper Processes one key and value pair at the time, e.g.  word count map(key: uri, value: text): for word in tokenize(value)      emit(word, 1) // found 1 occurence of word inverted index map(key: uri, value: text): for word in tokenize(value)      emit(word, key) 
  • 7. Reducers Reducers processes one key and all values that belong to it, e.g. word count reduce(key: word type, value: list of 1s): emit(key, sum(value)) inverted index reduce(key: word type, value: list of URIs): emit(key, value) // e.g. to a distr. hash table
  • 8. Combiner Combiner functions is the subset of reducer functions where reducer functions can be written as recurrence equations, e.g. sum n+1 = sum n + x n+1 This property happens (surprisingly) often and can be used to speed up mapreduce jobs (dramatically) by putting the combiner function as an " afterburner" on the map functions tai l. But sometimes, e.g. for advanced machine learning algorithms reducers are more advanced and can't be used as combiners
  • 9. Shuffler - the silent leader When the mapper emits a key, value pair - the shuffler does the actual work in shipping it to the reducer, the addressing is a function of the key, e.g.       chosen reducer = hash(key)%num reducers but could basically be any routing function.  Q. When is changing the shuffle function useful? A. When data distributed is skewed, e.g. according to Zip's law . Examples of this are stop words in text (overly frequent) and skewness in sales numbers (where bestsellers massively outnumbers items in the long tail)
  • 10. Part 2  mapreduce in search
  • 11. What is important in search? Precision results match query but primarily user intent Recall not missing important results  Freshness timely recall, .e.g for news Responsiveness time is money, but search latency is minus-money Ease-of-use few input widgets and intelligent query understanding Safety not providing malware or web spam
  • 12. Recap: What is Search? Getting data Processing data Indexing data Searching data Search Quality (maintain and improve)  Ads Quality (which pays for the fun) But: not necessarily in that order and with feedback loops searching might lead to getting usage data used in both  processing and indexing
  • 13. Getting data - SizeOf(the Web) 2005  - Google and Yahoo competing with: ~ 20 Billion  (20*10 9 ) documents in their indices 2008 - Google seeing 1 trillion (1*10 12 ) simultaneously existing unique urls   => PetaBytes  (i.e. thousands of harddisks) Bibliography: http://www.ysearchblog.com/?p=172 http://googleblog.blogspot.com/2005/09/we-wanted-something-special-for-our.html http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html http://www.chato.cl/research/crawling_thesis http://research.google.com/people/jeff/WSDM09-keynote.pdf http://www.morganclaypool.com/doi/abs/10.2200/S00193ED1V01Y200905CAC006
  • 14. MR+Search: Get Data - Dups I Detect: exact duplicates near-duplicates scraped/combined content with mapreduce: Naively   O(N 2 ) problem - compare all N documents with each other, but can be pruned by: comparing only document which share substrings compare only documents of similar size     
  • 15. MR+Search:Get data - Dups II mapreduce job 1:     map (key: uri, value: text)          create hashes (shingles) of substrings of content          foreach hash in hashes: emit(key, hash)    reduce (key: uri, values: list of hashes)          size = number of hashes in values          outputvalue = concat(key, size)          foreach hash in hashes: emit(hash, outputvalue) mapreduce job 2:    map (key: hash, value: uri-numhashes)          emit(key, value)    reduce (key: hash, value: list of uri-numhashes)          emit all unique pairs of uri-numhash in value          // e.g. (uri0-4, uri1-7), (uri0-4, uri13-5), ..
  • 16. MR+Search: Dups III mapreduce job 3:     map (key: uri-numhashes, value: uri-numhashes)          emit(key, value)    reduce (key: uri-numhashes, value: list of uri-numhashes)          // e.g. values for key uri0-4 could be          // uri1-7, uri1-7, uri13-5, uri33-7          for each unique uri-numhash h in values             alpha = count of h occurences in values            doc_similarity = alpha/(key's numhashes + h's numhashes                       - alpha            output_key = key's uri concatenated with h's uri            emit(output_key, doc_similarity) References: http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bib/Princeton.pdf http://www.cs.uwaterloo.ca/~kmsalem/courses/CS848W10/presentations/Hani-proj.pdf http://uwspace.uwaterloo.ca/bitstream/10012/5750/1/Khoshdel%20Nikkhoo_Hani.pdf http://glinden.blogspot.com/2008/04/detecting-near-duplicates-in-big-data.html http://infolab.stanford.edu/~manku/papers/07www-duplicates.ppt http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf
  • 17. MR+Search: Processing data I E.g. a sequence of structurally similar map(reduce) steps where one updates a package with new fields: Content Normalization (to text and remove boilerplate) map (key: URI, value: rawcontent) create new documentpackage documentpackage.setRawContent(rawcontent) documentpackage.setText(stripBoilerPlate(ToText(rawContent) emit(key, documentpackage) Entity Extraction  map (key: URI, value: documentpackage) names = findPersonNameEntities(documentpackage.getText()) documentpackage.setNames(names) emit(key, documentpackage) Sentiment Analysis  map (key: URI, value: documentpackage) sentimentScore = calculateSentiment(documentpackage.getText()) documentpackage.setSentimentScore(sentimentScore) emit(key, documentpackage)
  • 18. MR+Search: Processing data II Creating a simple hypothetical ranking signal with mapreduce: => ratio of wovels per page map (key: URI, value: documentpackage)      numvowels, numnonvowels = 0      for each character in value:          if isWowel(character)              ++numvowels          else              ++numnonvowels      vowelratio = numvowels / (numvowels+numnonvowels)      documentpackage.setVowelRatio( vowelratio)      emit(key, documentpackage)          
  • 19. MR+Search: Indexing data I Inverted index among the "classics" of mapreduce example, it resembles word count but outputs the URI instead of occurence count map (key: URI, value: text)    for word in tokenize(value)        emit(word, key) reduce (key: wordtype, value: list of URIs)    output(key, value) // e.g. to a distributed hash
  • 20. MR+Search: Indexing data II How about positional information? map (key: URI, value: text)    wordposition = 0    foreach word in tokenize(value)        outputvalue = concat(key, wordposition)        ++wordposition        emit(word, outputvalue) reduce (key: wordtype, value: list of URI combined with word position)    create positional index for the given word (key) in the format     output(key, value) // e.g. to a distributed hash
  • 21. MR+Search: Indexing data III How about bigram or n-gram indices? map (key: URI, value: text)    bigramposition = 0    foreach bigram in bigramtokenizer(value)        outputvalue = concat(key, wordposition)        ++bigramposition        emit(bigram, outputvalue) reduce (key: bigramtype, value: list of URI combined with bigram position)    create positional bigram index for the given bigram (key) in the format     output(key, value) // e.g. to a distributed hash
  • 22. MR+Search: Indexing data IV How about indexing of stems or synonyms? map (key: URI, value: text)    foreach word in tokenizer(value)        synonyms = findSynonyms(word)        stem = createStem(word)        emit(word, key)        emit(stem, key)        foreach synonym in synonyms            emit(synonym, key) reduce (key: wordtype, value: list of URI)    output(key, value) // e.g. to a distributed hash
  • 23. MR+Search: Searching data I A lot of things typically happens at search time, the query is "dissected", classified, interpreted and perhaps rewritten before querying the index. This can generate clickthrough log data where we can find most clicked-on uris: mapreducejob 1 map (key: query, value: URI clicked on)    emit(value, key) reduce (key: URI clicked on, value: list of queries)     emit(count(value), key)) // get     mapreduce job 2   map (key: uri count, value: uri)    emit(key, value) reduce (key: uri count, value: list of uris)      emit(key, value)   // generate list of uris per popularity count          
  • 24. MR+Search: Searching data II Or we can find which the number of clicks per query (note: queries with relatively few clicks probably might have issues with ranking) mapreducejob 1 map (key: query, value: URI clicked on)    emit(value, key) reduce (key: URI clicked on, value: list of queries)    foreach unique query q in value // get all click per query for this uri        outputvalue = count number of occurences (clicks) of q in value        emit( q, outputvalue)    mapreduce job 2   map (key: query, value: count)    outputvalue = sum(value)     emit(key, value) reduce (key: query, value: list of counts)      emit(sum(value), key)   // get all clickthroughs per query for all urls          
  • 25. MR+Search: Quality Changing how processing and indexing mapreduce jobs work is likely to effect search quality (e.g. precision and recall), this can be evaluated with mapreduce, e.g. comparing the old and new set of indices by running querying both set of indices with the same (large) query log and compare differences in results. Task: how will you do that with mapreduce?      References: http://www.cs.northwestern.edu/~pardo/courses/mmml/papers/collaborative_filtering/crowdsourcing_for_relevance_evaluation_SIGIR08.pdf
  • 26. MR+Search: Ads Quality Ads are commercial search results, they should have similar requirements to relevancy as "organic" results, but have less text "to rank with" themselves (~twitter tweet size or less), but fortunately a lot of metadata (about advertiser, landing pages, keywords bid for etc.) that can be used to measure, improve and predict their relevancy      References: http://www.wsdm-conference.org/2010/proceedings/docs/p361.pdf http://web2py.iiit.ac.in/publications/default/download/techreport.pdf.a373bbf4a5b76063.4164436c69636b5468726f7567685261746549494954485265706f72742e706466.pdf http://pages.stern.nyu.edu/~narchak/wfp0828-archak.pdf http://research.yahoo.com/files/cikm2008-search%20advertising.pdf   
  • 27. MR+Search: Other Topics Plenty of mapreduce+search related topics haven't been covered, but here are some to consider looking at: machine translation clustering  graph algorithms spam/malware detection personalization References: http://www.google.com/research/pubs/och.html http://www.usenix.org/event/hotbots07/tech/full_papers/provos/provos.pdf https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms http://code.google.com/edu/submissions/mapreduce-minilecture/listing.html   http://www.youtube.com/watch?v=1ZDybXl212Q  (clustering)   http://www.youtube.com/watch?v=BT-piFBP4fE  (graphs)
  • 28. Part 3  advanced mapreduce example
  • 29. Adv. Mapreduce Example I This gives an example of how to port a machine learning classifier to mapreduce with discussion about optimizations The Classifier: Tikhonov Regularization with a square loss function ( this family includes: proximal support vector machines, ridge regression, shrinkage regression and regularized least-squares classification) (omega, gamma) = (I/mu + E T *E) -1 *(E T *D*e)     D - a matrix of training classes, e.g. [[-1.0, 1.0, 1.0, .. ]] A - a matrix with feature vectors, e.g. [[2.9, 3.3, 11.1, 2.4], .. ] e - a vector filled with ones, e.g [1.0, 1.0, .., 1.0] E = [A -e] mu = scalar constant # used to tune classifier D - a diagonal matrix with -1.0 or +1.0 values (depending on the class)  
  • 30. Adv. Mapreduce Example II How to classify an incoming feature vector  x class = x T *omega - gamma The classifier expression in Python (numerical python): (omega, gamma) = (I/mu + E.T*E).I*(E.T*D*e)   The expression have a (nice) additive property such that: (omega, gamma) =  (I/mu + E_1.T*E_1 + E_2.T*E_2).I*(E_1.T*D_1*e + E_2.T*D_2*e) With induction this becomes: (omega, gamma) =  (I/mu + E_1.T*E_1 + .. + E_i.T*E_i).I* (E_1.T*D_1*e + .. + E_i.T*D_i*e)
  • 31. Adv. Mapreduce Example III Brief Recap D - a matrix of training classes, e.g.  [[-1.0, 1.0, 1.0, .. ]] A - a matrix with feature vectors, e.g.  [[2.9, 3.3, 11.1, 2.4], .. ] e - a vector filled with ones, e.g [1.0, 1.0, .., 1.0] E = [A -e] mu = scalar constant # used to tune classifier D - a diagonal matrix with -1.0 or +1.0 values  (depending on the class)   A and D represent distributed training data, e.g. spread out on many machines or on distributed file system. Given the additive nature of the expression we can parallelize the  calculation of E.T*E and E.T*De
  • 33. Adv. Mapreduce Example V  d ef map (key, value):     # input key= class for one training example, e.g. "-1.0"     classes = [float(item) for item in key.split(",")]   # e.g. [-1.0]     D = numpy.diag(classes)       # input value = feature vector for one training example, e.g. "3.0, 7.0, 2.0"     featurematrix = [float(item) for item in value.split(",")]     A = numpy.matrix(featurematrix)       # create matrix E and vector e     e = numpy.matrix(numpy.ones(len(A)).reshape(len(A),1))     E = numpy.matrix(numpy.append(A,-e,axis=1))       # create a tuple with the values to be used by reducer     # and encode it with base64 to avoid potential trouble with '\t' and '\n' used     # as default separators in Hadoop Streaming     producedvalue = base64.b64encode(pickle.dumps( (E.T*E, E.T*D*e) )          # note: a single constant key "producedkey" sends to only one reducer     # somewhat "atypical" due to low degree of parallism on reducer side     print "producedkey\t%s" % (producedvalue)
  • 34. Adv. Mapreduce Example VI  def reduce (key, values, mu=0.1):    sumETE = None    sumETDe = None      # key isn't used, so ignoring it with _ (underscore).    for _, value in values:      # unpickle values      ETE, ETDe = pickle.loads(base64.b64decode(value))      if sumETE == None:        # create the I/mu with correct dimensions        sumETE = numpy.matrix(numpy.eye(ETE.shape[1])/mu)      sumETE += ETE        if sumETDe == None:        # create sumETDe with correct dimensions        sumETDe = ETDe      else:        sumETDe += ETDe        # note: omega = result[:-1] and gamma = result[-1]      # but printing entire vector as output      result = sumETE.I*sumETDe      print "%s\t%s" % (key, str(result.tolist()))
  • 35. Adv. Mapreduce Example VII Mapper Increment Size really makes a difference E.T*E and E.T*D*e given to reducer are independent of number of training data(!) ==> put as much training data in each E.T*E and E.T*D*e package  Example: Assume 1000 mappers with 1 billion training examples each (web scale :) and 100 features per training example If putting all billion examples into one E.T*E and E.T*D*e package reducer needs to summarize 1000 101x101 matrices (not bad) Or if sending 1 example per E.T*E and E.T*D*e package reducer needs to summarize 1 trillion (10^12) 101x101 matrices (intractable)
  • 36. Adv. Mapreduce Example VIII Avoid stressing the reducer
  • 37. Mapreduce Patterns Map() and Reduce() methods typically follow patterns, a recommended way of grouping code with such patterns is: extracting and generalize fingerprints based on: loops : e.g "do-while", "while", "for", "repeat-until" => "loop" conditions : e.g. "if" and "switch" => "condition" emits emit data types (if available) the map() method for both word count and index would be: map(key, value)    loop        emit(key:string, value:string)