Randomized Clustering
Text Retrieval Task

•   Text viewed as a sequences of terms in fields

•   Document and position for each term are indexed

•   Query is a sequence of terms (typically many more than
    user actually types)
Text Retrieval

•   Scores computed by merging occurrences of terms in
    query

•   Only top scoring documents are kept

•   Deletion and document edits done by adding new
    documents and keeping deletion list

•   (this is all standard ... Lucene is the best known example)
Traditional Scaling
                             Sharding
               1..n/4   n/4+1...n/2   n/2+1...3n/4   3n/4+1...n
                                                                    ?
              shard     shard         shard          shard        shard
                1         2             3              4            5
Replication




              shard     shard         shard          shard        shard
                1         2             3              4            5

              shard     shard         shard          shard        shard
                1         2             3              4            5
Consistent Hashing
        1 0
  0             1



  0             1
Problems

•   Presumes objects can be moved individually

•   Has very high insertion/deletion rate

•   Has disordered access patterns

•   Often exhibits content/placement correlations
Micro Sharding
    map                                  reduce                 hdfs
                                      Retrieval Indexer #1
for (t in types)                       Retrieval Indexer #2
   yield [key:(t, h(key)%shardCnt),      Retrieval Indexer #n

         value:doc]
                                      Content Indexer #1
                                       Content Indexer #2
                                        Content Indexer #m




  n,m >> number of search nodes
Search Architecture
                            Retrieval Engine #1

presentation   federator    Retrieval Engine #2
       layer    federator

                            Retrieval Engine #n


                            Content Engine #1

                            Content Engine #m
Control Architecture

 federator       Retrieval Engine #2



                 zookeeper

        katta
                                       HDFS
        master
                    indexer
Scenario: Node Start
●
    Node starts, tells ZK it exists and has no
    shards
●
    Master notified by ZK, looks at shard
    placement
●
    Imbalance exists so Master assigns shards to
    new node
●
    Node notified by ZK, downloads shard, tells ZK
●
    Master notified by ZK, looks at shard
    placement, unassigns shard somewhere
Scenario: Node Crash
●
    ZK detects node connection loss and session
    expiration
●
    Master is notified by ZK that node ephemeral
    file has vanished, looks at shard placements
●
    If under-replication exists, Master assigns
    shards to other nodes
●
    Nodes are notified by ZK, download shards, tell
    ZK
●
    Master is notified by ZK, no action needed
Summary of Master
●
    Master is notified of node set or shard set
    change
●
    Master examines current state of cluster
●
    If shards are under-replicated, add
    assignments
●
    If shards are over-replicated, delete
    assignments
●
    If cluster is imbalanced, add assignments
●
    Rinse, repeat
Quick Results
  •   No deletion/insertion in indexes at runtime

  •   Reloading micro-shards allows large sequential transfers

  •   Multiple shards allows very simple threading of search

  •   Random placement guided by balancing policy gives near
      optimal motion

   • Node addition and failure are simple, reliable
   • = Random sharding also near optimal
local global statistics, 2x query time improvement
load balancing
uniform management
Building Blocks
•   EC2 - elastic compute

•   Zookeeper - reliable coordination

•   Katta - shard and query management

•   Hadoop - map-reduce, RPC for Katta

•   Lucene - candidate set retrieval, index file storage

•   Deepdyve search algorithms - segment scoring
Building Blocks
•   EC2 - elastic compute

•   Zookeeper - reliable coordination

•   Katta - shard and query management

•   Hadoop - map-reduce, RPC for Katta

•   Lucene - candidate set retrieval, index file storage

•   Deepdyve search algorithms - segment scoring
Zookeeper
  • Replicated key-value in-memory store
  • Minimal semanticsversion
create, read, replace specified
sequential and ephemeral files
notifications

   • orderingstrict correctness guarantees
strict
       Very
quorum writes
no blocking operations
no race conditions

  • High speed 200,000 reads per second
50,000 updates per second,
Building Blocks
•   EC2 - elastic compute

•   Zookeeper - reliable coordination

•   Katta - shard and query management

•   Hadoop - map-reduce, RPC for Katta

•   Lucene - candidate set retrieval, index file storage

•   Deepdyve search algorithms - segment scoring
Katta Interface
   • -Simple Interface for query, vertical broadcast for update
Client horizontal broadcast
InodeManaged - add/removeShard

  • Pluggable Application Interface
  • current returnReturn Policy
Given
      Pluggable
                   state
return < 0 => done
return 0 => return result, allow updates
return n => wait at most n milliseconds

  • Comprehensivetimes
Results, exceptions, arrival
                             Results
Horizontal/Vertical
                  Broadcast
               1..n/4   n/4+1...n/2   n/2+1...3n/4   3n/4+1...n


              shard     shard         shard          shard
                1         2             3              4
Replication




              shard     shard         shard          shard
                1         2             3              4

              shard     shard         shard          shard
                1         2             3              4
Operations

federator       Retrieval Engine #2



                zookeeper

       katta
                                      HDFS
       master
                   indexer
Impact of Cloud
            Approach
•   Scale-free programming

•   Deployed in EC2 (test) or in private farm (production)

•   No single point of failure

•   Real-time scale up/down

•   Extensible to real-time index updates
Lessons

  • Random document to shard assignments
no correlations in
                   is good

⇒ strong bounds on node variations in search time
⇒ local statistics are as good as global statistics

no structure in shard to node assignments

⇒ node failure is not correlated to documents
⇒ load balancing and rebalancing is trivial
⇒ threaded search is trivial
More Lessons

  •    Randomized clustering requires good coordination
Zookeeper makes that easy


  •    Good coordination means not having to say you’re sorry
Masters coordinate but don’t participate
Resources
●
    My blog
    – http://tdunning.blogspot.com/

●
    The web-site
    – www.deepdyve.com

●
    Source code
    – Katta   (sourceforge)
    – Hadoop    (Apache)
    – Lucene    (Apache)

More Related Content

PPTX
Oscon data-2011-ted-dunning
PPTX
ACM 2013-02-25
PPTX
Boston hug
PPTX
Storm 2012-03-29
PDF
Buzz Words Dunning Real-Time Learning
KEY
Clojure at BackType
PPTX
And Then There Are Algorithms
KEY
ElephantDB
Oscon data-2011-ted-dunning
ACM 2013-02-25
Boston hug
Storm 2012-03-29
Buzz Words Dunning Real-Time Learning
Clojure at BackType
And Then There Are Algorithms
ElephantDB

What's hot (18)

PPTX
Drill dchug-29 nov2012
PDF
Spark Streaming into context
PPTX
Goto amsterdam-2013-skinned
PDF
PHP Backends for Real-Time User Interaction using Apache Storm.
PDF
Distributed real time stream processing- why and how
PPTX
GoodFit: Multi-Resource Packing of Tasks with Dependencies
PDF
[212]big models without big data using domain specific deep networks in data-...
PPTX
Multi-Tenant Storm Service on Hadoop Grid
PDF
Concurrent and Distributed Applications with Akka, Java and Scala
PPTX
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
PPTX
Resource Aware Scheduling in Apache Storm
PPTX
Apache Storm Internals
KEY
MapReduce and NoSQL
PPTX
Slide #1:Introduction to Apache Storm
PDF
Spark streaming: Best Practices
PDF
Analysis big data by use php with storm
PDF
Real-time Big Data Processing with Storm
PDF
Storm and Cassandra
Drill dchug-29 nov2012
Spark Streaming into context
Goto amsterdam-2013-skinned
PHP Backends for Real-Time User Interaction using Apache Storm.
Distributed real time stream processing- why and how
GoodFit: Multi-Resource Packing of Tasks with Dependencies
[212]big models without big data using domain specific deep networks in data-...
Multi-Tenant Storm Service on Hadoop Grid
Concurrent and Distributed Applications with Akka, Java and Scala
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Resource Aware Scheduling in Apache Storm
Apache Storm Internals
MapReduce and NoSQL
Slide #1:Introduction to Apache Storm
Spark streaming: Best Practices
Analysis big data by use php with storm
Real-time Big Data Processing with Storm
Storm and Cassandra
Ad

Viewers also liked (19)

PDF
Capacity Planning
PDF
EclipseConEurope2012 SOA - Talend with EasySOA
PPTX
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
PDF
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
PPT
MongoDB Sharding Webinar 2014
PDF
Building a High-Performance Distributed Task Queue on MongoDB
KEY
Sharding with MongoDB (Eliot Horowitz)
KEY
Mongodb sharding
PPTX
Event-Based Subscription with MongoDB
PDF
Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tr...
PDF
Enterprise Integration Patterns Revisited (EIP, Apache Camel, Talend ESB)
PPTX
Sharding Methods for MongoDB
PPTX
The Aggregation Framework
PPTX
Back to Basics Webinar 3: Introduction to Replica Sets
PPTX
Back to Basics 2017: Introduction to Sharding
PDF
Webinar: Working with Graph Data in MongoDB
PDF
Webinar: 10-Step Guide to Creating a Single View of your Business
PDF
MongoDB as Message Queue
KEY
MongoDB, E-commerce and Transactions
Capacity Planning
EclipseConEurope2012 SOA - Talend with EasySOA
MongoDB San Francisco 2013: Hash-based Sharding in MongoDB 2.4 presented by B...
Scaling MongoDB; Sharding Into and Beyond the Multi-Terabyte Range
MongoDB Sharding Webinar 2014
Building a High-Performance Distributed Task Queue on MongoDB
Sharding with MongoDB (Eliot Horowitz)
Mongodb sharding
Event-Based Subscription with MongoDB
Talend Open Studio Fundamentals #1: Workspaces, Jobs, Metadata and Trips & Tr...
Enterprise Integration Patterns Revisited (EIP, Apache Camel, Talend ESB)
Sharding Methods for MongoDB
The Aggregation Framework
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics 2017: Introduction to Sharding
Webinar: Working with Graph Data in MongoDB
Webinar: 10-Step Guide to Creating a Single View of your Business
MongoDB as Message Queue
MongoDB, E-commerce and Transactions
Ad

Similar to HPTS talk on micro-sharding with Katta (20)

PPT
HPTS talk on micro sharding with Katta
PDF
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Solr Exchange: Introduction to SolrCloud
PDF
CockroachDB: Architecture of a Geo-Distributed SQL Database
ODP
GIDS2014: SolrCloud: Searching Big Data
PDF
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
PDF
Column Stride Fields aka. DocValues
PDF
Column Stride Fields aka. DocValues
KEY
An introduction to Pincaster
PPTX
Dive into spark2
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PDF
NetflixOSS Open House Lightning talks
PDF
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
PDF
Hippo meetup: enterprise search with Solr and elasticsearch
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
PPTX
Benchmarking Solr Performance at Scale
PDF
Hadoop - Disk Fail In Place (DFIP)
PDF
Crawlware
PDF
Cacheconcurrencyconsistency cassandra svcc
HPTS talk on micro sharding with Katta
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Solr Exchange: Introduction to SolrCloud
CockroachDB: Architecture of a Geo-Distributed SQL Database
GIDS2014: SolrCloud: Searching Big Data
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
Column Stride Fields aka. DocValues
Column Stride Fields aka. DocValues
An introduction to Pincaster
Dive into spark2
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
NetflixOSS Open House Lightning talks
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Hippo meetup: enterprise search with Solr and elasticsearch
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Benchmarking Solr Performance at Scale
Hadoop - Disk Fail In Place (DFIP)
Crawlware
Cacheconcurrencyconsistency cassandra svcc

More from Ted Dunning (20)

PPTX
Dunning - SIGMOD - Data Economy.pptx
PPTX
How to Get Going with Kubernetes
PPTX
Progress for big data in Kubernetes
PPTX
Anomaly Detection: How to find what you didn’t know to look for
PPTX
Streaming Architecture including Rendezvous for Machine Learning
PPTX
Machine Learning Logistics
PPTX
Tensor Abuse - how to reuse machine learning frameworks
PPTX
Machine Learning logistics
PPTX
T digest-update
PPTX
Finding Changes in Real Data
PPTX
Where is Data Going? - RMDC Keynote
PPTX
Real time-hadoop
PPTX
Cheap learning-dunning-9-18-2015
PPTX
Sharing Sensitive Data Securely
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
How the Internet of Things is Turning the Internet Upside Down
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
PPTX
Dunning time-series-2015
PPTX
Doing-the-impossible
PPTX
Anomaly Detection - New York Machine Learning
Dunning - SIGMOD - Data Economy.pptx
How to Get Going with Kubernetes
Progress for big data in Kubernetes
Anomaly Detection: How to find what you didn’t know to look for
Streaming Architecture including Rendezvous for Machine Learning
Machine Learning Logistics
Tensor Abuse - how to reuse machine learning frameworks
Machine Learning logistics
T digest-update
Finding Changes in Real Data
Where is Data Going? - RMDC Keynote
Real time-hadoop
Cheap learning-dunning-9-18-2015
Sharing Sensitive Data Securely
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
How the Internet of Things is Turning the Internet Upside Down
Apache Kylin - OLAP Cubes for SQL on Hadoop
Dunning time-series-2015
Doing-the-impossible
Anomaly Detection - New York Machine Learning

Recently uploaded (20)

PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Architecture types and enterprise applications.pdf
PPTX
Web Crawler for Trend Tracking Gen Z Insights.pptx
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
Hindi spoken digit analysis for native and non-native speakers
DOCX
search engine optimization ppt fir known well about this
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
CloudStack 4.21: First Look Webinar slides
PPTX
Modernising the Digital Integration Hub
PPTX
Benefits of Physical activity for teenagers.pptx
PPT
Geologic Time for studying geology for geologist
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
A contest of sentiment analysis: k-nearest neighbor versus neural network
Architecture types and enterprise applications.pdf
Web Crawler for Trend Tracking Gen Z Insights.pptx
sustainability-14-14877-v2.pddhzftheheeeee
NewMind AI Weekly Chronicles – August ’25 Week III
WOOl fibre morphology and structure.pdf for textiles
Hindi spoken digit analysis for native and non-native speakers
search engine optimization ppt fir known well about this
A comparative study of natural language inference in Swahili using monolingua...
Enhancing emotion recognition model for a student engagement use case through...
Univ-Connecticut-ChatGPT-Presentaion.pdf
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
observCloud-Native Containerability and monitoring.pptx
1 - Historical Antecedents, Social Consideration.pdf
CloudStack 4.21: First Look Webinar slides
Modernising the Digital Integration Hub
Benefits of Physical activity for teenagers.pptx
Geologic Time for studying geology for geologist

HPTS talk on micro-sharding with Katta

  • 2. Text Retrieval Task • Text viewed as a sequences of terms in fields • Document and position for each term are indexed • Query is a sequence of terms (typically many more than user actually types)
  • 3. Text Retrieval • Scores computed by merging occurrences of terms in query • Only top scoring documents are kept • Deletion and document edits done by adding new documents and keeping deletion list • (this is all standard ... Lucene is the best known example)
  • 4. Traditional Scaling Sharding 1..n/4 n/4+1...n/2 n/2+1...3n/4 3n/4+1...n ? shard shard shard shard shard 1 2 3 4 5 Replication shard shard shard shard shard 1 2 3 4 5 shard shard shard shard shard 1 2 3 4 5
  • 5. Consistent Hashing 1 0 0 1 0 1
  • 6. Problems • Presumes objects can be moved individually • Has very high insertion/deletion rate • Has disordered access patterns • Often exhibits content/placement correlations
  • 7. Micro Sharding map reduce hdfs Retrieval Indexer #1 for (t in types) Retrieval Indexer #2 yield [key:(t, h(key)%shardCnt), Retrieval Indexer #n value:doc] Content Indexer #1 Content Indexer #2 Content Indexer #m n,m >> number of search nodes
  • 8. Search Architecture Retrieval Engine #1 presentation federator Retrieval Engine #2 layer federator Retrieval Engine #n Content Engine #1 Content Engine #m
  • 9. Control Architecture federator Retrieval Engine #2 zookeeper katta HDFS master indexer
  • 10. Scenario: Node Start ● Node starts, tells ZK it exists and has no shards ● Master notified by ZK, looks at shard placement ● Imbalance exists so Master assigns shards to new node ● Node notified by ZK, downloads shard, tells ZK ● Master notified by ZK, looks at shard placement, unassigns shard somewhere
  • 11. Scenario: Node Crash ● ZK detects node connection loss and session expiration ● Master is notified by ZK that node ephemeral file has vanished, looks at shard placements ● If under-replication exists, Master assigns shards to other nodes ● Nodes are notified by ZK, download shards, tell ZK ● Master is notified by ZK, no action needed
  • 12. Summary of Master ● Master is notified of node set or shard set change ● Master examines current state of cluster ● If shards are under-replicated, add assignments ● If shards are over-replicated, delete assignments ● If cluster is imbalanced, add assignments ● Rinse, repeat
  • 13. Quick Results • No deletion/insertion in indexes at runtime • Reloading micro-shards allows large sequential transfers • Multiple shards allows very simple threading of search • Random placement guided by balancing policy gives near optimal motion • Node addition and failure are simple, reliable • = Random sharding also near optimal local global statistics, 2x query time improvement load balancing uniform management
  • 14. Building Blocks • EC2 - elastic compute • Zookeeper - reliable coordination • Katta - shard and query management • Hadoop - map-reduce, RPC for Katta • Lucene - candidate set retrieval, index file storage • Deepdyve search algorithms - segment scoring
  • 15. Building Blocks • EC2 - elastic compute • Zookeeper - reliable coordination • Katta - shard and query management • Hadoop - map-reduce, RPC for Katta • Lucene - candidate set retrieval, index file storage • Deepdyve search algorithms - segment scoring
  • 16. Zookeeper • Replicated key-value in-memory store • Minimal semanticsversion create, read, replace specified sequential and ephemeral files notifications • orderingstrict correctness guarantees strict Very quorum writes no blocking operations no race conditions • High speed 200,000 reads per second 50,000 updates per second,
  • 17. Building Blocks • EC2 - elastic compute • Zookeeper - reliable coordination • Katta - shard and query management • Hadoop - map-reduce, RPC for Katta • Lucene - candidate set retrieval, index file storage • Deepdyve search algorithms - segment scoring
  • 18. Katta Interface • -Simple Interface for query, vertical broadcast for update Client horizontal broadcast InodeManaged - add/removeShard • Pluggable Application Interface • current returnReturn Policy Given Pluggable state return < 0 => done return 0 => return result, allow updates return n => wait at most n milliseconds • Comprehensivetimes Results, exceptions, arrival Results
  • 19. Horizontal/Vertical Broadcast 1..n/4 n/4+1...n/2 n/2+1...3n/4 3n/4+1...n shard shard shard shard 1 2 3 4 Replication shard shard shard shard 1 2 3 4 shard shard shard shard 1 2 3 4
  • 20. Operations federator Retrieval Engine #2 zookeeper katta HDFS master indexer
  • 21. Impact of Cloud Approach • Scale-free programming • Deployed in EC2 (test) or in private farm (production) • No single point of failure • Real-time scale up/down • Extensible to real-time index updates
  • 22. Lessons • Random document to shard assignments no correlations in is good ⇒ strong bounds on node variations in search time ⇒ local statistics are as good as global statistics no structure in shard to node assignments ⇒ node failure is not correlated to documents ⇒ load balancing and rebalancing is trivial ⇒ threaded search is trivial
  • 23. More Lessons • Randomized clustering requires good coordination Zookeeper makes that easy • Good coordination means not having to say you’re sorry Masters coordinate but don’t participate
  • 24. Resources ● My blog – http://tdunning.blogspot.com/ ● The web-site – www.deepdyve.com ● Source code – Katta (sourceforge) – Hadoop (Apache) – Lucene (Apache)