SlideShare a Scribd company logo
Search data store for the world's largest
                            biometric identity system


                    Regunath Balasubramanian         Shashikant Soni
                      regunathb@gmail.com      soni.shashikant@gmail.com
                       twitter @regunathb




CONFIDENTIAL: For limited circulation only                                 Slide 1
India
● 1.2 billion residents
   ● 640,000 villages, ~60% lives under $2/day
   ● ~75% literacy, <3% pays Income Tax, <20% banking
   ● ~800 million mobile, ~200-300 mn migrant workers

● Govt. spends about $25-40B on direct subsidies
   ● Residents have no standard identity document
   ● Most programs plagued with ghost and multiple identities causing
     leakage of 30-40%




                                                                        Slide 2
Aadhaar
● Create a common ‘national identity’ for every ‘resident’
   ●Biometric backed identity to eliminate duplicates
   ●‘Verifiable online identity’ for portability
● Applications ecosystem using open APIs
   ●Aadhaar enabled bank account and payment platform
   ●Aadhaar enabled electronic, paperless KYC (Know Your
     Customer)




                                                             Slide 3
Search Requirements
● Multi-attribute query like:
   name contains ‘regunath’ AND city = ‘bangalore’ AND
   address contains ‘J P Nagar’ AND YearOfBirth = ……


● Search 1.2B resident data with photo, history
   ●35Kb - Average record size
● Response times in milliseconds
● Open scale out


                                                         Slide 4
Why MongoDB
● Auto-sharding
● Replication
● Failover
   … Essentially an AP (slaveOk) data store in CAP parlance

● Evolving schema
● Map-Reduce for analysis
● Full text search
   ●Compound (or) multi-keys


                                                              Slide 5
Design

               { _id:123456789, name: ‘abcde’, year:1980, ….. }
    MongoDB         2

                                             Search API                                  Client App
                                                                  Name=‘abcde’
    Solr            1
                                                                  Address=‘some place’
  Indexes     Name: ‘abcde’                                       Year= 1980
              Address: ‘some place’
              year: 1980



● Read/Search
   ●Sharded Solr indexes for search
   ●Keyed document read from MongoDB
● Write
   ●Eventual consistency (across data sources) driven by
    application
   ●Composite MongodDB-Solr app persistence handler                                                   Slide 6
Implementation and Deployment
   ● Start - 4M records in 2 shards
   Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas)
   ● Performance , Reliability & Durability
      ●SlaveOk
      ●getLastError, Write Concern: availability vs durability
          j = journaling
          w = nodes-to-write
   ● Replica-sets / Shards – how?
            RS 1                RS 1              RS 1
            Rs 2                                  RS 2              RS 2

Primary
                     Config 1          Config 2          Config 3
Secondary

Arbiter               Router           Router            Router
                                                                           Slide 7
Monitoring and Troubleshooting
● Monitoring tools evaluated
   ●MMS
   ●munin
● Manual approach - daily ritual
   ●RS, DB, config, router - health and stats
● Problem analysis stats
   ●mongostat, iostat, currentOps, logs
   ●Client connections
● Stats for storage, shards addition
   ●Data file size
   ●Shard data distribution
   ●Replication
                                                Slide 8
Key Learnings on MongoDB
● Indexing 32 fields
   ●Compound indexes
   ●Multi-keys indexes
       {…"indexes" : [{ "email":"john.doe@email.com", "phone":"123456789“ }] }
       db.coll.find ({ "indexes.email" : "john.doe@email.com" })
   ●Indexes use b-tree
   ●Many fields to index
   ●Performs well upto 1-2M documents
   ●Best if index fits in memory
● Data replication, RS failover
   ●Rollback when RS goes out of sync
       Manual restore (physical data copy)
       Restarting a very stale node
                                                                            Slide 9
Questions?



                    Regunath Balasubramanian               Shashikant Soni
                      regunathb@gmail.com            soni.shashikant@gmail.com
                       twitter @regunathb




CONFIDENTIAL: For limited circulation only                                       Slide 10

More Related Content

PPTX
Data science
PDF
Introduction to data science
PPTX
Power BI Consultants | Power BI Solutions | Power BI Service
PPTX
Data science
PDF
Online News Portal System
PDF
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
PDF
Android College Application Project Report
PPTX
Student Result Management System
Data science
Introduction to data science
Power BI Consultants | Power BI Solutions | Power BI Service
Data science
Online News Portal System
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Android College Application Project Report
Student Result Management System

What's hot (20)

PPTX
Result Management System - CSE Final Year Projects
PDF
Data science presentation
PPTX
WEB Scraping.pptx
DOCX
Job portal project documentary
PPT
Online Food Delivery
PPTX
Relational model
PPTX
Student result management system
PPT
Data mining techniques unit 1
PPTX
Major issues in data mining
PPTX
SQL Basics
PPT
Dimensional Modeling
PDF
A project report on chat application
PPTX
eJobs-UrbanClap.pptx
PPTX
CSE Final Year Project Presentation on Android Application
PDF
Basic SQL
PPTX
Student information system project report
PPTX
Predictive analytics
PPTX
SQL - Structured query language introduction
PPTX
Introduction to SQL
Result Management System - CSE Final Year Projects
Data science presentation
WEB Scraping.pptx
Job portal project documentary
Online Food Delivery
Relational model
Student result management system
Data mining techniques unit 1
Major issues in data mining
SQL Basics
Dimensional Modeling
A project report on chat application
eJobs-UrbanClap.pptx
CSE Final Year Project Presentation on Android Application
Basic SQL
Student information system project report
Predictive analytics
SQL - Structured query language introduction
Introduction to SQL
Ad

Viewers also liked (11)

PPTX
MongoDB at eBay
PPTX
Aadhaar at 5th_elephant_v3
ODP
Hadoop at aadhaar
PDF
Knowledge Engines and AI – Applications Beyond Gaming
PPTX
AADHAR SECURE TRAVEL IDENETITY
PPTX
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
PDF
Three Big Data Case Studies
PPTX
Introduction To HBase
PDF
Step by-step presentation on digital payments
PDF
India - A Cashless Economy (NPCI/UPI)
PPTX
ppt on aadhar card project
MongoDB at eBay
Aadhaar at 5th_elephant_v3
Hadoop at aadhaar
Knowledge Engines and AI – Applications Beyond Gaming
AADHAR SECURE TRAVEL IDENETITY
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Three Big Data Case Studies
Introduction To HBase
Step by-step presentation on digital payments
India - A Cashless Economy (NPCI/UPI)
ppt on aadhar card project
Ad

Similar to Mongo for aadhaar (20)

PDF
MongoDB in FS
PPTX
MediaGlu and Mongo DB
PPTX
An Introduction to Big Data, NoSQL and MongoDB
PDF
Using Spring with NoSQL databases (SpringOne China 2012)
PPTX
Tim marston
PPTX
Tim Marston.
PPT
Welcome and Introduction to A Morning with MongoDB Petah Tikvah
PDF
MongoDB Use Cases and Roadmap
PPT
A Morning with MongoDB - Helsinki
PPTX
mongoDB: Driving a data revolution
PPTX
MongoDB Use Cases: Healthcare, CMS, Analytics
PDF
How to Get Started with Your MongoDB Pilot Project
PPTX
Agility and Scalability with MongoDB
PPTX
An Evening with MongoDB Detroit 2013
PDF
A Morning with MongoDB Barcelona: Use Cases and Roadmap
PPTX
Webinar: Getting Started with MongoDB - Back to Basics
PPTX
MongoDB Evenings Minneapolis: MongoDB is Cool But When Should I Use It?
PDF
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
PPTX
Scaling with MongoDB
PDF
Big Data Israel Meetup : Couchbase and Big Data
MongoDB in FS
MediaGlu and Mongo DB
An Introduction to Big Data, NoSQL and MongoDB
Using Spring with NoSQL databases (SpringOne China 2012)
Tim marston
Tim Marston.
Welcome and Introduction to A Morning with MongoDB Petah Tikvah
MongoDB Use Cases and Roadmap
A Morning with MongoDB - Helsinki
mongoDB: Driving a data revolution
MongoDB Use Cases: Healthcare, CMS, Analytics
How to Get Started with Your MongoDB Pilot Project
Agility and Scalability with MongoDB
An Evening with MongoDB Detroit 2013
A Morning with MongoDB Barcelona: Use Cases and Roadmap
Webinar: Getting Started with MongoDB - Back to Basics
MongoDB Evenings Minneapolis: MongoDB is Cool But When Should I Use It?
Webinar: How We Evaluated MongoDB as a Relational Database Replacement
Scaling with MongoDB
Big Data Israel Meetup : Couchbase and Big Data

More from MongoDB (20)

PDF
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
PDF
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
PDF
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
PDF
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
PDF
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
PDF
MongoDB SoCal 2020: MongoDB Atlas Jump Start
PDF
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
PDF
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
PDF
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
PDF
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
PDF
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
PDF
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
PDF
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
PDF
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
PDF
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
PDF
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
PDF
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
PDF
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...

Mongo for aadhaar

  • 1. Search data store for the world's largest biometric identity system Regunath Balasubramanian Shashikant Soni regunathb@gmail.com soni.shashikant@gmail.com twitter @regunathb CONFIDENTIAL: For limited circulation only Slide 1
  • 2. India ● 1.2 billion residents ● 640,000 villages, ~60% lives under $2/day ● ~75% literacy, <3% pays Income Tax, <20% banking ● ~800 million mobile, ~200-300 mn migrant workers ● Govt. spends about $25-40B on direct subsidies ● Residents have no standard identity document ● Most programs plagued with ghost and multiple identities causing leakage of 30-40% Slide 2
  • 3. Aadhaar ● Create a common ‘national identity’ for every ‘resident’ ●Biometric backed identity to eliminate duplicates ●‘Verifiable online identity’ for portability ● Applications ecosystem using open APIs ●Aadhaar enabled bank account and payment platform ●Aadhaar enabled electronic, paperless KYC (Know Your Customer) Slide 3
  • 4. Search Requirements ● Multi-attribute query like: name contains ‘regunath’ AND city = ‘bangalore’ AND address contains ‘J P Nagar’ AND YearOfBirth = …… ● Search 1.2B resident data with photo, history ●35Kb - Average record size ● Response times in milliseconds ● Open scale out Slide 4
  • 5. Why MongoDB ● Auto-sharding ● Replication ● Failover … Essentially an AP (slaveOk) data store in CAP parlance ● Evolving schema ● Map-Reduce for analysis ● Full text search ●Compound (or) multi-keys Slide 5
  • 6. Design { _id:123456789, name: ‘abcde’, year:1980, ….. } MongoDB 2 Search API Client App Name=‘abcde’ Solr 1 Address=‘some place’ Indexes Name: ‘abcde’ Year= 1980 Address: ‘some place’ year: 1980 ● Read/Search ●Sharded Solr indexes for search ●Keyed document read from MongoDB ● Write ●Eventual consistency (across data sources) driven by application ●Composite MongodDB-Solr app persistence handler Slide 6
  • 7. Implementation and Deployment ● Start - 4M records in 2 shards Current - 250M records in 8 shards ( 8 x ~2 TB x 3 replicas) ● Performance , Reliability & Durability ●SlaveOk ●getLastError, Write Concern: availability vs durability  j = journaling  w = nodes-to-write ● Replica-sets / Shards – how? RS 1 RS 1 RS 1 Rs 2 RS 2 RS 2 Primary Config 1 Config 2 Config 3 Secondary Arbiter Router Router Router Slide 7
  • 8. Monitoring and Troubleshooting ● Monitoring tools evaluated ●MMS ●munin ● Manual approach - daily ritual ●RS, DB, config, router - health and stats ● Problem analysis stats ●mongostat, iostat, currentOps, logs ●Client connections ● Stats for storage, shards addition ●Data file size ●Shard data distribution ●Replication Slide 8
  • 9. Key Learnings on MongoDB ● Indexing 32 fields ●Compound indexes ●Multi-keys indexes  {…"indexes" : [{ "email":"john.doe@email.com", "phone":"123456789“ }] }  db.coll.find ({ "indexes.email" : "john.doe@email.com" }) ●Indexes use b-tree ●Many fields to index ●Performs well upto 1-2M documents ●Best if index fits in memory ● Data replication, RS failover ●Rollback when RS goes out of sync  Manual restore (physical data copy)  Restarting a very stale node Slide 9
  • 10. Questions? Regunath Balasubramanian Shashikant Soni regunathb@gmail.com soni.shashikant@gmail.com twitter @regunathb CONFIDENTIAL: For limited circulation only Slide 10