SlideShare a Scribd company logo
Big Data and Me Bhupesh Bansal Feb 3, 2012
Relational Model Architecture Reference :  http:// www.slideshare.net / adorepump / voldemort-nosql
Linkedin 2006 Reference :  http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Relational model The relational model is a triumph of computer science: General Concise Well understood But then again: SQL is a pain Hard to build re-usable data structures Hides performance issues/details
Specialized Systems Architecture Reference :  http:// www.slideshare.net / adorepump / voldemort-nosql
Linkedin 2007 Reference :  http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
Specialized systems Specialized systems are efficient (10-100x) Search: Inverted index Offline: Hadoop, Terradata, Oracle DWH Memcached In memory systems (social graph) Specialized system are scalable New data and problems Graphs, sequences, and text
Batch Driven Architecture Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
Motivation I : Big Data  02/06/12 Reference :  algo2.iti.kit.edu/.../fopraext/index.html
Motivation II: Data Driven Features
Motivation III: Makes Money  02/06/12 Proprietary & Confidential
Motivation IV: Big Data is cool 02/06/12
Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588
Big Data Challenges Large scale data processing Use all available signals eg. Weblogs, Social signals (twitter/facebook/linkedin) Data Driven Applications Refine data push back to user for consumption Near real time feedback loop Keep continuously improving
Why is this hard ? Large scale data processing TB/PB of data Traditional storage systems cannot handle the scale Data Driven Applications Need to run complex machine learning algorithms on this data scale Near real time analysis improves application performance and usage.
Some good news !! Hadoop Biggest single driver for large scale data economy Scales, works, easy to use Memcached Works, scales and is fast Open source world Lot of awesome people working on awesome systems eg. hBase, memcached, Voldemort, kafka, mahout etc. Sharing across companies Common practices/knowledge sharing across companies.
What works !! Simplicity Go with the simplest design possible. Near real time Async/Batch processing Put computation to background as much as possible Duplicate data everywhere Build customized solution for each problem Duplicate data as needed Data river  Publish events and let all systems consume at their own pace Monitoring/Alerting Keep a close eye on things and build a strong dev-ops team
What doesn’t works !! Magic systems Auto configure, Auto tuning Very hard to get it right instead have easy configuration and better monitoring Open source  If Not supported by strong engineering team internally Be ready to have folks spend 30-40% time on understanding, helping open source components Silver bullets One system to solve all scaling problems eg. Hbase Build separate systems for separate problems Central data source  Don’ t lock your data let it flow Use  (Kafka, Scribe or any publish/subscribe system)
Open source Very very important for any company today Do not reinvent the wheel Do not write a line of code if not needed 90/10 % rule Pick up open source solutions, fix what is broken Big plus for hiring Stand on shoulder of crowd
Open source: Storage Problem: You want to store TB of data for user consumption in real time Latency < 50 ms Scale 10,000 QPS + Solutions Big table design eg. Hbase Amazon Dynamo design eg. Voldemort Cache with persistence eg. Membase Document based storage eg. MongoDB
Open source: Publish/Subscribe Problem: Data River for all other systems to get their feed Solutions Strong data guarantees eg. ActiveMQ, RabbitMQ, HornetQ Log feeds eg. Scribe, flume Kafka  A great mix of both the world
Open source: Real time analysis Problem: Analyze a stream of data and do simple analysis/reporting Solutions Splunk General purpose but high maintenance expansive analysis tool OpenTSDB Simple but scalable metrics reporting Yahoo S4/Twitter Storm Online map-reduce ish New systems will need lots of love and care
Open source: Search Problem: unstructured queries on data Solutions Lucene Most tested common search (but just a) library Solr Old system with lot of users but bad design Elastic Search Very well designed but new system Linkedin search open source systems sensieDB, zoie
Open source: Batch computation Problem: You want to process TB of data Solutions is simple: Use Hadoop Hadoop workflow manager Azkaban Oozie Query Native Java code Cascading Hive Pig
Open source: Other Serialization Avro, Thrift, protocol buffers Compression Snappy, LZO Monitoring Ganglia
My personal picks !! Storage: Pure key-value lookup : Voldemort Range queries, Hadoop job support: Hbase Batch generated Read only data serving: Voldemort Publish/Subscribe HornetQ OR Kafka Search ElasticSearch Hadoop Azkaban Hive and Native Java code
Jeff Dean’s Thoughts Very practical advice on building good reliable distributed systems. Highlights Back of the envelope calculations Understand your base numbers well Scale for 10X not 100X Embrace chaos/failure and design around it Monitor/status hooks at all levels Important not to try to be all things for everybody Reference  : http:// www.slideshare.net / xlight /google-designs-lessons-and-advice-from-building-large-distributed-systems
How Voldemort was born ? Reference : 1)  http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2)  http://www.slideshare.net/adorepump/voldemort-nosql
Why NoSQL ? TBs of data Sharding the only way to scale No joins possible (Data is split across machines) Specialized systems eg search, network feed breaks relational model No constraints, triggers, etc disappear Lots of denormalization Latency is key Relational DB depend on caching layer to achieve high throughput and low latency
Inspired By Amazon Dynamo & Memcached Amazon ’s Dynamo storage system Works across data centers Eventual consistency Commodity hardware Memcached Actually works Really fast Really simple
ACID Vs CAP ACID  Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency  to  Eventual consistency Proprietary & Confidential 02/06/12
Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change  Replication  Each partition is stored by  ‘N’ nodes Proprietary & Confidential 02/06/12
R+W > N  N - The replication factor  R - The number of blocking reads W - The number of blocking writes If  R+W > N  then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes  R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 02/06/12
Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary
Vector Clock Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 02/06/12
Client API Data is organized into  “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V)  GET (Key K) MULTI-GET (Iterator<Key> K),  DELETE (Key K) / (Key K , Version ver) No Range Scans
Voldemort Physical Deployment
 
Read-only storage engine Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback! Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12
Batch Driven Architecture Reference :   http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /
Questions

More Related Content

PPTX
Jstorm introduction-0.9.6
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PDF
Shared slides-edbt-keynote-03-19-13
PDF
Apache storm vs. Spark Streaming
PDF
May 29, 2014 Toronto Hadoop User Group - Micro ETL
PDF
Storm: distributed and fault-tolerant realtime computation
Jstorm introduction-0.9.6
Hadoop introduction , Why and What is Hadoop ?
Big Data Anti-Patterns: Lessons From the Front LIne
Shared slides-edbt-keynote-03-19-13
Apache storm vs. Spark Streaming
May 29, 2014 Toronto Hadoop User Group - Micro ETL
Storm: distributed and fault-tolerant realtime computation

What's hot (20)

PPTX
File Context
PPTX
Introduction to Data Analyst Training
PPTX
Next-Gen Decision Making in Under 2ms
PPTX
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
PDF
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
PPTX
Hadoop and Graph Data Management: Challenges and Opportunities
PPT
Hadoop and Voldemort @ LinkedIn
PDF
Why Spark Is the Next Top (Compute) Model
PDF
Comparing Accumulo, Cassandra, and HBase
PPT
Daniel Abadi: VLDB 2009 Panel
PDF
Meetup ml spark_ppt
ODP
Front Range PHP NoSQL Databases
PPT
UnConference for Georgia Southern Computer Science March 31, 2015
PPTX
Dynamo and BigTable in light of the CAP theorem
PDF
Introduction to Hadoop and MapReduce
PPTX
Have your cake and eat it too
PPTX
Allyourbase
PPT
Presentation on Hadoop Technology
PPTX
Speed it up and Spark it up at Intel
PPT
Clustering van IT-componenten
File Context
Introduction to Data Analyst Training
Next-Gen Decision Making in Under 2ms
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
Apache Spark 2.4 Bridges the Gap Between Big Data and Deep Learning
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Voldemort @ LinkedIn
Why Spark Is the Next Top (Compute) Model
Comparing Accumulo, Cassandra, and HBase
Daniel Abadi: VLDB 2009 Panel
Meetup ml spark_ppt
Front Range PHP NoSQL Databases
UnConference for Georgia Southern Computer Science March 31, 2015
Dynamo and BigTable in light of the CAP theorem
Introduction to Hadoop and MapReduce
Have your cake and eat it too
Allyourbase
Presentation on Hadoop Technology
Speed it up and Spark it up at Intel
Clustering van IT-componenten
Ad

Similar to Bhupeshbansal bigdata (20)

PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPS
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
PDF
Distributed Systems: scalability and high availability
PPTX
Big data concepts
PPT
Schemaless Databases
PPS
Web20expo Scalable Web Arch
PPS
Web20expo Scalable Web Arch
PPS
Web20expo Scalable Web Arch
PPTX
NoSQL Introduction, Theory, Implementations
PDF
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
PDF
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
PDF
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
PPT
No SQL Databases as modern database concepts
PDF
How can Hadoop & SAP be integrated
PPTX
عصر کلان داده، چرا و چگونه؟
PPT
Hadoop basics
PPTX
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
PPTX
Agile data warehousing
PPT
No sql
ODP
Nonrelational Databases
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Distributed Systems: scalability and high availability
Big data concepts
Schemaless Databases
Web20expo Scalable Web Arch
Web20expo Scalable Web Arch
Web20expo Scalable Web Arch
NoSQL Introduction, Theory, Implementations
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
No SQL Databases as modern database concepts
How can Hadoop & SAP be integrated
عصر کلان داده، چرا و چگونه؟
Hadoop basics
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL
Agile data warehousing
No sql
Nonrelational Databases
Ad

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPT
Teaching material agriculture food technology
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Machine learning based COVID-19 study performance prediction
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Building Integrated photovoltaic BIPV_UPV.pdf
Teaching material agriculture food technology
sap open course for s4hana steps from ECC to s4
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Electronic commerce courselecture one. Pdf
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Weekly Chronicles - August'25 Week I

Bhupeshbansal bigdata

  • 1. Big Data and Me Bhupesh Bansal Feb 3, 2012
  • 2. Relational Model Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql
  • 3. Linkedin 2006 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
  • 4. Relational model The relational model is a triumph of computer science: General Concise Well understood But then again: SQL is a pain Hard to build re-usable data structures Hides performance issues/details
  • 5. Specialized Systems Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql
  • 6. Linkedin 2007 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm
  • 7. Specialized systems Specialized systems are efficient (10-100x) Search: Inverted index Offline: Hadoop, Terradata, Oracle DWH Memcached In memory systems (social graph) Specialized system are scalable New data and problems Graphs, sequences, and text
  • 8. Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
  • 9. Motivation I : Big Data 02/06/12 Reference : algo2.iti.kit.edu/.../fopraext/index.html
  • 10. Motivation II: Data Driven Features
  • 11. Motivation III: Makes Money 02/06/12 Proprietary & Confidential
  • 12. Motivation IV: Big Data is cool 02/06/12
  • 13. Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588
  • 14. Big Data Challenges Large scale data processing Use all available signals eg. Weblogs, Social signals (twitter/facebook/linkedin) Data Driven Applications Refine data push back to user for consumption Near real time feedback loop Keep continuously improving
  • 15. Why is this hard ? Large scale data processing TB/PB of data Traditional storage systems cannot handle the scale Data Driven Applications Need to run complex machine learning algorithms on this data scale Near real time analysis improves application performance and usage.
  • 16. Some good news !! Hadoop Biggest single driver for large scale data economy Scales, works, easy to use Memcached Works, scales and is fast Open source world Lot of awesome people working on awesome systems eg. hBase, memcached, Voldemort, kafka, mahout etc. Sharing across companies Common practices/knowledge sharing across companies.
  • 17. What works !! Simplicity Go with the simplest design possible. Near real time Async/Batch processing Put computation to background as much as possible Duplicate data everywhere Build customized solution for each problem Duplicate data as needed Data river Publish events and let all systems consume at their own pace Monitoring/Alerting Keep a close eye on things and build a strong dev-ops team
  • 18. What doesn’t works !! Magic systems Auto configure, Auto tuning Very hard to get it right instead have easy configuration and better monitoring Open source If Not supported by strong engineering team internally Be ready to have folks spend 30-40% time on understanding, helping open source components Silver bullets One system to solve all scaling problems eg. Hbase Build separate systems for separate problems Central data source Don’ t lock your data let it flow Use (Kafka, Scribe or any publish/subscribe system)
  • 19. Open source Very very important for any company today Do not reinvent the wheel Do not write a line of code if not needed 90/10 % rule Pick up open source solutions, fix what is broken Big plus for hiring Stand on shoulder of crowd
  • 20. Open source: Storage Problem: You want to store TB of data for user consumption in real time Latency < 50 ms Scale 10,000 QPS + Solutions Big table design eg. Hbase Amazon Dynamo design eg. Voldemort Cache with persistence eg. Membase Document based storage eg. MongoDB
  • 21. Open source: Publish/Subscribe Problem: Data River for all other systems to get their feed Solutions Strong data guarantees eg. ActiveMQ, RabbitMQ, HornetQ Log feeds eg. Scribe, flume Kafka A great mix of both the world
  • 22. Open source: Real time analysis Problem: Analyze a stream of data and do simple analysis/reporting Solutions Splunk General purpose but high maintenance expansive analysis tool OpenTSDB Simple but scalable metrics reporting Yahoo S4/Twitter Storm Online map-reduce ish New systems will need lots of love and care
  • 23. Open source: Search Problem: unstructured queries on data Solutions Lucene Most tested common search (but just a) library Solr Old system with lot of users but bad design Elastic Search Very well designed but new system Linkedin search open source systems sensieDB, zoie
  • 24. Open source: Batch computation Problem: You want to process TB of data Solutions is simple: Use Hadoop Hadoop workflow manager Azkaban Oozie Query Native Java code Cascading Hive Pig
  • 25. Open source: Other Serialization Avro, Thrift, protocol buffers Compression Snappy, LZO Monitoring Ganglia
  • 26. My personal picks !! Storage: Pure key-value lookup : Voldemort Range queries, Hadoop job support: Hbase Batch generated Read only data serving: Voldemort Publish/Subscribe HornetQ OR Kafka Search ElasticSearch Hadoop Azkaban Hive and Native Java code
  • 27. Jeff Dean’s Thoughts Very practical advice on building good reliable distributed systems. Highlights Back of the envelope calculations Understand your base numbers well Scale for 10X not 100X Embrace chaos/failure and design around it Monitor/status hooks at all levels Important not to try to be all things for everybody Reference : http:// www.slideshare.net / xlight /google-designs-lessons-and-advice-from-building-large-distributed-systems
  • 28. How Voldemort was born ? Reference : 1) http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2) http://www.slideshare.net/adorepump/voldemort-nosql
  • 29. Why NoSQL ? TBs of data Sharding the only way to scale No joins possible (Data is split across machines) Specialized systems eg search, network feed breaks relational model No constraints, triggers, etc disappear Lots of denormalization Latency is key Relational DB depend on caching layer to achieve high throughput and low latency
  • 30. Inspired By Amazon Dynamo & Memcached Amazon ’s Dynamo storage system Works across data centers Eventual consistency Commodity hardware Memcached Actually works Really fast Really simple
  • 31. ACID Vs CAP ACID Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency to Eventual consistency Proprietary & Confidential 02/06/12
  • 32. Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change Replication Each partition is stored by ‘N’ nodes Proprietary & Confidential 02/06/12
  • 33. R+W > N N - The replication factor R - The number of blocking reads W - The number of blocking writes If R+W > N then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 02/06/12
  • 34. Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary
  • 35. Vector Clock Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 02/06/12
  • 36. Client API Data is organized into “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V) GET (Key K) MULTI-GET (Iterator<Key> K), DELETE (Key K) / (Key K , Version ver) No Range Scans
  • 38.  
  • 39. Read-only storage engine Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback! Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
  • 40. What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12
  • 41. Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010
  • 42. Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /

Editor's Notes

  • #4: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #7: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #11: Statistical learning as the ultimate agile development tool (Peter Norvig), “business logic” through data rather than code
  • #30: No Joins Across data domains due to APIs Within data domains due to performance Natural operation: getAll(id…) Latency: if you want to call 30 services on your main pages, they better be quick (30 * 20ms = 600ms)
  • #32: - Strong Consistency: all clients see the same view, even in presence of updates - High Availability: all clients can find some replica of the data, even in the presence of failures Partition-tolerance: the system properties hold even when the system is partitioned high availability : Mantra for websites Better to deal with inconsistencies, because their primary need is to scale well to allow for a smooth user experience.
  • #33: Hashing .. Why do we need it ?? Basic problem : Clients need to know which data is where ?? Many ways of solving it Central configuration Hashing Linear hashing works : issue is when cluster is dynamic ?? KeyHash –node IDmapping change for a lot of entries When you add new slots Consistent hashing : preserves key –Node mapping for most of the keys and only change the minimal amount needed How to do it ?? Number of partitions ---------------------------- Arbitrary , each node is allocated many partitions (better load balancing and fault tolerance) Few hundreds to few thousands .. Key  partition mapping is fixed and only ownership of partitions can change
  • #35: Give example of read and writes with vector clocks Pros and cons vs paxos and 2pc User can supply strategy for handling cases where v1 and v2 are not comparable.
  • #36: Fancy way of doing Optimistic locking
  • #37: Very simple APIS NO Range Scans .. . No iterator on KeySet / Entry SET : Very hard to fix performance Have plans to provide such an iterator
  • #38: Explain about partitions Make things fast by removing slow things, not by tuning HTTP client not performant Separate caching layer
  • #40: Transfer time: 30 minutes Can max out a gb network, so be careful
  • #43: Example: member data--does not make sense to repeatedly join positions, emails, groups, etc. Explain about joins How to better model in java? Json like data model
  • #44: Questions, comments, etc