Bhupeshbansal bigdata

Big Data and Me Bhupesh Bansal Feb 3, 2012

Relational Model Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql

Linkedin 2006 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm

Relational model The relational model is a triumph of computer science: General Concise Well understood But then again: SQL is a pain Hard to build re-usable data structures Hides performance issues/details

Specialized Systems Architecture Reference : http:// www.slideshare.net / adorepump / voldemort-nosql

Linkedin 2007 Reference : http://www.slideshare.net/linkedin/linked-in-javaone-2008-tech-session-comm

Specialized systems Specialized systems are efficient (10-100x) Search: Inverted index Offline: Hadoop, Terradata, Oracle DWH Memcached In memory systems (social graph) Specialized system are scalable New data and problems Graphs, sequences, and text

Batch Driven Architecture Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010

Motivation I : Big Data 02/06/12 Reference : algo2.iti.kit.edu/.../fopraext/index.html

Motivation II: Data Driven Features

Motivation III: Makes Money 02/06/12 Proprietary & Confidential

Motivation IV: Big Data is cool 02/06/12

Reference : http:// www.slideshare.net / BenSiscovick /the-business-of-big-data-ia-ventures-8577588

Big Data Challenges Large scale data processing Use all available signals eg. Weblogs, Social signals (twitter/facebook/linkedin) Data Driven Applications Refine data push back to user for consumption Near real time feedback loop Keep continuously improving

Why is this hard ? Large scale data processing TB/PB of data Traditional storage systems cannot handle the scale Data Driven Applications Need to run complex machine learning algorithms on this data scale Near real time analysis improves application performance and usage.

Some good news !! Hadoop Biggest single driver for large scale data economy Scales, works, easy to use Memcached Works, scales and is fast Open source world Lot of awesome people working on awesome systems eg. hBase, memcached, Voldemort, kafka, mahout etc. Sharing across companies Common practices/knowledge sharing across companies.

What works !! Simplicity Go with the simplest design possible. Near real time Async/Batch processing Put computation to background as much as possible Duplicate data everywhere Build customized solution for each problem Duplicate data as needed Data river Publish events and let all systems consume at their own pace Monitoring/Alerting Keep a close eye on things and build a strong dev-ops team

What doesn’t works !! Magic systems Auto configure, Auto tuning Very hard to get it right instead have easy configuration and better monitoring Open source If Not supported by strong engineering team internally Be ready to have folks spend 30-40% time on understanding, helping open source components Silver bullets One system to solve all scaling problems eg. Hbase Build separate systems for separate problems Central data source Don’ t lock your data let it flow Use (Kafka, Scribe or any publish/subscribe system)

Open source Very very important for any company today Do not reinvent the wheel Do not write a line of code if not needed 90/10 % rule Pick up open source solutions, fix what is broken Big plus for hiring Stand on shoulder of crowd

Open source: Storage Problem: You want to store TB of data for user consumption in real time Latency < 50 ms Scale 10,000 QPS + Solutions Big table design eg. Hbase Amazon Dynamo design eg. Voldemort Cache with persistence eg. Membase Document based storage eg. MongoDB

Open source: Publish/Subscribe Problem: Data River for all other systems to get their feed Solutions Strong data guarantees eg. ActiveMQ, RabbitMQ, HornetQ Log feeds eg. Scribe, flume Kafka A great mix of both the world

Open source: Real time analysis Problem: Analyze a stream of data and do simple analysis/reporting Solutions Splunk General purpose but high maintenance expansive analysis tool OpenTSDB Simple but scalable metrics reporting Yahoo S4/Twitter Storm Online map-reduce ish New systems will need lots of love and care

Open source: Search Problem: unstructured queries on data Solutions Lucene Most tested common search (but just a) library Solr Old system with lot of users but bad design Elastic Search Very well designed but new system Linkedin search open source systems sensieDB, zoie

Open source: Batch computation Problem: You want to process TB of data Solutions is simple: Use Hadoop Hadoop workflow manager Azkaban Oozie Query Native Java code Cascading Hive Pig

Open source: Other Serialization Avro, Thrift, protocol buffers Compression Snappy, LZO Monitoring Ganglia

My personal picks !! Storage: Pure key-value lookup : Voldemort Range queries, Hadoop job support: Hbase Batch generated Read only data serving: Voldemort Publish/Subscribe HornetQ OR Kafka Search ElasticSearch Hadoop Azkaban Hive and Native Java code

Jeff Dean’s Thoughts Very practical advice on building good reliable distributed systems. Highlights Back of the envelope calculations Understand your base numbers well Scale for 10X not 100X Embrace chaos/failure and design around it Monitor/status hooks at all levels Important not to try to be all things for everybody Reference : http:// www.slideshare.net / xlight /google-designs-lessons-and-advice-from-building-large-distributed-systems

How Voldemort was born ? Reference : 1) http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010 2) http://www.slideshare.net/adorepump/voldemort-nosql

Why NoSQL ? TBs of data Sharding the only way to scale No joins possible (Data is split across machines) Specialized systems eg search, network feed breaks relational model No constraints, triggers, etc disappear Lots of denormalization Latency is key Relational DB depend on caching layer to achieve high throughput and low latency

Inspired By Amazon Dynamo & Memcached Amazon ’s Dynamo storage system Works across data centers Eventual consistency Commodity hardware Memcached Actually works Really fast Really simple

ACID Vs CAP ACID Great for single centralized server. CAP Theorem Consistency (Strict), Availability , Partition Tolerance Impossible to achieve all three at same time in distributed platform Can choose 2 out of 3 Dynamo chooses High Availability and Partition Tolerance by sacrificing Strict Consistency to Eventual consistency Proprietary & Confidential 02/06/12

Consistent Hashing Key space is Partitioned Many small partitions Partitions never change Partitions ownership can change Replication Each partition is stored by ‘N’ nodes Proprietary & Confidential 02/06/12

R+W > N N - The replication factor R - The number of blocking reads W - The number of blocking writes If R+W > N then we have a quorum-like algorithm Guarantees that we will read latest writes OR fail R, W, N can be tuned for different use cases W = 1, Highly available writes R = 1, Read intensive workloads Knobs to tune performance, durability and availability Proprietary & Confidential 02/06/12

Versioning & Conflict Resolution Eventual Consistency allows multiple versions of value Need a way to understand which value is latest Need a way to say values are not comparable Solutions Timestamp Vector clocks Provides global ordering. No locking or blocking necessary

Vector Clock Vector Clock [Lamport] provides way to order events in a distributed system. A vector clock is a tuple {t1 , t2 , ..., tn } of counters. Each value update has a master node When data is written with master node i, it increments ti. All the replicas will receive the same version Helps resolving consistency between writes on multiple replicas If you get network partitions You can have a case where two vector clocks are not comparable. In this case Voldemort returns both values to clients for conflict resolution Proprietary & Confidential 02/06/12

Client API Data is organized into “stores”, i.e. tables Key-value only But values can be arbitrarily rich or complex Maps, lists, nested combinations … Four operations PUT (Key K, Value V) GET (Key K) MULTI-GET (Iterator<Key> K), DELETE (Key K) / (Key K , Version ver) No Range Scans

Read-only storage engine Throughput vs. Latency Index building done in Hadoop Fully parallel transfer Very efficient on-disk structure Heavy reliance on OS pagecache Rollback! Reference : http:// www.slideshare.net / bhupeshbansal /hadoop-user-group-jan2010

What do we use Hadoop/Voldemort for ? Proprietary & Confidential 02/06/12

Data Flow Driven Architecture Reference : http:// sna-projects.com /blog/2011/08/ kafka /

Bhupeshbansal bigdata

More Related Content

What's hot (20)

Similar to Bhupeshbansal bigdata (20)

Recently uploaded (20)

Bhupeshbansal bigdata

Editor's Notes