SlideShare a Scribd company logo
Databases in the Cloud
     Seminar: Big Data Analytics
                Winter Semester 2012-13



             Moshfiqur Rahman
Agenda
       Cloud Computing Introduction
       Big Data
       RDBMS and Cloud Databases
       Scalability, Elasticity, Availability – New attributes for
        databases
           In RDBMS
           In Cloud Databases
       Challenges in Cloud Databases
       Big Data Analytics in Cloud
       Conclusion

    2
Cloud Computing Introduction
       What is in the Cloud?
           Application as a service
           Hardware and system software
       Public and Private Cloud
       Cloud Computing attributes
           Virtually infinite computing resources
           Start small and grow as needed
           Pay-per-use scheme




    3
Cloud Computing Introduction




                      Source: http://en.wikipedia.org/wiki/Cloud_computing

4
Big Data
       Large and complex data sets
           Exponential growth
           Structured, semi-structured, unstructured
           Hard to process in traditional database system
       Challenges with Big Data
           Capture, scrutinization, storage, search, sharing, analysis…
       Big Data sources
           Mobile devices
           Sensors
           Software/server logs
           Cameras and so on…

    5
Big Data Attributes
       Volume
           Factors contribute to the increase of data, for example, text streams
            from social networks
           Hidden relationships in data
           Data storage cost decreased but data analysis issues increased!
       Variety
           Data can be of all possible formats
           Structured/semi-structured data from RDBMS
           Unstructured data from documents, emails, video, audio, sensors
       Velocity
           Keep up the data processing speed with data production speed
           Streams of real time data from sensors and social media
           Reacting quickly to the increase of data velocity

    6
Relational Database
       A relational database is
           Collection of tables (entities)
               Multiple columns
               Multiple rows (tuples)
       Accessed by SQL
       Join multiple tables to get related data
       Normalization is used to minimize redundancy and
        dependency
       Referential Integrity is used to ensure data consistency
       Managed by Relational Database Management System
        (RDBMS)
       Oracle, MS SQL Server, MySQL, etc.

    7
RDBMS - A misfit for cloud?
       RDBMS has
           Simplicity
           Robustness
           Flexibility
           Performance
           Compatibility
           (Limited) Scalability
       Cloud databases require
           Scalability
           Elasticity
           Availability

    8
Cloud Databases
       Key/value store – a new kind of database management
        system
           Store data as key/value pair
           Targeted for specialized applications where a RDBMS is not
            suitable
       Also known as
           Document-oriented database
           Internet-facing database
           Attribute-oriented database
           Distributed database, etc.



    9
Cloud Databases - Advantages
    Stores data in format of items
        Customer items, Order items in an e-commerce system
        A single item contains all the relevant data
    Relationships are not deprecated, just simplified
        Order items contains the keys of associated Customer item
         and Product items
    Able to scale easily and dynamically
        Allows the user to pay only for used resources
        Allows the vendor to scale their infrastructure depending on
         their entire platform size



    10
Cloud Databases - Advantages
    Reduce the development time
        By decreasing developing time with object relational data
         mapping
        Easier to map application object to key/value database items




    11
Cloud Databases - Disadvantages
    Relationships are not defined in data models
        DBMS cannot enforce data integrity
        Deleting item from a set of related items will make data
         inconsistent
    No shared standard
        Totally different set of APIs
        Application developed for one cloud vendor is hard to port to
         another cloud vendor




    12
Scalability
    Desired property of a system to accommodate growing
     amounts of work
        By adding more hardware in single machine
        By adding more machines (a.k.a. node)
    Two ways to scale a system
        Vertically or Scale Up
            New hardware is added to a single node in a system
            Adding more processors or memory to a single machine
        Horizontally or Scale Out
            Add more nodes to a system
            Scaling out from one web-server system to a three web-server
             system

    13
Elasticity
    Ability to spread the workloads dynamically over the
     available resources
        Automatically adds more resources when workload increases
        Automatically shrinks back and removes the unneeded
         resources when workload decreases
    Very important for cloud environment
        Pay-per-scheme




    14
Availability
    Allows the user read and write data at any time without
     blocking them
    Response time is virtually constant and does not depend
     on
        Number of concurrent users
        Database size
        Any other system parameter
    Automatic data backups and failover management




    15
Scalability in RDBMS
    RDBMS provides limited scalability
        Scale up on a single node
        Scale out with relatively small numbers of nodes
    Scale up is not infinite but increase in workload can be
     virtually infinite
    Scale out is overwhelming in system with hundreds or
     thousands of nodes




    16
Elasticity in RDBMS
    RDBMS allows very limited elasticity at storage and
     web/application server layers
        Add a web server when the workload increases and adjust the
         throughput to dissipate the loads to the new server
        When workload decreases, detach the server from the system,
         use it for different purposes
        At storage layers, more disks can be added
    Adding a bigger machine and replace the overloaded
     database server
        Expensive investment
        Unnecessary investment for a seasonal hype


    17
Availability in RDBMS
    Employs storage redundancy by performing data
     replication
        Also ensure improved performance for concurrent users
        Provides resiliency in case of a failure
    Data replication is not so easy process
        Synchronization
        Replicate the whole database to make synchronization easier




    18
Scalability, Elasticity and Availability in
Cloud Databases
    New breed of databases focusing on scalability, elasticity
     and availability
        Key/value store supports nearly limitless scalability
        In the expense of other benefits come with RDBMS
    Data accessed by a single key
        Provides the basis for scalability
        Data item is contained in a single object and handled by a
         single node
    Some modern applications need multiple key/value pair
     access atomically
        Online multi-player games, Google Drive
        Hence required multi-key atomicity

    19
Scalability, Elasticity and Availability in
Cloud Databases
    Different database implementations
        Google’s MegaStore
        G-Store
        Relational Cloud
        ElasTras




    20
MegaStore
    Uses Bigtable as the underlying system
    Provides multi-key atomicity
        Data Fusion
        Group multiple key/value pair as single collection
        Write/ahead logging
        Two-phase commit to support ACID transactions on a
         collection




    21
MegaStore
    Advantages
        Allows entities to be arbitrarily distributed over multiple nodes
        Better performance when entity group co-located in a single
         node
    Disadvantages
        Exhibits performance issues when entity group is distributed
         across multiple nodes




    22
G-Store
    Provides transactional multi-key access over dynamic,
     non-overlapping groups of keys
    Created groups are transient in nature
    Creates abstract group for on-demand transaction access
        Leader key, follower keys
        Ownership of read/write access transfers to the node hosting
         the key group
        No key should not be claimed by multiple group, no key should
         be without a owner




    23
G-Store
    Advantages
        Transactions are efficient for key group resides on single node
    Disadvantages
        A group must be small enough to reside on a single node




    24
Relational Cloud
    Works on Elasticity extensively
    Uses a graph-based partitioning method to split large
     databases across multiple machines
        Workload aware partitioning strategy
    Frontend transaction trace component
        keeps track of transactions
        Analyze the transactions to determine the set of tuples
         accessed together
        Creates a graph of transactions
        Weight is given to the edges to denote how often a
         transactions are executed

    25
Relational Cloud




26
Relational Cloud
    Advantages
        Uses the MySQL, Postgre-SQL as backend databases
        Migrate the database partitions without causing downtime
        Replicate the data for availability
    Disadvantages
        Scaling the graph representation is difficult as it leads to a
         graph with N nodes and up to N2 edges for an N-tuple
         database




    27
ElasTras
    A cloud database under research providing better
     scalability and elasticity with transactional data access




                        Figure: System overview of ElasTras

    28
ElasTras
    Two level Transaction Manager (TM)
        Higher level TM (HTM)
        Owning TM (OTM)
    When any transaction request arrives
        Load balancer uses some load balancing policy and forward the
         request to appropriate HTM
        HTM decides whether to execute the transaction locally or
         forward to OTM
        OTM has exclusive access rights to the data accessed by a
         single transaction
    System state information and database metadata managed
     by Metadata Manager

    29
ElasTras
    Two approach to partition the database
        Static Partitioning
        Dynamic Partitioning
    Static Partitioning
        Database designer defines the partitioning
        ElasTras is responsible for mapping the partitions to their
         specific OTMs
        Also reassigns partitions if workload increases
        Application has the knowledge of partitions
        ElasTras can provide ACID transactional guarantees as
         transactions executed locally to a partition


    30
ElasTras
    Dynamic Partitioning
        Basis for the elasticity of the data store
        Uses range or hash based partitioning scheme
        Applications are not aware of the partitions
        Transactions are not guaranteed to be limited to a single
         partition
        Provides mini transactions with restricted transactional
         semantics to ensure scalability and avoid distributed
         transactions
        Mini transactions ensures recovery but no global
         synchronization


    31
ElasTras
    Advantages
        Provides transactional guarantees in scalable manner
        OTM’s reassigning partitions capability with changing workload
         ensures elasticity and scalability
        Provides ACID transactions when transactions are limited to a
         single partition
    Disadvantages
        In dynamic partitioning, ElasTras only supports mini
         transactions with restricted transactional semantics to avoid
         distributed transactions
        Mini transactions only ensure recovery but no global
         synchronization

    32
Challenges in Cloud Databases
    Importing data
        Data transport are complex and may incur huge cost
    Auto failover management
        Server crashes, hardware malfunction
        Database must be replicated, automatically replace and start
         working if any failure occurs
    Auto scalability and elasticity management
        Scale instantly and automatically both throughput and size
        Very granular increases and shrinking back in resources




    33
Big Data Analytics
    Big Data Analytics
        Process of analyzing huge amount of structured, semi-
         structured and unstructured data of variety types
        Discover the hidden patterns and unknown correlations in
         data
    Companies are interested in big data analytics to achieve
     competitive advantages over rival companies
        Through effective marketing
        Propose new innovative services




    34
Big Data Analytics
    Big data analytics help companies make better business
     decisions
    Traditional analytic software are available for data analysis
        Advanced technologies such as predictive analysis, data mining, etc.
    But, traditional analytics software
        is not suitable for big data with semi-structured and/or unstructured
         data
        is not able to handle the demand of processing power needs to
         analyze those big data
    New class of big data analytics environment has emerged
        NoSQL databases
        Hadoop
        MapReduce


    35
Big Data Analytics in Cloud
    Available database as a service in Cloud
        Amazon SimpleDB
        Google AppEngine
        Microsoft SQL Azure
        so on…
    Limitations in Cloud
        Limitations over query execution time, for example, Amazon
         SimpleDB restricts any query which takes more than 5 sec
        Limitations over result dataset size, for example, Google
         AppEngine does not allow users to retrieve more than 1000
         items for any query
    Impractical for big data analytics

    36
Big Data Analytics in Cloud
    Specialized solution for big data analytics in cloud
        Google BigQuery
        Amazon Elastic MapReduce (EMR)




    37
Google BigQuery
    Cloud based interactive query service for big data
    Implementation of Dremel, a parallel query engine
    Query executes on a small number of very large append-
     only tables
    Two core technology
        Columnar storage
            Records are separated in column values
            Put all single column values in different storage volume forming a tree
        Tree architecture
            Query pushing down to the branches of the tree
            Results are aggregated from the leaves


    38
Amazon Elastic MapReduce (EMR)
    A hosted Hadoop framework
    Provides a web service to process huge amounts of data
    Contains a MapReduce framework
        Sub divides the data in smaller chunks and process them in
         parallel (the “map” function)
        Recombines them into final solution (the “reduce” function)




    39
Google BigQuery vs. Amazon EMR
Head to Head

Google BigQuery                                     Amazon EMR
Interactive data analysis tool for large data set   A programming framework to process big data.
Comparable to Hive but claims to be faster          Accessible by data analysis application
than that                                           developed in Pig, Hive or other programming
                                                    languages using Amazon’s SDK
Designed to run faster query and user friendly      Supports implementing complex data
even for non-programmers with built-in GUI          processing logic
Good for ad-hoc and trial-and-error interactive     Good for batch processing of large dataset
query on large dataset for quick analysis and       doing time consuming data conversion and
troubleshooting                                     aggregation
Provides a regular expression engine to             Structuring data fully dependent on application
structure the unstructured data                     logic
Does not support large result set neither           Supports both large result set and joining of
joining of large tables                             table
Does not support updating existing data, only       Supports updating existing data
append of data is possible

  40
Conclusion
    End of RDBMS?
    Cloud databases for big data
        Finding relationships in data
        Solving the problem for scalability, elasticity and availability
    More rising issues
        Efficient multi tenancy
        Data privacy




    41
Thanks for your attention




42

More Related Content

PPT
Virtualization.ppt
PPTX
basic concept of Cloud computing and its architecture
PPTX
Lecture 5: Client Side Programming 1
PPTX
SLA Agreement, types and Life Cycle
PPTX
Azure Introduction
PPTX
What is Virtualization and its types & Techniques.What is hypervisor and its ...
PPTX
Virtual machine
PPTX
Cloud database
Virtualization.ppt
basic concept of Cloud computing and its architecture
Lecture 5: Client Side Programming 1
SLA Agreement, types and Life Cycle
Azure Introduction
What is Virtualization and its types & Techniques.What is hypervisor and its ...
Virtual machine
Cloud database

What's hot (20)

PDF
IaaS, SaaS, PasS : Cloud Computing
PPTX
PPTX
Multi-Tenant Approach
PPT
Web Servers (ppt)
PPTX
AWS Simple Storage Service (s3)
PPT
Introduction to Virtualization
PPT
Mobile Application Development MAD J2ME
PPTX
Introduction to cloud computing
PPT
Virtualization in cloud computing ppt
PPTX
Historical development of cloud computing
PDF
Object Oriented Analysis Design using UML
PDF
Cloud Computing - Introduction
PPTX
Middleware Technologies ppt
PPTX
Windows form application - C# Training
PDF
Application of Cloud Computing
PPT
Webservices
PPT
Asp.net control
PPTX
Grid protocol architecture
PDF
Serverless Computing
PPTX
Characteristics of cloud computing
IaaS, SaaS, PasS : Cloud Computing
Multi-Tenant Approach
Web Servers (ppt)
AWS Simple Storage Service (s3)
Introduction to Virtualization
Mobile Application Development MAD J2ME
Introduction to cloud computing
Virtualization in cloud computing ppt
Historical development of cloud computing
Object Oriented Analysis Design using UML
Cloud Computing - Introduction
Middleware Technologies ppt
Windows form application - C# Training
Application of Cloud Computing
Webservices
Asp.net control
Grid protocol architecture
Serverless Computing
Characteristics of cloud computing
Ad

Viewers also liked (6)

PDF
Big Data Analytics - GTech Seminar
PDF
OPC -Connectivity using Java
PDF
Overview of big data in cloud computing
PDF
Internet of Things (IoT) and Big Data
PPTX
IoT + Big Data + Cloud + AI Integration Strategy Insights from Patents
PDF
Green Cloud Computing
Big Data Analytics - GTech Seminar
OPC -Connectivity using Java
Overview of big data in cloud computing
Internet of Things (IoT) and Big Data
IoT + Big Data + Cloud + AI Integration Strategy Insights from Patents
Green Cloud Computing
Ad

Similar to Presentation on Databases in the Cloud (20)

PPTX
Big Data (NJ SQL Server User Group)
PDF
Preparing yourdataforcloud
PPTX
PPTX
PDF
Prepare Your Data For The Cloud
PDF
Preparing your data for the cloud
PPTX
SQL and NoSQL in SQL Server
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
PPTX
Sql vs NoSQL
PPTX
The Rise of NoSQL and Polyglot Persistence
PPTX
An Intro to NoSQL Databases
PPTX
NoSQL
PDF
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
PDF
Architecting for the cloud storage misc topics
PPTX
The NoSQL movement @ DotNetToscana
PDF
No sql – rise of the clusters
PPT
SQL, NoSQL, BigData in Data Architecture
PPTX
Database Decision Framework
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
PDF
Scaling data on public clouds
Big Data (NJ SQL Server User Group)
Preparing yourdataforcloud
Prepare Your Data For The Cloud
Preparing your data for the cloud
SQL and NoSQL in SQL Server
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Sql vs NoSQL
The Rise of NoSQL and Polyglot Persistence
An Intro to NoSQL Databases
NoSQL
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Architecting for the cloud storage misc topics
The NoSQL movement @ DotNetToscana
No sql – rise of the clusters
SQL, NoSQL, BigData in Data Architecture
Database Decision Framework
NoSQL A brief look at Apache Cassandra Distributed Database
Scaling data on public clouds

Recently uploaded (20)

PPTX
E8 Q1 020ssssssssssssssssssssssssssssss2 PS.pptx
PPSX
Multiple scenes in a single painting.ppsx
PDF
; Projeto Rixa Antiga.pdf
PPTX
22 Bindushree Sahu.pptxmadam curie life and achievements
PDF
DPSR MUN'25 (U).pdf hhhhhhhhhhhhhbbnhhhh
PPTX
VAD - Acute and chronic disorders of mesenteric.pptx
PPTX
White Green Simple and Professional Business Pitch Deck Presentation.pptx
PPTX
current by laws xxxxxxxxxxxxxxxxxxxxxxxxxxx
PPTX
unit5-servicesrelatedtogeneticsinnursing-241221084421-d77c4adb.pptx
PDF
the saint and devil who dominated the outcasts
PPTX
Green and Orange Illustration Understanding Climate Change Presentation.pptx
PPTX
vsfbvefbegbefvsegbthnmthndgbdfvbrsjmrysnedgbdzndhzmsr
PDF
Slide_BIS 2020 v2.pdf....................................
PPTX
Art Appreciation-Lesson-1-1.pptx College
PPTX
Callie Slide Show Slide Show Slide Show S
PDF
Close Enough S3 E7 "Bridgette the Brain"
PPTX
Technical-Codes-presentation-G-12Student
PDF
waiting, Queuing, best time an event cab be done at a time .pdf
PDF
Chapter 3 about The site of the first mass
PPTX
Brown and Beige Vintage Scrapbook Idea Board Presentation.pptx.pptx
E8 Q1 020ssssssssssssssssssssssssssssss2 PS.pptx
Multiple scenes in a single painting.ppsx
; Projeto Rixa Antiga.pdf
22 Bindushree Sahu.pptxmadam curie life and achievements
DPSR MUN'25 (U).pdf hhhhhhhhhhhhhbbnhhhh
VAD - Acute and chronic disorders of mesenteric.pptx
White Green Simple and Professional Business Pitch Deck Presentation.pptx
current by laws xxxxxxxxxxxxxxxxxxxxxxxxxxx
unit5-servicesrelatedtogeneticsinnursing-241221084421-d77c4adb.pptx
the saint and devil who dominated the outcasts
Green and Orange Illustration Understanding Climate Change Presentation.pptx
vsfbvefbegbefvsegbthnmthndgbdfvbrsjmrysnedgbdzndhzmsr
Slide_BIS 2020 v2.pdf....................................
Art Appreciation-Lesson-1-1.pptx College
Callie Slide Show Slide Show Slide Show S
Close Enough S3 E7 "Bridgette the Brain"
Technical-Codes-presentation-G-12Student
waiting, Queuing, best time an event cab be done at a time .pdf
Chapter 3 about The site of the first mass
Brown and Beige Vintage Scrapbook Idea Board Presentation.pptx.pptx

Presentation on Databases in the Cloud

  • 1. Databases in the Cloud Seminar: Big Data Analytics Winter Semester 2012-13 Moshfiqur Rahman
  • 2. Agenda  Cloud Computing Introduction  Big Data  RDBMS and Cloud Databases  Scalability, Elasticity, Availability – New attributes for databases  In RDBMS  In Cloud Databases  Challenges in Cloud Databases  Big Data Analytics in Cloud  Conclusion 2
  • 3. Cloud Computing Introduction  What is in the Cloud?  Application as a service  Hardware and system software  Public and Private Cloud  Cloud Computing attributes  Virtually infinite computing resources  Start small and grow as needed  Pay-per-use scheme 3
  • 4. Cloud Computing Introduction Source: http://en.wikipedia.org/wiki/Cloud_computing 4
  • 5. Big Data  Large and complex data sets  Exponential growth  Structured, semi-structured, unstructured  Hard to process in traditional database system  Challenges with Big Data  Capture, scrutinization, storage, search, sharing, analysis…  Big Data sources  Mobile devices  Sensors  Software/server logs  Cameras and so on… 5
  • 6. Big Data Attributes  Volume  Factors contribute to the increase of data, for example, text streams from social networks  Hidden relationships in data  Data storage cost decreased but data analysis issues increased!  Variety  Data can be of all possible formats  Structured/semi-structured data from RDBMS  Unstructured data from documents, emails, video, audio, sensors  Velocity  Keep up the data processing speed with data production speed  Streams of real time data from sensors and social media  Reacting quickly to the increase of data velocity 6
  • 7. Relational Database  A relational database is  Collection of tables (entities)  Multiple columns  Multiple rows (tuples)  Accessed by SQL  Join multiple tables to get related data  Normalization is used to minimize redundancy and dependency  Referential Integrity is used to ensure data consistency  Managed by Relational Database Management System (RDBMS)  Oracle, MS SQL Server, MySQL, etc. 7
  • 8. RDBMS - A misfit for cloud?  RDBMS has  Simplicity  Robustness  Flexibility  Performance  Compatibility  (Limited) Scalability  Cloud databases require  Scalability  Elasticity  Availability 8
  • 9. Cloud Databases  Key/value store – a new kind of database management system  Store data as key/value pair  Targeted for specialized applications where a RDBMS is not suitable  Also known as  Document-oriented database  Internet-facing database  Attribute-oriented database  Distributed database, etc. 9
  • 10. Cloud Databases - Advantages  Stores data in format of items  Customer items, Order items in an e-commerce system  A single item contains all the relevant data  Relationships are not deprecated, just simplified  Order items contains the keys of associated Customer item and Product items  Able to scale easily and dynamically  Allows the user to pay only for used resources  Allows the vendor to scale their infrastructure depending on their entire platform size 10
  • 11. Cloud Databases - Advantages  Reduce the development time  By decreasing developing time with object relational data mapping  Easier to map application object to key/value database items 11
  • 12. Cloud Databases - Disadvantages  Relationships are not defined in data models  DBMS cannot enforce data integrity  Deleting item from a set of related items will make data inconsistent  No shared standard  Totally different set of APIs  Application developed for one cloud vendor is hard to port to another cloud vendor 12
  • 13. Scalability  Desired property of a system to accommodate growing amounts of work  By adding more hardware in single machine  By adding more machines (a.k.a. node)  Two ways to scale a system  Vertically or Scale Up  New hardware is added to a single node in a system  Adding more processors or memory to a single machine  Horizontally or Scale Out  Add more nodes to a system  Scaling out from one web-server system to a three web-server system 13
  • 14. Elasticity  Ability to spread the workloads dynamically over the available resources  Automatically adds more resources when workload increases  Automatically shrinks back and removes the unneeded resources when workload decreases  Very important for cloud environment  Pay-per-scheme 14
  • 15. Availability  Allows the user read and write data at any time without blocking them  Response time is virtually constant and does not depend on  Number of concurrent users  Database size  Any other system parameter  Automatic data backups and failover management 15
  • 16. Scalability in RDBMS  RDBMS provides limited scalability  Scale up on a single node  Scale out with relatively small numbers of nodes  Scale up is not infinite but increase in workload can be virtually infinite  Scale out is overwhelming in system with hundreds or thousands of nodes 16
  • 17. Elasticity in RDBMS  RDBMS allows very limited elasticity at storage and web/application server layers  Add a web server when the workload increases and adjust the throughput to dissipate the loads to the new server  When workload decreases, detach the server from the system, use it for different purposes  At storage layers, more disks can be added  Adding a bigger machine and replace the overloaded database server  Expensive investment  Unnecessary investment for a seasonal hype 17
  • 18. Availability in RDBMS  Employs storage redundancy by performing data replication  Also ensure improved performance for concurrent users  Provides resiliency in case of a failure  Data replication is not so easy process  Synchronization  Replicate the whole database to make synchronization easier 18
  • 19. Scalability, Elasticity and Availability in Cloud Databases  New breed of databases focusing on scalability, elasticity and availability  Key/value store supports nearly limitless scalability  In the expense of other benefits come with RDBMS  Data accessed by a single key  Provides the basis for scalability  Data item is contained in a single object and handled by a single node  Some modern applications need multiple key/value pair access atomically  Online multi-player games, Google Drive  Hence required multi-key atomicity 19
  • 20. Scalability, Elasticity and Availability in Cloud Databases  Different database implementations  Google’s MegaStore  G-Store  Relational Cloud  ElasTras 20
  • 21. MegaStore  Uses Bigtable as the underlying system  Provides multi-key atomicity  Data Fusion  Group multiple key/value pair as single collection  Write/ahead logging  Two-phase commit to support ACID transactions on a collection 21
  • 22. MegaStore  Advantages  Allows entities to be arbitrarily distributed over multiple nodes  Better performance when entity group co-located in a single node  Disadvantages  Exhibits performance issues when entity group is distributed across multiple nodes 22
  • 23. G-Store  Provides transactional multi-key access over dynamic, non-overlapping groups of keys  Created groups are transient in nature  Creates abstract group for on-demand transaction access  Leader key, follower keys  Ownership of read/write access transfers to the node hosting the key group  No key should not be claimed by multiple group, no key should be without a owner 23
  • 24. G-Store  Advantages  Transactions are efficient for key group resides on single node  Disadvantages  A group must be small enough to reside on a single node 24
  • 25. Relational Cloud  Works on Elasticity extensively  Uses a graph-based partitioning method to split large databases across multiple machines  Workload aware partitioning strategy  Frontend transaction trace component  keeps track of transactions  Analyze the transactions to determine the set of tuples accessed together  Creates a graph of transactions  Weight is given to the edges to denote how often a transactions are executed 25
  • 27. Relational Cloud  Advantages  Uses the MySQL, Postgre-SQL as backend databases  Migrate the database partitions without causing downtime  Replicate the data for availability  Disadvantages  Scaling the graph representation is difficult as it leads to a graph with N nodes and up to N2 edges for an N-tuple database 27
  • 28. ElasTras  A cloud database under research providing better scalability and elasticity with transactional data access Figure: System overview of ElasTras 28
  • 29. ElasTras  Two level Transaction Manager (TM)  Higher level TM (HTM)  Owning TM (OTM)  When any transaction request arrives  Load balancer uses some load balancing policy and forward the request to appropriate HTM  HTM decides whether to execute the transaction locally or forward to OTM  OTM has exclusive access rights to the data accessed by a single transaction  System state information and database metadata managed by Metadata Manager 29
  • 30. ElasTras  Two approach to partition the database  Static Partitioning  Dynamic Partitioning  Static Partitioning  Database designer defines the partitioning  ElasTras is responsible for mapping the partitions to their specific OTMs  Also reassigns partitions if workload increases  Application has the knowledge of partitions  ElasTras can provide ACID transactional guarantees as transactions executed locally to a partition 30
  • 31. ElasTras  Dynamic Partitioning  Basis for the elasticity of the data store  Uses range or hash based partitioning scheme  Applications are not aware of the partitions  Transactions are not guaranteed to be limited to a single partition  Provides mini transactions with restricted transactional semantics to ensure scalability and avoid distributed transactions  Mini transactions ensures recovery but no global synchronization 31
  • 32. ElasTras  Advantages  Provides transactional guarantees in scalable manner  OTM’s reassigning partitions capability with changing workload ensures elasticity and scalability  Provides ACID transactions when transactions are limited to a single partition  Disadvantages  In dynamic partitioning, ElasTras only supports mini transactions with restricted transactional semantics to avoid distributed transactions  Mini transactions only ensure recovery but no global synchronization 32
  • 33. Challenges in Cloud Databases  Importing data  Data transport are complex and may incur huge cost  Auto failover management  Server crashes, hardware malfunction  Database must be replicated, automatically replace and start working if any failure occurs  Auto scalability and elasticity management  Scale instantly and automatically both throughput and size  Very granular increases and shrinking back in resources 33
  • 34. Big Data Analytics  Big Data Analytics  Process of analyzing huge amount of structured, semi- structured and unstructured data of variety types  Discover the hidden patterns and unknown correlations in data  Companies are interested in big data analytics to achieve competitive advantages over rival companies  Through effective marketing  Propose new innovative services 34
  • 35. Big Data Analytics  Big data analytics help companies make better business decisions  Traditional analytic software are available for data analysis  Advanced technologies such as predictive analysis, data mining, etc.  But, traditional analytics software  is not suitable for big data with semi-structured and/or unstructured data  is not able to handle the demand of processing power needs to analyze those big data  New class of big data analytics environment has emerged  NoSQL databases  Hadoop  MapReduce 35
  • 36. Big Data Analytics in Cloud  Available database as a service in Cloud  Amazon SimpleDB  Google AppEngine  Microsoft SQL Azure  so on…  Limitations in Cloud  Limitations over query execution time, for example, Amazon SimpleDB restricts any query which takes more than 5 sec  Limitations over result dataset size, for example, Google AppEngine does not allow users to retrieve more than 1000 items for any query  Impractical for big data analytics 36
  • 37. Big Data Analytics in Cloud  Specialized solution for big data analytics in cloud  Google BigQuery  Amazon Elastic MapReduce (EMR) 37
  • 38. Google BigQuery  Cloud based interactive query service for big data  Implementation of Dremel, a parallel query engine  Query executes on a small number of very large append- only tables  Two core technology  Columnar storage  Records are separated in column values  Put all single column values in different storage volume forming a tree  Tree architecture  Query pushing down to the branches of the tree  Results are aggregated from the leaves 38
  • 39. Amazon Elastic MapReduce (EMR)  A hosted Hadoop framework  Provides a web service to process huge amounts of data  Contains a MapReduce framework  Sub divides the data in smaller chunks and process them in parallel (the “map” function)  Recombines them into final solution (the “reduce” function) 39
  • 40. Google BigQuery vs. Amazon EMR Head to Head Google BigQuery Amazon EMR Interactive data analysis tool for large data set A programming framework to process big data. Comparable to Hive but claims to be faster Accessible by data analysis application than that developed in Pig, Hive or other programming languages using Amazon’s SDK Designed to run faster query and user friendly Supports implementing complex data even for non-programmers with built-in GUI processing logic Good for ad-hoc and trial-and-error interactive Good for batch processing of large dataset query on large dataset for quick analysis and doing time consuming data conversion and troubleshooting aggregation Provides a regular expression engine to Structuring data fully dependent on application structure the unstructured data logic Does not support large result set neither Supports both large result set and joining of joining of large tables table Does not support updating existing data, only Supports updating existing data append of data is possible 40
  • 41. Conclusion  End of RDBMS?  Cloud databases for big data  Finding relationships in data  Solving the problem for scalability, elasticity and availability  More rising issues  Efficient multi tenancy  Data privacy 41
  • 42. Thanks for your attention 42