SlideShare a Scribd company logo
Optimizing Right tool for Data
Pipeline
Parimala Killada
Agenda
• Motivation for Data Engineering
• Data Engineering Ecosystem
• Data Stores
• Amazon Redshift & Example project
• Redshift vs No SQL databases
• Conclusion
Motivation for Data Engineering
• Able to get meaningful insights from data
• Acts as chief problem solver
• Able to build data infrastructures
• Get access to new technologies
Data Engineering Eco system
Data Stores
• Provides easy access to end user or data scientists
• Stores data in organized forms
• Varieties of data stores:
• Relational Databases
• Key-value data stores
• Document stores
• Wide-column stores
• Graph stores
• Multi model data stores
Pipeline
MedicostGuider
Amazon Redshift
Redshift
• Data stored in columnar format
• Distribution keys: Used to distribute data among nodes
• Sort keys: Data sorted within a distribution key
• Compression codes: Reduces storage space
• Shared-nothing Architecture: Queries are run on each node
using own set of data
Sample Database Creation
Cassandra Hbase Redshift
Description Wide-column store
based on ideas of Big
Table and Dynamo DB
Wide-column store
based on Apache
Hadoop and on
concepts of Big Table
Large scale data
warehouse service for
use with business
intelligence tools
Database model Wide column store Wide column store Relational DBMS
License Open Source Open Source commercial
Cloud-based no no yes
Implementation
language
Java Java C
Server operating
systems
BSD
Linux
OS X
Windows
Linux
Unix
Windows
hosted
Cassandra Hbase Redshift
Data scheme schema-free schema-free yes
Typing yes no yes
XML support no no
Secondary indexes restricted no restricted
SQL no no yes
Server-side scripts (Stored
Procedures)
no Yes (Coprocessors in
java)
user defined functions
APIs and other access
methods
Proprietary
protocol like CQL
Java API
RESTful HTTP API
Thrift
JDBC
ODBC
Triggers yes yes no
Cassandra Hbase Redshift
Partitioning methods Sharding Sharding Sharding
Replication methods selectable replication
factor
selectable replication
factor
yes
MapReduce functions
support
yes yes no
Consistency concepts Eventual Consistency
Immediate Consistency
Eventual Consistency Immediate Consistency
Foreign keys no no Yes
Transaction concepts no no ACID
Concurrency yes yes yes
Durability yes yes yes
In-memory capabilities no yes
User concepts (Access
Control)
Access rights for users
can be defined per
object
Access Control Lists
(ACL)
Implementation Based
on Hadoop and
Zookeeper
fine grained access
rights according to SQL-
standard
• HBase:
• Key characteristics:
· Distributed and scalable big data store
· Strong consistency
· Built on top of Hadoop HDFS
· CP on CAP( Consistency, Availability, Partition Tolerance)
• Good for:
· Optimized for read
· Well suited for range based scan
· Strict consistency
· Fast read and write with scalability
• Not good for:
· Classic transactional applications or even relational analytics
· Applications need full table scan
· Data to be aggregated, rolled up, analyzed cross rows
• Usage Case: Facebook message
What's the difference between Cassandra and
HBase?
• Cassandra
• Key characteristics:
. High availability
· Incremental scalability
· Eventually consistent
· Trade-offs between consistency and latency
· Minimal administration
· No SPF (Single point of failure) – all nodes are the same in Cassandra
· AP on CAP
• Good for:
· Simple setup, maintenance code
· Fast random read/write
· Flexible parsing/wide column requirement
· No multiple secondary index needed
• Not good for:
· Secondary index
· Relational data
· Transactional operations (Rollback, Commit)
· Primary & Financial record
· Stringent and authorization needed on data
· Dynamic queries/searching on column data
· Low latency
• Usage Case: Twitter, Travel portal
Redshift Pro’s:
• Columnar Storage. Redshift stores data in a columnar format. This allows
operations done on a single column to be extremely fast. Operations like MIN,
MAX, SUM, AVG can compute over billions of rows in seconds.
• Sorted table format. The tables in Redshift are sorted according to the CREATE
TABLE statement. Having data sorted allows for very dense compression and
fast retrieval of information.
• SQL-92 Compliant. Redshift is built off of the Postgres database project. This
means that most Postgres tools and drivers will work out of the box. This
allows for the integration complex BI solutions with little technical overhead.
• Easy Administration. Redshift remains true to the Amazon form of making
things dead simple to administer. Within a few minutes you can have a 100
node cluster running that is fully monitored, has backups and point-in-time
recovery all at the click of a few buttons. It is also simple to scale up or down
as required, and has a very big suite of instrumentation for every part of the
cluster including all queries ever run.
Cassandra/Hbase Vs Redshift
Redshift con’s:
• DML operations are very costly and slow.
Because of the sorted columnar format, doing singular DML operations like INSERT,
UPDATE and DELETE are very expensive. This means your data needs to be loaded
in large batch loads, typically via an ETL/ELT process.
• One region, one availability zone.
all nodes are in a single EC2 Region and Availability Zone. This means that for
multi-DC setups and full AZ fault tolerance, one or more additional clusters need to
be administered and kept in-sync.
• Complex query optimization. Most queries you run in Redshift are going to be
very fast. But when processing billions of rows of data, every last optimization
can make a big impact.
Cassandra/Hbase Vs Redshift
Conclusions
• Data stores are chosen based on CAP Theorem
• Organization’s existing architecture
• Price and Security
• Choosing right tool for the project remains challenging and depends
on the use case.
About me
• BS and MS in Computer Science.
• Interested in #Data, # Distributed Systems , #
Machine Learning
• Loves to Cook, Paint and Travel.

More Related Content

PPTX
Introduction to Data Engineering
PDF
Future of Data Engineering
PDF
Summary introduction to data engineering
PDF
Learn to Use Databricks for Data Science
PDF
Data Lake: A simple introduction
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Introduction to Data Engineering
PPTX
DW Migration Webinar-March 2022.pptx
Introduction to Data Engineering
Future of Data Engineering
Summary introduction to data engineering
Learn to Use Databricks for Data Science
Data Lake: A simple introduction
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Introduction to Data Engineering
DW Migration Webinar-March 2022.pptx

What's hot (20)

PPTX
Demystifying data engineering
PPTX
Big Data Analytics with Hadoop
PPTX
Relational databases vs Non-relational databases
PPTX
Snowflake Overview
PDF
Azure Data Factory V2; The Data Flows
PPTX
Snowflake: The Good, the Bad, and the Ugly
PPTX
Introduction to Data Engineering
PDF
What is data engineering?
PPT
Data Analyst Role
PDF
Big Data Architecture
PDF
Data engineering zoomcamp introduction
KEY
NoSQL Databases: Why, what and when
PPTX
Apache HBase™
PDF
Big Data
PPT
Hadoop Security Architecture
PPTX
Databricks Platform.pptx
PPTX
Apache Atlas: Governance for your Data
PPTX
Master the Multi-Clustered Data Warehouse - Snowflake
PPTX
Zero to Snowflake Presentation
PPTX
Azure Synapse Analytics Overview (r1)
Demystifying data engineering
Big Data Analytics with Hadoop
Relational databases vs Non-relational databases
Snowflake Overview
Azure Data Factory V2; The Data Flows
Snowflake: The Good, the Bad, and the Ugly
Introduction to Data Engineering
What is data engineering?
Data Analyst Role
Big Data Architecture
Data engineering zoomcamp introduction
NoSQL Databases: Why, what and when
Apache HBase™
Big Data
Hadoop Security Architecture
Databricks Platform.pptx
Apache Atlas: Governance for your Data
Master the Multi-Clustered Data Warehouse - Snowflake
Zero to Snowflake Presentation
Azure Synapse Analytics Overview (r1)
Ad

Similar to Data engineering (20)

PPTX
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
PPTX
Module 2.2 Introduction to NoSQL Databases.pptx
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
PPTX
Big Data and Cloud Computing
PDF
Bases de datos en la nube con AWS
PPTX
Azure DocumentDB Overview
PPTX
Azure data platform overview
PDF
Module 2 - Datalake
PDF
Technologies for Data Analytics Platform
PDF
Simple, Modular and Extensible Big Data Platform Concept
PDF
20160331 sa introduction to big data pipelining berlin meetup 0.3
PDF
SQL Engines for Hadoop - The case for Impala
PPTX
Big Data_Architecture.pptx
PPTX
BigData, NoSQL & ElasticSearch
PDF
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
PPTX
Big Data in the Cloud with Azure Marketplace Images
PPTX
Cassandra Architecture FTW
PPTX
NoSQLDatabases
PPTX
Master.pptx
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Module 2.2 Introduction to NoSQL Databases.pptx
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Big Data and Cloud Computing
Bases de datos en la nube con AWS
Azure DocumentDB Overview
Azure data platform overview
Module 2 - Datalake
Technologies for Data Analytics Platform
Simple, Modular and Extensible Big Data Platform Concept
20160331 sa introduction to big data pipelining berlin meetup 0.3
SQL Engines for Hadoop - The case for Impala
Big Data_Architecture.pptx
BigData, NoSQL & ElasticSearch
Scaling RDBMS on AWS- ClustrixDB @AWS Meetup 20160711
Big Data in the Cloud with Azure Marketplace Images
Cassandra Architecture FTW
NoSQLDatabases
Master.pptx
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Ad

Recently uploaded (20)

PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
Predictive modeling basics in data cleaning process
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Business Analytics and business intelligence.pdf
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
Managing Community Partner Relationships
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Leprosy and NLEP programme community medicine
PPTX
Business_Capability_Map_Collection__pptx
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Predictive modeling basics in data cleaning process
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Business Analytics and business intelligence.pdf
A Complete Guide to Streamlining Business Processes
Managing Community Partner Relationships
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
retention in jsjsksksksnbsndjddjdnFPD.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Leprosy and NLEP programme community medicine
Business_Capability_Map_Collection__pptx
Pilar Kemerdekaan dan Identi Bangsa.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Introduction to Inferential Statistics.pptx
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Qualitative Qantitative and Mixed Methods.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...

Data engineering

  • 1. Optimizing Right tool for Data Pipeline Parimala Killada
  • 2. Agenda • Motivation for Data Engineering • Data Engineering Ecosystem • Data Stores • Amazon Redshift & Example project • Redshift vs No SQL databases • Conclusion
  • 3. Motivation for Data Engineering • Able to get meaningful insights from data • Acts as chief problem solver • Able to build data infrastructures • Get access to new technologies
  • 5. Data Stores • Provides easy access to end user or data scientists • Stores data in organized forms • Varieties of data stores: • Relational Databases • Key-value data stores • Document stores • Wide-column stores • Graph stores • Multi model data stores
  • 8. Redshift • Data stored in columnar format • Distribution keys: Used to distribute data among nodes • Sort keys: Data sorted within a distribution key • Compression codes: Reduces storage space • Shared-nothing Architecture: Queries are run on each node using own set of data
  • 10. Cassandra Hbase Redshift Description Wide-column store based on ideas of Big Table and Dynamo DB Wide-column store based on Apache Hadoop and on concepts of Big Table Large scale data warehouse service for use with business intelligence tools Database model Wide column store Wide column store Relational DBMS License Open Source Open Source commercial Cloud-based no no yes Implementation language Java Java C Server operating systems BSD Linux OS X Windows Linux Unix Windows hosted
  • 11. Cassandra Hbase Redshift Data scheme schema-free schema-free yes Typing yes no yes XML support no no Secondary indexes restricted no restricted SQL no no yes Server-side scripts (Stored Procedures) no Yes (Coprocessors in java) user defined functions APIs and other access methods Proprietary protocol like CQL Java API RESTful HTTP API Thrift JDBC ODBC Triggers yes yes no
  • 12. Cassandra Hbase Redshift Partitioning methods Sharding Sharding Sharding Replication methods selectable replication factor selectable replication factor yes MapReduce functions support yes yes no Consistency concepts Eventual Consistency Immediate Consistency Eventual Consistency Immediate Consistency Foreign keys no no Yes Transaction concepts no no ACID Concurrency yes yes yes Durability yes yes yes In-memory capabilities no yes User concepts (Access Control) Access rights for users can be defined per object Access Control Lists (ACL) Implementation Based on Hadoop and Zookeeper fine grained access rights according to SQL- standard
  • 13. • HBase: • Key characteristics: · Distributed and scalable big data store · Strong consistency · Built on top of Hadoop HDFS · CP on CAP( Consistency, Availability, Partition Tolerance) • Good for: · Optimized for read · Well suited for range based scan · Strict consistency · Fast read and write with scalability • Not good for: · Classic transactional applications or even relational analytics · Applications need full table scan · Data to be aggregated, rolled up, analyzed cross rows • Usage Case: Facebook message What's the difference between Cassandra and HBase?
  • 14. • Cassandra • Key characteristics: . High availability · Incremental scalability · Eventually consistent · Trade-offs between consistency and latency · Minimal administration · No SPF (Single point of failure) – all nodes are the same in Cassandra · AP on CAP • Good for: · Simple setup, maintenance code · Fast random read/write · Flexible parsing/wide column requirement · No multiple secondary index needed • Not good for: · Secondary index · Relational data · Transactional operations (Rollback, Commit) · Primary & Financial record · Stringent and authorization needed on data · Dynamic queries/searching on column data · Low latency • Usage Case: Twitter, Travel portal
  • 15. Redshift Pro’s: • Columnar Storage. Redshift stores data in a columnar format. This allows operations done on a single column to be extremely fast. Operations like MIN, MAX, SUM, AVG can compute over billions of rows in seconds. • Sorted table format. The tables in Redshift are sorted according to the CREATE TABLE statement. Having data sorted allows for very dense compression and fast retrieval of information. • SQL-92 Compliant. Redshift is built off of the Postgres database project. This means that most Postgres tools and drivers will work out of the box. This allows for the integration complex BI solutions with little technical overhead. • Easy Administration. Redshift remains true to the Amazon form of making things dead simple to administer. Within a few minutes you can have a 100 node cluster running that is fully monitored, has backups and point-in-time recovery all at the click of a few buttons. It is also simple to scale up or down as required, and has a very big suite of instrumentation for every part of the cluster including all queries ever run. Cassandra/Hbase Vs Redshift
  • 16. Redshift con’s: • DML operations are very costly and slow. Because of the sorted columnar format, doing singular DML operations like INSERT, UPDATE and DELETE are very expensive. This means your data needs to be loaded in large batch loads, typically via an ETL/ELT process. • One region, one availability zone. all nodes are in a single EC2 Region and Availability Zone. This means that for multi-DC setups and full AZ fault tolerance, one or more additional clusters need to be administered and kept in-sync. • Complex query optimization. Most queries you run in Redshift are going to be very fast. But when processing billions of rows of data, every last optimization can make a big impact. Cassandra/Hbase Vs Redshift
  • 17. Conclusions • Data stores are chosen based on CAP Theorem • Organization’s existing architecture • Price and Security • Choosing right tool for the project remains challenging and depends on the use case.
  • 18. About me • BS and MS in Computer Science. • Interested in #Data, # Distributed Systems , # Machine Learning • Loves to Cook, Paint and Travel.