Data engineering

Optimizing Right tool for Data
Pipeline
Parimala Killada

Agenda
• Motivation for Data Engineering
• Data Engineering Ecosystem
• Data Stores
• Amazon Redshift & Example project
• Redshift vs No SQL databases
• Conclusion

Motivation for Data Engineering
• Able to get meaningful insights from data
• Acts as chief problem solver
• Able to build data infrastructures
• Get access to new technologies

Data Stores
• Provides easy access to end user or data scientists
• Stores data in organized forms
• Varieties of data stores:
• Relational Databases
• Key-value data stores
• Document stores
• Wide-column stores
• Graph stores
• Multi model data stores

Redshift
• Data stored in columnar format
• Distribution keys: Used to distribute data among nodes
• Sort keys: Data sorted within a distribution key
• Compression codes: Reduces storage space
• Shared-nothing Architecture: Queries are run on each node
using own set of data

Cassandra Hbase Redshift
Description Wide-column store
based on ideas of Big
Table and Dynamo DB
Wide-column store
based on Apache
Hadoop and on
concepts of Big Table
Large scale data
warehouse service for
use with business
intelligence tools
Database model Wide column store Wide column store Relational DBMS
License Open Source Open Source commercial
Cloud-based no no yes
Implementation
language
Java Java C
Server operating
systems
BSD
Linux
OS X
Windows
Linux
Unix
Windows
hosted

Data scheme schema-free schema-free yes
Typing yes no yes
XML support no no
Secondary indexes restricted no restricted
SQL no no yes
Server-side scripts (Stored
Procedures)
no Yes (Coprocessors in
java)
user defined functions
APIs and other access
methods
Proprietary
protocol like CQL
Java API
RESTful HTTP API
Thrift
JDBC
ODBC
Triggers yes yes no

Partitioning methods Sharding Sharding Sharding
Replication methods selectable replication
factor
selectable replication
factor
yes
MapReduce functions
support
yes yes no
Consistency concepts Eventual Consistency
Immediate Consistency
Eventual Consistency Immediate Consistency
Foreign keys no no Yes
Transaction concepts no no ACID
Concurrency yes yes yes
Durability yes yes yes
In-memory capabilities no yes
User concepts (Access
Control)
Access rights for users
can be defined per
object
Access Control Lists
(ACL)
Implementation Based
on Hadoop and
Zookeeper
fine grained access
rights according to SQL-
standard

• HBase:
• Key characteristics:
· Distributed and scalable big data store
· Strong consistency
· Built on top of Hadoop HDFS
· CP on CAP( Consistency, Availability, Partition Tolerance)
• Good for:
· Optimized for read
· Well suited for range based scan
· Strict consistency
· Fast read and write with scalability
• Not good for:
· Classic transactional applications or even relational analytics
· Applications need full table scan
· Data to be aggregated, rolled up, analyzed cross rows
• Usage Case: Facebook message
What's the difference between Cassandra and
HBase?

• Cassandra
• Key characteristics:
. High availability
· Incremental scalability
· Eventually consistent
· Trade-offs between consistency and latency
· Minimal administration
· No SPF (Single point of failure) – all nodes are the same in Cassandra
· AP on CAP
• Good for:
· Simple setup, maintenance code
· Fast random read/write
· Flexible parsing/wide column requirement
· No multiple secondary index needed
• Not good for:
· Secondary index
· Relational data
· Transactional operations (Rollback, Commit)
· Primary & Financial record
· Stringent and authorization needed on data
· Dynamic queries/searching on column data
· Low latency
• Usage Case: Twitter, Travel portal

Redshift Pro’s:
• Columnar Storage. Redshift stores data in a columnar format. This allows
operations done on a single column to be extremely fast. Operations like MIN,
MAX, SUM, AVG can compute over billions of rows in seconds.
• Sorted table format. The tables in Redshift are sorted according to the CREATE
TABLE statement. Having data sorted allows for very dense compression and
fast retrieval of information.
• SQL-92 Compliant. Redshift is built off of the Postgres database project. This
means that most Postgres tools and drivers will work out of the box. This
allows for the integration complex BI solutions with little technical overhead.
• Easy Administration. Redshift remains true to the Amazon form of making
things dead simple to administer. Within a few minutes you can have a 100
node cluster running that is fully monitored, has backups and point-in-time
recovery all at the click of a few buttons. It is also simple to scale up or down
as required, and has a very big suite of instrumentation for every part of the
cluster including all queries ever run.
Cassandra/Hbase Vs Redshift

Redshift con’s:
• DML operations are very costly and slow.
Because of the sorted columnar format, doing singular DML operations like INSERT,
UPDATE and DELETE are very expensive. This means your data needs to be loaded
in large batch loads, typically via an ETL/ELT process.
• One region, one availability zone.
all nodes are in a single EC2 Region and Availability Zone. This means that for
multi-DC setups and full AZ fault tolerance, one or more additional clusters need to
be administered and kept in-sync.
• Complex query optimization. Most queries you run in Redshift are going to be
very fast. But when processing billions of rows of data, every last optimization
can make a big impact.
Cassandra/Hbase Vs Redshift

Conclusions
• Data stores are chosen based on CAP Theorem
• Organization’s existing architecture
• Price and Security
• Choosing right tool for the project remains challenging and depends
on the use case.

About me
• BS and MS in Computer Science.
• Interested in #Data, # Distributed Systems , #
Machine Learning
• Loves to Cook, Paint and Travel.

Data engineering

More Related Content

What's hot (20)

Similar to Data engineering (20)

Recently uploaded (20)

Data engineering