Comparison - RDBMS vs Hadoop vs Apache

Relational Database Management System
• An RDBMS, or relational database management system, is
software that allows users to update, query, and manage
relational databases. Structured Query Language (SQL) is the
most common programming language used to access a
database. The SQL standard has been modified to allow for
the storage, retrieval, and publication of JSON data within a
relational database, providing greater flexibility.
• The most fundamental RDBMS functions are related to create,
read, update, and delete operations, which are referred to
collectively as CRUD. They serve as the foundation for a well-
organized system that promotes consistent data treatment.
RDBMS

Hadoop
• Apache Hadoop is a set of open-source software utilities that
allows you to solve problems involving massive amounts of data
and computation by utilizing a network of many computers. It
provides a software framework for distributed big data storage
and processing based on the MapReduce programming model.
• The core of Apache Hadoop is made up of a storage component
known as Hadoop Distributed File System (HDFS) and a
processing component that uses the MapReduce programming
model. Hadoop divides files into large blocks and distributes
them across cluster nodes. It then distributes packaged code to
nodes in order for the data to be processed in parallel. This
method makes use of data locality.
HADOOP

Spark
• Apache Spark is a free and open-source unified analytics
engine for processing large amounts of data. Spark provides a
programming interface for entire clusters with implicit data
parallelism and fault tolerance.
• Apache Spark necessitates the use of a cluster manager and
a distributed storage system. Spark supports standalone
(native Spark cluster) cluster management, where you can
launch a cluster either manually or using the launch scripts
provided by the install package. These daemons can also be
run on a single machine for testing), Hadoop YARN, Apache
Mesos, or Kubernetes.
SPARK

RDBMS HADOOP SPARK
RDBMS Vs Hadoop Vs Spark
Data
Variety
Data
Storage
Used for Average Data
sets (in GBs)
Used for Large Data
sets (TBs and PBs)
Used for Large Data
sets (TBs and PBs)
SQL Language Spark SQL
Querying
HQL (hive Query
Language)
Used for structured
Data Only
Used for Semi
Structured,
Unstructured and
Structured Data
Used for Semi
Structured,
Unstructured and
Structured Data

Schema
Required on Write
(Static Schema)
Required on Read
(Dynamic Schema)
License Free
Cost
Speed Reads are Fast
Both Reads and Writes
are fast
More than 100 times
faster than Hadoop in
some cases
Required on Read
(Dynamic Schema)
RDBMS HADOOP SPARK
Free

Works on Relational
Tables
Works on Key Value
Pair
Resilient Distributed
Datasets (RDDs)
Data
Objects
Hardware
Profile
High End Profiles
Commodity/ Utility
Harware
High End Profiles
Used
Cases
OLTP (Online
transaction
processing)
Analytics (Audio,
video, logs etc), Data
Discovery
Streaming Data, Machine
Learning, Fog
Computing, interactive
analyses
RDBMS HADOOP SPARK

RDBMS
• Maintainability: allows database
admins to maintain, control,
update data into the database
easily
• Flexibility: saves a lot of time as
updating data in one place is
enough
• Data Structure: stores data in
tabular format, easily understood
by users, organized data
• Privileges: allows database
administrators to control
activities over the database
• Data Safety: data will be safe
when the program crashes by
authorization codes, other
security layers
HADOOP
• Scalable: it can store and
distribute very large data sets
• Cost-Effective: The raw data
would be deleted, as it would be
too cost-prohibitive to keep
• Flexible: easy access to new
data sources and tap into
different types of data
• Fast: unique storage method is
based on a distributed file
system that basically ‘maps’
data
• Resilient to failure: in the event
of failure, there is another copy
available for use.
SPARK
• Speed: 100 times faster than
Hadoop for large scale data
processing
• Ease of use: easy to use AAPIs
for operating on large datasets
• Advanced Analytics: It supports
Machine learning (ML), Graph
algorithms, Streaming data,
SQL queries, etc.
• Dynamic: easy to develop
parallel applications
• Multilingual: supports many
languages for code writing such
as Python, Java, Scala, etc.
• Powerful: can handle many
analytics challenges
Benefits

RDBMS
• Software is expensive
• Complex software refers to
expensive hardware and hence
increases overall cost to avail
the RDBMS service
• It requires skilled human
resources to implement
• Certain applications are slow in
processing
• It is difficult to recover the lost
data
HADOOP
• Fails when it needs to access the
small size file in a large amount
• It is a framework in java, which
makes it more insecure as it can
be easily exploited by any the
cyber-criminal
• Its efficiency decreases while
performing in small data
surroundings
• It uses Kerberos for security
features that are not easy to
manage. Storage and network
encryption are missing in Kerberos
which makes us more concerned
about it
SPARK
• No file management system in
Apache Spark, which need to be
integrated with other platforms
• Doesn’t support real-time data
stream processing fully.
• Not easy to keep data in memory
when we talk about the cost-
efficient processing of big data
• There is a problem with small files
when we use Spark with Hadoop
• The latency of Apache Spark is
higher which results in lower
throughput.
Limitations

Comparison - RDBMS vs Hadoop vs Apache

More Related Content

What's hot (20)

Similar to Comparison - RDBMS vs Hadoop vs Apache (20)

Recently uploaded (20)

Comparison - RDBMS vs Hadoop vs Apache