3. Data Processing
• process of collecting, processing, manipulating, and
managing the data to generate meaningful
information to the end user.
• Data may be originated from diversified sources in the
form of transactions, observations, and so forth – data
capture
• Once data is captured, data processing begins.
• There are basically two different types of data
processing, namely, centralized and distributed data
processing.
5. Parallel Data Processing
• Simultaneous execution of multiple sub-tasks that
collectively comprise a larger task.
• The goal is to reduce the execution time by dividing a
single larger task into multiple smaller tasks that run
concurrently.
• Typically achieved within the confines of a single
machine with multiple processors.
6. Distributed Data Processing
• Similar to parallel data
processing in that the
same principle of
“divide-and-conquer”.
• However, distributed
data processing is
always achieved
through physically
separate machines
that are networked
together as a cluster.
7. Cluster
• Horizontally scalable storage solutions.
• Clusters also provides the mechanism to enable distributed
data processing with linear scalability.
• Since clusters are highly scalable, Big Data processing as large
datasets can be divided into smaller datasets and then
processed in parallel in a distributed manner.
• When leveraging a cluster, Big Data datasets can either be
processed in batch mode or real-time model.
• Ideally, a cluster will be comprised of low-cost commodity
nodes that collectively provide increased processing
capacity.
• Other benefits; redundancy and fault tolerance
– Consist of physically separate nodes.
– Redundancy and fault tolerance allow resilient processing and
analysis to occur if a network or node failure occurs.
9. Cluster
• There are two major types of clusters, namely,
– high-availability cluster and
– load-balancing cluster
• High availability clusters are designed to minimize downtime and provide
uninterrupted service when nodes fail.
• High availability makes the system highly fault tolerant with many
redundant nodes, which sustain faults and failures.
– Such systems also ensure high reliability and scalability.
– The higher the redundancy, the higher the availability.
• Load-balancing clusters are designed to distribute workloads across
different cluster nodes to share the service load among the nodes.
– If a node goes down, the load from that node is switched over to another node
• The main objective of load balancing is to
– optimize the use of resources,
– minimize response time,
– maximize throughput, and
– avoid overload on any one of the resources.
10. Processing Workloads
• Processing workload defined as the amount and
nature of data that is processed within a
certain amount of time.
• Workloads are usually divided into two types:
– Batch
– Transactional
11. Processing Workloads
Batch
• Offline processing
• Processing data in batches and usually imposes delays,
which in turn results in high-latency responses.
• Batch workloads typically involve large quantities of data
with sequential read/writes and comprise of groups of
read or write queries.
• Queries can be complex and involve multiple joins.
• OLAP systems commonly process workloads in batches.
• Strategic BI and analytics are batch-oriented as they are
highly read-intensive tasks involving large volumes of data.
13. Processing Workloads
Transactional
• Online processing/real-time processing
• Transactional workload processing follows an approach whereby
data is processed interactively without delay, resulting in low–
latency responses.
• Transaction workloads involve small amounts of data with random
reads and writes.
• OLTP and operational systems, which are generally write-intensive.
• Although these workloads contain a mix of read/write queries,
they are generally more write-intensive than read-intensive.
• Transactional workloads comprise random reads/writes that
involve fewer joins than business intelligence and reporting
workloads.
14. Processing Workloads
Real time processing
processed in-memory due to the requirement
to analyze the data while it is streaming
16. MapReduce
• Batch processing framework.
• Highly scalable and reliable
• Principle of divide-and-conquer – provides
built-in fault tolerance and redundancy.
• Has roots in both distributed and parallel
computing.
• Process schema-less datasets.
• A dataset is broken down into multiplesmaller
parts, and operations are performed on
each part independently and in parallel.
17. Map and Reduce Task
• A single processing run of the MapReduce processing engine
is known as MapReduce job.
• Each MapReduce job is composed of a map task and a
reduce task
• Each task consists of multiple stages.
18. Map and Reduce Task
• Job tracker runs on the master node, and TaskTracker runs on the
slave node.
• only one TaskTracker per slave node.
– TaskTracker and NameNode run in one machine while
– JobTracker and DataNode run in another machine, making each node
perform both computing and storage tasks.
20. Map and Reduce Task
• Step 1: Take the file as input for processing purpose. Any file will consist of
group of lines. These lines containing key-value pair of data. Whole file can be
read out with this method.
• Step 2: In next step file will be in "splitting" mode. This mode will divide file into
key, value pair of data. This time key will be offset and data will be value part of
program. Each line will be read individually so there is no need to split data
manually.
• Step 3: Further step is to process the value of each line with associate from
counting number. Each individual that is separated from a space counted with
number and that number is written with each key. This is the logic of "mapping"
that programmer need to write.
• Step 4: After that shuffling is performed and with this each key get associated
with group of numbers that involved in mapping section. Now scenario become
key with string and value will be list of numbers. This will go as input to reducer.
• Step 5: In reducer phase whole numbers are counted and each key associated
with final counting is the sum of all numbers which leads to final result.
• Step 6: Output of reducer phase will lead to final result. This final result will
have counting of individual wordcount.
21. MapReduce Algorithms
• Task Parallelism
– Parallelization of data processing by dividing a task into sub-tasks
and running each sub-task on a separate processor, generally on a
separate node in a cluster.
– Each sub-task generally executes a different algorithm, with its own
copy of the same data or different data as its input, in parallel.
– Generally, the output from multiple sub-tasks is joined together to
obtain the final set of results.
• Data Parallelism
– Parallelization of data processing by dividing a dataset into multiple
datasets and processing each sub-dataset in parallel.
– The sub-datasets are spread across multiple nodes and are all
processed using the same algorithm.
– Generally, the output from each processed sub-dataset is joined
together to obtain the final set of results.
22. Realtime Processing
• In realtime mode, data is processed in-memory - it is
captured before being persisted to the disk.
• Response time generally ranges from a sub-second to under a
minute.
• Realtime mode addresses the velocity characteristic.
• Also called event or stream processing as the data either
arrives continuously (stream) or at intervals (event).
• The individual event/stream datum is generally small in
size, but its continuous nature results in very large datasets.
• Another related term, interactive mode.
• Interactive mode generally refers to query processing in
realtime.
• Operational BI/analytics are generally conducted in realtime
mode.
23. Distributed Data Processing Principle
• Fundamental principle - Speed, Consistency and Volume (SCV) principle.
• Speed – refers to how quickly the data can be processed once it is
generated.
– In the case of realtime analytics, data is processed comparatively faster than
batch analytics.
– This generally excludes the time taken to capture data and focuses only on
the actual data processing, such as generating statistics or executing an
algorithm.
• Consistency – refers to the accuracy and precision of results.
- Results are deemed accurate if they are close to the correct value and precise if
close to each other.
• Volume – refers to the amount of data that can be processed.
- huge volumes of data that need to be processed in a distributed manner.
- Processing such voluminous data in its entirety while ensuring speed and
consistency is not possible.
24. Realtime Processing - Spark
• for data processing
• designed to be fast and general purpose
• based on cluster computing platform
• uses in-memory distributed computing
• can run on top of existing Hadoop environment / can also run as
standalone
• provides shell support (interactive programming environment)
• supports different types of workloads (batch and streaming data)
25. Realtime Processing - Spark
Spark Architecture
• Apache Spark uses a master/slave/worker architecture.
• A driver program runs on the master node and talks to an executor on worker node.
• Spark applications run as independent sets of processes, which is coordinated by the
SparkContext object and created by the driver program.
• Spark can run in standalone mode or on a cluster of many nodes.
• SparkContext talks to the cluster manager to allocate resources across applications.
• Spark acquires executors on nodes in the cluster, and then sends application code to
them.
• Executors are processes that actually run the application code and store data for these
applications.
26. Realtime Processing - Spark
• Spark's core concept is Resilient Distributed Dataset (RDD).
• RDD is a fault-tolerant collection of elements.
• RDD can be operated on in parallel.
• It is a resilient and distributed collection of records, which can be
at one partition or more, depending on the configuration.
• RDD is an immutable distributed collection of objects, which
implies that you cannot change data in RDD but you can apply
transformation on one RDD to get another one as a result.
• It abstracts away the complexity of working in parallel.
• Resilient: Fault tolerant, able to re-compute when it has missing
records or damaged partitions due to node failures.
• Distributed: Data resides on multiple nodes in a cluster.
• Dataset: A collection of partitioned data with a key-value pair or
primitive values called tuples. Represents the records of data you
work with.
28. Realtime Processing - Spark
Additional traits;
•Immutable: RDD never changes once created; they are read-only and can only
be transformed to another RDD using transformation operations.
• Lazy evaluated: you first load data into RDD, then apply a filter on it, and ask
for a count.
• In-memory: Spark keeps RDD in memory as much size as it can and for as long
as possible.
• Typed: RDD records are strongly typed, like Int in RDD[Int] or tuple (Int, String)
in RDD[(Int, String)].
• Cacheable: RDD can store data in a persistent storage.
• Partitioned: Data is split into a number of logical partitions based on multiple
parameters, and then distributed across nodes in a cluster.
• Parallel: RDDs are normally distributed on multiple nodes, which is the
abstraction it provides; after partitioning, it is acted upon in parallel fashion.
Location aware: RDDs has location preference, Spark tries to create them as
close to data as possible provided resources are available (data locality).
29. MapReduce vs Spark
MapReduce
• Amount of overhead associated with MapReduce – job creation and
coordination.
• batch-oriented processing of large amounts of data that has been
stored to disk.
• Cannot process data incrementally and can only process complete
datasets.
• It therefore requires all input data to be available in its entirety before
the execution of the data processing job.
Spark
• Realtime processing
• Suitable for smaller set data
• In-memory processing (not permanent but faster)
• Uses parallel computing BUT NOT mapper & reducer
30. Real Use Cases Examples
MapReduce
• Log Analysis: Many companies generate huge volumes of log data from various systems and applications.
MapReduce can be used to efficiently analyze these logs to extract valuable insights, such as identifying system
failures, user behavior patterns, or security threats.
• Search Engine Indexing: Search engines like Google and Bing use MapReduce to process and index web pages
efficiently. MapReduce can distribute the task of crawling and indexing web pages across multiple nodes in a cluster,
allowing search engines to handle massive amounts of data.
• Social Media Analytics: Social media platforms analyze vast amounts of user-generated content to extract insights,
such as trending topics, sentiment analysis, and user behavior analysis. MapReduce can be used to process this data
in parallel, enabling real-time or near-real-time analytics.
• E-commerce Recommendation Systems: E-commerce companies utilize MapReduce to analyze customer behavior,
purchase history, and product interactions to generate personalized recommendations for users.
• Genomics and Bioinformatics: In genomics research, scientists use MapReduce to analyze large DNA sequencing
datasets for tasks such as genome assembly, variant calling, and gene expression analysis.
Spark
• Real-time Stream Processing: Spark Streaming enables real-time processing of streaming data from sources like IoT
devices, sensors, social media feeds, and financial transactions. Industries such as finance, telecommunications, and
online advertising use Spark Streaming for real-time analytics and decision-making.
• Graph Processing: Spark's GraphX library allows companies to analyze and process large-scale graphs and networks
efficiently. Use cases include social network analysis, fraud detection, recommendation systems, and network
infrastructure optimization.
• Data Warehousing and ETL: Spark SQL provides a unified interface for querying structured data using SQL queries,
making it suitable for data warehousing and ETL (Extract, Transform, Load) tasks. Companies use Spark SQL to query
and analyze data stored in various formats and data sources, such as HDFS, HBase, and relational databases.
• Large-scale Data Processing: Spark's general-purpose data processing capabilities make it suitable for a wide range
of big data processing tasks, including data cleansing, transformation, aggregation, and analysis. Industries such as
finance, healthcare, retail, and manufacturing leverage Spark for processing and analyzing large volumes of data
efficiently.