SlideShare a Scribd company logo
DSC650: Data Technology and
Future Emergence
Lecture 4: Data Processing
Lecturer: Dr Jasber Kaur
Lecture 4: Data Processing
Different Type of Data Processing: Parallel, Distributed, Batch, Transactional, Cluster and etc
MapReduce Framework, Algorithm and Process Data
Real-Time Data Analysis using Apache Spark
Scalability and Fault Tolerance
Optimization and Data Locality
Real World Cases
At the end of the lecture, students should be able to;
• CLO1: Demonstrate an understanding on the basic
concepts and practices of big data technology
Big Data: Concepts, Technology, and Architecture, First Edition. Balamurugan Balusamy, Nandhini Abirami. R, Seifedine Kadry, and
Amir H. Gandomi. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc
Data Processing
• process of collecting, processing, manipulating, and
managing the data to generate meaningful
information to the end user.
• Data may be originated from diversified sources in the
form of transactions, observations, and so forth – data
capture
• Once data is captured, data processing begins.
• There are basically two different types of data
processing, namely, centralized and distributed data
processing.
Data Processing
Parallel Data Processing
• Simultaneous execution of multiple sub-tasks that
collectively comprise a larger task.
• The goal is to reduce the execution time by dividing a
single larger task into multiple smaller tasks that run
concurrently.
• Typically achieved within the confines of a single
machine with multiple processors.
Distributed Data Processing
• Similar to parallel data
processing in that the
same principle of
“divide-and-conquer”.
• However, distributed
data processing is
always achieved
through physically
separate machines
that are networked
together as a cluster.
Cluster
• Horizontally scalable storage solutions.
• Clusters also provides the mechanism to enable distributed
data processing with linear scalability.
• Since clusters are highly scalable, Big Data processing as large
datasets can be divided into smaller datasets and then
processed in parallel in a distributed manner.
• When leveraging a cluster, Big Data datasets can either be
processed in batch mode or real-time model.
• Ideally, a cluster will be comprised of low-cost commodity
nodes that collectively provide increased processing
capacity.
• Other benefits; redundancy and fault tolerance
– Consist of physically separate nodes.
– Redundancy and fault tolerance allow resilient processing and
analysis to occur if a network or node failure occurs.
Cluster
Multiple stand-alone PCs connected
together through a dedicated switch
login node acts as the gateway into the cluster
Cluster
• There are two major types of clusters, namely,
– high-availability cluster and
– load-balancing cluster
• High availability clusters are designed to minimize downtime and provide
uninterrupted service when nodes fail.
• High availability makes the system highly fault tolerant with many
redundant nodes, which sustain faults and failures.
– Such systems also ensure high reliability and scalability.
– The higher the redundancy, the higher the availability.
• Load-balancing clusters are designed to distribute workloads across
different cluster nodes to share the service load among the nodes.
– If a node goes down, the load from that node is switched over to another node
• The main objective of load balancing is to
– optimize the use of resources,
– minimize response time,
– maximize throughput, and
– avoid overload on any one of the resources.
Processing Workloads
• Processing workload defined as the amount and
nature of data that is processed within a
certain amount of time.
• Workloads are usually divided into two types:
– Batch
– Transactional
Processing Workloads
Batch
• Offline processing
• Processing data in batches and usually imposes delays,
which in turn results in high-latency responses.
• Batch workloads typically involve large quantities of data
with sequential read/writes and comprise of groups of
read or write queries.
• Queries can be complex and involve multiple joins.
• OLAP systems commonly process workloads in batches.
• Strategic BI and analytics are batch-oriented as they are
highly read-intensive tasks involving large volumes of data.
Processing Workloads
Batch processing
Jobs - aggregate the data and keep them
available for analysis when required
Processing Workloads
Transactional
• Online processing/real-time processing
• Transactional workload processing follows an approach whereby
data is processed interactively without delay, resulting in low–
latency responses.
• Transaction workloads involve small amounts of data with random
reads and writes.
• OLTP and operational systems, which are generally write-intensive.
• Although these workloads contain a mix of read/write queries,
they are generally more write-intensive than read-intensive.
• Transactional workloads comprise random reads/writes that
involve fewer joins than business intelligence and reporting
workloads.
Processing Workloads
Real time processing
processed in-memory due to the requirement
to analyze the data while it is streaming
Processing Workloads
Examples - Real time and batch computation systems
MapReduce
• Batch processing framework.
• Highly scalable and reliable
• Principle of divide-and-conquer – provides
built-in fault tolerance and redundancy.
• Has roots in both distributed and parallel
computing.
• Process schema-less datasets.
• A dataset is broken down into multiplesmaller
parts, and operations are performed on
each part independently and in parallel.
Map and Reduce Task
• A single processing run of the MapReduce processing engine
is known as MapReduce job.
• Each MapReduce job is composed of a map task and a
reduce task
• Each task consists of multiple stages.
Map and Reduce Task
• Job tracker runs on the master node, and TaskTracker runs on the
slave node.
• only one TaskTracker per slave node.
– TaskTracker and NameNode run in one machine while
– JobTracker and DataNode run in another machine, making each node
perform both computing and storage tasks.
Map and Reduce Task
Map and Reduce Task
• Step 1: Take the file as input for processing purpose. Any file will consist of
group of lines. These lines containing key-value pair of data. Whole file can be
read out with this method.
• Step 2: In next step file will be in "splitting" mode. This mode will divide file into
key, value pair of data. This time key will be offset and data will be value part of
program. Each line will be read individually so there is no need to split data
manually.
• Step 3: Further step is to process the value of each line with associate from
counting number. Each individual that is separated from a space counted with
number and that number is written with each key. This is the logic of "mapping"
that programmer need to write.
• Step 4: After that shuffling is performed and with this each key get associated
with group of numbers that involved in mapping section. Now scenario become
key with string and value will be list of numbers. This will go as input to reducer.
• Step 5: In reducer phase whole numbers are counted and each key associated
with final counting is the sum of all numbers which leads to final result.
• Step 6: Output of reducer phase will lead to final result. This final result will
have counting of individual wordcount.
MapReduce Algorithms
• Task Parallelism
– Parallelization of data processing by dividing a task into sub-tasks
and running each sub-task on a separate processor, generally on a
separate node in a cluster.
– Each sub-task generally executes a different algorithm, with its own
copy of the same data or different data as its input, in parallel.
– Generally, the output from multiple sub-tasks is joined together to
obtain the final set of results.
• Data Parallelism
– Parallelization of data processing by dividing a dataset into multiple
datasets and processing each sub-dataset in parallel.
– The sub-datasets are spread across multiple nodes and are all
processed using the same algorithm.
– Generally, the output from each processed sub-dataset is joined
together to obtain the final set of results.
Realtime Processing
• In realtime mode, data is processed in-memory - it is
captured before being persisted to the disk.
• Response time generally ranges from a sub-second to under a
minute.
• Realtime mode addresses the velocity characteristic.
• Also called event or stream processing as the data either
arrives continuously (stream) or at intervals (event).
• The individual event/stream datum is generally small in
size, but its continuous nature results in very large datasets.
• Another related term, interactive mode.
• Interactive mode generally refers to query processing in
realtime.
• Operational BI/analytics are generally conducted in realtime
mode.
Distributed Data Processing Principle
• Fundamental principle - Speed, Consistency and Volume (SCV) principle.
• Speed – refers to how quickly the data can be processed once it is
generated.
– In the case of realtime analytics, data is processed comparatively faster than
batch analytics.
– This generally excludes the time taken to capture data and focuses only on
the actual data processing, such as generating statistics or executing an
algorithm.
• Consistency – refers to the accuracy and precision of results.
- Results are deemed accurate if they are close to the correct value and precise if
close to each other.
• Volume – refers to the amount of data that can be processed.
- huge volumes of data that need to be processed in a distributed manner.
- Processing such voluminous data in its entirety while ensuring speed and
consistency is not possible.
Realtime Processing - Spark
• for data processing
• designed to be fast and general purpose
• based on cluster computing platform
• uses in-memory distributed computing
• can run on top of existing Hadoop environment / can also run as
standalone
• provides shell support (interactive programming environment)
• supports different types of workloads (batch and streaming data)
Realtime Processing - Spark
Spark Architecture
• Apache Spark uses a master/slave/worker architecture.
• A driver program runs on the master node and talks to an executor on worker node.
• Spark applications run as independent sets of processes, which is coordinated by the
SparkContext object and created by the driver program.
• Spark can run in standalone mode or on a cluster of many nodes.
• SparkContext talks to the cluster manager to allocate resources across applications.
• Spark acquires executors on nodes in the cluster, and then sends application code to
them.
• Executors are processes that actually run the application code and store data for these
applications.
Realtime Processing - Spark
• Spark's core concept is Resilient Distributed Dataset (RDD).
• RDD is a fault-tolerant collection of elements.
• RDD can be operated on in parallel.
• It is a resilient and distributed collection of records, which can be
at one partition or more, depending on the configuration.
• RDD is an immutable distributed collection of objects, which
implies that you cannot change data in RDD but you can apply
transformation on one RDD to get another one as a result.
• It abstracts away the complexity of working in parallel.
• Resilient: Fault tolerant, able to re-compute when it has missing
records or damaged partitions due to node failures.
• Distributed: Data resides on multiple nodes in a cluster.
• Dataset: A collection of partitioned data with a key-value pair or
primitive values called tuples. Represents the records of data you
work with.
Realtime Processing - Spark
Example
Realtime Processing - Spark
Additional traits;
•Immutable: RDD never changes once created; they are read-only and can only
be transformed to another RDD using transformation operations.
• Lazy evaluated: you first load data into RDD, then apply a filter on it, and ask
for a count.
• In-memory: Spark keeps RDD in memory as much size as it can and for as long
as possible.
• Typed: RDD records are strongly typed, like Int in RDD[Int] or tuple (Int, String)
in RDD[(Int, String)].
• Cacheable: RDD can store data in a persistent storage.
• Partitioned: Data is split into a number of logical partitions based on multiple
parameters, and then distributed across nodes in a cluster.
• Parallel: RDDs are normally distributed on multiple nodes, which is the
abstraction it provides; after partitioning, it is acted upon in parallel fashion.
Location aware: RDDs has location preference, Spark tries to create them as
close to data as possible provided resources are available (data locality).
MapReduce vs Spark
MapReduce
• Amount of overhead associated with MapReduce – job creation and
coordination.
• batch-oriented processing of large amounts of data that has been
stored to disk.
• Cannot process data incrementally and can only process complete
datasets.
• It therefore requires all input data to be available in its entirety before
the execution of the data processing job.
Spark
• Realtime processing
• Suitable for smaller set data
• In-memory processing (not permanent but faster)
• Uses parallel computing BUT NOT mapper & reducer
Real Use Cases Examples
MapReduce
• Log Analysis: Many companies generate huge volumes of log data from various systems and applications.
MapReduce can be used to efficiently analyze these logs to extract valuable insights, such as identifying system
failures, user behavior patterns, or security threats.
• Search Engine Indexing: Search engines like Google and Bing use MapReduce to process and index web pages
efficiently. MapReduce can distribute the task of crawling and indexing web pages across multiple nodes in a cluster,
allowing search engines to handle massive amounts of data.
• Social Media Analytics: Social media platforms analyze vast amounts of user-generated content to extract insights,
such as trending topics, sentiment analysis, and user behavior analysis. MapReduce can be used to process this data
in parallel, enabling real-time or near-real-time analytics.
• E-commerce Recommendation Systems: E-commerce companies utilize MapReduce to analyze customer behavior,
purchase history, and product interactions to generate personalized recommendations for users.
• Genomics and Bioinformatics: In genomics research, scientists use MapReduce to analyze large DNA sequencing
datasets for tasks such as genome assembly, variant calling, and gene expression analysis.
Spark
• Real-time Stream Processing: Spark Streaming enables real-time processing of streaming data from sources like IoT
devices, sensors, social media feeds, and financial transactions. Industries such as finance, telecommunications, and
online advertising use Spark Streaming for real-time analytics and decision-making.
• Graph Processing: Spark's GraphX library allows companies to analyze and process large-scale graphs and networks
efficiently. Use cases include social network analysis, fraud detection, recommendation systems, and network
infrastructure optimization.
• Data Warehousing and ETL: Spark SQL provides a unified interface for querying structured data using SQL queries,
making it suitable for data warehousing and ETL (Extract, Transform, Load) tasks. Companies use Spark SQL to query
and analyze data stored in various formats and data sources, such as HDFS, HBase, and relational databases.
• Large-scale Data Processing: Spark's general-purpose data processing capabilities make it suitable for a wide range
of big data processing tasks, including data cleansing, transformation, aggregation, and analysis. Industries such as
finance, healthcare, retail, and manufacturing leverage Spark for processing and analyzing large volumes of data
efficiently.

More Related Content

PPT
Basic premise for hadoop's architectures
PPTX
unit 1 big data.pptx
PPTX
Scalable Data Analytics: Technologies and Methods
PPT
HDFS_architecture.ppt
PPTX
MOD-2 presentation on engineering students
PPTX
Data warehouse
PPTX
Cloud computing
Basic premise for hadoop's architectures
unit 1 big data.pptx
Scalable Data Analytics: Technologies and Methods
HDFS_architecture.ppt
MOD-2 presentation on engineering students
Data warehouse
Cloud computing

Similar to DSC650 : DATA TECHNOLOGY AND FUTURE EMERGENCE (CHAPTER 4) (20)

PPTX
Cloud Computing - Geektalk
PDF
Development of concurrent services using In-Memory Data Grids
PPTX
TASK AND DATA PARALLELISM in Computer Science pptx
PPTX
Unit II - Data Science (3) VI semester SRMIST
PPTX
Module-2_HADOOP.pptx
PPTX
BIg Data Analytics-Module-2 vtu engineering.pptx
PPTX
Module 3 - DBMS System Architecture Principles
PPTX
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
PPTX
DBMS.pptx
PPT
SecondPresentationDesigning_Parallel_Programs.ppt
PPTX
NOSQL introduction for big data analytics
PPTX
Lectures 9-HCE 311.pptx;parallel systems
PDF
DataIntensiveComputing.pdf
PPTX
Lecture 3.31 3.32.pptx
PPTX
Hadoop and Mapreduce for .NET User Group
PPTX
Batch Processing vs Stream Processing Difference
PDF
Building Big Data Streaming Architectures
PDF
Data structures and algorithms Module-1.pdf
PDF
Unit 5 Advanced Computer Architecture
PPT
Cloud Computing - Geektalk
Development of concurrent services using In-Memory Data Grids
TASK AND DATA PARALLELISM in Computer Science pptx
Unit II - Data Science (3) VI semester SRMIST
Module-2_HADOOP.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
Module 3 - DBMS System Architecture Principles
BIg Data Analytics-Module-2 as per vtu syllabus.pptx
DBMS.pptx
SecondPresentationDesigning_Parallel_Programs.ppt
NOSQL introduction for big data analytics
Lectures 9-HCE 311.pptx;parallel systems
DataIntensiveComputing.pdf
Lecture 3.31 3.32.pptx
Hadoop and Mapreduce for .NET User Group
Batch Processing vs Stream Processing Difference
Building Big Data Streaming Architectures
Data structures and algorithms Module-1.pdf
Unit 5 Advanced Computer Architecture
Ad

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Modernizing your data center with Dell and AMD
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Electronic commerce courselecture one. Pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Modernizing your data center with Dell and AMD
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Electronic commerce courselecture one. Pdf
A Presentation on Artificial Intelligence
Review of recent advances in non-invasive hemoglobin estimation
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Ad

DSC650 : DATA TECHNOLOGY AND FUTURE EMERGENCE (CHAPTER 4)

  • 1. DSC650: Data Technology and Future Emergence Lecture 4: Data Processing Lecturer: Dr Jasber Kaur
  • 2. Lecture 4: Data Processing Different Type of Data Processing: Parallel, Distributed, Batch, Transactional, Cluster and etc MapReduce Framework, Algorithm and Process Data Real-Time Data Analysis using Apache Spark Scalability and Fault Tolerance Optimization and Data Locality Real World Cases At the end of the lecture, students should be able to; • CLO1: Demonstrate an understanding on the basic concepts and practices of big data technology Big Data: Concepts, Technology, and Architecture, First Edition. Balamurugan Balusamy, Nandhini Abirami. R, Seifedine Kadry, and Amir H. Gandomi. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc
  • 3. Data Processing • process of collecting, processing, manipulating, and managing the data to generate meaningful information to the end user. • Data may be originated from diversified sources in the form of transactions, observations, and so forth – data capture • Once data is captured, data processing begins. • There are basically two different types of data processing, namely, centralized and distributed data processing.
  • 5. Parallel Data Processing • Simultaneous execution of multiple sub-tasks that collectively comprise a larger task. • The goal is to reduce the execution time by dividing a single larger task into multiple smaller tasks that run concurrently. • Typically achieved within the confines of a single machine with multiple processors.
  • 6. Distributed Data Processing • Similar to parallel data processing in that the same principle of “divide-and-conquer”. • However, distributed data processing is always achieved through physically separate machines that are networked together as a cluster.
  • 7. Cluster • Horizontally scalable storage solutions. • Clusters also provides the mechanism to enable distributed data processing with linear scalability. • Since clusters are highly scalable, Big Data processing as large datasets can be divided into smaller datasets and then processed in parallel in a distributed manner. • When leveraging a cluster, Big Data datasets can either be processed in batch mode or real-time model. • Ideally, a cluster will be comprised of low-cost commodity nodes that collectively provide increased processing capacity. • Other benefits; redundancy and fault tolerance – Consist of physically separate nodes. – Redundancy and fault tolerance allow resilient processing and analysis to occur if a network or node failure occurs.
  • 8. Cluster Multiple stand-alone PCs connected together through a dedicated switch login node acts as the gateway into the cluster
  • 9. Cluster • There are two major types of clusters, namely, – high-availability cluster and – load-balancing cluster • High availability clusters are designed to minimize downtime and provide uninterrupted service when nodes fail. • High availability makes the system highly fault tolerant with many redundant nodes, which sustain faults and failures. – Such systems also ensure high reliability and scalability. – The higher the redundancy, the higher the availability. • Load-balancing clusters are designed to distribute workloads across different cluster nodes to share the service load among the nodes. – If a node goes down, the load from that node is switched over to another node • The main objective of load balancing is to – optimize the use of resources, – minimize response time, – maximize throughput, and – avoid overload on any one of the resources.
  • 10. Processing Workloads • Processing workload defined as the amount and nature of data that is processed within a certain amount of time. • Workloads are usually divided into two types: – Batch – Transactional
  • 11. Processing Workloads Batch • Offline processing • Processing data in batches and usually imposes delays, which in turn results in high-latency responses. • Batch workloads typically involve large quantities of data with sequential read/writes and comprise of groups of read or write queries. • Queries can be complex and involve multiple joins. • OLAP systems commonly process workloads in batches. • Strategic BI and analytics are batch-oriented as they are highly read-intensive tasks involving large volumes of data.
  • 12. Processing Workloads Batch processing Jobs - aggregate the data and keep them available for analysis when required
  • 13. Processing Workloads Transactional • Online processing/real-time processing • Transactional workload processing follows an approach whereby data is processed interactively without delay, resulting in low– latency responses. • Transaction workloads involve small amounts of data with random reads and writes. • OLTP and operational systems, which are generally write-intensive. • Although these workloads contain a mix of read/write queries, they are generally more write-intensive than read-intensive. • Transactional workloads comprise random reads/writes that involve fewer joins than business intelligence and reporting workloads.
  • 14. Processing Workloads Real time processing processed in-memory due to the requirement to analyze the data while it is streaming
  • 15. Processing Workloads Examples - Real time and batch computation systems
  • 16. MapReduce • Batch processing framework. • Highly scalable and reliable • Principle of divide-and-conquer – provides built-in fault tolerance and redundancy. • Has roots in both distributed and parallel computing. • Process schema-less datasets. • A dataset is broken down into multiplesmaller parts, and operations are performed on each part independently and in parallel.
  • 17. Map and Reduce Task • A single processing run of the MapReduce processing engine is known as MapReduce job. • Each MapReduce job is composed of a map task and a reduce task • Each task consists of multiple stages.
  • 18. Map and Reduce Task • Job tracker runs on the master node, and TaskTracker runs on the slave node. • only one TaskTracker per slave node. – TaskTracker and NameNode run in one machine while – JobTracker and DataNode run in another machine, making each node perform both computing and storage tasks.
  • 20. Map and Reduce Task • Step 1: Take the file as input for processing purpose. Any file will consist of group of lines. These lines containing key-value pair of data. Whole file can be read out with this method. • Step 2: In next step file will be in "splitting" mode. This mode will divide file into key, value pair of data. This time key will be offset and data will be value part of program. Each line will be read individually so there is no need to split data manually. • Step 3: Further step is to process the value of each line with associate from counting number. Each individual that is separated from a space counted with number and that number is written with each key. This is the logic of "mapping" that programmer need to write. • Step 4: After that shuffling is performed and with this each key get associated with group of numbers that involved in mapping section. Now scenario become key with string and value will be list of numbers. This will go as input to reducer. • Step 5: In reducer phase whole numbers are counted and each key associated with final counting is the sum of all numbers which leads to final result. • Step 6: Output of reducer phase will lead to final result. This final result will have counting of individual wordcount.
  • 21. MapReduce Algorithms • Task Parallelism – Parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate processor, generally on a separate node in a cluster. – Each sub-task generally executes a different algorithm, with its own copy of the same data or different data as its input, in parallel. – Generally, the output from multiple sub-tasks is joined together to obtain the final set of results. • Data Parallelism – Parallelization of data processing by dividing a dataset into multiple datasets and processing each sub-dataset in parallel. – The sub-datasets are spread across multiple nodes and are all processed using the same algorithm. – Generally, the output from each processed sub-dataset is joined together to obtain the final set of results.
  • 22. Realtime Processing • In realtime mode, data is processed in-memory - it is captured before being persisted to the disk. • Response time generally ranges from a sub-second to under a minute. • Realtime mode addresses the velocity characteristic. • Also called event or stream processing as the data either arrives continuously (stream) or at intervals (event). • The individual event/stream datum is generally small in size, but its continuous nature results in very large datasets. • Another related term, interactive mode. • Interactive mode generally refers to query processing in realtime. • Operational BI/analytics are generally conducted in realtime mode.
  • 23. Distributed Data Processing Principle • Fundamental principle - Speed, Consistency and Volume (SCV) principle. • Speed – refers to how quickly the data can be processed once it is generated. – In the case of realtime analytics, data is processed comparatively faster than batch analytics. – This generally excludes the time taken to capture data and focuses only on the actual data processing, such as generating statistics or executing an algorithm. • Consistency – refers to the accuracy and precision of results. - Results are deemed accurate if they are close to the correct value and precise if close to each other. • Volume – refers to the amount of data that can be processed. - huge volumes of data that need to be processed in a distributed manner. - Processing such voluminous data in its entirety while ensuring speed and consistency is not possible.
  • 24. Realtime Processing - Spark • for data processing • designed to be fast and general purpose • based on cluster computing platform • uses in-memory distributed computing • can run on top of existing Hadoop environment / can also run as standalone • provides shell support (interactive programming environment) • supports different types of workloads (batch and streaming data)
  • 25. Realtime Processing - Spark Spark Architecture • Apache Spark uses a master/slave/worker architecture. • A driver program runs on the master node and talks to an executor on worker node. • Spark applications run as independent sets of processes, which is coordinated by the SparkContext object and created by the driver program. • Spark can run in standalone mode or on a cluster of many nodes. • SparkContext talks to the cluster manager to allocate resources across applications. • Spark acquires executors on nodes in the cluster, and then sends application code to them. • Executors are processes that actually run the application code and store data for these applications.
  • 26. Realtime Processing - Spark • Spark's core concept is Resilient Distributed Dataset (RDD). • RDD is a fault-tolerant collection of elements. • RDD can be operated on in parallel. • It is a resilient and distributed collection of records, which can be at one partition or more, depending on the configuration. • RDD is an immutable distributed collection of objects, which implies that you cannot change data in RDD but you can apply transformation on one RDD to get another one as a result. • It abstracts away the complexity of working in parallel. • Resilient: Fault tolerant, able to re-compute when it has missing records or damaged partitions due to node failures. • Distributed: Data resides on multiple nodes in a cluster. • Dataset: A collection of partitioned data with a key-value pair or primitive values called tuples. Represents the records of data you work with.
  • 27. Realtime Processing - Spark Example
  • 28. Realtime Processing - Spark Additional traits; •Immutable: RDD never changes once created; they are read-only and can only be transformed to another RDD using transformation operations. • Lazy evaluated: you first load data into RDD, then apply a filter on it, and ask for a count. • In-memory: Spark keeps RDD in memory as much size as it can and for as long as possible. • Typed: RDD records are strongly typed, like Int in RDD[Int] or tuple (Int, String) in RDD[(Int, String)]. • Cacheable: RDD can store data in a persistent storage. • Partitioned: Data is split into a number of logical partitions based on multiple parameters, and then distributed across nodes in a cluster. • Parallel: RDDs are normally distributed on multiple nodes, which is the abstraction it provides; after partitioning, it is acted upon in parallel fashion. Location aware: RDDs has location preference, Spark tries to create them as close to data as possible provided resources are available (data locality).
  • 29. MapReduce vs Spark MapReduce • Amount of overhead associated with MapReduce – job creation and coordination. • batch-oriented processing of large amounts of data that has been stored to disk. • Cannot process data incrementally and can only process complete datasets. • It therefore requires all input data to be available in its entirety before the execution of the data processing job. Spark • Realtime processing • Suitable for smaller set data • In-memory processing (not permanent but faster) • Uses parallel computing BUT NOT mapper & reducer
  • 30. Real Use Cases Examples MapReduce • Log Analysis: Many companies generate huge volumes of log data from various systems and applications. MapReduce can be used to efficiently analyze these logs to extract valuable insights, such as identifying system failures, user behavior patterns, or security threats. • Search Engine Indexing: Search engines like Google and Bing use MapReduce to process and index web pages efficiently. MapReduce can distribute the task of crawling and indexing web pages across multiple nodes in a cluster, allowing search engines to handle massive amounts of data. • Social Media Analytics: Social media platforms analyze vast amounts of user-generated content to extract insights, such as trending topics, sentiment analysis, and user behavior analysis. MapReduce can be used to process this data in parallel, enabling real-time or near-real-time analytics. • E-commerce Recommendation Systems: E-commerce companies utilize MapReduce to analyze customer behavior, purchase history, and product interactions to generate personalized recommendations for users. • Genomics and Bioinformatics: In genomics research, scientists use MapReduce to analyze large DNA sequencing datasets for tasks such as genome assembly, variant calling, and gene expression analysis. Spark • Real-time Stream Processing: Spark Streaming enables real-time processing of streaming data from sources like IoT devices, sensors, social media feeds, and financial transactions. Industries such as finance, telecommunications, and online advertising use Spark Streaming for real-time analytics and decision-making. • Graph Processing: Spark's GraphX library allows companies to analyze and process large-scale graphs and networks efficiently. Use cases include social network analysis, fraud detection, recommendation systems, and network infrastructure optimization. • Data Warehousing and ETL: Spark SQL provides a unified interface for querying structured data using SQL queries, making it suitable for data warehousing and ETL (Extract, Transform, Load) tasks. Companies use Spark SQL to query and analyze data stored in various formats and data sources, such as HDFS, HBase, and relational databases. • Large-scale Data Processing: Spark's general-purpose data processing capabilities make it suitable for a wide range of big data processing tasks, including data cleansing, transformation, aggregation, and analysis. Industries such as finance, healthcare, retail, and manufacturing leverage Spark for processing and analyzing large volumes of data efficiently.