SlideShare a Scribd company logo
Ch.5
Data-Intensive Technologies
for Cloud
Computing
SUPERVISED BY:DR.ABBAS AL-BAKRY
BY: HUDA HAMDAN ALI
Contents:
 introduction
 Data-Intensive Computing Applications.
 Data-Parallelism.
 The “Data Gap”.
 Processing Approach.
 Common Characteristics of data-intensive computing systems.
 Grid Computing.
 Data-Intensive System Architectures.
 Google MapReduce.
 Hadoop.
 LexisNexis HPCC.
 Architecture Comparison
introduction
 Data-intensive computing represents a new computing paradigm which can address the data gap
using scalable parallel processing to allow government, commercial organizations, and research
environments to process massive amounts of data and implement applications previously thought
to be impractical or infeasible.
 Cloud computing provides the opportunity for organizations with limited internal resources to
implement large-scale data-intensive computing applications in a cost-effective manner.
 Data-intensive computing can be implemented in a public cloud or as a private cloud.
introduction Cont..
Data-Intensive Computing Applications
 Parallel processing approaches can be generally classified as either compute intensive, or data-
intensive.
 Compute-intensive is used to describe application programs that are compute bound. In
compute-intensive applications, multiple operations are performed simultaneously, with each
operation addressing a particular part of the problem.
 Data-intensive is used to describe applications that are I/O bound or with a need to process large
volumes of data.
Cont..
 Parallel processing of data-intensive applications typically involves partitioning or subdividing
the data into multiple segments which can be processed independently using the same
executable application program in parallel on an appropriate computing platform, then
reassembling the results to produce the completed output data.
 The fundamental challenges for data intensive computing are managing and processing
exponentially growing data volumes, significantly reducing associated data analysis cycles to
support practical, timely applications, and developing new algorithms which can scale to search
and process massive amounts of data.
 Cloud computing can address these challenges with the capability to provision new computing
resources or extend existing resources to provide parallel computing capabilities which scale to
match growing data volumes.
Data-Parallelism
 Computer system architectures which can support data-parallel applications are a potential solution to
terabyte and petabyte scale data processing requirements.
 We can define data-parallelism as a computation applied independently to each data item of a set of
data which allows the degree of parallelism to be scaled with the volume of data.
 the most important reason for developing data-parallel applications is the potential for scalable
performance, and may result in several orders of magnitude performance improvement.
 The key issues with developing applications using data-parallelism are the choice of the algorithm, the
strategy for data decomposition, load balancing on processing nodes, message passing
communications between nodes, and the overall accuracy of the results.
Data-Intensive Technologies for CloudComputing
The “Data Gap”
 Data gaps can be created in a number of ways. The data might not exist, it might not be
accessible, it might not be completed or it might not be evaluated and studied adequately,
according to the Open Group, a not-for-profit consortium that stresses business efficiency.
 This can cause problems for the energy management system that you may be using, and make
your job even more difficult and time-consuming than it already is.
 How will the “Data Gap” be addressed and bridged?
 the answer is a scalable computer systems hardware and software architecture designed for data-
intensive computing applications which can scale to processing billions of records per second
(BORPS).
Processing Approach
 Current data-intensive computing platforms use a “divide and conquer” parallel processing
approach combining multiple processors and disks in large computing clusters connected using
high-speed communications switches and networks which allows the data to be partitioned
among the available computing resources and processed independently to achieve performance
and scalability based on the amount of data (Fig. 5.1).
 We can define a cluster as “a type of parallel and distributed system, which consists of a collection
of inter-connected stand-alone computers working together as a single integrated computing
resource.
 This approach to parallel processing is often referred to as a “shared nothing” approach since
each node consisting of processor, local memory, and disk resources shares nothing with other
nodes in the cluster.
Cont..
Common Characteristics
of data-intensive computing systems
 To achieve high performance in data-intensive computing, it is important to minimize the
movement of data. instead of moving the data, the program or algorithm is transferred to the
nodes with the data that needs to be processed.
 A an important characteristic of data-intensive computing systems is the focus on reliability and
availability.
Data-intensive computing systems are designed to be fault resilient. This includes redundant copies
of all data files on disk, storage of intermediate processing results on disk, automatic detection of
node or processing failures, and selective re-computation of results.
 A an important characteristic of data-intensive computing systems is the inherent scalability of the
underlying hardware and software architecture. The number of nodes and processing tasks
assigned for a specific application can be variable or fixed depending on the hardware, software,
communications, and distributed file system architecture.
Grid Computing
 A computing grid is typically heterogeneous in nature (nodes can have different processor,
memory, and disk resources), and consists of multiple disparate computers distributed across
organizations and often geographically using wide-area networking communications usually with
relatively low-bandwidth.
 Grids are typically used to solve complex computational problems which are compute-intensive
requiring only small amounts of data for each processing node.
 In contrast, data-intensive computing systems are typically homogeneous in nature (nodes in the
computing cluster have identical processor, memory, and disk resources), use high-bandwidth
communications between nodes such as gigabit Ethernet switches, and are located in close
proximity in a data center using high-density hardware such as rack-mounted blade servers.
 Geographically dispersed grid systems are more difficult to manage, less reliable, and less secure
than data-intensive computing systems which are usually located in secure data center
environments.
Data-Intensive Technologies for CloudComputing
Data-Intensive System Architectures
 A variety of system architectures have been implemented for data-intensive and large-scale data
analysis applications including parallel and distributed relational database management systems
which have been available to run on shared nothing clusters of processing nodes for more than
two decades.
 Although these systems have the ability to run parallel applications and queries expressed in the
SQL language, they are typically not general-purpose processing platforms.
 Internet companies such as Google, Yahoo, Microsoft, Facebook, and others required a new
processing approach to effectively deal with the enormous amount of Web data for applications
such as search engines and social networking.
 Several solutions have emerged including the Map Reduce architecture pioneered by Google and
now available in an open source implementation called Hadoop used by Yahoo, Facebook, and
others.
Google MapReduce
 The MapReduce architecture and programming model pioneered by Google is an example of a
modern systems architecture designed for processing and analyzing large datasets and is being
used successfully by Google in many applications to process massive amounts of raw Web data.
 Since the system automatically takes care of details like partitioning the input data, scheduling and
executing tasks across a processing cluster, and managing the communications between nodes,
programmers with no experience in parallel programming can easily use a large distributed
processing environment.
 The programming model for MapReduce architecture is a simple abstraction where the
computation takes a set of input key-value pairs associated with the input data and produces a set
of output key-value pairs.
 The overall model for this process is shown in Fig. 5.2.
 In the Google implementation of MapReduce, functions are coded in the C++ programming
language.
 Underlying and overlayed with the MapReduce architecture is the Google File System (GFS).
GFS was designed to be a high-performance, scalable distributed file system for very large
data files and data-intensive applications providing fault tolerance and running on clusters of
commodity hardware.
 GFS has proven to be highly effective for data-intensive computing on very large files, but is
less effective for small files which can cause hot spots if many MapReduce tasks are accessing
the same file.
Data-Intensive Technologies for CloudComputing
Data-Intensive Technologies for CloudComputing
Data-Intensive Technologies for CloudComputing
Hadoop
 Hadoop is an open source software project sponsored by The Apache Software Foundation.
 The Hadoop MapReduce architecture is functionally similar to the Google implementation except
that the base programming language for Hadoop is Java instead of C++.
 The implementation is intended to execute on clusters of commodity processors(Fig. 5.4) utilizing
Linux as the operating system environment.
 Hadoop clusters also utilize the “shared nothing” distributed processing paradigm linking
individual systems with local processor, memory, and disk resources using high-speed
communications switching capabilities.
Data-Intensive Technologies for CloudComputing
 The flexibility of Hadoop configurations allows small clusters to be created for testing and
development using desktop systems or any system running Unix/Linux providing a JVM environment
 The Hadoop MapReduce architecture is similar to the Google implementation creating fixed-size
input splits from the input data and assigning the splits to Map tasks.
 The local output from the Map tasks is copied to Reduce nodes where it is sorted and merged for
processing by Reduce tasks which produce the final output as shown in Fig. 5.5.
 Hadoop implements a distributed data processing scheduling and execution environment and
framework for MapReduce jobs.
Cont..
 the input data splits are located through a process called data locality optimization.
 The number of Reduce tasks is determined independently and can be user-specified.
 As with the Google MapReduce implementation, all Map tasks must complete before the
shuffle and sort phase can occur and Reduce tasks initiated.
 The Hadoop framework also supports Combiner functions which can reduce the amount of
data movement in a job.
Cont..
HDFS
 Hadoop includes a distributed file system called HDFS which is analogous to GFS in the
Google MapReduce implementation.
 HDFS also follows a master/slave architecture which consists of a single master server that
manages the distributed filesystem namespace and regulates access to files by clients called
the Namenode. In addition, there are multiple Datanodes, one per node in the cluster, which
manage the disk storage attached to the nodes and assigned to Hadoop.
 The Namenode determines the mapping of blocks to Datanodes. The Datanodes are
responsible for serving read and write requests from filesystem clients such as MapReduce
tasks, and they also perform block creation, deletion, and replication based on commands
from the Namenode.
LexisNexis HPCC
 LexisNexis, an industry leader in data content, data aggregation, and information services
independently developed and implemented a solution for data-intensive computing called the
HPCC (High-Performance Computing Cluster) which is also referred to as the Data Analytics
Supercomputer (DAS).
 LexisNexis developers recognized that to meet all the requirements of data intensive computing
applications in an optimum manner required the design and implementation of two distinct
processing environments: system configurations to support both parallel batch data processing
(Thor) and high-performance online query applications using indexed data files (Roxie).
 The HPCC platform also includes a data-centric declarative programming language for parallel
data processing called ECL
 the LexisNexis vision for this computing platform is depicted in Fig. 5.9.
Cont..
Data-Intensive Technologies for CloudComputing
 data refinery whose overall purpose is the general processing of massive volumes of raw data of any
type for any purpose but typically used for data cleansing and hygiene, extract, transform,
load processing of the raw data, record linking and entity resolution, large-scale ad-hoc complex
analytics, and creation of keyed data and indexes to support high-performance structured queries
and data warehouse applications. The data refinery is also referred to as Thor. A Thor cluster is
similar in its function, execution environment, filesystem, and capabilities to the Google and
Hadoop MapReduce platforms.
 The second of the parallel data processing platforms is called Roxie and functions as a rapid data
delivery engine. Roxie utilizes a distributed indexed filesystem to provide parallel processing of
queries using an optimized execution environment and filesystem for high-performance online
processing.
ECL
 The ECL programming language is a key factor in the flexibility and capabilities of the HPCC
processing environment.
 ECL was designed to be a transparent and implicitly parallel programming language for data-
intensive applications.
 It is a high-level, declarative, non-procedural dataflow-oriented language that allows the programmer
to define what the data processing result should be and the dataflows and transformations that are
necessary to achieve the result.
 It combines data representation with algorithm implementation, and is the fusion of both a query
language and a parallel data processing language.
 The ECL language includes extensive capabilities for data definition, filtering, data management, and
data transformation, and provides an extensive set of built-in functions to operate on records in
datasets which can include user-defined transformation functions.
Hadoop vs. HPCC Architecture
Comparison
 Hadoop MapReduce and the LexisNexis HPCC platform are both scalable architectures directed
towards data-intensive computing solutions.
 Hadoop is an open source platform which increases its flexibility and adaptability to many problem
domains since new capabilities can be readily added by users adopting this technology.
 The LexisNexis HPCC platform is an integrated set of systems, software, and other architectural
components designed to provide data-intensive computing capabilities from raw data processing,
to high-performance query processing and data mining.
 The LexisNexis HPCC is a mature, reliable, well proven, commercially supported system platform
used in government installations, research labs, and commercial enterprises.
 the HPCC architecture offers a higher level of integration of system components, an execution
environment not limited by a specific computing paradigm such as MapReduce,
 and high programmer productivity utilizing the ECL programming language and tools.
Data-Intensive Technologies for CloudComputing

More Related Content

PPTX
Cloud Infrastructure.pptx
PPTX
VTU 6th Sem Elective CSE - Module 4 cloud computing
PPTX
Introducing Technologies for Handling Big Data by Jaseela
PDF
Deployment Models in Cloud Computing
PPTX
The rise of “Big Data” on cloud computing
PDF
Aneka platform
PDF
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
PDF
OIT552 Cloud Computing - Question Bank
Cloud Infrastructure.pptx
VTU 6th Sem Elective CSE - Module 4 cloud computing
Introducing Technologies for Handling Big Data by Jaseela
Deployment Models in Cloud Computing
The rise of “Big Data” on cloud computing
Aneka platform
Distributed and Cloud Computing 1st Edition Hwang Solutions Manual
OIT552 Cloud Computing - Question Bank

What's hot (20)

PPT
Hadoop Map Reduce
PPTX
Introduction to Aneka, Aneka Model is explained
PPTX
Virtualization in cloud computing
PPTX
Distributed shred memory architecture
PPTX
Lec 7 query processing
PPTX
Security in distributed systems
PPTX
Deployment Models of Cloud Computing.pptx
PPTX
NIST Cloud Computing Reference Architecture
PPTX
Distributed database management system
PPT
Formal Specification in Software Engineering SE9
PPT
Logical Clocks (Distributed computing)
PPT
Cloud analytics
PPTX
Concurrency Control in Distributed Database.
PPTX
Cohesion and coupling
PPT
Virtualization in cloud computing ppt
PDF
Bayesian learning
PPTX
CLOUD COMPUTING UNIT - 3.pptx
PDF
Computer network unit 1 notes
PDF
Design issues of dos
PDF
MapReduce in Cloud Computing
Hadoop Map Reduce
Introduction to Aneka, Aneka Model is explained
Virtualization in cloud computing
Distributed shred memory architecture
Lec 7 query processing
Security in distributed systems
Deployment Models of Cloud Computing.pptx
NIST Cloud Computing Reference Architecture
Distributed database management system
Formal Specification in Software Engineering SE9
Logical Clocks (Distributed computing)
Cloud analytics
Concurrency Control in Distributed Database.
Cohesion and coupling
Virtualization in cloud computing ppt
Bayesian learning
CLOUD COMPUTING UNIT - 3.pptx
Computer network unit 1 notes
Design issues of dos
MapReduce in Cloud Computing
Ad

Similar to Data-Intensive Technologies for Cloud Computing (20)

PPTX
Dataintensive
PPTX
DataJan27.pptxDataFoundationsPresentation
PPT
12575474.ppt
DOCX
hadoop seminar training report
PPT
Data Intensive Computing Map-Reduce Programming.ppt
PDF
module4-cloudcomputing-180131071200.pdf
PPT
Big data with hadoop
PPTX
Big data
PPT
Big Data & Hadoop
DOCX
Hadoop Seminar Report
PPT
Bigdata processing with Spark
PPTX
عصر کلان داده، چرا و چگونه؟
PPT
Big data analytics, survey r.nabati
PDF
PPT
Seminar presentation
PDF
Big data and hadoop overvew
PDF
PDF
cloud computing notes for enginnering students
PPT
Big Data And Hadoop
PDF
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Dataintensive
DataJan27.pptxDataFoundationsPresentation
12575474.ppt
hadoop seminar training report
Data Intensive Computing Map-Reduce Programming.ppt
module4-cloudcomputing-180131071200.pdf
Big data with hadoop
Big data
Big Data & Hadoop
Hadoop Seminar Report
Bigdata processing with Spark
عصر کلان داده، چرا و چگونه؟
Big data analytics, survey r.nabati
Seminar presentation
Big data and hadoop overvew
cloud computing notes for enginnering students
Big Data And Hadoop
Big Data App servor by Lance Riedel, CTO, The Hive for The Hive India event
Ad

Recently uploaded (20)

PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Computer network topology notes for revision
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
1_Introduction to advance data techniques.pptx
PDF
Introduction to Business Data Analytics.
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Fluorescence-microscope_Botany_detailed content
Major-Components-ofNKJNNKNKNKNKronment.pptx
Mega Projects Data Mega Projects Data
Launch Your Data Science Career in Kochi – 2025
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Computer network topology notes for revision
Miokarditis (Inflamasi pada Otot Jantung)
IBA_Chapter_11_Slides_Final_Accessible.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
1_Introduction to advance data techniques.pptx
Introduction to Business Data Analytics.
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx
Supervised vs unsupervised machine learning algorithms
Fluorescence-microscope_Botany_detailed content

Data-Intensive Technologies for Cloud Computing

  • 1. Ch.5 Data-Intensive Technologies for Cloud Computing SUPERVISED BY:DR.ABBAS AL-BAKRY BY: HUDA HAMDAN ALI
  • 2. Contents:  introduction  Data-Intensive Computing Applications.  Data-Parallelism.  The “Data Gap”.  Processing Approach.  Common Characteristics of data-intensive computing systems.  Grid Computing.  Data-Intensive System Architectures.  Google MapReduce.  Hadoop.  LexisNexis HPCC.  Architecture Comparison
  • 3. introduction  Data-intensive computing represents a new computing paradigm which can address the data gap using scalable parallel processing to allow government, commercial organizations, and research environments to process massive amounts of data and implement applications previously thought to be impractical or infeasible.  Cloud computing provides the opportunity for organizations with limited internal resources to implement large-scale data-intensive computing applications in a cost-effective manner.  Data-intensive computing can be implemented in a public cloud or as a private cloud.
  • 5. Data-Intensive Computing Applications  Parallel processing approaches can be generally classified as either compute intensive, or data- intensive.  Compute-intensive is used to describe application programs that are compute bound. In compute-intensive applications, multiple operations are performed simultaneously, with each operation addressing a particular part of the problem.  Data-intensive is used to describe applications that are I/O bound or with a need to process large volumes of data.
  • 6. Cont..  Parallel processing of data-intensive applications typically involves partitioning or subdividing the data into multiple segments which can be processed independently using the same executable application program in parallel on an appropriate computing platform, then reassembling the results to produce the completed output data.  The fundamental challenges for data intensive computing are managing and processing exponentially growing data volumes, significantly reducing associated data analysis cycles to support practical, timely applications, and developing new algorithms which can scale to search and process massive amounts of data.  Cloud computing can address these challenges with the capability to provision new computing resources or extend existing resources to provide parallel computing capabilities which scale to match growing data volumes.
  • 7. Data-Parallelism  Computer system architectures which can support data-parallel applications are a potential solution to terabyte and petabyte scale data processing requirements.  We can define data-parallelism as a computation applied independently to each data item of a set of data which allows the degree of parallelism to be scaled with the volume of data.  the most important reason for developing data-parallel applications is the potential for scalable performance, and may result in several orders of magnitude performance improvement.  The key issues with developing applications using data-parallelism are the choice of the algorithm, the strategy for data decomposition, load balancing on processing nodes, message passing communications between nodes, and the overall accuracy of the results.
  • 9. The “Data Gap”  Data gaps can be created in a number of ways. The data might not exist, it might not be accessible, it might not be completed or it might not be evaluated and studied adequately, according to the Open Group, a not-for-profit consortium that stresses business efficiency.  This can cause problems for the energy management system that you may be using, and make your job even more difficult and time-consuming than it already is.  How will the “Data Gap” be addressed and bridged?  the answer is a scalable computer systems hardware and software architecture designed for data- intensive computing applications which can scale to processing billions of records per second (BORPS).
  • 10. Processing Approach  Current data-intensive computing platforms use a “divide and conquer” parallel processing approach combining multiple processors and disks in large computing clusters connected using high-speed communications switches and networks which allows the data to be partitioned among the available computing resources and processed independently to achieve performance and scalability based on the amount of data (Fig. 5.1).  We can define a cluster as “a type of parallel and distributed system, which consists of a collection of inter-connected stand-alone computers working together as a single integrated computing resource.  This approach to parallel processing is often referred to as a “shared nothing” approach since each node consisting of processor, local memory, and disk resources shares nothing with other nodes in the cluster.
  • 12. Common Characteristics of data-intensive computing systems  To achieve high performance in data-intensive computing, it is important to minimize the movement of data. instead of moving the data, the program or algorithm is transferred to the nodes with the data that needs to be processed.  A an important characteristic of data-intensive computing systems is the focus on reliability and availability. Data-intensive computing systems are designed to be fault resilient. This includes redundant copies of all data files on disk, storage of intermediate processing results on disk, automatic detection of node or processing failures, and selective re-computation of results.  A an important characteristic of data-intensive computing systems is the inherent scalability of the underlying hardware and software architecture. The number of nodes and processing tasks assigned for a specific application can be variable or fixed depending on the hardware, software, communications, and distributed file system architecture.
  • 13. Grid Computing  A computing grid is typically heterogeneous in nature (nodes can have different processor, memory, and disk resources), and consists of multiple disparate computers distributed across organizations and often geographically using wide-area networking communications usually with relatively low-bandwidth.  Grids are typically used to solve complex computational problems which are compute-intensive requiring only small amounts of data for each processing node.  In contrast, data-intensive computing systems are typically homogeneous in nature (nodes in the computing cluster have identical processor, memory, and disk resources), use high-bandwidth communications between nodes such as gigabit Ethernet switches, and are located in close proximity in a data center using high-density hardware such as rack-mounted blade servers.  Geographically dispersed grid systems are more difficult to manage, less reliable, and less secure than data-intensive computing systems which are usually located in secure data center environments.
  • 15. Data-Intensive System Architectures  A variety of system architectures have been implemented for data-intensive and large-scale data analysis applications including parallel and distributed relational database management systems which have been available to run on shared nothing clusters of processing nodes for more than two decades.  Although these systems have the ability to run parallel applications and queries expressed in the SQL language, they are typically not general-purpose processing platforms.  Internet companies such as Google, Yahoo, Microsoft, Facebook, and others required a new processing approach to effectively deal with the enormous amount of Web data for applications such as search engines and social networking.  Several solutions have emerged including the Map Reduce architecture pioneered by Google and now available in an open source implementation called Hadoop used by Yahoo, Facebook, and others.
  • 16. Google MapReduce  The MapReduce architecture and programming model pioneered by Google is an example of a modern systems architecture designed for processing and analyzing large datasets and is being used successfully by Google in many applications to process massive amounts of raw Web data.  Since the system automatically takes care of details like partitioning the input data, scheduling and executing tasks across a processing cluster, and managing the communications between nodes, programmers with no experience in parallel programming can easily use a large distributed processing environment.  The programming model for MapReduce architecture is a simple abstraction where the computation takes a set of input key-value pairs associated with the input data and produces a set of output key-value pairs.  The overall model for this process is shown in Fig. 5.2.
  • 17.  In the Google implementation of MapReduce, functions are coded in the C++ programming language.  Underlying and overlayed with the MapReduce architecture is the Google File System (GFS). GFS was designed to be a high-performance, scalable distributed file system for very large data files and data-intensive applications providing fault tolerance and running on clusters of commodity hardware.  GFS has proven to be highly effective for data-intensive computing on very large files, but is less effective for small files which can cause hot spots if many MapReduce tasks are accessing the same file.
  • 21. Hadoop  Hadoop is an open source software project sponsored by The Apache Software Foundation.  The Hadoop MapReduce architecture is functionally similar to the Google implementation except that the base programming language for Hadoop is Java instead of C++.  The implementation is intended to execute on clusters of commodity processors(Fig. 5.4) utilizing Linux as the operating system environment.  Hadoop clusters also utilize the “shared nothing” distributed processing paradigm linking individual systems with local processor, memory, and disk resources using high-speed communications switching capabilities.
  • 23.  The flexibility of Hadoop configurations allows small clusters to be created for testing and development using desktop systems or any system running Unix/Linux providing a JVM environment  The Hadoop MapReduce architecture is similar to the Google implementation creating fixed-size input splits from the input data and assigning the splits to Map tasks.  The local output from the Map tasks is copied to Reduce nodes where it is sorted and merged for processing by Reduce tasks which produce the final output as shown in Fig. 5.5.  Hadoop implements a distributed data processing scheduling and execution environment and framework for MapReduce jobs.
  • 25.  the input data splits are located through a process called data locality optimization.  The number of Reduce tasks is determined independently and can be user-specified.  As with the Google MapReduce implementation, all Map tasks must complete before the shuffle and sort phase can occur and Reduce tasks initiated.  The Hadoop framework also supports Combiner functions which can reduce the amount of data movement in a job.
  • 27. HDFS  Hadoop includes a distributed file system called HDFS which is analogous to GFS in the Google MapReduce implementation.  HDFS also follows a master/slave architecture which consists of a single master server that manages the distributed filesystem namespace and regulates access to files by clients called the Namenode. In addition, there are multiple Datanodes, one per node in the cluster, which manage the disk storage attached to the nodes and assigned to Hadoop.  The Namenode determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write requests from filesystem clients such as MapReduce tasks, and they also perform block creation, deletion, and replication based on commands from the Namenode.
  • 28. LexisNexis HPCC  LexisNexis, an industry leader in data content, data aggregation, and information services independently developed and implemented a solution for data-intensive computing called the HPCC (High-Performance Computing Cluster) which is also referred to as the Data Analytics Supercomputer (DAS).  LexisNexis developers recognized that to meet all the requirements of data intensive computing applications in an optimum manner required the design and implementation of two distinct processing environments: system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie).  The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL  the LexisNexis vision for this computing platform is depicted in Fig. 5.9.
  • 31.  data refinery whose overall purpose is the general processing of massive volumes of raw data of any type for any purpose but typically used for data cleansing and hygiene, extract, transform, load processing of the raw data, record linking and entity resolution, large-scale ad-hoc complex analytics, and creation of keyed data and indexes to support high-performance structured queries and data warehouse applications. The data refinery is also referred to as Thor. A Thor cluster is similar in its function, execution environment, filesystem, and capabilities to the Google and Hadoop MapReduce platforms.  The second of the parallel data processing platforms is called Roxie and functions as a rapid data delivery engine. Roxie utilizes a distributed indexed filesystem to provide parallel processing of queries using an optimized execution environment and filesystem for high-performance online processing.
  • 32. ECL  The ECL programming language is a key factor in the flexibility and capabilities of the HPCC processing environment.  ECL was designed to be a transparent and implicitly parallel programming language for data- intensive applications.  It is a high-level, declarative, non-procedural dataflow-oriented language that allows the programmer to define what the data processing result should be and the dataflows and transformations that are necessary to achieve the result.  It combines data representation with algorithm implementation, and is the fusion of both a query language and a parallel data processing language.  The ECL language includes extensive capabilities for data definition, filtering, data management, and data transformation, and provides an extensive set of built-in functions to operate on records in datasets which can include user-defined transformation functions.
  • 33. Hadoop vs. HPCC Architecture Comparison  Hadoop MapReduce and the LexisNexis HPCC platform are both scalable architectures directed towards data-intensive computing solutions.  Hadoop is an open source platform which increases its flexibility and adaptability to many problem domains since new capabilities can be readily added by users adopting this technology.  The LexisNexis HPCC platform is an integrated set of systems, software, and other architectural components designed to provide data-intensive computing capabilities from raw data processing, to high-performance query processing and data mining.
  • 34.  The LexisNexis HPCC is a mature, reliable, well proven, commercially supported system platform used in government installations, research labs, and commercial enterprises.  the HPCC architecture offers a higher level of integration of system components, an execution environment not limited by a specific computing paradigm such as MapReduce,  and high programmer productivity utilizing the ECL programming language and tools.