SlideShare a Scribd company logo
ISSN 2348-1196 (print)
International Journal of Computer Science and Information Technology Research ISSN 2348-120X (online)
Vol. 3, Issue 4, pp: (159-161), Month: October - December 2015, Available at: www.researchpublish.com
Page | 159
Research Publish Journals
A Data Aware Caching (Dache) for Big-Data
Applications Using the MapReduce Framework
Utkarsh Honey1
, Yogesh More2
, Prasad Wandhekar3
, Santosh Wayal4
,
Prof. Jayashree Chaudhari5
1, 2, 3, 4, 5
Dr. D Y Patil School of Engineering, Pune, India 411047
Abstract: The big-data refers to the large-scale distributed data processing applications which works on
exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation,
are the software systems for big-data applications. An observation of the MapReduce framework is that the
framework generates a large amount of intermediate data. MapReduce is unable to utilize such data so they are
thrown after used. We propose Dache, a data-aware cache framework used for big-data applications. In Dache,
tasks submit their intermediate results to the cache manager and queries the cache manager before executing the
actual computing work. A novel cache description system and a cache request and reply protocol aredesigned
Keyword: Big-data, MapReduce, Hadoop, caching.
I. INTRODUCTION
Google MapReduce is a programming model and also a software framework for Big -scale distributed Computing on
large amounts of data. Figure illustrates the high level work flow of a MapReduce Task. Application developers specify
the computation in terms of reduce function and a map and the underlying MapReduce Task scheduling system
automatically parallelizes the computation across the cluster of machines. While MapReduce obtain popularity for its
simple programming interface and excellent Performance when implement a large spectrum of applications. Since most
such applications take a huge amount of input data, they are called as “Big-dataapplications”.
Input data is first divided and then given to workers in the map stage. Every Individual data items are called records. The
MapReduce system parses the input splits to each worker and produces records. After the phase of map, intermediate
results generated in the map phase are shuffled and sorted by the MapReduce system which are then given into the
workers in the reduce phase. Final results are computed by multiple reducers and after it written back to the disk. Hadoop
is an open-source implementation of the Google MapReduce programming model. Hadoop includes of the Hadoop
Common, which provides access to the file systems supported by Hadoop. HDFS (Hadoop Distributed File System)
provides distributed file storage and is optimized for large unchangeable blocks of data. A small Hadoop cluster will
include a single master and multiple worker nodes known as slave. Whereas master node runs multiple processes,
including a Task tracker and a Name Node. The Task tracker is having control for the management of running jobs in the
Hadoop cluster. While the name node manages the HDFS, The Task tracker and the Name Node are primarily collocated
on the same physical machine. The other servers in the cluster run a Task tracker for Data Node processes. A MapReduce
job is divided into number of tasks. Tasks are managed by the Task tracker. The Task trackers and the Data Node are
collated on the same servers to provide data locality. MapReduce provides a standardized framework for implementing
the large-scale distributed computation, known as the big-data applications. Still, there is restriction of the system. i.e. The
inefficiency in incremental processing which refers to the applications that incrementally grow up the input data and
continuously apply computations on this input and produce output. There are potential duplicate computations and
operations are performed in this process. As the MapReduce does not have the any technique which is to identify such
duplicate computations and accelerate the job execution. While Motivating by this observation, in same paper we propose,
a data-aware cache scheme for big data applications using MapReduce framework, which goals at extending the
MapReduce framework and provisioning a cache layer which is used for efficiently identifying and accessing cache items
in a MapReduce job.
ISSN 2348-1196 (print)
International Journal of Computer Science and Information Technology Research ISSN 2348-120X (online)
Vol. 3, Issue 4, pp: (159-161), Month: October - December 2015, Available at: www.researchpublish.com
Page | 160
Research Publish Journals
II. OBJECTIVE(S) AND SCOPE
The Scope Of The Work Can Be Extended To Following:
 Requires minimum change to the original MapReduce programmingmodel.
 Application code only requires slight changes in order to utilize Dache.
 Implement Dache in Hadoop by extending the relevant components.
 Tested experiments show that it can easily eliminate all the duplicate tasks in incremental MapReducejobs.
 Minimum execution time and CPU utilization
Problem Statement:
In current Hadoop MapReduce framework is that the framework generates a large flow of intermediate data. MapReduce
is unable to save that such data so they are deleted after used. But in our system we introducing the cache memory that
holds the intermediate results in it, because of that the data processing, means job executing processing is faster than old
system, So that system is a time consuming, repetition of data processing arereduced.
Map Reduce Architecture:
Fig 1: Map Reduce Architecture
ISSN 2348-1196 (print)
International Journal of Computer Science and Information Technology Research ISSN 2348-120X (online)
Vol. 3, Issue 4, pp: (159-161), Month: October - December 2015, Available at: www.researchpublish.com
Page | 161
Research Publish Journals
2.3.1 Operations:
 Clients submit jobs to the Job Tracker.
 Job Tracker talks to the name node.
 Job Tracker creates execution plan.
 Job Tracker submits works to Task tracker.
 Task Trackers report progress via heart beats.
 Job Tracker manages the phases.
 Job Tracker updates the status.
Software & Hardware Requirement:
Software Requirements:
Operating System : WINDOWS XP & ABOVE
Database : My SQL,Hadoop
Language : java
Hardware Requirements:
System : Pentium Iv 2.4 GHz
Hard Disk : 40 GB
Monitor : 15 VGA Color
RAM : 512 Mb
III. CONCLUSIONS
This paper present the design and evaluation of a data aware cache framework that requires minimum changes to the
original MapReduce programming model for provisioning the incremental processing for Big data applications using
MapReduce model. In this paper we propose, a data-aware cache description scheme, architecture and protocol. In this
Paper Presented method only requires a slight modification in the input format processing and task management of
MapReduce framework. As a result, application code requires only slight changes in order to use Data in data aware
caching. This paper implements Hadoop by extending relevant components. In the future, we propose to adapt our
framework is to more general application scenarios and also implement the scheme in the Hadoopproject.
REFERENCES
[1] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communication of ACM, vol.
51, no. 1, pp. 107-113.
[2] Hadoop, http://www.Hadoop.apache.org.
[3] Cache algorithms, http://en.wikipedia.org/wiki/Cache algorithms.
[4] Java programming language, http://www.java.com/.
[5] Google compute engine, http://cloud.google.com/products/computeengine.html
[6] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters
http://labs.google.com/papers/mapreduce.html
[7] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System http://labs.google.com/
papers/gfs.html

More Related Content

PDF
Survey on Performance of Hadoop Map reduce Optimization Methods
DOC
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PDF
Survey of Parallel Data Processing in Context with MapReduce
PDF
Paper id 25201498
PPTX
Analysing of big data using map reduce
PDF
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
Survey on Performance of Hadoop Map reduce Optimization Methods
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Survey of Parallel Data Processing in Context with MapReduce
Paper id 25201498
Analysing of big data using map reduce
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
Harnessing Hadoop and Big Data to Reduce Execution Times

What's hot (17)

PDF
Leveraging Map Reduce With Hadoop for Weather Data Analytics
PDF
Eg4301808811
PPTX
PDF
Applying stratosphere for big data analytics
PPTX
Stratosphere with big_data_analytics
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PPTX
Map Reduce basics
PDF
Big Data Processing: Performance Gain Through In-Memory Computation
PDF
Understanding hadoop
PPTX
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
DOCX
Seminar Report Vaibhav
PDF
Design architecture based on web
PPTX
Hadoop interview questions
PDF
Seminar_Report_hadoop
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Eg4301808811
Applying stratosphere for big data analytics
Stratosphere with big_data_analytics
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Map Reduce basics
Big Data Processing: Performance Gain Through In-Memory Computation
Understanding hadoop
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
Seminar Report Vaibhav
Design architecture based on web
Hadoop interview questions
Seminar_Report_hadoop
Ad

Similar to A data aware caching 2415 (20)

PDF
B017320612
PDF
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
PDF
Design Issues and Challenges of Peer-to-Peer Video on Demand System
PPTX
Cppt Hadoop
PPTX
PPT
Introduccion a Hadoop / Introduction to Hadoop
PDF
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
PDF
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
DOC
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
PDF
G017143640
PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PPTX
Introduction to Apache Hadoop
PDF
Introduction to Big Data and Hadoop using Local Standalone Mode
DOCX
Hadoop Seminar Report
PPT
Hadoop ppt2
PDF
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
DOCX
Ijircce publish this paper
B017320612
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Cppt Hadoop
Introduccion a Hadoop / Introduction to Hadoop
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVI...
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
G017143640
Big Data Analysis and Its Scheduling Policy – Hadoop
Introduction to Apache Hadoop
Introduction to Big Data and Hadoop using Local Standalone Mode
Hadoop Seminar Report
Hadoop ppt2
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Ijircce publish this paper
Ad

Recently uploaded (20)

PDF
Soil Improvement Techniques Note - Rabbi
PDF
Design Guidelines and solutions for Plastics parts
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Software Engineering and software moduleing
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
737-MAX_SRG.pdf student reference guides
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
PPTX
Feature types and data preprocessing steps
PDF
Abrasive, erosive and cavitation wear.pdf
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
August -2025_Top10 Read_Articles_ijait.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Soil Improvement Techniques Note - Rabbi
Design Guidelines and solutions for Plastics parts
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Software Engineering and software moduleing
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
CyberSecurity Mobile and Wireless Devices
Information Storage and Retrieval Techniques Unit III
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
737-MAX_SRG.pdf student reference guides
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
August 2025 - Top 10 Read Articles in Network Security & Its Applications
Improvement effect of pyrolyzed agro-food biochar on the properties of.pdf
Feature types and data preprocessing steps
Abrasive, erosive and cavitation wear.pdf
Module 8- Technological and Communication Skills.pptx
August -2025_Top10 Read_Articles_ijait.pdf
Categorization of Factors Affecting Classification Algorithms Selection
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...

A data aware caching 2415

  • 1. ISSN 2348-1196 (print) International Journal of Computer Science and Information Technology Research ISSN 2348-120X (online) Vol. 3, Issue 4, pp: (159-161), Month: October - December 2015, Available at: www.researchpublish.com Page | 159 Research Publish Journals A Data Aware Caching (Dache) for Big-Data Applications Using the MapReduce Framework Utkarsh Honey1 , Yogesh More2 , Prasad Wandhekar3 , Santosh Wayal4 , Prof. Jayashree Chaudhari5 1, 2, 3, 4, 5 Dr. D Y Patil School of Engineering, Pune, India 411047 Abstract: The big-data refers to the large-scale distributed data processing applications which works on exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data. MapReduce is unable to utilize such data so they are thrown after used. We propose Dache, a data-aware cache framework used for big-data applications. In Dache, tasks submit their intermediate results to the cache manager and queries the cache manager before executing the actual computing work. A novel cache description system and a cache request and reply protocol aredesigned Keyword: Big-data, MapReduce, Hadoop, caching. I. INTRODUCTION Google MapReduce is a programming model and also a software framework for Big -scale distributed Computing on large amounts of data. Figure illustrates the high level work flow of a MapReduce Task. Application developers specify the computation in terms of reduce function and a map and the underlying MapReduce Task scheduling system automatically parallelizes the computation across the cluster of machines. While MapReduce obtain popularity for its simple programming interface and excellent Performance when implement a large spectrum of applications. Since most such applications take a huge amount of input data, they are called as “Big-dataapplications”. Input data is first divided and then given to workers in the map stage. Every Individual data items are called records. The MapReduce system parses the input splits to each worker and produces records. After the phase of map, intermediate results generated in the map phase are shuffled and sorted by the MapReduce system which are then given into the workers in the reduce phase. Final results are computed by multiple reducers and after it written back to the disk. Hadoop is an open-source implementation of the Google MapReduce programming model. Hadoop includes of the Hadoop Common, which provides access to the file systems supported by Hadoop. HDFS (Hadoop Distributed File System) provides distributed file storage and is optimized for large unchangeable blocks of data. A small Hadoop cluster will include a single master and multiple worker nodes known as slave. Whereas master node runs multiple processes, including a Task tracker and a Name Node. The Task tracker is having control for the management of running jobs in the Hadoop cluster. While the name node manages the HDFS, The Task tracker and the Name Node are primarily collocated on the same physical machine. The other servers in the cluster run a Task tracker for Data Node processes. A MapReduce job is divided into number of tasks. Tasks are managed by the Task tracker. The Task trackers and the Data Node are collated on the same servers to provide data locality. MapReduce provides a standardized framework for implementing the large-scale distributed computation, known as the big-data applications. Still, there is restriction of the system. i.e. The inefficiency in incremental processing which refers to the applications that incrementally grow up the input data and continuously apply computations on this input and produce output. There are potential duplicate computations and operations are performed in this process. As the MapReduce does not have the any technique which is to identify such duplicate computations and accelerate the job execution. While Motivating by this observation, in same paper we propose, a data-aware cache scheme for big data applications using MapReduce framework, which goals at extending the MapReduce framework and provisioning a cache layer which is used for efficiently identifying and accessing cache items in a MapReduce job.
  • 2. ISSN 2348-1196 (print) International Journal of Computer Science and Information Technology Research ISSN 2348-120X (online) Vol. 3, Issue 4, pp: (159-161), Month: October - December 2015, Available at: www.researchpublish.com Page | 160 Research Publish Journals II. OBJECTIVE(S) AND SCOPE The Scope Of The Work Can Be Extended To Following:  Requires minimum change to the original MapReduce programmingmodel.  Application code only requires slight changes in order to utilize Dache.  Implement Dache in Hadoop by extending the relevant components.  Tested experiments show that it can easily eliminate all the duplicate tasks in incremental MapReducejobs.  Minimum execution time and CPU utilization Problem Statement: In current Hadoop MapReduce framework is that the framework generates a large flow of intermediate data. MapReduce is unable to save that such data so they are deleted after used. But in our system we introducing the cache memory that holds the intermediate results in it, because of that the data processing, means job executing processing is faster than old system, So that system is a time consuming, repetition of data processing arereduced. Map Reduce Architecture: Fig 1: Map Reduce Architecture
  • 3. ISSN 2348-1196 (print) International Journal of Computer Science and Information Technology Research ISSN 2348-120X (online) Vol. 3, Issue 4, pp: (159-161), Month: October - December 2015, Available at: www.researchpublish.com Page | 161 Research Publish Journals 2.3.1 Operations:  Clients submit jobs to the Job Tracker.  Job Tracker talks to the name node.  Job Tracker creates execution plan.  Job Tracker submits works to Task tracker.  Task Trackers report progress via heart beats.  Job Tracker manages the phases.  Job Tracker updates the status. Software & Hardware Requirement: Software Requirements: Operating System : WINDOWS XP & ABOVE Database : My SQL,Hadoop Language : java Hardware Requirements: System : Pentium Iv 2.4 GHz Hard Disk : 40 GB Monitor : 15 VGA Color RAM : 512 Mb III. CONCLUSIONS This paper present the design and evaluation of a data aware cache framework that requires minimum changes to the original MapReduce programming model for provisioning the incremental processing for Big data applications using MapReduce model. In this paper we propose, a data-aware cache description scheme, architecture and protocol. In this Paper Presented method only requires a slight modification in the input format processing and task management of MapReduce framework. As a result, application code requires only slight changes in order to use Data in data aware caching. This paper implements Hadoop by extending relevant components. In the future, we propose to adapt our framework is to more general application scenarios and also implement the scheme in the Hadoopproject. REFERENCES [1] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Communication of ACM, vol. 51, no. 1, pp. 107-113. [2] Hadoop, http://www.Hadoop.apache.org. [3] Cache algorithms, http://en.wikipedia.org/wiki/Cache algorithms. [4] Java programming language, http://www.java.com/. [5] Google compute engine, http://cloud.google.com/products/computeengine.html [6] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters http://labs.google.com/papers/mapreduce.html [7] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System http://labs.google.com/ papers/gfs.html