SlideShare a Scribd company logo
BY
B.REVATHI REDDY
(19/11/2016)
BIG DATA
What is Big Data?
Big data refers to huge amount of digital information collected from multiple
and different sources.
Big Data is one of those things that is completely transforming the way we
are doing the everyday things which leaves a digital trace which can be
used and analyzed. Big Data refers to our ability to make use of the ever
increasing volumes of data .An aim to solve new problems or old problems
in a better way.
Data generated by us :
Mobile Devices
Conversation data
Photo and video Image data
Social Networks data
Satellites
The Internet Of Things data
Big Data are characterized by 3v’s:
Volume – Data Quantity
Velocity - Data Speed
Variety – Types of data
Storing Big Data
Analyzing data characteristics
Selecting data sources for analysis
Eliminating redundant data
Processing Big Data
Mapping data to programming frame work
Connecting and extracting data from storage
Transforming data for processing
Subdividing data for Hadoop MapReduce
Creating the components of Hadoop MapReduce jobs
Executing Hadoop MapReduce Jobs
Monitoring the progress of the job flows
The Structure of Big Data
Structured – Traditional data sources , the data stored in
fields in a database
Semi-structured – a form of structured data that doesn’t
conform with the formal structure of the data models of relat
ional databases and also has tags or other markers to sepa
rate semantic elements within the data
Unstructured – video data , audio data , the data that do
esn’t reside in a traditional row-column database .
How is Big Data actually used?
Some examples…
Better understand and target customers
Understand and optimize business processes
Improving health
Improving security
Improving sports performance
Improving and optimizing cities and countries
There are endless applications of Big Data. Any business t
hat doesn’t seriously consider the implications of big data
runs in the risk of being left behind!
Infrastructure of Big Data
To handle different dimensions of big data in terms of volume , ve
locity, variety an effective and efficient design has to used proces
s large amount of data arriving at high speed from different sourc
es .Multiple faces are present here
Multi-source Big data generation
Big data Storage
Big data Processing
Cloud Computing and Big Data
Big Data needs massive amounts of memory or storage space fo
r all the data to be stored in .This is where Cloud Computing com
es into the picture which is cost saving ,scalable , provides variet
y of services like - huge processing power, high storage capabilit
y.
Survey paper on Big Data(IEEE)
Ms.Vibhavari Chavan, Prof.Rajesh.N.Phursule(IJCSIT paper)
Big Data usually includes data set with sizes beyond the ability of
commonly used software tools to capture, manage and process dat
a within a tolerable elapsed time .
 Size of big data is constantly a moving target.
 Big Data is a set of techniques and technologies that require new
form of integration to uncover large hidden values from large data s
ets.
 Big data environment is used to organize and analyze various typ
es of data.
Map Reduce framework generates a lot of intermediate data.
Hadoop
Hadoop is open source framework
Hadoop framework is written in java
Response time varies depending on the complexity of the process
Massive scalability is the key advantage
Currently used for index web searches , email spam detection, pred
iction in financial services etc.
By storing data hadoop consists of 2components:
HDFS , Map Reduce
HDFS
HDFS is the file system component of Hadoop framework designed a
nd optimized to store large amounts of data on low cost hardware. Arch
itecture of HDFS has :
Name Node - kind of master node having the information abo
ut metadata. All data node address, free space, active passive type dat
a node, stored data, job tracker.
Data Node – Data node is a type of slave node in the hadoop,
which is used to save the data and there is task tracker in data node w
hich is use to track on the ongoing job on the data node and the jobs w
hich coming from name node.
MapReduce Framework
Two input files:
file1: “hello world hello moon”
file2: “goodbye world goodnight moon”
Three operations:
Map
Combine
Reduce
Map
First map: Second map:
< hello, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 1 > < goodnight, 1 >
< moon, 1 > < moon, 1 >
COMBINE
First map: Second map:
< moon, 1 > < goodbye, 1 >
< world, 1 > < world, 1 >
< hello, 2 > < goodnight, 1 >
< moon, 1 >
REDUCE
< goodbye, 1 >
< goodnight, 1 >
< moon, 2 >
< world, 2 >
< hello, 2 >
Big data
PIG
Initially developed by Yahoo! Is a programming language used to handle any k
ind of data.
 Pig had two components:
first being the language itself called “PigLatin”
second is the runtime environment where the PigLatin programs are
executed .
Look at the programming language itself so that easier than having to write
mapper and reducer programs:
• The first step in this language is to LOAD the data to be manipulate
d into HDFS
• Then run the data through a set of TRANSFORMations (in turn conve
rted into mapper and reducer tasks )
• DUMP the data to the screen or STORE the results elsewhere.
HIVE
Initially developed by Facebook now Apache HIVE is a data warehouse infrast
ructure built on top of hadoop for query, data summarization and analysis.
Supports analysis of datasets stored in Hadoop’s HDFS and other compatible
file systems
Different storage types – plain text, HBase and other
Metadata storage in RDBMS ,reduces time for semantic checks
Operating on compressed data stored in Hadoop
Built-in User-defined Functions(UDF’s)
SQL like queries “HiveQL” that are implicitly converted into MapReduce jobs
It provides indexes including bit map indexes to fasten the queries.
HBase
HBase is a column-oriented Database where as HDFS is file system
HBase has a table format with rows and columns and each table sho
uld have a Primary Key defined in it that is used for all accesses in this
HBase table. Allows many attributes to be grouped into Column familie
s .
Table schema should be predefined along with the column families ,b
ut is flexible enough to add new columns to the families at any time ,ma
king the schema flexible .
Just as HDFS’s NameNode and slave nodes MapReduce also has Jo
bTracker and TaskTracker slave nodes .
Availability of NameNode in this case is also a concern jus as in HDF
S , and is also sensitive to loss of information of the master node
Conclusion
Hadoop MapReduce is an open source framework used for data-sensiti
ve ,reliable, fault tolerant, scalable data, has many implementation opti
ons and allows rewriting algorithms into MapReduce.
The framework breaks up large data into smaller chunks and handles it
.
We can present the design and evaluation of a data aware cache fram
ework that requires minimum change to the original MapReduce progra
mming model for provisioning incremental processing for Big data appli
cations using the MapReduce model.
References
www.quoble.com
www.insidebigdata.com
www.ibmbigdatahub.com
www.data-magnum.com
Survey paper on Big Data and Hadoop by Varsha B.Bobade ,IRJET volume-3
Survey paper on Big Data by Ms.Vibhavari Chavan ,Prof.Rajesh N.Phursule,IJCS
IT,vol.5

More Related Content

PDF
Big data and hadoop
PPTX
Big data ppt
PPT
Big Data Analytics 2014
PPTX
Big Data Concepts
PPTX
A Glimpse of Bigdata - Introduction
PDF
Harnessing Hadoop and Big Data to Reduce Execution Times
PPTX
Big Data and Hadoop
PDF
Introduction to Big Data and Hadoop using Local Standalone Mode
Big data and hadoop
Big data ppt
Big Data Analytics 2014
Big Data Concepts
A Glimpse of Bigdata - Introduction
Harnessing Hadoop and Big Data to Reduce Execution Times
Big Data and Hadoop
Introduction to Big Data and Hadoop using Local Standalone Mode

What's hot (19)

PPTX
HADOOP TECHNOLOGY ppt
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
PPT
hadoop
PPTX
Big data Presentation
PDF
PPTX
Hadoop Presentation - PPT
PPTX
Big Data and Hadoop
PPTX
Apache hadoop introduction and architecture
DOCX
Hadoop Seminar Report
DOCX
Seminar Report Vaibhav
PPT
Big Data & Hadoop
DOC
PDF
Apache Hadoop - Big Data Engineering
PPTX
Big data
PPTX
Big data & hadoop
PDF
Seminar_Report_hadoop
PPTX
Hadoop
PPTX
Hadoop info
PPTX
Bigdata and hadoop
HADOOP TECHNOLOGY ppt
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
hadoop
Big data Presentation
Hadoop Presentation - PPT
Big Data and Hadoop
Apache hadoop introduction and architecture
Hadoop Seminar Report
Seminar Report Vaibhav
Big Data & Hadoop
Apache Hadoop - Big Data Engineering
Big data
Big data & hadoop
Seminar_Report_hadoop
Hadoop
Hadoop info
Bigdata and hadoop
Ad

Viewers also liked (10)

PDF
WS 3B Disappointment to Success: Growing Pains in Ft. Collins
PDF
161117
DOCX
Hipervinculo
PDF
Decoration+World+(China)_Axis+Mundi_Apriile 2016
PDF
Calcul québec florent parent
PDF
COMPARATIVE STUDY OF INDUCTION MOTOR STARTERS USING MATLAB SIMULINK
PPT
Presentacion ntroduction2
PPTX
"Electronics for Behavioral Health" - Jim Doscher (GM Healthcare, Analog Devi...
PDF
WS 3D How Open Streets PHL Brought a New Idea to an Old City in 6 Months
DOCX
IA 440 Essay 1
WS 3B Disappointment to Success: Growing Pains in Ft. Collins
161117
Hipervinculo
Decoration+World+(China)_Axis+Mundi_Apriile 2016
Calcul québec florent parent
COMPARATIVE STUDY OF INDUCTION MOTOR STARTERS USING MATLAB SIMULINK
Presentacion ntroduction2
"Electronics for Behavioral Health" - Jim Doscher (GM Healthcare, Analog Devi...
WS 3D How Open Streets PHL Brought a New Idea to an Old City in 6 Months
IA 440 Essay 1
Ad

Similar to Big data (20)

PDF
Big Data Analysis and Its Scheduling Policy – Hadoop
PDF
G017143640
PDF
Hadoop and Big Data Analytics | Sysfore
PPT
hadoop
PDF
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
PPTX
Big data Hadoop presentation
PDF
PPTX
Hadoop and BigData - July 2016
PPTX
Introduction to Apache Hadoop Eco-System
PDF
Big Data
PPTX
Hadoop
PPTX
Managing Big data with Hadoop
DOCX
project report on hadoop
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
PPTX
Big Data and Hadoop
PDF
BIGDATA MODULE 3.pdf
PDF
Hadoop and its role in Facebook: An Overview
PPTX
Hadoop basics
PDF
Learn About Big Data and Hadoop The Most Significant Resource
Big Data Analysis and Its Scheduling Policy – Hadoop
G017143640
Hadoop and Big Data Analytics | Sysfore
hadoop
UNIT-II-BIG-DATA-FINAL(aktu imp)-PDF.pdf
Big data Hadoop presentation
Hadoop and BigData - July 2016
Introduction to Apache Hadoop Eco-System
Big Data
Hadoop
Managing Big data with Hadoop
project report on hadoop
Hadoop a Natural Choice for Data Intensive Log Processing
Big Data and Hadoop
BIGDATA MODULE 3.pdf
Hadoop and its role in Facebook: An Overview
Hadoop basics
Learn About Big Data and Hadoop The Most Significant Resource

Recently uploaded (20)

PPTX
Institutional Correction lecture only . . .
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
PPH.pptx obstetrics and gynecology in nursing
PPTX
Cell Types and Its function , kingdom of life
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
Business Ethics Teaching Materials for college
PPTX
Pharma ospi slides which help in ospi learning
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Complications of Minimal Access Surgery at WLH
Institutional Correction lecture only . . .
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
TR - Agricultural Crops Production NC III.pdf
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
Renaissance Architecture: A Journey from Faith to Humanism
102 student loan defaulters named and shamed – Is someone you know on the list?
Abdominal Access Techniques with Prof. Dr. R K Mishra
Week 4 Term 3 Study Techniques revisited.pptx
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPH.pptx obstetrics and gynecology in nursing
Cell Types and Its function , kingdom of life
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Final Presentation General Medicine 03-08-2024.pptx
Cell Structure & Organelles in detailed.
Business Ethics Teaching Materials for college
Pharma ospi slides which help in ospi learning
Microbial disease of the cardiovascular and lymphatic systems
Complications of Minimal Access Surgery at WLH

Big data

  • 2. What is Big Data? Big data refers to huge amount of digital information collected from multiple and different sources. Big Data is one of those things that is completely transforming the way we are doing the everyday things which leaves a digital trace which can be used and analyzed. Big Data refers to our ability to make use of the ever increasing volumes of data .An aim to solve new problems or old problems in a better way.
  • 3. Data generated by us : Mobile Devices Conversation data Photo and video Image data Social Networks data Satellites The Internet Of Things data
  • 4. Big Data are characterized by 3v’s: Volume – Data Quantity Velocity - Data Speed Variety – Types of data Storing Big Data Analyzing data characteristics Selecting data sources for analysis Eliminating redundant data
  • 5. Processing Big Data Mapping data to programming frame work Connecting and extracting data from storage Transforming data for processing Subdividing data for Hadoop MapReduce Creating the components of Hadoop MapReduce jobs Executing Hadoop MapReduce Jobs Monitoring the progress of the job flows
  • 6. The Structure of Big Data Structured – Traditional data sources , the data stored in fields in a database Semi-structured – a form of structured data that doesn’t conform with the formal structure of the data models of relat ional databases and also has tags or other markers to sepa rate semantic elements within the data Unstructured – video data , audio data , the data that do esn’t reside in a traditional row-column database .
  • 7. How is Big Data actually used? Some examples… Better understand and target customers Understand and optimize business processes Improving health Improving security Improving sports performance Improving and optimizing cities and countries There are endless applications of Big Data. Any business t hat doesn’t seriously consider the implications of big data runs in the risk of being left behind!
  • 8. Infrastructure of Big Data To handle different dimensions of big data in terms of volume , ve locity, variety an effective and efficient design has to used proces s large amount of data arriving at high speed from different sourc es .Multiple faces are present here Multi-source Big data generation Big data Storage Big data Processing Cloud Computing and Big Data Big Data needs massive amounts of memory or storage space fo r all the data to be stored in .This is where Cloud Computing com es into the picture which is cost saving ,scalable , provides variet y of services like - huge processing power, high storage capabilit y.
  • 9. Survey paper on Big Data(IEEE) Ms.Vibhavari Chavan, Prof.Rajesh.N.Phursule(IJCSIT paper) Big Data usually includes data set with sizes beyond the ability of commonly used software tools to capture, manage and process dat a within a tolerable elapsed time .  Size of big data is constantly a moving target.  Big Data is a set of techniques and technologies that require new form of integration to uncover large hidden values from large data s ets.  Big data environment is used to organize and analyze various typ es of data. Map Reduce framework generates a lot of intermediate data.
  • 10. Hadoop Hadoop is open source framework Hadoop framework is written in java Response time varies depending on the complexity of the process Massive scalability is the key advantage Currently used for index web searches , email spam detection, pred iction in financial services etc. By storing data hadoop consists of 2components: HDFS , Map Reduce
  • 11. HDFS HDFS is the file system component of Hadoop framework designed a nd optimized to store large amounts of data on low cost hardware. Arch itecture of HDFS has : Name Node - kind of master node having the information abo ut metadata. All data node address, free space, active passive type dat a node, stored data, job tracker. Data Node – Data node is a type of slave node in the hadoop, which is used to save the data and there is task tracker in data node w hich is use to track on the ongoing job on the data node and the jobs w hich coming from name node.
  • 12. MapReduce Framework Two input files: file1: “hello world hello moon” file2: “goodbye world goodnight moon” Three operations: Map Combine Reduce Map First map: Second map: < hello, 1 > < goodbye, 1 > < world, 1 > < world, 1 > < hello, 1 > < goodnight, 1 > < moon, 1 > < moon, 1 > COMBINE First map: Second map: < moon, 1 > < goodbye, 1 > < world, 1 > < world, 1 > < hello, 2 > < goodnight, 1 > < moon, 1 > REDUCE < goodbye, 1 > < goodnight, 1 > < moon, 2 > < world, 2 > < hello, 2 >
  • 14. PIG Initially developed by Yahoo! Is a programming language used to handle any k ind of data.  Pig had two components: first being the language itself called “PigLatin” second is the runtime environment where the PigLatin programs are executed . Look at the programming language itself so that easier than having to write mapper and reducer programs: • The first step in this language is to LOAD the data to be manipulate d into HDFS • Then run the data through a set of TRANSFORMations (in turn conve rted into mapper and reducer tasks ) • DUMP the data to the screen or STORE the results elsewhere.
  • 15. HIVE Initially developed by Facebook now Apache HIVE is a data warehouse infrast ructure built on top of hadoop for query, data summarization and analysis. Supports analysis of datasets stored in Hadoop’s HDFS and other compatible file systems Different storage types – plain text, HBase and other Metadata storage in RDBMS ,reduces time for semantic checks Operating on compressed data stored in Hadoop Built-in User-defined Functions(UDF’s) SQL like queries “HiveQL” that are implicitly converted into MapReduce jobs It provides indexes including bit map indexes to fasten the queries.
  • 16. HBase HBase is a column-oriented Database where as HDFS is file system HBase has a table format with rows and columns and each table sho uld have a Primary Key defined in it that is used for all accesses in this HBase table. Allows many attributes to be grouped into Column familie s . Table schema should be predefined along with the column families ,b ut is flexible enough to add new columns to the families at any time ,ma king the schema flexible . Just as HDFS’s NameNode and slave nodes MapReduce also has Jo bTracker and TaskTracker slave nodes . Availability of NameNode in this case is also a concern jus as in HDF S , and is also sensitive to loss of information of the master node
  • 17. Conclusion Hadoop MapReduce is an open source framework used for data-sensiti ve ,reliable, fault tolerant, scalable data, has many implementation opti ons and allows rewriting algorithms into MapReduce. The framework breaks up large data into smaller chunks and handles it . We can present the design and evaluation of a data aware cache fram ework that requires minimum change to the original MapReduce progra mming model for provisioning incremental processing for Big data appli cations using the MapReduce model.
  • 18. References www.quoble.com www.insidebigdata.com www.ibmbigdatahub.com www.data-magnum.com Survey paper on Big Data and Hadoop by Varsha B.Bobade ,IRJET volume-3 Survey paper on Big Data by Ms.Vibhavari Chavan ,Prof.Rajesh N.Phursule,IJCS IT,vol.5