SlideShare a Scribd company logo
Data Analytics
A Lecture Series
on
Dr.Chitra A.Dhawale
P.R.Pote College of Engg.and Mgmt.
Data Analytics:  HDFS  with  Big Data :  Issues and Application
Data Analytics (MCA19304)
COURSE OUTCOMES
AT THE END OF COURSE THE STUDENT SHOULD BE ABLE TO :
1. DEVELOP AND MAINTAIN RELIABLE, SCALABLE SYSTEMS USING
APACHE, HADOOP
2. WRITE MAP REDUCE BASED APPLICATION
3. DIFFERENTIATE BETWEEN CONVENTIONAL SQL AND NOSQL
4. ANALYZE AND DEVELOP BIG DATA SOLUTIONS USING HIVE AND PIG
Data Analytics (MCA19304)
UNIT I
• DISTRIBUTED FILE SYSTEM AND ITS ISSUES
• INTRODUCTION TO BIG DATA,
• BIG DATA CHARACTERISTICS
• TYPES OF BIG DATA
• TRADITIONAL VS. BIG DATA APPROACH
• BIG DATA APPLICATIONS
Distributed file system and its issues
Distributed file system and its issues
• A single machine with 4 Hard disks with 1 tb of data (I/O Channel), with 100
Mbps speed. For Processing needs 45 mins.
• For Faster Processing :
• Divide data and store it on multiple machines with same configuration as above
– Assume all machines are processing data in parallel manner then , It will take
45/5 = 9 mins for processing.
• Processing will be 5 times faster than a single machine.
Distributed file system and its issues
Distributed file system and its issues
Each machine have its own local file system (physical file system ) where you store data i.e create folders and
subfolders and so on.
Distributed file system is not physical, it is virtual or logical file system.
Hadoop used DFS.
Install libraries on every machine running as a separate process in different machines.
These are creating virtual layer over the physical file system under it.
This virtual layer is called distributed file system
Distributed File System
Distributed file system and its issues
• Virtual File System is a software i.e set of programs—obviously….Set of
commands
• Ex. Dfs -copy source file destination file
• Dfs -copy file1 file 2
• It read file1 which is distributed on 5 machines say ( A,B,C,D,E ), user having
no idea about it….Where each part of file is ?
• ( path is virtual path) nowhere it is existing.
• Any dfs follows master slave architecture
DFS
Master Machine
Slave Machines
 Upper Machine is Master Machine and Lower 5
are Slave ones.
 Data is splitted and stored on slave machines.
 Master does not store any data. It only stores
metadata.
 Master Machine only know (as File is divided into
blocks (File to Block Mapping and blocks are
distributed on slave machines i.e Block to Slave
mapping)
 Data can only be accessed via Master as Only
Master know the actual location of data on each
slave.
HDFS
• While reading data, if any of the node failure then client may get partial data.
• To overcome this at the time of configuring HDFS, replication factor is set i.e if
replication factor = 2 , it means every block is replicated ( copied at two places) i.e 2
copies are maintained for each block.
• In case of failure of one node, block can be accessed from another node. Data is
transmitted to machine (Server) where program is running.
.
Features of DFS
Transparency :
 Structure transparency –
There is no need for the client to know about the number or locations of file servers and the
storage devices.
 Access transparency –
Both local and remote files should be accessible in the same manner.
 Naming transparency –
Once a name is given to the file, it should not be changed during transferring from one
node to another.
Features of DFS
• Replication transparency –
If a file is copied on multiple nodes, both the copies of the file and their
locations should be hidden from one node to another.
 User mobility :
It will automatically bring the user’s home directory to the node where the
user logs in.
• Performance :
Performance is based on the average amount of time needed to convince the
client requests.
• This time covers the CPU time + time taken to access secondary storage +
network access time.
Features of DFS
 Simplicity and ease of use :
The user interface of a file system should be simple and the number of commands in the file should be
small.
 High availability :
A Distributed File System should be able to continue in case of any partial failures like a link failure, a
node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and independent file servers
for controlling different and independent storage devices.
 Scalability :
Since growing the network by adding new machines or joining two networks together is routine, the
distributed system will inevitably grow over time. As a result, a good distributed file system should be
built to scale quickly as the number of nodes and users in the system grows. Service should not be
substantially disrupted as the number of nodes and users grows.
Features of DFS
 High reliability :
A file system should create backup copies of key files that can be used if the originals are lost.
Many file systems employ stable storage as a high-reliability strategy.
 Data integrity :
 Multiple users frequently share a file system.
 The integrity of data saved in a shared file must be guaranteed by the file system.
 That is, concurrent access requests from many users who are competing for access to the
same file must be correctly synchronized using a concurrency control method.
 Atomic transactions are a high-level concurrency management mechanism for data
integrity that is frequently offered to users by a file system.
Features of DFS
 Security :
Users of heterogeneous distributed systems have the option of using multiple computer
platforms for different purposes.
 Heterogeneity :
 To safeguard the information contained in the file system from unwanted & unauthorized
access, security mechanisms must be implemented.
 A distributed file system should be secure so that its users may trust that their data will be
kept private.
Issues with DFS
 In Distributed File System nodes and connections needs to be secured therefore
we can say that security is at stake.
 There is a possibility of lose of messages and data in the network while movement
from one node to another.
 Database connection in case of Distributed File System is complicated.
 Also handling of the database is not easy in Distributed File System as compared
to a single user system.
 There are chances that overloading will take place if all nodes tries to send data
at once.do with the local
Factors- Big Data Generation
Evolution of Technology
Factors- Big Data Generation
IOT
Factors- Big Data Generation
Social Media
Factors- Big Data Generation
Others
What is Big Data?
Characteristics – Big Data
FIVE V’S OF BIG DATA : 1 . VOLUME
Characteristics – Big Data
FIVE V’S OF BIG DATA : 2. VARIETY
Characteristics – Big Data
FIVE V’S OF BIG DATA : 3 . VELOCITY
Characteristics – Big Data
FIVE V’S OF BIG DATA : 4. VALUE
Characteristics – Big Data
FIVE V’S OF BIG DATA : 4. VERACITY
Characteristics of Big Data at a glance
Types of Big Data
Types of Big Data
• Structured
The structured data includes all the data that can be stored in a tabular column.
Relational databases are examples of structured data.
It is easy to make sense of the relational databases.
Most of the modern computers are able to make sense of structured data.
Types of Big Data
Unstructured
• Unstructured data refers to the data that lacks any specific form or structure whatsoever.
• The unstructured data is the one that cannot be stored in a spreadsheet;
• Unstructured data, on the other hand, is the one which cannot be fit into tabular databases.
• Examples of unstructured data include audio, video, and other sorts of data which comprise such a big chunk
of the big data today. Email is an example of unstructured data.
Types of Big Data
Semi-structured
• The semi-structured data includes both structured and unstructured data.
• This type of data sets include a proper structure, but still it might not be possible
to sort or process that data due to some constraints.
• This type of data includes the XML data, JSON files, and others.
Traditional Vs. Big Data
• 1.Traditional data
• Traditional data is the structured data which is being majorly maintained by all types of businesses
starting from very small to big organizations.
• In traditional database system a centralized database architecture used to store and maintain the data in
a fixed format or fields in a file.
• For managing and accessing the data structured query language (SQL) is used.
• 2. Big data :
Big data deal with too large or complex data sets which is difficult to manage in traditional data-processing
application software.
• It deals with large volume of both structured, semi structured and unstructured data. Volume, velocity and
variety, veracity and value.
• Big data not only refers to large amount of data it refers to extracting meaningful data by analyzing the
huge amount of complex data sets.
S.No. TRADITIONAL DATA BIG DATA
01. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level.
02. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes.
03. Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data.
04. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds.
05.
Traditional data source is centralized and it is managed in centralized
form. Big data source is distributed and it is managed in distributed form.
06. Data integration is very easy. Data integration is very difficult.
07. Normal system configuration is capable to process traditional data. High system configuration is required to process big data.
08. The size of the data is very small. The size is more than the traditional data size.
09.
Traditional data base tools are required to perform any data base
operation. Special kind of data base tools are required to perform any data base operation.
10. Normal functions can manipulate data. Special kind of functions can manipulate data.
11. Its data model is strict schema based and it is static. Its data model is flat schema based and it is dynamic.
12.. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship.
13. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable.
14. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data.
15.
Its data sources includes ERP transaction data, CRM transaction data,
financial data, organizational data, web transaction data etc. Its data sources includes social media, device data, sensor data, video, images, audio etc.
Applications of Big Data
•Big data in retail
•Big data in healthcare
•Big data in education
•Big data in e-commerce
•Big data in media and entertainment
•Big data in finance
•Big data in travel industry
•Big data in telecom
•Big data in automobile
Applications of Big Data

More Related Content

PPTX
Database part1-
PPTX
History of database processing module 1 (2)
PDF
ITI015En-The evolution of databases (I)
PDF
Database History
PPT
Unit 01 dbms
PPT
Lecture 05 - The Data Warehouse and Technology
Database part1-
History of database processing module 1 (2)
ITI015En-The evolution of databases (I)
Database History
Unit 01 dbms
Lecture 05 - The Data Warehouse and Technology

What's hot (20)

PPTX
Cp 121 lecture 01
 
PPT
Introduction & history of dbms
PPTX
Trends in the Database
PDF
Chapter 01 Fundamental of Database Management System (DBMS)
PPTX
CS3270 - DATABASE SYSTEM - Lecture (1)
PPTX
Database assignment
PPTX
Database Management Systems - Management Information System
PPTX
1 introduction ddbms
PPTX
Database System Architectures
PDF
SULTHAN's ICT-2 for UG courses
PPTX
Distributed web based systems
PPT
Database Systems
PPT
Emerging database technology multimedia database
PPT
Lecture 3 multimedia databases
PPTX
Trends in Database Management
PPT
Introduction to Data Management and Sharing
PPT
Multimedia Database
DOCX
DBMS FOR STUDENTS MUST DOWNLOAD AND READ
PPTX
Database systems
Cp 121 lecture 01
 
Introduction & history of dbms
Trends in the Database
Chapter 01 Fundamental of Database Management System (DBMS)
CS3270 - DATABASE SYSTEM - Lecture (1)
Database assignment
Database Management Systems - Management Information System
1 introduction ddbms
Database System Architectures
SULTHAN's ICT-2 for UG courses
Distributed web based systems
Database Systems
Emerging database technology multimedia database
Lecture 3 multimedia databases
Trends in Database Management
Introduction to Data Management and Sharing
Multimedia Database
DBMS FOR STUDENTS MUST DOWNLOAD AND READ
Database systems
Ad

Similar to Data Analytics: HDFS with Big Data : Issues and Application (20)

PDF
Lesson_2_foundations_for_Big_Data_UE.pdf
PDF
UNIT 5- Other Databases.pdf
PPTX
Unit-1 Introduction to Big Data.pptx
PPTX
Resources security and protection Distributed operating system
PPTX
Introduction to Data Storage and Cloud Computing
PPTX
advanced database management system by uni
PPTX
chapter-1Introduction to DS,Issues and Architecture.pptx
PPTX
Database Management system intro.pptx
PDF
Chapter2.pdf
PPTX
Database-Management-System - Topic Data Models
PPTX
DBMS basics and normalizations unit.pptx
PPTX
FILE SYSTEM VS DBMS ppt.pptx
PPTX
distributed files in parallel computonglec 7.pptx
PPTX
An Introduction to Database systems.pptx
PPTX
Database Systems Lec 1.pptx
PPT
HDFS_architecture.ppt
PPTX
System Analysis And Design
PPTX
Overview of Big Data by Sunny
PPTX
DBMS.pptx
PPTX
Distributed Storage in advanced database.pptx
Lesson_2_foundations_for_Big_Data_UE.pdf
UNIT 5- Other Databases.pdf
Unit-1 Introduction to Big Data.pptx
Resources security and protection Distributed operating system
Introduction to Data Storage and Cloud Computing
advanced database management system by uni
chapter-1Introduction to DS,Issues and Architecture.pptx
Database Management system intro.pptx
Chapter2.pdf
Database-Management-System - Topic Data Models
DBMS basics and normalizations unit.pptx
FILE SYSTEM VS DBMS ppt.pptx
distributed files in parallel computonglec 7.pptx
An Introduction to Database systems.pptx
Database Systems Lec 1.pptx
HDFS_architecture.ppt
System Analysis And Design
Overview of Big Data by Sunny
DBMS.pptx
Distributed Storage in advanced database.pptx
Ad

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
Foundation of Data Science unit number two notes
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Lecture1 pattern recognition............
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to machine learning and Linear Models
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
1_Introduction to advance data techniques.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Foundation of Data Science unit number two notes
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Miokarditis (Inflamasi pada Otot Jantung)
oil_refinery_comprehensive_20250804084928 (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Fluorescence-microscope_Botany_detailed content
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Business Acumen Training GuidePresentation.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
.pdf is not working space design for the following data for the following dat...
Supervised vs unsupervised machine learning algorithms
Lecture1 pattern recognition............
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
Introduction to machine learning and Linear Models
STUDY DESIGN details- Lt Col Maksud (21).pptx

Data Analytics: HDFS with Big Data : Issues and Application

  • 1. Data Analytics A Lecture Series on Dr.Chitra A.Dhawale P.R.Pote College of Engg.and Mgmt.
  • 3. Data Analytics (MCA19304) COURSE OUTCOMES AT THE END OF COURSE THE STUDENT SHOULD BE ABLE TO : 1. DEVELOP AND MAINTAIN RELIABLE, SCALABLE SYSTEMS USING APACHE, HADOOP 2. WRITE MAP REDUCE BASED APPLICATION 3. DIFFERENTIATE BETWEEN CONVENTIONAL SQL AND NOSQL 4. ANALYZE AND DEVELOP BIG DATA SOLUTIONS USING HIVE AND PIG
  • 4. Data Analytics (MCA19304) UNIT I • DISTRIBUTED FILE SYSTEM AND ITS ISSUES • INTRODUCTION TO BIG DATA, • BIG DATA CHARACTERISTICS • TYPES OF BIG DATA • TRADITIONAL VS. BIG DATA APPROACH • BIG DATA APPLICATIONS
  • 5. Distributed file system and its issues
  • 6. Distributed file system and its issues • A single machine with 4 Hard disks with 1 tb of data (I/O Channel), with 100 Mbps speed. For Processing needs 45 mins. • For Faster Processing : • Divide data and store it on multiple machines with same configuration as above – Assume all machines are processing data in parallel manner then , It will take 45/5 = 9 mins for processing. • Processing will be 5 times faster than a single machine.
  • 7. Distributed file system and its issues
  • 8. Distributed file system and its issues Each machine have its own local file system (physical file system ) where you store data i.e create folders and subfolders and so on. Distributed file system is not physical, it is virtual or logical file system. Hadoop used DFS. Install libraries on every machine running as a separate process in different machines. These are creating virtual layer over the physical file system under it. This virtual layer is called distributed file system Distributed File System
  • 9. Distributed file system and its issues • Virtual File System is a software i.e set of programs—obviously….Set of commands • Ex. Dfs -copy source file destination file • Dfs -copy file1 file 2 • It read file1 which is distributed on 5 machines say ( A,B,C,D,E ), user having no idea about it….Where each part of file is ? • ( path is virtual path) nowhere it is existing. • Any dfs follows master slave architecture
  • 10. DFS Master Machine Slave Machines  Upper Machine is Master Machine and Lower 5 are Slave ones.  Data is splitted and stored on slave machines.  Master does not store any data. It only stores metadata.  Master Machine only know (as File is divided into blocks (File to Block Mapping and blocks are distributed on slave machines i.e Block to Slave mapping)  Data can only be accessed via Master as Only Master know the actual location of data on each slave.
  • 11. HDFS • While reading data, if any of the node failure then client may get partial data. • To overcome this at the time of configuring HDFS, replication factor is set i.e if replication factor = 2 , it means every block is replicated ( copied at two places) i.e 2 copies are maintained for each block. • In case of failure of one node, block can be accessed from another node. Data is transmitted to machine (Server) where program is running. .
  • 12. Features of DFS Transparency :  Structure transparency – There is no need for the client to know about the number or locations of file servers and the storage devices.  Access transparency – Both local and remote files should be accessible in the same manner.  Naming transparency – Once a name is given to the file, it should not be changed during transferring from one node to another.
  • 13. Features of DFS • Replication transparency – If a file is copied on multiple nodes, both the copies of the file and their locations should be hidden from one node to another.  User mobility : It will automatically bring the user’s home directory to the node where the user logs in. • Performance : Performance is based on the average amount of time needed to convince the client requests. • This time covers the CPU time + time taken to access secondary storage + network access time.
  • 14. Features of DFS  Simplicity and ease of use : The user interface of a file system should be simple and the number of commands in the file should be small.  High availability : A Distributed File System should be able to continue in case of any partial failures like a link failure, a node failure, or a storage drive crash. A high authentic and adaptable distributed file system should have different and independent file servers for controlling different and independent storage devices.  Scalability : Since growing the network by adding new machines or joining two networks together is routine, the distributed system will inevitably grow over time. As a result, a good distributed file system should be built to scale quickly as the number of nodes and users in the system grows. Service should not be substantially disrupted as the number of nodes and users grows.
  • 15. Features of DFS  High reliability : A file system should create backup copies of key files that can be used if the originals are lost. Many file systems employ stable storage as a high-reliability strategy.  Data integrity :  Multiple users frequently share a file system.  The integrity of data saved in a shared file must be guaranteed by the file system.  That is, concurrent access requests from many users who are competing for access to the same file must be correctly synchronized using a concurrency control method.  Atomic transactions are a high-level concurrency management mechanism for data integrity that is frequently offered to users by a file system.
  • 16. Features of DFS  Security : Users of heterogeneous distributed systems have the option of using multiple computer platforms for different purposes.  Heterogeneity :  To safeguard the information contained in the file system from unwanted & unauthorized access, security mechanisms must be implemented.  A distributed file system should be secure so that its users may trust that their data will be kept private.
  • 17. Issues with DFS  In Distributed File System nodes and connections needs to be secured therefore we can say that security is at stake.  There is a possibility of lose of messages and data in the network while movement from one node to another.  Database connection in case of Distributed File System is complicated.  Also handling of the database is not easy in Distributed File System as compared to a single user system.  There are chances that overloading will take place if all nodes tries to send data at once.do with the local
  • 18. Factors- Big Data Generation Evolution of Technology
  • 19. Factors- Big Data Generation IOT
  • 20. Factors- Big Data Generation Social Media
  • 21. Factors- Big Data Generation Others
  • 22. What is Big Data?
  • 23. Characteristics – Big Data FIVE V’S OF BIG DATA : 1 . VOLUME
  • 24. Characteristics – Big Data FIVE V’S OF BIG DATA : 2. VARIETY
  • 25. Characteristics – Big Data FIVE V’S OF BIG DATA : 3 . VELOCITY
  • 26. Characteristics – Big Data FIVE V’S OF BIG DATA : 4. VALUE
  • 27. Characteristics – Big Data FIVE V’S OF BIG DATA : 4. VERACITY
  • 28. Characteristics of Big Data at a glance
  • 29. Types of Big Data
  • 30. Types of Big Data • Structured The structured data includes all the data that can be stored in a tabular column. Relational databases are examples of structured data. It is easy to make sense of the relational databases. Most of the modern computers are able to make sense of structured data.
  • 31. Types of Big Data Unstructured • Unstructured data refers to the data that lacks any specific form or structure whatsoever. • The unstructured data is the one that cannot be stored in a spreadsheet; • Unstructured data, on the other hand, is the one which cannot be fit into tabular databases. • Examples of unstructured data include audio, video, and other sorts of data which comprise such a big chunk of the big data today. Email is an example of unstructured data.
  • 32. Types of Big Data Semi-structured • The semi-structured data includes both structured and unstructured data. • This type of data sets include a proper structure, but still it might not be possible to sort or process that data due to some constraints. • This type of data includes the XML data, JSON files, and others.
  • 33. Traditional Vs. Big Data • 1.Traditional data • Traditional data is the structured data which is being majorly maintained by all types of businesses starting from very small to big organizations. • In traditional database system a centralized database architecture used to store and maintain the data in a fixed format or fields in a file. • For managing and accessing the data structured query language (SQL) is used. • 2. Big data : Big data deal with too large or complex data sets which is difficult to manage in traditional data-processing application software. • It deals with large volume of both structured, semi structured and unstructured data. Volume, velocity and variety, veracity and value. • Big data not only refers to large amount of data it refers to extracting meaningful data by analyzing the huge amount of complex data sets.
  • 34. S.No. TRADITIONAL DATA BIG DATA 01. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level. 02. Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to Zettabytes or Exabytes. 03. Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data. 04. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds. 05. Traditional data source is centralized and it is managed in centralized form. Big data source is distributed and it is managed in distributed form. 06. Data integration is very easy. Data integration is very difficult. 07. Normal system configuration is capable to process traditional data. High system configuration is required to process big data.
  • 35. 08. The size of the data is very small. The size is more than the traditional data size. 09. Traditional data base tools are required to perform any data base operation. Special kind of data base tools are required to perform any data base operation. 10. Normal functions can manipulate data. Special kind of functions can manipulate data. 11. Its data model is strict schema based and it is static. Its data model is flat schema based and it is dynamic. 12.. Traditional data is stable and inter relationship. Big data is not stable and unknown relationship. 13. Traditional data is in manageable volume. Big data is in huge volume which becomes unmanageable. 14. It is easy to manage and manipulate the data. It is difficult to manage and manipulate the data. 15. Its data sources includes ERP transaction data, CRM transaction data, financial data, organizational data, web transaction data etc. Its data sources includes social media, device data, sensor data, video, images, audio etc.
  • 36. Applications of Big Data •Big data in retail •Big data in healthcare •Big data in education •Big data in e-commerce •Big data in media and entertainment •Big data in finance •Big data in travel industry •Big data in telecom •Big data in automobile