SlideShare a Scribd company logo
BIG DATA : ISSUES, CHALLENGES,
TOOLS AND GOOD PRACTICES
1
MOTIVATION
• Data stores are growing by 50% each year, and that rate of increase
is accelerating[8]
• The type of data is also changing. Over 80% of it will be
unstructured data which does not work well with relational
databases[8]
• The main difficulty is because the volume is increasing rapidly in
comparison to computing resources
2
DEFINING BIG DATA
• It is defined as large amount of data which requires new
technologies and architectures so that it becomes possible to
extract value form it by capturing and analysis process.
• It is a recent upcoming technology that can bring huge benefits to
the business organizations.
3
PROPERTIES OF BIG DATA
• Variety : Data being produced is not only traditional but also semi
structured from various sources.
• Volume : Data is supposed to increase in zetta bytes in near future
• Velocity : Speed of data coming from various sources
• Variability : It considers the inconsistencies of data flow.
• Complexity : It is difficult to link, match cleanse, and transform
data across systems coming from various sources.
• Value : Queries can be run against the data stored to deduct
important results.
4
PROPERTIES OF BIG DATA...
5
RELATED WORK
• Collaborative research on methodologies for big data analysis and
design.[1]
• Databases required for big data [2]
• Architectural considerations for big data [3]
• Concept of big data with market solutions [4]
• Scientific Data Infrastructure (SDI) generic architectural model [5]
• How big data analytics is different from traditional analytics [6]
• Analysis of social media sites like facebook,flickr,google+ [7]
6
IMPORTANCE OF BIG DATA
• Log Storage in IT Industries
– IT industries store large amounts of data as logs to deal with
problems which occur rarely.
– Big data analytics is used on the data to pinpoint the point of
failures
– Traditional Systems are not able to handle these logs.
• Sensor Data
– Massive amount of sensor data is also a big challenge for Big data
7
• Risk Analysis
– It’s important for financial institutions to model data in order to
calculate the risk.
– A lot of potential data is underutilized because of its volume and should
be integrated to determine the risk patterns more accurately
• Social Media
– The largest use of Big data is for social media and customer sentiments
– Keeping an eye on what the customers are saying is like getting a
feedback.
– The customer feedback can then be used to make decisions and add value
to the business
8
BIG DATA CHALLENGES AND ISSUES
• Privacy and Security
– The most important issue with Big data which includes conceptual,
technical as well as legal significance
– The personal information of a person when combined with external
large data sets leads to the inference of new private facts about
that person
– Big data used by law enforcement will increase the chances of
certain tagged people to suffer from adverse consequences .
9
• Data Access and Sharing of Information
– If data is to be used to make accurate decisions in time it becomes
necessary that it should be available in accurate, complete and timely
manner
• Storage and Processing Issues
– Many companies are struggling to store the large amount of data they
are producing
• Outsourcing storage to the cloud may seem like an option but long
upload times and constant updates to the data preclude this
option
– Processing a large amount of data also takes a lot of time
10
• Analytical Challenges
– What if data volume gets so large that we don’t know how to
deal with it
– Does all data need to be stored ?
– Does all data need to be analyzed?
– Which data points are really important ?
– How can data be used to best advantages
• Skill Requirement : Being a new and emerging technology, it needs
to attract organization and youth with diverse new skill sets.
11
• Technical Challenges
– Fault Tolerance
– Scalability
– Quality of Data
– Heterogeneous Data
Ravi 12
TOOLS AND TECHNIQUES AVAILABLE
• Hadoop - is an open source project hosted by Apache Software
Foundation for managing Big data
• Hadoop consists of two main components :
– Hadoop File System (HDFS) which is a distributed file-
system that stores the data on multiple separate servers
(each of which having its own processor(s))
– MapReduce the framework that understands and assigns
work to the nodes in a cluster[9]
13
ADVANTAGES OF HADOOP
• Hadoop provides the following advantages[9]
– Data read/write performance is increased by distributing the
data across the cluster allowing each processor to do work in a
parallel fashion
– It’s scalable, new nodes can be added as needed without making
changes to the existing system
– It’s cost effective because it brings parallel computing to
commodity servers
14
ADVANTAGES OF HADOOP…
– It’s flexible, it can absorb any type of data, structured or not
from any number of sources
– It’s fault tolerant, it handles failures intrinsically by always
storing multiple copies of the data and automatically loading a
copy when a fault is detected
15
HADOOP
• How do you use Hadoop?
– The developer writes a program that conforms to the MapReduce
programming model
– The developer specifies the format of the data to be processed in
their program
16
HADOOP
• How does MapReduce work?[10]
– Each Hadoop program performs two tasks:
• Map - Breaks all of the data down into key/value pairs
• Reduce - Takes the output from the map step as input and
combines those data key/value pairs into a smaller set of
key/value pairs
17
MAP REDUCE - EXAMPLE
• MapReduce example[10]: Assume you have five files, and each file
contains two columns that represent a city and the corresponding
temperature recorded in that city for the various measurement days
– Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome, 33 ,New
York, 18
• We want to find the maximum temperature for each city across all of
the data files
• Then we create five map tasks, where each mapper works on one of the
five files and the mapper task goes through the data and returns the
maximum temperature for each city
– Which results in: (Toronto, 20) (New York, 22) (Rome, 33)
18
MAP REDUCE – EXAMPLE…
• Let’s assume the other four mapper tasks (working on the other four
files not shown here) produced the following intermediate results:
– (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York,
33) (Rome, 38)(Toronto, 22) (New York, 20) (Rome, 31)(Toronto,
31) (New York, 19) (Rome, 30)
• All five of these output streams would be fed into the reduce tasks,
which combines the input results and outputs a single value for each
city, producing a final result set as follows:
– (Toronto, 32) (New York, 33) (Rome, 38)
19
BIG DATA – GOOD PRACTICES
• Creating dimensions of all the data being stored is good practice.
• All the dimensions should have durable surrogate keys that can’t be
changed and are unique.
• Expect to integrate structured and unstructured data
• Generality of technology is needed. Building it around key value pairs
work.
20
BIG DATA – GOOD PRACTICES…
• As value of big data becomes more apparent, privacy concerns grow.
• Data quality needs to be better.
• Limit on scalability of records.
• Business and IT leaders should work together to create more value
from data.
• Investment in data quality and metadata reduces processing time.
21
CONCLUSIONS
• New concept of big data, its importance and existing projects.
• Many challenges and issues exist which need to be brought up.
• Big data will help business grow.
• Hadoop Tool
22
REFERENCES
• [1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William
Money,“Big Data: Issues and Challenges Moving Forward”, IEEE, 46th
Hawaii International Conference on System Sciences, 2013.
• [2] Sam Madden, “ From Databases to Big Data”, IEEE, Internet
Computing, May-June 2012.
• [3] Kapil Bakshi, “Considerations for Big Data: Architecture and
Approach”,IEEE , Aerospace Conference, 2012.
• [4] Sachchidanand Singh, Nirmala Singh, “Big Data Analytics”,
IEEE,International Conference on Communication, Information &
Computing Technology (ICCICT), Oct. 19-20, 2012.
• [5] Yuri Demchenko, Zhiming Zhao, Paola Grosso, Adianto Wibisono,
Cees de Laat, “Addressing Big Data Challenges for Scientific Data
Infrastructure”, IEEE , 4th International Conference on Cloud
Computing Technology and Science, 2012.
23
REFERENCES...
• [6] Martin Courtney, “The Larging-up of Big Data”, IEEE, Engineering
& Technology, September 2012.
• [7] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von
Voigt, “Big Data Privacy Issues in Public Social Media”, IEEE, 6th
International Conference on Digital Ecosystems Technologies (DEST),
18-20 June 2012.
• [8] Why Every Database Must Be Broken Soon
https://blogs.vmware.com/vfabric/2013/03/why-every-database-
must-be-broken-soon.html
• [9] What is Hadoop? . http://www-
01.ibm.com/software/data/infosphere/hadoop/
• [10] What is MapReduce? http://www-
01.ibm.com/software/data/infosphere/hadoop/mapreduce
24
THANK YOU.
25

More Related Content

PPTX
Big data frameworks
PPTX
introduction to big data frameworks
PPTX
Big data ppt
PPTX
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
PPTX
Data mining with big data
PDF
DSSG Speaker Series: Paco Nathan
PPTX
Big data
PDF
Bigdatappt 140225061440-phpapp01
Big data frameworks
introduction to big data frameworks
Big data ppt
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Data mining with big data
DSSG Speaker Series: Paco Nathan
Big data
Bigdatappt 140225061440-phpapp01

What's hot (20)

PDF
Integrating Big Data Technologies
PPTX
PDF
Big Data - Insights & Challenges
PPSX
Big Data
PPTX
Big data ppt
PDF
Big data tools
PDF
The importance of data
PPTX
Big data
PDF
Big Data Evolution
PDF
Introduction to big data
PPTX
Special issues on big data
PPTX
Kartikey tripathi
PPTX
Presentation on Big Data Analytics
PPT
big data analytics in mobile cellular network
PPT
Big Data and Computer Science Education
PPT
Data mining with big data
PPT
big data
PDF
Data minig with Big data analysis
ODP
Big Data Analytics - Introduction
Integrating Big Data Technologies
Big Data - Insights & Challenges
Big Data
Big data ppt
Big data tools
The importance of data
Big data
Big Data Evolution
Introduction to big data
Special issues on big data
Kartikey tripathi
Presentation on Big Data Analytics
big data analytics in mobile cellular network
Big Data and Computer Science Education
Data mining with big data
big data
Data minig with Big data analysis
Big Data Analytics - Introduction
Ad

Viewers also liked (18)

PDF
Reglamento participación ciudadana
PDF
PDF
PDF
Garagino doc
PDF
Stenogr
PDF
Reglamento de Participación Ciudadana
PDF
Xm lquickref
PPTX
virtual resume
DOC
exprimer-la-certitude-et-le-doute
PPT
L’habitat intermédiaire
PPS
Dia de la Mà Vermella - Nens Soldats
PPTX
Présentation de projet urbain
PDF
Wharton study on_income_annuities (1)
PPTX
Hamma les annasser. au 01
PPTX
Tracking: Cookies vs. cookieless Tracking
PPTX
Spuren im Internet
Reglamento participación ciudadana
Garagino doc
Stenogr
Reglamento de Participación Ciudadana
Xm lquickref
virtual resume
exprimer-la-certitude-et-le-doute
L’habitat intermédiaire
Dia de la Mà Vermella - Nens Soldats
Présentation de projet urbain
Wharton study on_income_annuities (1)
Hamma les annasser. au 01
Tracking: Cookies vs. cookieless Tracking
Spuren im Internet
Ad

Similar to Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01 (20)

PPTX
Big data by Mithlesh sadh
PPTX
Big data analytics
PDF
Introduction to Big Data
PPTX
Big_Data_ppt[1] (1).pptx
DOCX
Content1. Introduction2. What is Big Data3. Characte.docx
PPTX
Big data ppt
PPTX
ppt final.pptx
PDF
Lecture 1-big data engineering (Introduction).pdf
PDF
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
PPTX
unit 1 big data.pptx
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
PresentationBig Data111111111111111.pptx
PDF
Hadoop Master Class : A concise overview
PPTX
Introduction to Big Data
PPTX
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
PPTX
Introduction to Cloud computing and Big Data-Hadoop
PPTX
Big-Data-Seminar-6-Aug-2014-Koenig
PDF
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
PPTX
bigdata.pptx
Big data by Mithlesh sadh
Big data analytics
Introduction to Big Data
Big_Data_ppt[1] (1).pptx
Content1. Introduction2. What is Big Data3. Characte.docx
Big data ppt
ppt final.pptx
Lecture 1-big data engineering (Introduction).pdf
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
unit 1 big data.pptx
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Data Engineer's Lunch #85: Designing a Modern Data Stack
PresentationBig Data111111111111111.pptx
Hadoop Master Class : A concise overview
Introduction to Big Data
Data Modeling and Scale Out - ScaleBase + 451-Group webinar 30.4.2015
Introduction to Cloud computing and Big Data-Hadoop
Big-Data-Seminar-6-Aug-2014-Koenig
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
bigdata.pptx

More from Soujanya V (7)

PPT
Decision tree
PPT
Asymptotic analysis
PPTX
Implementing java server pages standard tag library v2
PPTX
Filter
PPT
Load balancing
PPTX
Implementing jsp tag extensions
PPTX
Filter
Decision tree
Asymptotic analysis
Implementing java server pages standard tag library v2
Filter
Load balancing
Implementing jsp tag extensions
Filter

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
web development for engineering and engineering
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Welding lecture in detail for understanding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
PPT on Performance Review to get promotions
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Digital Logic Computer Design lecture notes
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
web development for engineering and engineering
Internet of Things (IOT) - A guide to understanding
Operating System & Kernel Study Guide-1 - converted.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
bas. eng. economics group 4 presentation 1.pptx
Geodesy 1.pptx...............................................
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Welding lecture in detail for understanding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
UNIT 4 Total Quality Management .pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT on Performance Review to get promotions
CH1 Production IntroductoryConcepts.pptx
Digital Logic Computer Design lecture notes

Bigdataissueschallengestoolsngoodpractices 141130054740-conversion-gate01

  • 1. BIG DATA : ISSUES, CHALLENGES, TOOLS AND GOOD PRACTICES 1
  • 2. MOTIVATION • Data stores are growing by 50% each year, and that rate of increase is accelerating[8] • The type of data is also changing. Over 80% of it will be unstructured data which does not work well with relational databases[8] • The main difficulty is because the volume is increasing rapidly in comparison to computing resources 2
  • 3. DEFINING BIG DATA • It is defined as large amount of data which requires new technologies and architectures so that it becomes possible to extract value form it by capturing and analysis process. • It is a recent upcoming technology that can bring huge benefits to the business organizations. 3
  • 4. PROPERTIES OF BIG DATA • Variety : Data being produced is not only traditional but also semi structured from various sources. • Volume : Data is supposed to increase in zetta bytes in near future • Velocity : Speed of data coming from various sources • Variability : It considers the inconsistencies of data flow. • Complexity : It is difficult to link, match cleanse, and transform data across systems coming from various sources. • Value : Queries can be run against the data stored to deduct important results. 4
  • 5. PROPERTIES OF BIG DATA... 5
  • 6. RELATED WORK • Collaborative research on methodologies for big data analysis and design.[1] • Databases required for big data [2] • Architectural considerations for big data [3] • Concept of big data with market solutions [4] • Scientific Data Infrastructure (SDI) generic architectural model [5] • How big data analytics is different from traditional analytics [6] • Analysis of social media sites like facebook,flickr,google+ [7] 6
  • 7. IMPORTANCE OF BIG DATA • Log Storage in IT Industries – IT industries store large amounts of data as logs to deal with problems which occur rarely. – Big data analytics is used on the data to pinpoint the point of failures – Traditional Systems are not able to handle these logs. • Sensor Data – Massive amount of sensor data is also a big challenge for Big data 7
  • 8. • Risk Analysis – It’s important for financial institutions to model data in order to calculate the risk. – A lot of potential data is underutilized because of its volume and should be integrated to determine the risk patterns more accurately • Social Media – The largest use of Big data is for social media and customer sentiments – Keeping an eye on what the customers are saying is like getting a feedback. – The customer feedback can then be used to make decisions and add value to the business 8
  • 9. BIG DATA CHALLENGES AND ISSUES • Privacy and Security – The most important issue with Big data which includes conceptual, technical as well as legal significance – The personal information of a person when combined with external large data sets leads to the inference of new private facts about that person – Big data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences . 9
  • 10. • Data Access and Sharing of Information – If data is to be used to make accurate decisions in time it becomes necessary that it should be available in accurate, complete and timely manner • Storage and Processing Issues – Many companies are struggling to store the large amount of data they are producing • Outsourcing storage to the cloud may seem like an option but long upload times and constant updates to the data preclude this option – Processing a large amount of data also takes a lot of time 10
  • 11. • Analytical Challenges – What if data volume gets so large that we don’t know how to deal with it – Does all data need to be stored ? – Does all data need to be analyzed? – Which data points are really important ? – How can data be used to best advantages • Skill Requirement : Being a new and emerging technology, it needs to attract organization and youth with diverse new skill sets. 11
  • 12. • Technical Challenges – Fault Tolerance – Scalability – Quality of Data – Heterogeneous Data Ravi 12
  • 13. TOOLS AND TECHNIQUES AVAILABLE • Hadoop - is an open source project hosted by Apache Software Foundation for managing Big data • Hadoop consists of two main components : – Hadoop File System (HDFS) which is a distributed file- system that stores the data on multiple separate servers (each of which having its own processor(s)) – MapReduce the framework that understands and assigns work to the nodes in a cluster[9] 13
  • 14. ADVANTAGES OF HADOOP • Hadoop provides the following advantages[9] – Data read/write performance is increased by distributing the data across the cluster allowing each processor to do work in a parallel fashion – It’s scalable, new nodes can be added as needed without making changes to the existing system – It’s cost effective because it brings parallel computing to commodity servers 14
  • 15. ADVANTAGES OF HADOOP… – It’s flexible, it can absorb any type of data, structured or not from any number of sources – It’s fault tolerant, it handles failures intrinsically by always storing multiple copies of the data and automatically loading a copy when a fault is detected 15
  • 16. HADOOP • How do you use Hadoop? – The developer writes a program that conforms to the MapReduce programming model – The developer specifies the format of the data to be processed in their program 16
  • 17. HADOOP • How does MapReduce work?[10] – Each Hadoop program performs two tasks: • Map - Breaks all of the data down into key/value pairs • Reduce - Takes the output from the map step as input and combines those data key/value pairs into a smaller set of key/value pairs 17
  • 18. MAP REDUCE - EXAMPLE • MapReduce example[10]: Assume you have five files, and each file contains two columns that represent a city and the corresponding temperature recorded in that city for the various measurement days – Toronto, 20 , New York, 22, Rome, 32 , Toronto, 4, Rome, 33 ,New York, 18 • We want to find the maximum temperature for each city across all of the data files • Then we create five map tasks, where each mapper works on one of the five files and the mapper task goes through the data and returns the maximum temperature for each city – Which results in: (Toronto, 20) (New York, 22) (Rome, 33) 18
  • 19. MAP REDUCE – EXAMPLE… • Let’s assume the other four mapper tasks (working on the other four files not shown here) produced the following intermediate results: – (Toronto, 18) (New York, 32) (Rome, 37)(Toronto, 32) (New York, 33) (Rome, 38)(Toronto, 22) (New York, 20) (Rome, 31)(Toronto, 31) (New York, 19) (Rome, 30) • All five of these output streams would be fed into the reduce tasks, which combines the input results and outputs a single value for each city, producing a final result set as follows: – (Toronto, 32) (New York, 33) (Rome, 38) 19
  • 20. BIG DATA – GOOD PRACTICES • Creating dimensions of all the data being stored is good practice. • All the dimensions should have durable surrogate keys that can’t be changed and are unique. • Expect to integrate structured and unstructured data • Generality of technology is needed. Building it around key value pairs work. 20
  • 21. BIG DATA – GOOD PRACTICES… • As value of big data becomes more apparent, privacy concerns grow. • Data quality needs to be better. • Limit on scalability of records. • Business and IT leaders should work together to create more value from data. • Investment in data quality and metadata reduces processing time. 21
  • 22. CONCLUSIONS • New concept of big data, its importance and existing projects. • Many challenges and issues exist which need to be brought up. • Big data will help business grow. • Hadoop Tool 22
  • 23. REFERENCES • [1] Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money,“Big Data: Issues and Challenges Moving Forward”, IEEE, 46th Hawaii International Conference on System Sciences, 2013. • [2] Sam Madden, “ From Databases to Big Data”, IEEE, Internet Computing, May-June 2012. • [3] Kapil Bakshi, “Considerations for Big Data: Architecture and Approach”,IEEE , Aerospace Conference, 2012. • [4] Sachchidanand Singh, Nirmala Singh, “Big Data Analytics”, IEEE,International Conference on Communication, Information & Computing Technology (ICCICT), Oct. 19-20, 2012. • [5] Yuri Demchenko, Zhiming Zhao, Paola Grosso, Adianto Wibisono, Cees de Laat, “Addressing Big Data Challenges for Scientific Data Infrastructure”, IEEE , 4th International Conference on Cloud Computing Technology and Science, 2012. 23
  • 24. REFERENCES... • [6] Martin Courtney, “The Larging-up of Big Data”, IEEE, Engineering & Technology, September 2012. • [7] Matthew Smith, Christian Szongott, Benjamin Henne, Gabriele von Voigt, “Big Data Privacy Issues in Public Social Media”, IEEE, 6th International Conference on Digital Ecosystems Technologies (DEST), 18-20 June 2012. • [8] Why Every Database Must Be Broken Soon https://blogs.vmware.com/vfabric/2013/03/why-every-database- must-be-broken-soon.html • [9] What is Hadoop? . http://www- 01.ibm.com/software/data/infosphere/hadoop/ • [10] What is MapReduce? http://www- 01.ibm.com/software/data/infosphere/hadoop/mapreduce 24