SlideShare a Scribd company logo
Scientific Journal Impact Factor (SJIF): 1.711
International Journal of Modern Trends in Engineering
and Research
www.ijmter.com
@IJMTER-2014, All rights Reserved 252
e-ISSN: 2349-9745
p-ISSN: 2393-8161
A Survey on Big Data Mining Challenges
Irin Ani John1
, Tintu Alphonsa Thomas2
1,2
Department Of Computer Science And Engineering
Amal Jyothi College Of Engineering
Kanjirappally , Kottayam, India
Abstract— Big Data is the new technology or science to make the well informed decision in
business or any other science discipline with huge volume of data from new sources of
heterogeneous data. . Such new sources include blogs, online media, social network, sensor network,
image data and other forms of data which vary in volume, structure, format and other factors. Big
Data applications are increasingly adopted in all science and engineering domains, including space
science, biomedical sciences and astronomic and deep space studies. The major challenges of big
data mining are in data accessing and processing, data privacy and mining algorithms. This paper
includes the information about what is big data, data mining with big data, the challenges in big data
mining and what are the currently available solutions to meet those challenges.
Keywords—Big Data, Big Data Mining Algorithms, MPI, MapReduce, Dryad.
I. INTRODUCTION
Recent years have witnessed a dramatic increase in our ability to collect data from different sensors,
instruments, in different formats from independent or connected applications. This data volume has
outgrown our ability to process, analyse, store and understand these datasets. Consider the internet
data, the web pages indexed by Google were around one million in 1998, but quickly reached 1
billion in 2000 and exceeded 1 trillion in 2008. This rapid expansion is accelerated by the dramatic
increase in acceptance of social networking application such as Facebook, Twitter, etc, that allows
user to create contents freely and amplify the already huge web volume.
Furthermore with mobile phones becoming the sensory gateway to get real time data in people from
different aspects, the vast amount of data that mobile carried can potentially process to improve our
daily life has significantly outpaced our past call data record based processing for billing purposes
only. People and devices (from home, to cars, to buses, railway stations and airport) are all loosely
connected. Trillions of such connected components will generate a huge data ocean, and valuable
information must be discovered from the data to help improve quality of life and make our world a
better place.
II. CHARACTERISTICS OF BIG DATA
The major characteristics can be defined by HACE theorem:[1] Big Data starts with large-volume
heterogeneous, autonomous sources with distributed and decentralized control and seeks to explore
complex and evolving relationships among data.
These characteristics make it an extreme challenge for discovering useful knowledge from the Big
Data. The heterogeneous characteristics of the Big Data is because different information collectors
prefer their own schemata or protocols for data recording, and the nature of different applications
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 253
also results in diverse data representations. Being autonomous, each data source is able to generate
and collect information without involving (or relying on) any centralised control. While the volume
of the data grows, the complexity and the relationships underneath the data will also increases
exponentially.
III. CHALLENGES OF BIG DATA
With the development of Internet services, indexes and queried contents were rapidly growing.
Therefore, search engine companies had to face the challenges of handling such big data. Google
created GFS[3] and MapReduce programming models[2] to cope with the challenges brought about
by data management and analysis at the Internet scale. Some obstacles[4][5][6] in the development
of big data applications are :
A. Data representation: Many datasets have certain levels of heterogeneity in type, structure,
semantics, organization, granularity, and accessibility. Data representation aims to make data
more meaningful for computer analysis and user interpretation. Efficient data representation shall
reflect data structure, class, and type, as well as integrated technologies, so as to enable efficient
operations on different datasets.
B. Redundancy reduction and data compression: Generally, there is a high level of redundancy in
datasets. Redundancy reduction and data compression is effective to reduce the indirect cost of the
entire system on the premise that the potential values of the data are not affected. For example,
most data generated by sensor networks are highly redundant, which may be filtered and
compressed at orders of magnitude.
C. Data life cycle management: Compared with the relatively slow advances of storage systems,
pervasive sensing and computing are generating data at unprecedented rates and scales. We are
confronted with a lot of pressing challenges, one of which is that the current storage system could
not support such massive data.
D. Analytical mechanism: The analytical system of big data shall process masses of heterogeneous
data within a limited time. However, traditional RDBMSs are strictly designed with a lack of
scalability and expandability, which could not meet the performance requirements. Non-relational
databases have shown their unique advantages in the processing of unstructured data and started
to become mainstream in big data analysis. Even so, there are still some problems of non-
relational database in their performance and particular applications. We shall find a compromising
solution between RDBMSs and non-relational databases. For example, some enterprises have
utilized a mixed database architecture that integrates the advantages of both types of databases.
E. Data Confidentiality: Most big data service providers or owners at present could not effectively
maintain and analyze such huge datasets because of their limited capacity. They must rely on
professionals or tools to analyze such data. These data contains details of the lowest granularity
and some sensitive information such as credit card numbers. Therefore, analysis of big data may
be delivered to a third party for processing only when proper preventive measures are taken to
protect such sensitive data, to ensure its safety.
F. Expendability and scalability: The analytical system of big data must support present and future
datasets. The analytical algorithm must be able to process increasingly expanding and more
complex datasets.
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 254
The major challenges[1] of data mining with big data are in Big data mining platform,in information
sharing and data privacy and Mining algorithms.
A. Big Data Mining Platform
The mining platform is needed to have efficient access to the data and computing processors.For Big
Data mining, because data scale is far beyond the capacity that a single personal computer (PC) can
handle, a typical Big Data processing framework will rely on cluster computers with a high-
performance computing platform, with a data mining task being deployed by running some parallel
programming tools, such as MapReduce or Enterprise Control Language (ECL),DryadLINQ[7], on a
large number of computing nodes.
B. Information Sharing and Data Privacy
A real-world concern is that Big Data applications are related to sensitive information, such as
banking transactions and medical records. Simple data exchanges or transmissions do not resolve
privacy concerns. To protect privacy, two common approaches are to 1) restrict access to the data,
such as adding certification or access control to the data entries, so sensitive information is accessible
by a limited group of users only, and 2) anonymize data fields such that sensitive information cannot
be pinpointed to an individual record[8]. The most common k-anonymity privacy measure is to
ensure that each individual in the database must be indistinguishable from k-1 others. Common
anonymization approaches are to use suppression, generalization, perturbation, and permutation to
generate an altered version of the data.
C. Mining Algorithms
As Big Data applications are featured with autonomous sources and decentralized controls,
aggregating distributed data sources to a centralized site for mining is systematically prohibitive due
to the potential transmission cost and privacy concerns. A Big Data mining system has to enable an
information exchange and fusion mechanism to ensure that all distributed sites can work together to
achieve a global optimization goal. Spare, uncertain, and incomplete data are defining features for
Big Data applications. The rise of Big Data is driven by the rapid increasing of complex data and
their changes in volumes and in nature. Inspired by the above challenges, many data mining methods
have been developed to find interesting knowledge from Big Data with complex relationships and
dynamically changing volumes. This has also spawned new computer architectures for real-time
data-intensive processing, such as the open source Apache Hadoop[9] project that runs on high-
performance clusters.
IV. BIG DATA ANALYTIC METHODS
At present, the main processing methods of big data are shown as follows.
A. Bloom Filter: Bloom Filter consists of a series of Hash functions. The principle of Bloom Filter is
to store Hash values of data other than data itself by utilising a bit array, which is in essence a
bitmap index that uses Hash functions to conduct lossy compression storage of data. It has such
advantages as high space efficiency and high query speed, but also has some disadvantages in
misrecognition and deletion.
B. Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values
or index values. Hashing has such advantages as rapid reading, writing, and high query speed, but
it is hard to find a sound Hash function.
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 255
C. Index: index is always an effective method to reduce the expense of disk reading and writing, and
improve insertion, deletion, modification, and query speeds in both traditional relational databases
that manage structured data, and other technologies that manage semi-structured and unstructured
data. However, index has a disadvantage that it has the additional cost for storing index files
which should be maintained dynamically when data is updated.
D. Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word
frequency statistics. The main idea of Triel is to utilise common prefixes of character strings to
reduce comparison on character strings to the greatest extent, so as to improve query efficiency.
E. Parallel Computing: compared to traditional serial computing, parallel computing refers to
simultaneously utilising several computing resources to complete a computation task. Its basic
idea is to decompose a problem and assign them to several separate processes to be independently
completed, so as to achieve coprocessing. Presently, some classic parallel computing models
include MPI (Message Passing Interface), MapReduce, and Dryad .
Although the parallel computing systems or tools, such as MapReduce or Dryad, are useful for big
data analysis, they are low levels tools that are hard to learn and use. Therefore, some high-level
parallel programming tools or languages are being developed based on these systems. Such high-
level languages include Sawzall, Pig, and Hive used for MapReduce, as well as Scope and
DryadLINQ used for Dryad.
MapReduce: MapReduce is a simple but powerful programming model for large-scale computing
using a large number of clusters of commercial PCs to achieve automatic parallel processing and
distribution. In MapReduce, computing model only has two functions, i.e., Map and Reduce, both of
which are programmed by users. The Map function processes input key-value pairs and generates
intermediate key-value pairs. Then,MapReduce will combine all the intermediate values related to
the same key and transmit them to the Reduce function, which further compress the value set into a
smaller set. MapReduce has the advantage that it avoids the complicated steps for developing
parallel applications, e.g., data scheduling, fault-tolerance, and inter-node communications. The user
only needs to program the two functions to develop a parallel application. The initial MapReduce
framework did not support multiple datasets in a task, which has been mitigated by some recent
enhancements.
Over the past decades, programmers are familiar with the advanced declarative language of SQL,
often used in a relational database, for task description and dataset analysis. However, the succinct
MapReduce framework only provides two nontransparent functions, which cannot cover all the
common operations. Therefore, programmers have to spend time on programming the basic
functions, which are typically hard to be maintained and reused. In order to improve the
programming efficiency, some advanced language systems have been proposed, e.g., Sawzall of
Google, Pig Latin of Yahoo, Hive of Facebook, and Scope of Microsoft.
Dryad: Dryad is a general-purpose distributed execution engine for processing parallel applications
of coarse-grained data. The operational structure of Dryad is a directed acyclic graph, in which
vertexes represent programs and edges represent data channels. Dryad executes operations on the
vertexes in clusters and transmits data via data channels, including documents, TCP connections, and
shared-memory FIFO. During operation, resources in a logic operation graph are automatically map
to physical resources.
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161
@IJMTER-2014, All rights Reserved 256
The operation structure of Dryad is coordinated by a central program called job manager, which can
be executed in clusters or workstations through network. A job manager consists of two parts: 1)
application codes which are used to build a job communication graph, and 2) program library codes
that are used to arrange available resources. All kinds of data are directly transmitted between
vertexes. Therefore, the job manager is only responsible for decision-making, which does not
obstruct any data transmission.
In Dryad, application developers can flexibly choose any directed acyclic graph to describe the
communication modes of the application and express data transmission mechanisms. In addition,
Dryad allows vertexes to use any amount of input and output data, while MapReduce supports only
one input and output set. DryadLINQ is the advanced language of Dryad and is used to integrate the
aforementioned SQL-like language execution environment.
V. CONCLUSION
This paper surveys various challenges in the storage and processing of big data. The major areas of
challenges are big data mining platform, information sharing and privacy, and mining algorithms.
The analysis of different big data mining platforms includes analysis about MPI, MapReduce and
Dryad. We regard Big Data as an emerging trend and the need for Big Data mining is arising in all
science and engineering domains.
REFERENCES
[1] X.Wu, X.Zhu, G.Wu, and Wei Ding, “Data Mining with Big Data,” IEEE Trans. Knowledge and Data Eng., vol.
26, no.1, Jan 2014.
[2] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Cluster,” Google Inc.
[3] Ghemawat S, Gobioff H, Leung S-T (2003) The google file system. In: ACM SIGOPS Operating Systems Review,
vol 37. ACM, pp 29–43
[4] Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endowment
5(12):2032–2033.
[5] Chaudhuri S, Dayal U, Narasayya V (2011) An overview of business intelligence technology. Commun ACM
54(8):88–98.
[6] Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, FranklinM, Gehrke J, Haas L, Halevy A, Han J et al
(2012) Challenges and opportunities with big data. A community white paper developed by leading researches
across the United States.
[7] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential
building blocks.ACM SIGOPS Oper Syst Rev 41(3):59–72.
[8] G. Cormode and D. Srivastava, “Anonymized Data: Generation, Models, Usage,” Proc. ACM SIGMOD Int’l Conf.
Management Data, pp. 1015-1018, 2009.
[9] Mark Kerzner and Sujee Maniyam, “ Hadoop Illuminated,” eBook.
A Survey on Big Data Mining Challenges
A Survey on Big Data Mining Challenges

More Related Content

PDF
Efficient Data Filtering Algorithm for Big Data Technology in Telecommunicati...
PDF
Efficient Data Filtering Algorithm for Big Data Technology in Telecommunicati...
PPT
Research issues in the big data and its Challenges
PDF
wireless sensor network
DOCX
JPJ1417 Data Mining With Big Data
PPTX
Data mining on big data
PDF
Data minig with Big data analysis
PPTX
Efficient Data Filtering Algorithm for Big Data Technology in Telecommunicati...
Efficient Data Filtering Algorithm for Big Data Technology in Telecommunicati...
Research issues in the big data and its Challenges
wireless sensor network
JPJ1417 Data Mining With Big Data
Data mining on big data
Data minig with Big data analysis

What's hot (18)

PDF
BIG DATA IN CLOUD COMPUTING REVIEW AND OPPORTUNITIES
PDF
Data Mining and Big Data Challenges and Research Opportunities
PDF
Big data survey
PPT
Big DataParadigm, Challenges, Analysis, and Application
PPTX
Data mining with big data implementation
PPT
Elementary Concepts of data minig
PDF
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
PPTX
Data mining & big data presentation 01
PDF
Big data privacy issues in public social media
PDF
06. 9534 14985-1-ed b edit dhyan
PDF
Big data Mining Using Very-Large-Scale Data Processing Platforms
PPTX
Big data seminor
PDF
kambatla2014.pdf
PPT
Data mining with big data
PDF
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
DOCX
data mining with big data
PDF
Mining Big Data to Predicting Future
BIG DATA IN CLOUD COMPUTING REVIEW AND OPPORTUNITIES
Data Mining and Big Data Challenges and Research Opportunities
Big data survey
Big DataParadigm, Challenges, Analysis, and Application
Data mining with big data implementation
Elementary Concepts of data minig
An Investigation on Scalable and Efficient Privacy Preserving Challenges for ...
RESEARCH IN BIG DATA – AN OVERVIEW
Data mining & big data presentation 01
Big data privacy issues in public social media
06. 9534 14985-1-ed b edit dhyan
Big data Mining Using Very-Large-Scale Data Processing Platforms
Big data seminor
kambatla2014.pdf
Data mining with big data
Overcomming Big Data Mining Challenges for Revolutionary Breakthroughs in Com...
data mining with big data
Mining Big Data to Predicting Future
Ad

Viewers also liked (12)

PDF
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
PDF
Big data Paper
PPT
Text and Data Mining Using Cultural Heritage Data
PPTX
Public Data and Data Mining Competitions - What are Lessons?
PPTX
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
PDF
Ibm big data-analytics research paper(non-online)
PPTX
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
PPTX
Mining Big Data in Real Time
PPTX
Big Data for Development: Opportunities and Challenges, Summary Slidedeck
PPTX
2013 py con awesome big data algorithms
PDF
A Survey Paper on Data Mining With Big Data
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Big data Paper
Text and Data Mining Using Cultural Heritage Data
Public Data and Data Mining Competitions - What are Lessons?
"Big Data for Development: Opportunities & Challenges” - UN Global Pulse
Ibm big data-analytics research paper(non-online)
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
Mining Big Data in Real Time
Big Data for Development: Opportunities and Challenges, Summary Slidedeck
2013 py con awesome big data algorithms
A Survey Paper on Data Mining With Big Data
Ad

Similar to A Survey on Big Data Mining Challenges (20)

PDF
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
PDF
UNIT 1 -BIG DATA ANALYTICS Full.pdf
PDF
Sameer Kumar Das International Conference Paper 53
PDF
A Theoretical Exploration of Data Management and Integration in Organisation ...
PDF
A Theoretical Exploration of Data Management and Integration in Organisation ...
PPTX
DEVOLSAFGSDFHGKJHJGHFGDFSDFDSDASFDGFUC.pptx
PDF
Identifying and analyzing the transient and permanent barriers for big data
PDF
A SURVEY OF BIG DATA ANALYTICS
PDF
A SURVEY OF BIG DATA ANALYTICS..........
PDF
big data Big Things
DOCX
Encroachment in Data Processing using Big Data Technology
PDF
Big Data: Issues and Challenges
PDF
IRJET- Scope of Big Data Analytics in Industrial Domain
PDF
Research paper on big data and hadoop
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
PDF
Research in Big Data - An Overview
PDF
RESEARCH IN BIG DATA – AN OVERVIEW
PDF
Real World Application of Big Data In Data Mining Tools
PPTX
CASE STUDY ON METHODS AND TOOLS FOR THE BIG DATA ANALYSIS
UNIT 1 -BIG DATA ANALYTICS Full.pdf
Sameer Kumar Das International Conference Paper 53
A Theoretical Exploration of Data Management and Integration in Organisation ...
A Theoretical Exploration of Data Management and Integration in Organisation ...
DEVOLSAFGSDFHGKJHJGHFGDFSDFDSDASFDGFUC.pptx
Identifying and analyzing the transient and permanent barriers for big data
A SURVEY OF BIG DATA ANALYTICS
A SURVEY OF BIG DATA ANALYTICS..........
big data Big Things
Encroachment in Data Processing using Big Data Technology
Big Data: Issues and Challenges
IRJET- Scope of Big Data Analytics in Industrial Domain
Research paper on big data and hadoop
RESEARCH IN BIG DATA – AN OVERVIEW
Research in Big Data - An Overview
RESEARCH IN BIG DATA – AN OVERVIEW
Real World Application of Big Data In Data Mining Tools

More from Editor IJMTER (20)

PDF
A NEW DATA ENCODER AND DECODER SCHEME FOR NETWORK ON CHIP
PDF
A RESEARCH - DEVELOP AN EFFICIENT ALGORITHM TO RECOGNIZE, SEPARATE AND COUNT ...
PDF
Analysis of VoIP Traffic in WiMAX Environment
PDF
A Hybrid Cloud Approach for Secure Authorized De-Duplication
PDF
Aging protocols that could incapacitate the Internet
PDF
A Cloud Computing design with Wireless Sensor Networks For Agricultural Appli...
PDF
A CAR POOLING MODEL WITH CMGV AND CMGNV STOCHASTIC VEHICLE TRAVEL TIMES
PDF
Sustainable Construction With Foam Concrete As A Green Green Building Material
PDF
USE OF ICT IN EDUCATION ONLINE COMPUTER BASED TEST
PDF
Textual Data Partitioning with Relationship and Discriminative Analysis
PDF
Testing of Matrices Multiplication Methods on Different Processors
PDF
Survey on Malware Detection Techniques
PDF
SURVEY OF TRUST BASED BLUETOOTH AUTHENTICATION FOR MOBILE DEVICE
PDF
SURVEY OF GLAUCOMA DETECTION METHODS
PDF
Survey: Multipath routing for Wireless Sensor Network
PDF
Step up DC-DC Impedance source network based PMDC Motor Drive
PDF
SPIRITUAL PERSPECTIVE OF AUROBINDO GHOSH’S PHILOSOPHY IN TODAY’S EDUCATION
PDF
Software Quality Analysis Using Mutation Testing Scheme
PDF
Software Defect Prediction Using Local and Global Analysis
PDF
Software Cost Estimation Using Clustering and Ranking Scheme
A NEW DATA ENCODER AND DECODER SCHEME FOR NETWORK ON CHIP
A RESEARCH - DEVELOP AN EFFICIENT ALGORITHM TO RECOGNIZE, SEPARATE AND COUNT ...
Analysis of VoIP Traffic in WiMAX Environment
A Hybrid Cloud Approach for Secure Authorized De-Duplication
Aging protocols that could incapacitate the Internet
A Cloud Computing design with Wireless Sensor Networks For Agricultural Appli...
A CAR POOLING MODEL WITH CMGV AND CMGNV STOCHASTIC VEHICLE TRAVEL TIMES
Sustainable Construction With Foam Concrete As A Green Green Building Material
USE OF ICT IN EDUCATION ONLINE COMPUTER BASED TEST
Textual Data Partitioning with Relationship and Discriminative Analysis
Testing of Matrices Multiplication Methods on Different Processors
Survey on Malware Detection Techniques
SURVEY OF TRUST BASED BLUETOOTH AUTHENTICATION FOR MOBILE DEVICE
SURVEY OF GLAUCOMA DETECTION METHODS
Survey: Multipath routing for Wireless Sensor Network
Step up DC-DC Impedance source network based PMDC Motor Drive
SPIRITUAL PERSPECTIVE OF AUROBINDO GHOSH’S PHILOSOPHY IN TODAY’S EDUCATION
Software Quality Analysis Using Mutation Testing Scheme
Software Defect Prediction Using Local and Global Analysis
Software Cost Estimation Using Clustering and Ranking Scheme

Recently uploaded (20)

PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
DOCX
573137875-Attendance-Management-System-original
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
web development for engineering and engineering
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Welding lecture in detail for understanding
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Construction Project Organization Group 2.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Project quality management in manufacturing
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
573137875-Attendance-Management-System-original
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
web development for engineering and engineering
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Welding lecture in detail for understanding
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
bas. eng. economics group 4 presentation 1.pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Construction Project Organization Group 2.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Project quality management in manufacturing
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Model Code of Practice - Construction Work - 21102022 .pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026

A Survey on Big Data Mining Challenges

  • 1. Scientific Journal Impact Factor (SJIF): 1.711 International Journal of Modern Trends in Engineering and Research www.ijmter.com @IJMTER-2014, All rights Reserved 252 e-ISSN: 2349-9745 p-ISSN: 2393-8161 A Survey on Big Data Mining Challenges Irin Ani John1 , Tintu Alphonsa Thomas2 1,2 Department Of Computer Science And Engineering Amal Jyothi College Of Engineering Kanjirappally , Kottayam, India Abstract— Big Data is the new technology or science to make the well informed decision in business or any other science discipline with huge volume of data from new sources of heterogeneous data. . Such new sources include blogs, online media, social network, sensor network, image data and other forms of data which vary in volume, structure, format and other factors. Big Data applications are increasingly adopted in all science and engineering domains, including space science, biomedical sciences and astronomic and deep space studies. The major challenges of big data mining are in data accessing and processing, data privacy and mining algorithms. This paper includes the information about what is big data, data mining with big data, the challenges in big data mining and what are the currently available solutions to meet those challenges. Keywords—Big Data, Big Data Mining Algorithms, MPI, MapReduce, Dryad. I. INTRODUCTION Recent years have witnessed a dramatic increase in our ability to collect data from different sensors, instruments, in different formats from independent or connected applications. This data volume has outgrown our ability to process, analyse, store and understand these datasets. Consider the internet data, the web pages indexed by Google were around one million in 1998, but quickly reached 1 billion in 2000 and exceeded 1 trillion in 2008. This rapid expansion is accelerated by the dramatic increase in acceptance of social networking application such as Facebook, Twitter, etc, that allows user to create contents freely and amplify the already huge web volume. Furthermore with mobile phones becoming the sensory gateway to get real time data in people from different aspects, the vast amount of data that mobile carried can potentially process to improve our daily life has significantly outpaced our past call data record based processing for billing purposes only. People and devices (from home, to cars, to buses, railway stations and airport) are all loosely connected. Trillions of such connected components will generate a huge data ocean, and valuable information must be discovered from the data to help improve quality of life and make our world a better place. II. CHARACTERISTICS OF BIG DATA The major characteristics can be defined by HACE theorem:[1] Big Data starts with large-volume heterogeneous, autonomous sources with distributed and decentralized control and seeks to explore complex and evolving relationships among data. These characteristics make it an extreme challenge for discovering useful knowledge from the Big Data. The heterogeneous characteristics of the Big Data is because different information collectors prefer their own schemata or protocols for data recording, and the nature of different applications
  • 2. International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161 @IJMTER-2014, All rights Reserved 253 also results in diverse data representations. Being autonomous, each data source is able to generate and collect information without involving (or relying on) any centralised control. While the volume of the data grows, the complexity and the relationships underneath the data will also increases exponentially. III. CHALLENGES OF BIG DATA With the development of Internet services, indexes and queried contents were rapidly growing. Therefore, search engine companies had to face the challenges of handling such big data. Google created GFS[3] and MapReduce programming models[2] to cope with the challenges brought about by data management and analysis at the Internet scale. Some obstacles[4][5][6] in the development of big data applications are : A. Data representation: Many datasets have certain levels of heterogeneity in type, structure, semantics, organization, granularity, and accessibility. Data representation aims to make data more meaningful for computer analysis and user interpretation. Efficient data representation shall reflect data structure, class, and type, as well as integrated technologies, so as to enable efficient operations on different datasets. B. Redundancy reduction and data compression: Generally, there is a high level of redundancy in datasets. Redundancy reduction and data compression is effective to reduce the indirect cost of the entire system on the premise that the potential values of the data are not affected. For example, most data generated by sensor networks are highly redundant, which may be filtered and compressed at orders of magnitude. C. Data life cycle management: Compared with the relatively slow advances of storage systems, pervasive sensing and computing are generating data at unprecedented rates and scales. We are confronted with a lot of pressing challenges, one of which is that the current storage system could not support such massive data. D. Analytical mechanism: The analytical system of big data shall process masses of heterogeneous data within a limited time. However, traditional RDBMSs are strictly designed with a lack of scalability and expandability, which could not meet the performance requirements. Non-relational databases have shown their unique advantages in the processing of unstructured data and started to become mainstream in big data analysis. Even so, there are still some problems of non- relational database in their performance and particular applications. We shall find a compromising solution between RDBMSs and non-relational databases. For example, some enterprises have utilized a mixed database architecture that integrates the advantages of both types of databases. E. Data Confidentiality: Most big data service providers or owners at present could not effectively maintain and analyze such huge datasets because of their limited capacity. They must rely on professionals or tools to analyze such data. These data contains details of the lowest granularity and some sensitive information such as credit card numbers. Therefore, analysis of big data may be delivered to a third party for processing only when proper preventive measures are taken to protect such sensitive data, to ensure its safety. F. Expendability and scalability: The analytical system of big data must support present and future datasets. The analytical algorithm must be able to process increasingly expanding and more complex datasets.
  • 3. International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161 @IJMTER-2014, All rights Reserved 254 The major challenges[1] of data mining with big data are in Big data mining platform,in information sharing and data privacy and Mining algorithms. A. Big Data Mining Platform The mining platform is needed to have efficient access to the data and computing processors.For Big Data mining, because data scale is far beyond the capacity that a single personal computer (PC) can handle, a typical Big Data processing framework will rely on cluster computers with a high- performance computing platform, with a data mining task being deployed by running some parallel programming tools, such as MapReduce or Enterprise Control Language (ECL),DryadLINQ[7], on a large number of computing nodes. B. Information Sharing and Data Privacy A real-world concern is that Big Data applications are related to sensitive information, such as banking transactions and medical records. Simple data exchanges or transmissions do not resolve privacy concerns. To protect privacy, two common approaches are to 1) restrict access to the data, such as adding certification or access control to the data entries, so sensitive information is accessible by a limited group of users only, and 2) anonymize data fields such that sensitive information cannot be pinpointed to an individual record[8]. The most common k-anonymity privacy measure is to ensure that each individual in the database must be indistinguishable from k-1 others. Common anonymization approaches are to use suppression, generalization, perturbation, and permutation to generate an altered version of the data. C. Mining Algorithms As Big Data applications are featured with autonomous sources and decentralized controls, aggregating distributed data sources to a centralized site for mining is systematically prohibitive due to the potential transmission cost and privacy concerns. A Big Data mining system has to enable an information exchange and fusion mechanism to ensure that all distributed sites can work together to achieve a global optimization goal. Spare, uncertain, and incomplete data are defining features for Big Data applications. The rise of Big Data is driven by the rapid increasing of complex data and their changes in volumes and in nature. Inspired by the above challenges, many data mining methods have been developed to find interesting knowledge from Big Data with complex relationships and dynamically changing volumes. This has also spawned new computer architectures for real-time data-intensive processing, such as the open source Apache Hadoop[9] project that runs on high- performance clusters. IV. BIG DATA ANALYTIC METHODS At present, the main processing methods of big data are shown as follows. A. Bloom Filter: Bloom Filter consists of a series of Hash functions. The principle of Bloom Filter is to store Hash values of data other than data itself by utilising a bit array, which is in essence a bitmap index that uses Hash functions to conduct lossy compression storage of data. It has such advantages as high space efficiency and high query speed, but also has some disadvantages in misrecognition and deletion. B. Hashing: it is a method that essentially transforms data into shorter fixed-length numerical values or index values. Hashing has such advantages as rapid reading, writing, and high query speed, but it is hard to find a sound Hash function.
  • 4. International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161 @IJMTER-2014, All rights Reserved 255 C. Index: index is always an effective method to reduce the expense of disk reading and writing, and improve insertion, deletion, modification, and query speeds in both traditional relational databases that manage structured data, and other technologies that manage semi-structured and unstructured data. However, index has a disadvantage that it has the additional cost for storing index files which should be maintained dynamically when data is updated. D. Triel: also called trie tree, a variant of Hash Tree. It is mainly applied to rapid retrieval and word frequency statistics. The main idea of Triel is to utilise common prefixes of character strings to reduce comparison on character strings to the greatest extent, so as to improve query efficiency. E. Parallel Computing: compared to traditional serial computing, parallel computing refers to simultaneously utilising several computing resources to complete a computation task. Its basic idea is to decompose a problem and assign them to several separate processes to be independently completed, so as to achieve coprocessing. Presently, some classic parallel computing models include MPI (Message Passing Interface), MapReduce, and Dryad . Although the parallel computing systems or tools, such as MapReduce or Dryad, are useful for big data analysis, they are low levels tools that are hard to learn and use. Therefore, some high-level parallel programming tools or languages are being developed based on these systems. Such high- level languages include Sawzall, Pig, and Hive used for MapReduce, as well as Scope and DryadLINQ used for Dryad. MapReduce: MapReduce is a simple but powerful programming model for large-scale computing using a large number of clusters of commercial PCs to achieve automatic parallel processing and distribution. In MapReduce, computing model only has two functions, i.e., Map and Reduce, both of which are programmed by users. The Map function processes input key-value pairs and generates intermediate key-value pairs. Then,MapReduce will combine all the intermediate values related to the same key and transmit them to the Reduce function, which further compress the value set into a smaller set. MapReduce has the advantage that it avoids the complicated steps for developing parallel applications, e.g., data scheduling, fault-tolerance, and inter-node communications. The user only needs to program the two functions to develop a parallel application. The initial MapReduce framework did not support multiple datasets in a task, which has been mitigated by some recent enhancements. Over the past decades, programmers are familiar with the advanced declarative language of SQL, often used in a relational database, for task description and dataset analysis. However, the succinct MapReduce framework only provides two nontransparent functions, which cannot cover all the common operations. Therefore, programmers have to spend time on programming the basic functions, which are typically hard to be maintained and reused. In order to improve the programming efficiency, some advanced language systems have been proposed, e.g., Sawzall of Google, Pig Latin of Yahoo, Hive of Facebook, and Scope of Microsoft. Dryad: Dryad is a general-purpose distributed execution engine for processing parallel applications of coarse-grained data. The operational structure of Dryad is a directed acyclic graph, in which vertexes represent programs and edges represent data channels. Dryad executes operations on the vertexes in clusters and transmits data via data channels, including documents, TCP connections, and shared-memory FIFO. During operation, resources in a logic operation graph are automatically map to physical resources.
  • 5. International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 02, Issue 01, [January - 2015] e-ISSN: 2349-9745, p-ISSN: 2393-8161 @IJMTER-2014, All rights Reserved 256 The operation structure of Dryad is coordinated by a central program called job manager, which can be executed in clusters or workstations through network. A job manager consists of two parts: 1) application codes which are used to build a job communication graph, and 2) program library codes that are used to arrange available resources. All kinds of data are directly transmitted between vertexes. Therefore, the job manager is only responsible for decision-making, which does not obstruct any data transmission. In Dryad, application developers can flexibly choose any directed acyclic graph to describe the communication modes of the application and express data transmission mechanisms. In addition, Dryad allows vertexes to use any amount of input and output data, while MapReduce supports only one input and output set. DryadLINQ is the advanced language of Dryad and is used to integrate the aforementioned SQL-like language execution environment. V. CONCLUSION This paper surveys various challenges in the storage and processing of big data. The major areas of challenges are big data mining platform, information sharing and privacy, and mining algorithms. The analysis of different big data mining platforms includes analysis about MPI, MapReduce and Dryad. We regard Big Data as an emerging trend and the need for Big Data mining is arising in all science and engineering domains. REFERENCES [1] X.Wu, X.Zhu, G.Wu, and Wei Ding, “Data Mining with Big Data,” IEEE Trans. Knowledge and Data Eng., vol. 26, no.1, Jan 2014. [2] Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Cluster,” Google Inc. [3] Ghemawat S, Gobioff H, Leung S-T (2003) The google file system. In: ACM SIGOPS Operating Systems Review, vol 37. ACM, pp 29–43 [4] Labrinidis A, Jagadish HV (2012) Challenges and opportunities with big data. Proc VLDB Endowment 5(12):2032–2033. [5] Chaudhuri S, Dayal U, Narasayya V (2011) An overview of business intelligence technology. Commun ACM 54(8):88–98. [6] Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, FranklinM, Gehrke J, Haas L, Halevy A, Han J et al (2012) Challenges and opportunities with big data. A community white paper developed by leading researches across the United States. [7] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks.ACM SIGOPS Oper Syst Rev 41(3):59–72. [8] G. Cormode and D. Srivastava, “Anonymized Data: Generation, Models, Usage,” Proc. ACM SIGMOD Int’l Conf. Management Data, pp. 1015-1018, 2009. [9] Mark Kerzner and Sujee Maniyam, “ Hadoop Illuminated,” eBook.