SlideShare a Scribd company logo
DATA MINING IN AGRICULTURE
PRESENTED BY:-
JHADE SUNIL
M.Sc. Agri stats.
PALB-6175
UAS GKVK
FLOW OF SEMINAR
 INTRODUCTION
 HISTORY
 STEPS IN DATA MINING
 GOALS OF DATA MINING
 TECHNIQUES USED IN DATA MINING
 DATA MINING METHODOLOGIES
 ROLE OF DATA MINING IN AGRICULTURE
 CASE STUDIES
 CONCLUSION
 REFERENCES
INTRODUCTION
 Agriculture is the backbone of the Indian nation.
 Demand of food is increasing due to low Productivity of
agriculture production.
 Agricultural researches put extra effort for more production.
 Agricultural data acquisition rapidly increases these data to
be extracted when needed .
 Data mining can be used for predicting the future trends of
agricultural processes.
HISTORY
Evolution of Database Technology
Evolutionary steps Enabling technologies Product
providers
Data collection
(1960s)
Computers, tapes IBM
Data access
(1980s)
Relational databases Oracle,
Informix, IBM,
Microsoft
Data warehousing
(1990s)
OLAP, multi dimensional
databases, data warehouses
Pilot,
Microstrategy
Data mining
(emerging today)
Advanced algorithms,
multiprocessor computers
Pilot, IBM,
others
Measurement of data
This data occupies
Terabytes - 1012
bytes
Petabytes - 1015 bytes
Exabytes- 1018 bytes
Zettabytes - 1021
bytes
Yottabytes - 1024 bytes
knowledge discovery in databases
1. Data cleaning:
To remove noise and inconsistent data
2. Data integration:
To integrate (combine) multiple data SOURCES
3. Data selection:
The process of selecting task relevant data.
4. Data transformation:
Summary normalization aggregation operations are
Performed(convert data into two dimension form) and
consolidate the data.
Steps of KDD (Cont..)
5. Data mining:
Intelligent methods are applied to the data to discover
knowledge or patterns
6. Pattern evaluation:
Evaluation of the interesting patterns by thresholding
7. Knowledge Discovery:
Visualization and presentation methods are used to
present the mined knowledge to the user.
VISUAL DATA MINING
What Is Data Mining?
Data mining is a process of extraction important and hidden
predictive useful information from large sets of data is called
data mining.
Data mining is…
 Knowledge discovery from database
 Do analyze huge amounts of data.
 Combination of statistics, probability analysis.
 Employs statistical methods.
DATA
MINING
STATISTICS
MACHINE
LEARNING
DATABASE
TECHNOLOGY
INFORMATION
SCIENCES
ARTIFICAL
INTELLIGENCE
VISUALIZATION
OTHER
DISCIPLINES
Goals of Data Mining
 Identification: identify the existence of an item, an
event, an activity.
 Classification: partition the data into categories.
 Optimization: optimize the use of limited resources.
 Prediction: how certain attributes within the data will
behave in the future.
Why mine data? Scientific view point
 Data collected and stored at
enormous speeds (GB/hour)
o remote sensors on a satellite
o telescopes scanning the skies
o microarrays generating gene
expression data
o scientific simulations
generating terabytes of data
 Traditional techniques infeasible for raw data
 Data mining may help scientists
o in classifying and segmenting data
o in Hypothesis Formation
TECHNIQUES USED IN DATA MINING
PATTERN RECOGNITION using MACHINE LEARNING techniques
 CLASSIFICATION
 CLUSTERING
 ASSOCIATION RULES
 PREDICTION
1.CLASSIFICATION
 Classification techniques are designed for classifying
unknown samples using information provided by set of
classified samples.
Ex : classify countries based on climate, or classify cars based
on gas mileage
Presentation : decision tree, classification rule, neural network,
support vector machine
2. CLUSTERING
Clustering is a partitioning a homogeneous data set of
meaningful sub-classes called clusters.
• Clustering: unsupervised classification: no predefined classes.
: supervised classification: predefined classes
A good clustering method will produce high quality clusters in
which:
• the intra-class similarity is high.
• the inter-class similarity is low.
3. ASSOCIATION
 Association rule of data mining to search unseen or
desired pattern among the extent amount of data.
 Association rule used to find out the elements co-occur
repeatedly within a dataset consisting of many independent
selections of elements, and discover rules.
 Application of association rules mining is the market
basket analysis, Pathogen analysis(downy mildew and
powdery mildew).
 Regression is a data mining (machine learning) technique
used to fit an equation to a dataset.
 A straight line is given by the equation y = mx + c and
determines the approximate values for m and c to calculate
the value of y based on a particular value of x.
 Multiple regression, uses more than one input variable and
allows for the fitting of more complex models.
 bootstrap algorithms for small sample
4. Prediction
5.MACHINE LEARNING
 Ability to automatically learn to recognize complex
patterns and make intelligent decisions based on data.
6.PATTERN RECOGNITION
 Branch of machine learning that focuses on the
recognition of patterns and regularities in data.
 Decision tree analysis
 K-nearest neighbor
 K-means
 Neural networks
 Bayesian theorem and network
 Support vector machine
 Fuzzy logic
 Weka Tool
MACHINE LEARNING METHODS
DECISION TREE ANALYSIS
 A decision tree is a decision support tool that uses
a tree-like graph
 Relatively fast compared to other classification
models.
 Obtain similar and sometimes better accuracy
compared to other models.
 Can be converted into simple and easy to
understand classification rules
A decision tree is constructed in two phases:
Tree Building (growing)Phase:
Repeatedly partition the training data until all the
examples in each partition belong to one class or the
partition is sufficiently small
Tree Pruning Phase:
Remove dependency on statistical noise or variation that
may be particular only to the training set
NEAREST NEIGHBOR and K-NEAREST NEIGHBOR
 The Nearest Neighbour rule achieves consistently high
performance, without a priori assumptions about the
distributions from which the training examples are drawn
 It supervised learning technique
 Classified based on weighted average votes.
 K-Nearest Neighbor is considered a lazy learning algorithm
that classifies data sets based on their similarity with
neighbors.
K-MEANS
This is a popular for cluster analysis in data mining. k-
means clustering aims to partition n observations
into k clusters in which each observation belongs to
the cluster with the nearest mean, serving as a prototype of
the cluster. This results in a partitioning of the data space
into Voronoi cells.
NEURAL NETWORKS
An information processing paradigm that is inspired by the way
biological nervous systems, such as the brain process information
BAYESIAN NETWORK
 Originated from Bayes’ theorm
 Also known as posterior probability
For example: a Bayesian network could represent the
probabilistic relationships between diseases and symptoms.
Given symptoms, the network can be used to compute the
probabilities of the presence of various diseases.
)p(x
)p(h)h|p(x
)x|p(h
).p(h)h|p(x)p(x
i
11i
i1
m
1j
jjii


SUPPORT VECTOR MACHINE
 Support Vector Machines (SVM) supervised learning
technique for solving classification and regression
problem.
 Svm aims to find the hyperplane that best separates two
classes of data
SVM FOR CLASSIFICATION:
1 Linear Support Vector Machine for Separable Data
2 Linear Support Vector Machine for Non-Separable Data
3 Nonlinear Support Vector Machine
Linear Classifiers
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Linear Classifiers
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Linear Classifiers
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
How would you
classify this data?
Linear Classifiers
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Any of these would
be fine..
..but which is best?
Classifier Margin
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
Define the margin of
a linear classifier as
the width that the
boundary could be
increased by before
hitting a datapoint.
Maximum Margin
fx
a
yest
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM (Called
an LSVM)
Linear SVM
Why Maximum Margin?
denotes +1
denotes -1
f(x,w,b) = sign(w. x - b)
The maximum
margin linear
classifier is the
linear classifier with
the, um, maximum
margin.
This is the simplest
kind of SVM (Called
an LSVM)
Support Vectors are
those datapoints
that the margin
pushes up against
1. Intuitively this feels
safest.
2. If we’ve made a small
error in the location of
the boundary (it’s been
jolted in its
perpendicular
direction) this gives us
least chance of causing
a misclassification.
3. LOOCV is easy since the
model is immune to
removal of any non-
support-vector data
points.
4. There’s some theory
(using VC dimension)
that is related to (but
not the same as) the
proposition that this is
a good thing.
5. Empirically it works
very very well.
Data mining
FUZZY LOGIC
 Fuzzy logic an approach to computing based on “degrees
of truth” rather than the usual “true (or) false”
 the truth values of variables may only be the integer
values 0 or 1.
 It takes probabilistic measure to quantify the parameter.
Ex: if cold is a fuzzy set, exact temperature values might be mapped to
the fuzzy set as follows:
15 degrees → 0.2 (slightly cold)
10 degrees → 0.5 (quite cold)
0 degrees → 1 (totally cold)
WEKA SOFTWARE
o machine learning
software written in
java.
o Free software.
o Analyze data from
agricultural
domains
o Visualization tools
and algorithms for
data analysis and
predictive
modeling.
Waikato Environment for Knowledge Analysis
ADVANTAGES OF WEKA
 Runs on almost any modern computing platform.
 A comprehensive collection of data preprocessing and
modeling techniques.
 Ease of use due to graphical user interfaces.
 Support several standard data mining tasks, more
specifically, data preprocessing, clustering, classification
regression, visualization, and feature selection.
COMMERCIAL TOOLS
• oracle data miner
- http://www.oracle.com
• Data to knowledge
- http://alg.nsca.uluc.edu
• Sas
- http://www.sas.com
• Clementine
- http://spss.com/clemetine
• Intelligent miner
- http://www.306.ibm.com/software
ROLE OF DATA MINING IN AGRICULTURE
o Crop yield estimation.
o Estimation of damage caused by pest.
o Mushroom grading.
o Spatial data mining reveals interesting pattern related to
agriculture.
o Crop price prediction.
o Characterize agricultural soil profiles.
ROLE IN AGRICULTURE DOMAIN
Data mining methodologies application
neural networks focuses on weather
forecasts, prediction of rainfall.
K-means classifying soil in combination
with GPS, wine fermentation
problem, yield prediction.
Fuzzy set for detecting weed in precision
agriculture.
k-nearest neighbor simulating daily precipitation and
other weather condition.
TYPICAL APPLICATION OF DATA MINING
Case study -1
• Data mining techniques were adopted in order to
predict crop production.
• Comparing the estimated values density-based
clustering with estimated values of multiple linear
regression values
Hyderabad Ramesh et al.,2013
OVERVIEW OF DATA
• The data is used years from 1955 to 2009 for East Godavari
district of Andhra Pradesh in India.
• The information gathering process is done with three
government units like Indian Meteorological Department,
Statistical Institution and Agricultural department.
• Each area in this collection is identified by the respective
longitude and latitude of the region.
• the estimation of the crop production is analyzed with respect
to eight parameters namely Year, Rainfall, Area of Sowing,
yield, fertilizers(Nitrogen, Phosphorous, and potassium) And
yield . Cont.....
The year attribute specifies the year in which the data available in
Hectares.
Rainfall attribute specifies the Rainfall in East Godavari in the
specified year in centimeters.
Area of sowing attribute specifies the total area sowed in east
Godavari district in the specified year that region in Hectares. .
Production attribute specifies the production of crop in east Godavari
district in the specified year in Tons.
Yield specifies in Kilogram per hectare.
Production attribute specifies the production of crop in the specified
year in Metric Tons.
Fertilizers specify in Tons in the specified year.
METHODOLOGY
• The statistical method namely multiple linear regression
technique and data mining method namely Density-based
clustering technique were take up for the estimation of crop
production analysis.
Multiple Linear Regression:
Multiple linear regression (MLR) is the method used to model
the linear relationship between a dependent variable and one or
more independent variable(s). The dependent variable is
sometimes termed as predictant i.e. rainfall and independent
variables are called predictors i.e. Year, Area of sowing,
Production .
εXβ........XβXββY pp22110 
METHODOLOGY (Cont’d……)
Density-based clustering technique:
 Density is usually defined as the number of objects in
a particular neighborhood of data objects.
 The Density-based clustering techniques is that, for
each point of a cluster, the neighborhood of a given
unit distance contains at least a minimum number of
points.
Table 1:Exact production and estimated values using Multiple Linear
Regression technique.
Observation Year Production
( Exact )
40 years interval
Production
(estimation)
Percentage of difference
2000 683423 592461 13
2001 579850 566050 2
2002 551115 579433 -5
2003 762453 722638 5
2004 743614 742752 0
2005 348727 399062 -14
2006 547716 551541 -1
2007 691069 691069 3
2008 716609 697227 3
2009 616567 633494 -3
The estimated results using Multiple Linear Regression technique
which are ranging between -14% and +13% for 40 years interval.
Table-2: Exact production and Estimated values using Density-based
clustering technique.
Observation Year Production
( Exact )
6 clusters
Production
(estimation)
Percentage of difference
2000 683423 666011 3
2001 579850 651103 -12
2002 551115 566972 -3
2003 762453 703914 8
2004 743614 737897 1
2005 348727 392770 -13
2006 547716 534709 2
2007 691069 791589 -11
2008 716609 676321 6
2009 616567 695574 -13
The estimated results using Density-based clustering technique
which are ranging between -13% and +8% for 6-clusters
approximation.
Table-3: Comparison between Exact production and estimated
values using Multiple Linear Regression technique and Density-
based clustering technique
Observation
Year
Production
( Exact )
Production ( Estimation)
Multiple Linear Regression
technique
Density-based clustering technique
2000 683423 592461 666011
2001 579850 566050 651103
2002 551115 579433 566972
2003 762453 722638 703914
2004 743614 742752 737897
2005 348727 399062 392770
2006 547716 551541 534709
2007 691069 691069 791589
2008 716609 697227 676321
2009 616567 633494 695574
RESULT
Data mining
CONCLUSION
Initially the statistical model Multiple Linear Regression technique is
applied on existing data. The results so obtained were verified and
analyzed using the Data Mining technique namely Density-based
clustering technique.
In this procedure the results of two methods were compared
according to the specific region i.e. East Godavari district of Andhra
Pradesh in India. Similar process was adopted for all the districts of
Andhra Pradesh to improve and authenticate the validity of yield
prediction which are useful for the farmers of Andhra Pradesh for the
prediction of a specific crop.
CONCLUSION
 Data mining is boon for large data in Agriculture.
 Extraction of knowledge is a big challenge.
 A lot of data mining techniques are developed today
to tackle the challenge.
 Skill is also required to handle the tools and
techniques.
REFERENCE
BHARGAVI, P. AND JYOTHI, S., 2009, Applying Naive Bayes data mining technique
for classification of agricultural land soils. International Journal of Computer Science
and Network Security, 9(8): 117–22.
FAYYAD, U., PIATETSKY-SHAPIRO, G. AND SMYTH, P., 1996, From data mining
to knowledge discovery in databases. AI magazine, 17(3): 37–54.
PATEL, H. AND PATEL, D., 2014, A Brief survey of data mining techniques applied to
agricultural data. International Journal of Computer Applications, 95(9): 1–3.
PATIL, T. R. AND SHEREKAR, S. S., 2013, Performance analysis of Naive Bayes and
J48 classification algorithm for data classification. International Journal of Computer
Science and Applications, 6(2): 2561–61.
RAMESH, V. AND RAMAR, K., 2011, Classification of agricultural land soils: a
data mining approach. Agricultural Journal, 6(3): 82–6.
RAMESH, D. AND VARHADAN, V. B., 2013, Data Mining Techniques and
Applications to Agricultural Yield Data. International Journal of Advanced Research
in Computer and Communications Engineering, 2 (9): 3477-3482.
SHARMA, L. AND MEHTA, N., 2012, Data Mining Techniques: A Tool for
Knowledge Management System in Agriculture. International Journal of Scientific
and Technology Research, 1(5): 67-73.
VEENADHARI, S., MISRA, B. AND SINGH, C. D., 2011, Data mining Techniques
for Predicting Crop Productivity – A review article. International Journal of
Computer Science and Technology, 2(1): 114-118.

More Related Content

PPTX
metadata.pptx
PDF
Introduction to data mining and machine learning
PPTX
PDF
Introduction to Data Science
PDF
Feature Engineering
PPTX
Supervised learning and Unsupervised learning
PDF
Kaggle presentation
PPTX
Machine Learning
metadata.pptx
Introduction to data mining and machine learning
Introduction to Data Science
Feature Engineering
Supervised learning and Unsupervised learning
Kaggle presentation
Machine Learning

What's hot (20)

PDF
Hyperparameter Optimization for Machine Learning
PPT
Machine learning Algorithm
PPTX
NAMED ENTITY RECOGNITION
PPTX
Data science Big Data
PDF
Classification Based Machine Learning Algorithms
PPTX
Supervised and unsupervised learning
PDF
Data Science Project Lifecycle
PPTX
Dimensionality Reduction and feature extraction.pptx
PPT
Machine learning
PDF
Machine Learning with PyCaret
PPTX
Data Streaming in Big Data Analysis
PDF
Data Science Full Course | Edureka
PPT
K mean-clustering algorithm
PPTX
Introduction to Named Entity Recognition
PPTX
Introduction to Data Science
PPTX
Text Classification
PDF
Information Extraction
PPTX
Data mining presentation.ppt
PPT
Data mining :Concepts and Techniques Chapter 2, data
PPTX
Machine learning
Hyperparameter Optimization for Machine Learning
Machine learning Algorithm
NAMED ENTITY RECOGNITION
Data science Big Data
Classification Based Machine Learning Algorithms
Supervised and unsupervised learning
Data Science Project Lifecycle
Dimensionality Reduction and feature extraction.pptx
Machine learning
Machine Learning with PyCaret
Data Streaming in Big Data Analysis
Data Science Full Course | Edureka
K mean-clustering algorithm
Introduction to Named Entity Recognition
Introduction to Data Science
Text Classification
Information Extraction
Data mining presentation.ppt
Data mining :Concepts and Techniques Chapter 2, data
Machine learning
Ad

Similar to Data mining (20)

PPTX
Machine Learning and Real-World Applications
PDF
Classification of Breast Cancer Diseases using Data Mining Techniques
PPTX
Data Mining: Mining stream time series and sequence data
PPTX
Data Mining: Mining stream time series and sequence data
PDF
Artificial Intelligence in Medical Imaging
PDF
Analysis and Comparison Study of Data Mining Algorithms Using Rapid Miner
PDF
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
PDF
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
PPTX
Clustering for Stream and Parallelism (DATA ANALYTICS)
PPS
Brief Tour of Machine Learning
PDF
Introduction to Data Mining
PDF
Cs501 dm intro
PPTX
SVM - Functional Verification
PDF
Comprehensive Survey of Data Classification & Prediction Techniques
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PPT
Chapter 1. Introduction
PPT
(Talk in Powerpoint Format)
PPTX
Data Mining Application and Trends
PDF
Machine learning for_finance
PDF
PNN and inversion-B
Machine Learning and Real-World Applications
Classification of Breast Cancer Diseases using Data Mining Techniques
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
Artificial Intelligence in Medical Imaging
Analysis and Comparison Study of Data Mining Algorithms Using Rapid Miner
ANALYSIS AND COMPARISON STUDY OF DATA MINING ALGORITHMS USING RAPIDMINER
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHES
Clustering for Stream and Parallelism (DATA ANALYTICS)
Brief Tour of Machine Learning
Introduction to Data Mining
Cs501 dm intro
SVM - Functional Verification
Comprehensive Survey of Data Classification & Prediction Techniques
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Chapter 1. Introduction
(Talk in Powerpoint Format)
Data Mining Application and Trends
Machine learning for_finance
PNN and inversion-B
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
1_Introduction to advance data techniques.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Introduction to Business Data Analytics.
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
Mega Projects Data Mega Projects Data
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction-to-Cloud-ComputingFinal.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Miokarditis (Inflamasi pada Otot Jantung)
Acceptance and paychological effects of mandatory extra coach I classes.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
1_Introduction to advance data techniques.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Introduction to Business Data Analytics.
Business Ppt On Nestle.pptx huunnnhhgfvu
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Launch Your Data Science Career in Kochi – 2025
Mega Projects Data Mega Projects Data
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd

Data mining

  • 1. DATA MINING IN AGRICULTURE PRESENTED BY:- JHADE SUNIL M.Sc. Agri stats. PALB-6175 UAS GKVK
  • 2. FLOW OF SEMINAR  INTRODUCTION  HISTORY  STEPS IN DATA MINING  GOALS OF DATA MINING  TECHNIQUES USED IN DATA MINING  DATA MINING METHODOLOGIES  ROLE OF DATA MINING IN AGRICULTURE  CASE STUDIES  CONCLUSION  REFERENCES
  • 3. INTRODUCTION  Agriculture is the backbone of the Indian nation.  Demand of food is increasing due to low Productivity of agriculture production.  Agricultural researches put extra effort for more production.  Agricultural data acquisition rapidly increases these data to be extracted when needed .  Data mining can be used for predicting the future trends of agricultural processes.
  • 5. Evolution of Database Technology Evolutionary steps Enabling technologies Product providers Data collection (1960s) Computers, tapes IBM Data access (1980s) Relational databases Oracle, Informix, IBM, Microsoft Data warehousing (1990s) OLAP, multi dimensional databases, data warehouses Pilot, Microstrategy Data mining (emerging today) Advanced algorithms, multiprocessor computers Pilot, IBM, others
  • 6. Measurement of data This data occupies Terabytes - 1012 bytes Petabytes - 1015 bytes Exabytes- 1018 bytes Zettabytes - 1021 bytes Yottabytes - 1024 bytes
  • 7. knowledge discovery in databases 1. Data cleaning: To remove noise and inconsistent data 2. Data integration: To integrate (combine) multiple data SOURCES 3. Data selection: The process of selecting task relevant data. 4. Data transformation: Summary normalization aggregation operations are Performed(convert data into two dimension form) and consolidate the data.
  • 8. Steps of KDD (Cont..) 5. Data mining: Intelligent methods are applied to the data to discover knowledge or patterns 6. Pattern evaluation: Evaluation of the interesting patterns by thresholding 7. Knowledge Discovery: Visualization and presentation methods are used to present the mined knowledge to the user.
  • 10. What Is Data Mining? Data mining is a process of extraction important and hidden predictive useful information from large sets of data is called data mining.
  • 11. Data mining is…  Knowledge discovery from database  Do analyze huge amounts of data.  Combination of statistics, probability analysis.  Employs statistical methods. DATA MINING STATISTICS MACHINE LEARNING DATABASE TECHNOLOGY INFORMATION SCIENCES ARTIFICAL INTELLIGENCE VISUALIZATION OTHER DISCIPLINES
  • 12. Goals of Data Mining  Identification: identify the existence of an item, an event, an activity.  Classification: partition the data into categories.  Optimization: optimize the use of limited resources.  Prediction: how certain attributes within the data will behave in the future.
  • 13. Why mine data? Scientific view point  Data collected and stored at enormous speeds (GB/hour) o remote sensors on a satellite o telescopes scanning the skies o microarrays generating gene expression data o scientific simulations generating terabytes of data  Traditional techniques infeasible for raw data  Data mining may help scientists o in classifying and segmenting data o in Hypothesis Formation
  • 14. TECHNIQUES USED IN DATA MINING PATTERN RECOGNITION using MACHINE LEARNING techniques  CLASSIFICATION  CLUSTERING  ASSOCIATION RULES  PREDICTION
  • 15. 1.CLASSIFICATION  Classification techniques are designed for classifying unknown samples using information provided by set of classified samples. Ex : classify countries based on climate, or classify cars based on gas mileage Presentation : decision tree, classification rule, neural network, support vector machine
  • 16. 2. CLUSTERING Clustering is a partitioning a homogeneous data set of meaningful sub-classes called clusters. • Clustering: unsupervised classification: no predefined classes. : supervised classification: predefined classes A good clustering method will produce high quality clusters in which: • the intra-class similarity is high. • the inter-class similarity is low.
  • 17. 3. ASSOCIATION  Association rule of data mining to search unseen or desired pattern among the extent amount of data.  Association rule used to find out the elements co-occur repeatedly within a dataset consisting of many independent selections of elements, and discover rules.  Application of association rules mining is the market basket analysis, Pathogen analysis(downy mildew and powdery mildew).
  • 18.  Regression is a data mining (machine learning) technique used to fit an equation to a dataset.  A straight line is given by the equation y = mx + c and determines the approximate values for m and c to calculate the value of y based on a particular value of x.  Multiple regression, uses more than one input variable and allows for the fitting of more complex models.  bootstrap algorithms for small sample 4. Prediction
  • 19. 5.MACHINE LEARNING  Ability to automatically learn to recognize complex patterns and make intelligent decisions based on data.
  • 20. 6.PATTERN RECOGNITION  Branch of machine learning that focuses on the recognition of patterns and regularities in data.
  • 21.  Decision tree analysis  K-nearest neighbor  K-means  Neural networks  Bayesian theorem and network  Support vector machine  Fuzzy logic  Weka Tool MACHINE LEARNING METHODS
  • 22. DECISION TREE ANALYSIS  A decision tree is a decision support tool that uses a tree-like graph  Relatively fast compared to other classification models.  Obtain similar and sometimes better accuracy compared to other models.  Can be converted into simple and easy to understand classification rules
  • 23. A decision tree is constructed in two phases: Tree Building (growing)Phase: Repeatedly partition the training data until all the examples in each partition belong to one class or the partition is sufficiently small Tree Pruning Phase: Remove dependency on statistical noise or variation that may be particular only to the training set
  • 24. NEAREST NEIGHBOR and K-NEAREST NEIGHBOR  The Nearest Neighbour rule achieves consistently high performance, without a priori assumptions about the distributions from which the training examples are drawn  It supervised learning technique  Classified based on weighted average votes.  K-Nearest Neighbor is considered a lazy learning algorithm that classifies data sets based on their similarity with neighbors.
  • 25. K-MEANS This is a popular for cluster analysis in data mining. k- means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
  • 26. NEURAL NETWORKS An information processing paradigm that is inspired by the way biological nervous systems, such as the brain process information
  • 27. BAYESIAN NETWORK  Originated from Bayes’ theorm  Also known as posterior probability For example: a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. )p(x )p(h)h|p(x )x|p(h ).p(h)h|p(x)p(x i 11i i1 m 1j jjii  
  • 28. SUPPORT VECTOR MACHINE  Support Vector Machines (SVM) supervised learning technique for solving classification and regression problem.  Svm aims to find the hyperplane that best separates two classes of data SVM FOR CLASSIFICATION: 1 Linear Support Vector Machine for Separable Data 2 Linear Support Vector Machine for Non-Separable Data 3 Nonlinear Support Vector Machine
  • 29. Linear Classifiers fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 30. Linear Classifiers fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 31. Linear Classifiers fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) How would you classify this data?
  • 32. Linear Classifiers fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Any of these would be fine.. ..but which is best?
  • 33. Classifier Margin fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.
  • 34. Maximum Margin fx a yest denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM
  • 35. Why Maximum Margin? denotes +1 denotes -1 f(x,w,b) = sign(w. x - b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Support Vectors are those datapoints that the margin pushes up against 1. Intuitively this feels safest. 2. If we’ve made a small error in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification. 3. LOOCV is easy since the model is immune to removal of any non- support-vector data points. 4. There’s some theory (using VC dimension) that is related to (but not the same as) the proposition that this is a good thing. 5. Empirically it works very very well.
  • 37. FUZZY LOGIC  Fuzzy logic an approach to computing based on “degrees of truth” rather than the usual “true (or) false”  the truth values of variables may only be the integer values 0 or 1.  It takes probabilistic measure to quantify the parameter. Ex: if cold is a fuzzy set, exact temperature values might be mapped to the fuzzy set as follows: 15 degrees → 0.2 (slightly cold) 10 degrees → 0.5 (quite cold) 0 degrees → 1 (totally cold)
  • 38. WEKA SOFTWARE o machine learning software written in java. o Free software. o Analyze data from agricultural domains o Visualization tools and algorithms for data analysis and predictive modeling. Waikato Environment for Knowledge Analysis
  • 39. ADVANTAGES OF WEKA  Runs on almost any modern computing platform.  A comprehensive collection of data preprocessing and modeling techniques.  Ease of use due to graphical user interfaces.  Support several standard data mining tasks, more specifically, data preprocessing, clustering, classification regression, visualization, and feature selection.
  • 40. COMMERCIAL TOOLS • oracle data miner - http://www.oracle.com • Data to knowledge - http://alg.nsca.uluc.edu • Sas - http://www.sas.com • Clementine - http://spss.com/clemetine • Intelligent miner - http://www.306.ibm.com/software
  • 41. ROLE OF DATA MINING IN AGRICULTURE o Crop yield estimation. o Estimation of damage caused by pest. o Mushroom grading. o Spatial data mining reveals interesting pattern related to agriculture. o Crop price prediction. o Characterize agricultural soil profiles.
  • 42. ROLE IN AGRICULTURE DOMAIN Data mining methodologies application neural networks focuses on weather forecasts, prediction of rainfall. K-means classifying soil in combination with GPS, wine fermentation problem, yield prediction. Fuzzy set for detecting weed in precision agriculture. k-nearest neighbor simulating daily precipitation and other weather condition.
  • 43. TYPICAL APPLICATION OF DATA MINING
  • 44. Case study -1 • Data mining techniques were adopted in order to predict crop production. • Comparing the estimated values density-based clustering with estimated values of multiple linear regression values Hyderabad Ramesh et al.,2013
  • 45. OVERVIEW OF DATA • The data is used years from 1955 to 2009 for East Godavari district of Andhra Pradesh in India. • The information gathering process is done with three government units like Indian Meteorological Department, Statistical Institution and Agricultural department. • Each area in this collection is identified by the respective longitude and latitude of the region. • the estimation of the crop production is analyzed with respect to eight parameters namely Year, Rainfall, Area of Sowing, yield, fertilizers(Nitrogen, Phosphorous, and potassium) And yield . Cont.....
  • 46. The year attribute specifies the year in which the data available in Hectares. Rainfall attribute specifies the Rainfall in East Godavari in the specified year in centimeters. Area of sowing attribute specifies the total area sowed in east Godavari district in the specified year that region in Hectares. . Production attribute specifies the production of crop in east Godavari district in the specified year in Tons. Yield specifies in Kilogram per hectare. Production attribute specifies the production of crop in the specified year in Metric Tons. Fertilizers specify in Tons in the specified year.
  • 47. METHODOLOGY • The statistical method namely multiple linear regression technique and data mining method namely Density-based clustering technique were take up for the estimation of crop production analysis. Multiple Linear Regression: Multiple linear regression (MLR) is the method used to model the linear relationship between a dependent variable and one or more independent variable(s). The dependent variable is sometimes termed as predictant i.e. rainfall and independent variables are called predictors i.e. Year, Area of sowing, Production . εXβ........XβXββY pp22110 
  • 48. METHODOLOGY (Cont’d……) Density-based clustering technique:  Density is usually defined as the number of objects in a particular neighborhood of data objects.  The Density-based clustering techniques is that, for each point of a cluster, the neighborhood of a given unit distance contains at least a minimum number of points.
  • 49. Table 1:Exact production and estimated values using Multiple Linear Regression technique. Observation Year Production ( Exact ) 40 years interval Production (estimation) Percentage of difference 2000 683423 592461 13 2001 579850 566050 2 2002 551115 579433 -5 2003 762453 722638 5 2004 743614 742752 0 2005 348727 399062 -14 2006 547716 551541 -1 2007 691069 691069 3 2008 716609 697227 3 2009 616567 633494 -3 The estimated results using Multiple Linear Regression technique which are ranging between -14% and +13% for 40 years interval.
  • 50. Table-2: Exact production and Estimated values using Density-based clustering technique. Observation Year Production ( Exact ) 6 clusters Production (estimation) Percentage of difference 2000 683423 666011 3 2001 579850 651103 -12 2002 551115 566972 -3 2003 762453 703914 8 2004 743614 737897 1 2005 348727 392770 -13 2006 547716 534709 2 2007 691069 791589 -11 2008 716609 676321 6 2009 616567 695574 -13 The estimated results using Density-based clustering technique which are ranging between -13% and +8% for 6-clusters approximation.
  • 51. Table-3: Comparison between Exact production and estimated values using Multiple Linear Regression technique and Density- based clustering technique Observation Year Production ( Exact ) Production ( Estimation) Multiple Linear Regression technique Density-based clustering technique 2000 683423 592461 666011 2001 579850 566050 651103 2002 551115 579433 566972 2003 762453 722638 703914 2004 743614 742752 737897 2005 348727 399062 392770 2006 547716 551541 534709 2007 691069 691069 791589 2008 716609 697227 676321 2009 616567 633494 695574 RESULT
  • 53. CONCLUSION Initially the statistical model Multiple Linear Regression technique is applied on existing data. The results so obtained were verified and analyzed using the Data Mining technique namely Density-based clustering technique. In this procedure the results of two methods were compared according to the specific region i.e. East Godavari district of Andhra Pradesh in India. Similar process was adopted for all the districts of Andhra Pradesh to improve and authenticate the validity of yield prediction which are useful for the farmers of Andhra Pradesh for the prediction of a specific crop.
  • 54. CONCLUSION  Data mining is boon for large data in Agriculture.  Extraction of knowledge is a big challenge.  A lot of data mining techniques are developed today to tackle the challenge.  Skill is also required to handle the tools and techniques.
  • 55. REFERENCE BHARGAVI, P. AND JYOTHI, S., 2009, Applying Naive Bayes data mining technique for classification of agricultural land soils. International Journal of Computer Science and Network Security, 9(8): 117–22. FAYYAD, U., PIATETSKY-SHAPIRO, G. AND SMYTH, P., 1996, From data mining to knowledge discovery in databases. AI magazine, 17(3): 37–54. PATEL, H. AND PATEL, D., 2014, A Brief survey of data mining techniques applied to agricultural data. International Journal of Computer Applications, 95(9): 1–3. PATIL, T. R. AND SHEREKAR, S. S., 2013, Performance analysis of Naive Bayes and J48 classification algorithm for data classification. International Journal of Computer Science and Applications, 6(2): 2561–61.
  • 56. RAMESH, V. AND RAMAR, K., 2011, Classification of agricultural land soils: a data mining approach. Agricultural Journal, 6(3): 82–6. RAMESH, D. AND VARHADAN, V. B., 2013, Data Mining Techniques and Applications to Agricultural Yield Data. International Journal of Advanced Research in Computer and Communications Engineering, 2 (9): 3477-3482. SHARMA, L. AND MEHTA, N., 2012, Data Mining Techniques: A Tool for Knowledge Management System in Agriculture. International Journal of Scientific and Technology Research, 1(5): 67-73. VEENADHARI, S., MISRA, B. AND SINGH, C. D., 2011, Data mining Techniques for Predicting Crop Productivity – A review article. International Journal of Computer Science and Technology, 2(1): 114-118.