SlideShare a Scribd company logo
Web Content Mining
Web Content Mining
Web Content Mining mines the content like text, image,
audio, video, metadata, hyperlinks and extracts useful
information.
Since Web content mining examines the content of the
web as well as the result of the search. Web Content
mining mines.
Web mining helps to understand customer behavior,
helps to evaluate the performance of a web site and the
research done in web content mining indirectly helps to
boost business.
Web Content Mining
Web content mining examines the search result of search
engine. Manually doing things consumes a lot of time.
When the data to be analyzed is in large quantities, then
it is hard to find out the relevant data. Since now in every
field of life manual work is replaced by technology. Same
happened in the case of internet. As people already
admit that internet is really a magic of technology. Web
Mining became a boon to this magic. In the early stages
Web contained few amount of data. So there was no
need of web mining tools. As years passed Web got
accumulated with large amount of data. Then retrieval of
data according to users need became hard task. Web
mining came as a rescue for this problem.
Web Content Mining
It can be further classified into
● Web page content mining
Web page Content mining is a traditional search of web
page via content.
● Search result mining.
Search result mining is a further search of pages found
from previous search.
Web Content Mining
Two approaches used in web content mining
1)Agent based approach
2)Database approach
Web Content Mining
1)Agent based approach
The three types of agents
● Intelligent search agents
● Information filtering/Categorizing agent
● Personalized web agents.
Web Content Mining
Intelligent Search agents automatically searches for
information according to a particular query using
domain characteristics and user profiles.
Information agents used number of techniques to
filter data according to the predefine instructions.
Personalized web agents learn user preferences and
discovers documents related to those user profiles.
In Database approach it consists of well formed
database containing schemas and attributes with
defined domains.
Web Content Mining
Web content mining becomes complicated when it
has to mine unstructured, structured, semi
structured and multimedia data.
Figure explains the web content mining
techniques.
Web content mining
Web Content Mining
Unstructured Data Mining Techniques
Content mining can be done on unstructured data
such as text.
Mining of unstructured data give unknown
information.
Text mining is extraction of previously unknown
information by extracting information from different
text sources. Content mining requires application
of data mining and text mining techniques.
Web Content Mining
Unstructured Data Mining Techniques
Basic Content Mining is a type of text
mining.Some of the techniques used in text
mining are Information.
● Extraction
● Topic Tracking
● Summarization
● Categorization
● Clustering
● Information Visualization.
Web Content Mining
Information Extraction (IE)
To extract information from unstructured data, pattern
matching is used. It traces out the keyword and phrases
and then finds out the connection of the keywords within
the text. This technique is very useful when there is large
volume of text. IE is the basis of many other techniques
used for unstructured mining. Information extraction can
be provided to KDD module because information
extraction has to transform unstructured text to more
structured data. First the information is mined from the
extracted data and then using different types of rules, the
missed out information are found out. IE that makes
incorrect predictions on data are discarded.
Web Content Mining
Topic Tracking
Topic Tracking is a technique in which it checks the
documents viewed by the user and studies the user
profiles. According to each user it predicts the other
documents related to users interest. In Topic Tracking
applied by yahoo, user can give a keyword and if
anything related to the keyword pops up then it will be
informed to the user. Same can be applied in the case of
mining unstructured data. An example for topic tracking is
that if we select the competitors name then if at anytime
their name will come up in the news then this information
will be passed to the company.
Web Content Mining
Topic Tracking
Topic tracking can be applied in many fields. Two such
areas are medical field and education field. In medical
field doctors can easily come to know latest treatments.
In education field topic tracking can be used to find out
the latest reference for research related work. Topic
tracking helps to track all subsequent stories in the news
stream.
Disadvantage of topic tracking is that when we search for
topics we may be provided with information which is not
related to our interest. For example if user sets an alert
for ‘web mining’ it can provide us with topics related to
mineral mining etc. which are not useful for user.
Web Content Mining
Summarization
Summarization is used to reduce the length of the document
by maintaining the main points. It helps the user to decide
whether they should read this topic or not. The time taken by
the technique to summarize the document is less than the
time taken by the user to read the first paragraph. The
challenge in summarization is to teach software to analyze
semantics and to interpret the meaning. This software
statistically weighs the sentence and then extracts important
sentences from the document.
Web Content Mining
Summarization
To understand the key points summarization tool search for
headings and sub headings to find out the important points of
that document. This tool also give the freedom to the user to
select how much percentage of the total text they want
extracted as summary. It can work along with other tools such
as Topic tracking and categorization to summarize the
document. An example for text Summarization is Microsoft
word’s AutoSummarize.
Web Content Mining
Categorization
Categorization is the technique of identifying main
themes by placing the documents into a predefined set of
group. This technique counts the number of words in a
document. It does not process the actual information. It
decides the main topic from the counts. It ranks the
document according to the topics. Documents having
majority content on a particular topic are ranked first.
Categorization can be used in business and industries to
provide customer support.
Web Content Mining
Clustering
Clustering is a technique used to group similar
documents. Here in clustering grouping is not done
based on predefined topic. It is done based on fly. Same
documents can appear in different group. As a result
useful documents will not be omitted from the search
results. Clustering helps the user to easily select the topic
of interest. Clustering technology is useful in
management information system.
Web Content Mining
Information Visualization
Visualization utilizes feature extraction and key term
indexing to build a graphical representation. Through
visualization, documents having similarity are found out.
Large textual materials are represented as visual
hierarchy or maps where browsing facility is allowed. It
helps the user to visually analyze the contents. User can
interact with the graph by zooming, creating sub maps
and scaling. This technique is useful to find out related
topic from a very large amount of documents.
Web Content Mining
Information Visualization
Visualization utilizes feature extraction and key term
indexing to build a graphical representation. Through
visualization, documents having similarity are found out.
Large textual materials are represented as visual
hierarchy or maps where browsing facility is allowed. It
helps the user to visually analyze the contents. User can
interact with the graph by zooming, creating sub maps
and scaling. This technique is useful to find out related
topic from a very large amount of documents.
Web Content Mining
Structured Data Mining Techniques
Web Crawler
There are two types of Web Crawler which are called as
External and Internal Web crawler. Crawlers are
computer programs that traverse the hypertext structure
in the web. External Crawler crawls through unknown
website. Internal crawler crawls through internal pages of
the website which are returned by external crawler.
Web Content Mining
Wrapper Generation
In Wrapper Generation, it provides information on the
capability of sources. Web pages are already ranked by
traditional search engines. According to the query web
pages are retrieved by using the value of page rank. The
sources are what query they will answer and the output
types. The
wrappers will also provide a variety of Meta information.
E.g. Domains, statistics, index look up about the sources.
Page Content Mining
Page Content Mining is structured data extraction
technique which works on the pages ranked by traditional
search engines. By comparing page Content rank it
classifies the pages.
Web Content Mining
Semi-Structured Data Mining Techniques
Object Exchange Model (OEM)
Relevant information are extracted from semi-structured
data and are embedded in a group of useful information
and stored in Object Exchange model (OEM). It helps the
user to understand the information structure on the web
more accurately. It is best suited for heterogeneous and
dynamic environment. A main feature of object exchange
model is self describing, there is no need to describe in
advance the structure of an object.
Web Content Mining
Semi-Structured Data Mining Techniques
Top down Extraction
In top down extraction, it extracts complex objects from a
set of rich web sources and converts into less complex
objects until atomic objects have been extracted.
Web Data Extraction Language
In Web data extraction language it converts web data to
structured data and delivers to end users. It stores data
in the form of tables.
Web Content Mining
Multimedia Data Mining Techniques
SKICAT
SKICAT is a successful astronomical data analysis and
cataloging system which produces digital catalog of sky
object. It uses machine learning technique to convert
these objects to human usable classes. It integrates
technique for image processing and data classification
which helps to classify very large classification set.
Color Histogram Matching
Color Histogram matching consists of Color histogram
equalization and Smoothing. Equalization tries to find out
correlation between color components. The problem
faced by equalization is sparse data problem which is the
presence of unwanted artifacts in equalized images. This
problem is solved by using smoothening.
Web Content Mining
Multimedia Miner
MultiMedia Miner Comprises of four major steps, Image
excavator for extraction of image and Video’s, a
preprocessor for extraction of image features and they
are stored in a database, A search kernel is used for
matching queries with image and video available in the
database. The discovery module performs image
information mining routines to trace out the patterns in
images.
Shot Boundary Detection
It is a technique in which automatically the boundaries
are detected between shots in video.
Web Content Mining
Web Content Mining Tools
Web Content Mining tools are software that helps to
download the essential information for users. It collects
appropriate and perfectly fitting information. Some of
them are Web Info Extractor, Mozenda, Screen-Scraper,
Web Content Extractor, and Automation Anywhere 5.5
Web Content Mining
Web content mining is being used in various different
areas
● Mining Online news sites
● Distance learning
Problems faced by Web Content mining such as
extracting
● Information from heterogeneous environment
● The redundancy
● The linked nature of the web
● The dynamic and noisy nature of the web were
highlighted
Web Content Mining
Integration of web content mining into web usage mining
is also possible . In the textual content of the web pages
are extracted through frequent word sequence. Then they
are combined with web server logs to study association
rule of user’s behavior. The result of the proposed system
helps in better recommendation, web personalization,
web construction and web user profiling.
Connection between Web Content Mining and Web
Structure mining. In this approach the web page content
is compared with the information defined by the structure
of the web site. Each web page is described with a set of
keyword. This information iscombined with the link
structure which generates context based description. This
comparison helps in finding out semantic information of a
web page and its neighborhood.

More Related Content

PPTX
3D Internet Seminar PPT - OECLIB
PDF
The real Estate Project Proposal Reprot
PDF
Synopsis on android application
PDF
symmetric key encryption algorithms
PPTX
Classification in data mining
PPTX
Taxation in the Philippines
PPT
K mean-clustering algorithm
PPTX
Novel and it's types
3D Internet Seminar PPT - OECLIB
The real Estate Project Proposal Reprot
Synopsis on android application
symmetric key encryption algorithms
Classification in data mining
Taxation in the Philippines
K mean-clustering algorithm
Novel and it's types

What's hot (20)

PPTX
Web content mining
PPTX
Web mining (structure mining)
PPT
Web Usage Pattern
PPTX
PPTX
web mining
ODP
Web Content Mining
PPTX
Data Mining: Graph mining and social network analysis
PPTX
HTTP request and response
PPTX
Web mining
PPTX
Web usage mining
PPT
Multimedia Mining
PDF
CS6010 Social Network Analysis Unit I
PPTX
Web search vs ir
PPTX
Information retrieval introduction
PPT
Information Retrieval Models
PDF
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
PPTX
Data mining tasks
PPT
Classical Encryption Techniques
PDF
Data science presentation
PPTX
Ranking algorithms
Web content mining
Web mining (structure mining)
Web Usage Pattern
web mining
Web Content Mining
Data Mining: Graph mining and social network analysis
HTTP request and response
Web mining
Web usage mining
Multimedia Mining
CS6010 Social Network Analysis Unit I
Web search vs ir
Information retrieval introduction
Information Retrieval Models
CS8080 INFORMATION RETRIEVAL TECHNIQUES - IRT - UNIT - I PPT IN PDF
Data mining tasks
Classical Encryption Techniques
Data science presentation
Ranking algorithms
Ad

Viewers also liked (20)

PPTX
Web Mining Presentation Final
PDF
Web mining slides
PPTX
WEB MINING.
PPTX
Web Mining & Text Mining
PDF
A Fast Implicit Gaussian Curvature Filter
PPTX
Discovering knowledge using web structure mining
PDF
Web of Data Usage Mining
PPTX
study Accelerating Spatially Varying Gaussian Filters
PPT
A survey on web usage mining techniques
PPT
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
PDF
Ph.D Dissertation Defense Slides on Efficient VLSI Architectures for Image En...
PPTX
Matlab Image Enhancement Techniques
PPT
Web Mining
PPTX
Cluster analysis
PPTX
Noise filtering
PPTX
Web Usage Mining - Temas Avanzados
PPT
Data mining slides
 
PPT
Data Mining Concepts
PPTX
Clustering in Data Mining
PPT
Seminar on cloud computing by Prashant Gupta
Web Mining Presentation Final
Web mining slides
WEB MINING.
Web Mining & Text Mining
A Fast Implicit Gaussian Curvature Filter
Discovering knowledge using web structure mining
Web of Data Usage Mining
study Accelerating Spatially Varying Gaussian Filters
A survey on web usage mining techniques
Data Mining: Concepts and techniques classification _chapter 9 :advanced methods
Ph.D Dissertation Defense Slides on Efficient VLSI Architectures for Image En...
Matlab Image Enhancement Techniques
Web Mining
Cluster analysis
Noise filtering
Web Usage Mining - Temas Avanzados
Data mining slides
 
Data Mining Concepts
Clustering in Data Mining
Seminar on cloud computing by Prashant Gupta
Ad

Similar to Web content mining (20)

PDF
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
PDF
Quest Trail: An Effective Approach for Construction of Personalized Search En...
PDF
IRJET - Re-Ranking of Google Search Results
PDF
An Improved Annotation Based Summary Generation For Unstructured Data
PDF
Data mining in web search engine optimization
PDF
International conference On Computer Science And technology
PDF
A detail survey of page re ranking various web features and techniques
PDF
`A Survey on approaches of Web Mining in Varied Areas
PDF
DWM-MODULE 6.pdf
PDF
WEBMINING_SOWMYAJYOTHI.pdf
ODP
Web mining
PPTX
WEB MINING.pptx
PDF
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
PDF
Comparable Analysis of Web Mining Categories
PDF
Research Report on Document Indexing-Nithish Kumar
PDF
Research report nithish
PPTX
Web Search Engine, Web Crawler, and Semantics Web
PDF
International Journal of Engineering Research and Development
PDF
Intelligent Semantic Web Search Engines: A Brief Survey
PDF
Intelligent Semantic Web Search Engines: A Brief Survey
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...
Quest Trail: An Effective Approach for Construction of Personalized Search En...
IRJET - Re-Ranking of Google Search Results
An Improved Annotation Based Summary Generation For Unstructured Data
Data mining in web search engine optimization
International conference On Computer Science And technology
A detail survey of page re ranking various web features and techniques
`A Survey on approaches of Web Mining in Varied Areas
DWM-MODULE 6.pdf
WEBMINING_SOWMYAJYOTHI.pdf
Web mining
WEB MINING.pptx
Web Usage Mining: A Survey on User's Navigation Pattern from Web Logs
Comparable Analysis of Web Mining Categories
Research Report on Document Indexing-Nithish Kumar
Research report nithish
Web Search Engine, Web Crawler, and Semantics Web
International Journal of Engineering Research and Development
Intelligent Semantic Web Search Engines: A Brief Survey
Intelligent Semantic Web Search Engines: A Brief Survey

More from Daminda Herath (8)

ODP
Data mining
ODP
Data mining
ODP
Personal Web Usage Mining
PPT
Social Aspect of the Internet
ODP
Personal web usage mining
PPT
JavaScript Libraries
PPT
1. Overview of Distributed Systems
Data mining
Data mining
Personal Web Usage Mining
Social Aspect of the Internet
Personal web usage mining
JavaScript Libraries
1. Overview of Distributed Systems

Recently uploaded (20)

PDF
RMMM.pdf make it easy to upload and study
PDF
Business Ethics Teaching Materials for college
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Basic Mud Logging Guide for educational purpose
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Pharma ospi slides which help in ospi learning
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
master seminar digital applications in india
PDF
Pre independence Education in Inndia.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Cell Structure & Organelles in detailed.
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
RMMM.pdf make it easy to upload and study
Business Ethics Teaching Materials for college
human mycosis Human fungal infections are called human mycosis..pptx
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Renaissance Architecture: A Journey from Faith to Humanism
Basic Mud Logging Guide for educational purpose
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Pharma ospi slides which help in ospi learning
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
master seminar digital applications in india
Pre independence Education in Inndia.pdf
O7-L3 Supply Chain Operations - ICLT Program
Cell Structure & Organelles in detailed.
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Week 4 Term 3 Study Techniques revisited.pptx
PPH.pptx obstetrics and gynecology in nursing

Web content mining

  • 2. Web Content Mining Web Content Mining mines the content like text, image, audio, video, metadata, hyperlinks and extracts useful information. Since Web content mining examines the content of the web as well as the result of the search. Web Content mining mines. Web mining helps to understand customer behavior, helps to evaluate the performance of a web site and the research done in web content mining indirectly helps to boost business.
  • 3. Web Content Mining Web content mining examines the search result of search engine. Manually doing things consumes a lot of time. When the data to be analyzed is in large quantities, then it is hard to find out the relevant data. Since now in every field of life manual work is replaced by technology. Same happened in the case of internet. As people already admit that internet is really a magic of technology. Web Mining became a boon to this magic. In the early stages Web contained few amount of data. So there was no need of web mining tools. As years passed Web got accumulated with large amount of data. Then retrieval of data according to users need became hard task. Web mining came as a rescue for this problem.
  • 4. Web Content Mining It can be further classified into ● Web page content mining Web page Content mining is a traditional search of web page via content. ● Search result mining. Search result mining is a further search of pages found from previous search.
  • 5. Web Content Mining Two approaches used in web content mining 1)Agent based approach 2)Database approach
  • 6. Web Content Mining 1)Agent based approach The three types of agents ● Intelligent search agents ● Information filtering/Categorizing agent ● Personalized web agents.
  • 7. Web Content Mining Intelligent Search agents automatically searches for information according to a particular query using domain characteristics and user profiles. Information agents used number of techniques to filter data according to the predefine instructions. Personalized web agents learn user preferences and discovers documents related to those user profiles. In Database approach it consists of well formed database containing schemas and attributes with defined domains.
  • 8. Web Content Mining Web content mining becomes complicated when it has to mine unstructured, structured, semi structured and multimedia data. Figure explains the web content mining techniques.
  • 10. Web Content Mining Unstructured Data Mining Techniques Content mining can be done on unstructured data such as text. Mining of unstructured data give unknown information. Text mining is extraction of previously unknown information by extracting information from different text sources. Content mining requires application of data mining and text mining techniques.
  • 11. Web Content Mining Unstructured Data Mining Techniques Basic Content Mining is a type of text mining.Some of the techniques used in text mining are Information. ● Extraction ● Topic Tracking ● Summarization ● Categorization ● Clustering ● Information Visualization.
  • 12. Web Content Mining Information Extraction (IE) To extract information from unstructured data, pattern matching is used. It traces out the keyword and phrases and then finds out the connection of the keywords within the text. This technique is very useful when there is large volume of text. IE is the basis of many other techniques used for unstructured mining. Information extraction can be provided to KDD module because information extraction has to transform unstructured text to more structured data. First the information is mined from the extracted data and then using different types of rules, the missed out information are found out. IE that makes incorrect predictions on data are discarded.
  • 13. Web Content Mining Topic Tracking Topic Tracking is a technique in which it checks the documents viewed by the user and studies the user profiles. According to each user it predicts the other documents related to users interest. In Topic Tracking applied by yahoo, user can give a keyword and if anything related to the keyword pops up then it will be informed to the user. Same can be applied in the case of mining unstructured data. An example for topic tracking is that if we select the competitors name then if at anytime their name will come up in the news then this information will be passed to the company.
  • 14. Web Content Mining Topic Tracking Topic tracking can be applied in many fields. Two such areas are medical field and education field. In medical field doctors can easily come to know latest treatments. In education field topic tracking can be used to find out the latest reference for research related work. Topic tracking helps to track all subsequent stories in the news stream. Disadvantage of topic tracking is that when we search for topics we may be provided with information which is not related to our interest. For example if user sets an alert for ‘web mining’ it can provide us with topics related to mineral mining etc. which are not useful for user.
  • 15. Web Content Mining Summarization Summarization is used to reduce the length of the document by maintaining the main points. It helps the user to decide whether they should read this topic or not. The time taken by the technique to summarize the document is less than the time taken by the user to read the first paragraph. The challenge in summarization is to teach software to analyze semantics and to interpret the meaning. This software statistically weighs the sentence and then extracts important sentences from the document.
  • 16. Web Content Mining Summarization To understand the key points summarization tool search for headings and sub headings to find out the important points of that document. This tool also give the freedom to the user to select how much percentage of the total text they want extracted as summary. It can work along with other tools such as Topic tracking and categorization to summarize the document. An example for text Summarization is Microsoft word’s AutoSummarize.
  • 17. Web Content Mining Categorization Categorization is the technique of identifying main themes by placing the documents into a predefined set of group. This technique counts the number of words in a document. It does not process the actual information. It decides the main topic from the counts. It ranks the document according to the topics. Documents having majority content on a particular topic are ranked first. Categorization can be used in business and industries to provide customer support.
  • 18. Web Content Mining Clustering Clustering is a technique used to group similar documents. Here in clustering grouping is not done based on predefined topic. It is done based on fly. Same documents can appear in different group. As a result useful documents will not be omitted from the search results. Clustering helps the user to easily select the topic of interest. Clustering technology is useful in management information system.
  • 19. Web Content Mining Information Visualization Visualization utilizes feature extraction and key term indexing to build a graphical representation. Through visualization, documents having similarity are found out. Large textual materials are represented as visual hierarchy or maps where browsing facility is allowed. It helps the user to visually analyze the contents. User can interact with the graph by zooming, creating sub maps and scaling. This technique is useful to find out related topic from a very large amount of documents.
  • 20. Web Content Mining Information Visualization Visualization utilizes feature extraction and key term indexing to build a graphical representation. Through visualization, documents having similarity are found out. Large textual materials are represented as visual hierarchy or maps where browsing facility is allowed. It helps the user to visually analyze the contents. User can interact with the graph by zooming, creating sub maps and scaling. This technique is useful to find out related topic from a very large amount of documents.
  • 21. Web Content Mining Structured Data Mining Techniques Web Crawler There are two types of Web Crawler which are called as External and Internal Web crawler. Crawlers are computer programs that traverse the hypertext structure in the web. External Crawler crawls through unknown website. Internal crawler crawls through internal pages of the website which are returned by external crawler.
  • 22. Web Content Mining Wrapper Generation In Wrapper Generation, it provides information on the capability of sources. Web pages are already ranked by traditional search engines. According to the query web pages are retrieved by using the value of page rank. The sources are what query they will answer and the output types. The wrappers will also provide a variety of Meta information. E.g. Domains, statistics, index look up about the sources. Page Content Mining Page Content Mining is structured data extraction technique which works on the pages ranked by traditional search engines. By comparing page Content rank it classifies the pages.
  • 23. Web Content Mining Semi-Structured Data Mining Techniques Object Exchange Model (OEM) Relevant information are extracted from semi-structured data and are embedded in a group of useful information and stored in Object Exchange model (OEM). It helps the user to understand the information structure on the web more accurately. It is best suited for heterogeneous and dynamic environment. A main feature of object exchange model is self describing, there is no need to describe in advance the structure of an object.
  • 24. Web Content Mining Semi-Structured Data Mining Techniques Top down Extraction In top down extraction, it extracts complex objects from a set of rich web sources and converts into less complex objects until atomic objects have been extracted. Web Data Extraction Language In Web data extraction language it converts web data to structured data and delivers to end users. It stores data in the form of tables.
  • 25. Web Content Mining Multimedia Data Mining Techniques SKICAT SKICAT is a successful astronomical data analysis and cataloging system which produces digital catalog of sky object. It uses machine learning technique to convert these objects to human usable classes. It integrates technique for image processing and data classification which helps to classify very large classification set. Color Histogram Matching Color Histogram matching consists of Color histogram equalization and Smoothing. Equalization tries to find out correlation between color components. The problem faced by equalization is sparse data problem which is the presence of unwanted artifacts in equalized images. This problem is solved by using smoothening.
  • 26. Web Content Mining Multimedia Miner MultiMedia Miner Comprises of four major steps, Image excavator for extraction of image and Video’s, a preprocessor for extraction of image features and they are stored in a database, A search kernel is used for matching queries with image and video available in the database. The discovery module performs image information mining routines to trace out the patterns in images. Shot Boundary Detection It is a technique in which automatically the boundaries are detected between shots in video.
  • 27. Web Content Mining Web Content Mining Tools Web Content Mining tools are software that helps to download the essential information for users. It collects appropriate and perfectly fitting information. Some of them are Web Info Extractor, Mozenda, Screen-Scraper, Web Content Extractor, and Automation Anywhere 5.5
  • 28. Web Content Mining Web content mining is being used in various different areas ● Mining Online news sites ● Distance learning Problems faced by Web Content mining such as extracting ● Information from heterogeneous environment ● The redundancy ● The linked nature of the web ● The dynamic and noisy nature of the web were highlighted
  • 29. Web Content Mining Integration of web content mining into web usage mining is also possible . In the textual content of the web pages are extracted through frequent word sequence. Then they are combined with web server logs to study association rule of user’s behavior. The result of the proposed system helps in better recommendation, web personalization, web construction and web user profiling. Connection between Web Content Mining and Web Structure mining. In this approach the web page content is compared with the information defined by the structure of the web site. Each web page is described with a set of keyword. This information iscombined with the link structure which generates context based description. This comparison helps in finding out semantic information of a web page and its neighborhood.