Data Mining Module 5 Business Analytics.pdf

Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY
MBA
SEMESTER: 3
SPECIALIZATION
BUSINESS ANALYTICS (BA 2)
SUBJECT
DATA MINING
MODULE NO : 5
WEB MINING & TEXT MINING
- Jayanti R Pande
DGICM College, Nagpur

Q1. What is web mining. Explain its process.
Web mining is the process of using data-mining techniques to automatically extract information from various sources on the
web. It involves discovering insights from web documents and services, encompassing a range of tasks beyond just applying
standard data-mining tools.
PROCESS OF WEB MINING
1.Resource finding: This involves retrieving data from multimedia sources available online or offline, such as news articles,
forums, blogs, and HTML documents. It includes extracting text content from HTML documents by removing HTML tags.
2.Information selection and pre-processing: In this step, the original data obtained in the previous subtask undergoes
transformations. These transformations may include prep-rocessing tasks like removing stop words, stemming, or
restructuring the data to achieve the desired representation, such as identifying phrases in the training corpus or converting
text into first-order logic form.
3.Generalization: Generalization is the process of discovering general patterns within individual websites as well as across
multiple sites. Various machine-learning techniques, data-mining methods, and specialized web-oriented approaches are
utilized to identify these patterns.
4.Analysis: This final task involves validating and interpreting the patterns mined from the data. It includes assessing the
significance and relevance of the discovered patterns in relation to the objectives of the web mining process.
1
Resource finding
2
Information selection
and pre-processing
3
Generalization
4
Analysis

Web Content Web Structure Web Usage
- Involves unstructured data, primarily text
documents.
- Deals with semi-structured data found in
hypertext documents.
- Focuses on interactive aspects of web
data, including link structures and
server/browser logs.
- Analysis typically involves machine
learning techniques, including statistical
methods like NLP.
- Utilizes proprietary algorithms for
analysis.
- Analysis methods encompass machine
learning and statistical techniques,
especially association rules.
- Data representation often includes
models like bag of words and n-gram terms.
- Representation is often depicted as edged
labeled graphs.
- Data represented using graphs and
relational tables.
- Applications include categorization,
clustering, and pattern identification within
textual data.
- Main applications include identifying
frequent substructures within web
documents and discovering website
schemas.
- Applications range from categorization
and clustering to site construction and rule
extraction from user behavior.
- It focuses on extracting meaningful
information from unstructured text data on
the web.
- It deals with the organization and
relationships between web elements like
pages and links.
- It emphasizes understanding user
behavior and interaction patterns on the
web.
- Techniques such as sentiment analysis and
topic modeling are commonly applied.
- Graph-based algorithms are often used to
analyze connectivity and relationships.
- Usage patterns are analyzed to improve
website design, content delivery, and
marketing strategies.
Q2. Compare Web Content, Web Structure and Web Usage.

Q3. Explain the working of HITS Algorithm.
HITS, short for Hyperlink-Induced Topic Search, is an algorithm used for ranking web pages based on their authority and hub
scores. It evaluates the importance of a web page by considering both its authority, which is a measure of its relevance to a
specific topic, and its hub score, which indicates its capacity to link to other authoritative pages on the same topic. HITS
algorithm operates by analysing the link structure of the web and iteratively computing authority and hub scores for web
pages.
1 Root Set Retrieval
2 Base Set Construction
3 Authority and Hub Computation
4 Iteration Process
5 Score Normalization
6 Repeat Iterations
HITS Algorithm Steps

The HITS (Hyperlink-Induced Topic Search) algorithm operates in several steps:
1 Root Set Retrieval: Initially, the most relevant pages to the search query are retrieved. This set is termed the root set and is
typically obtained using a text-based search algorithm.
2 Base Set Construction: The root set is expanded by including all pages linked from it and some pages that link to it. This
augmented set forms the base set, ensuring that a substantial number of strong authorities are included. This base set and the
hyperlinks among its pages constitute a focused subgraph.
3 Authority and Hub Computation: Authority and hub values are computed iteratively in a mutually recursive manner. An
authority value is calculated as the sum of the scaled hub values of the pages that point to it, while a hub value is determined as
the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked
pages.
4 Iteration Process:
Authority Update: Each node's authority score is updated to be the sum of the hub scores of each node pointing to it. This
implies that a node gains a high authority score by being linked from pages recognized as hubs for information.
Hub Update: Each node's hub score is updated to be the sum of the authority scores of each node it points to. This means that a
node earns a high hub score by linking to nodes considered authorities on the subject.
These updates are iterated through a series of iterations.
5 Score Normalization: After each iteration, the hub and authority scores are normalized by dividing each hub score by the
square root of the sum of the squares of all hub scores, and each authority score by the square root of the sum of the squares
of all authority scores. This normalization process ensures that the scores remain comparable across iterations.
6 Repeat Iterations: The iterations continue until convergence, where the scores stabilize or until a predefined stopping criterion
is met.

Q4. Write about Text Mining.
Text Mining is a process that involves transforming unstructured text into a structured format to uncover meaningful patterns
and insights. It utilizes advanced analytical techniques such as Naïve Bayes, Support Vector Machines (SVM), and deep learning
algorithms to explore hidden relationships within text data.
Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify
meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector
Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within
their unstructured data.
Text data can be organized into three main formats within databases:
1. Structured Data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store
and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and
phone numbers.
2. Unstructured Data: This data does not have a predefined format. It can include text from sources like social media or product
reviews, as well as rich media formats like video and audio files.
3. Semi-Structured Data: Semi-structured data is a blend between structured and unstructured formats. While it has some
organization, it lacks enough structure to meet the requirements of a relational database. Examples of semi-structured data
include XML, JSON, and HTML files.
Given that roughly 80% of data in the world resides in an unstructured format, text mining is an extremely valuable practice
within organizations. Text mining tools and natural language processing (NLP) techniques, such as information extraction, enable
the transformation of unstructured documents into a structured format for analysis and the generation of high-quality insights.
This, in turn, improves the decision-making of organizations, leading to better business outcomes.

Q5. Explain PageRank Algorithm and its implementation steps.
• The PageRank algorithm, developed by Google, is a method used to determine the importance of web pages in search engine
results. Named after Larry Page, one of Google's founders, PageRank measures the significance of a webpage based on the quantity
and quality of links pointing to it.
• Google describes PageRank as a system that assesses a webpage's importance by considering the number and quality of links it
receives from other pages. The underlying assumption is that more important websites are likely to attract more links from other
websites.
• The algorithm generates a probability distribution to represent the likelihood that a random surfer clicking on links will land on any
particular page. It can be applied to collections of documents of any size, assuming an even distribution of importance among all
documents at the start of the computation.
• In the PageRank computation, a series of iterations, or passes through the collection, are required to adjust the approximate
PageRank values to better reflect the true value. This involves transferring PageRank from a page to the targets of its outbound links,
with the transfer evenly divided among all outbound links.
• For example, in a scenario where web pages B, C, and D link to page A, each link would transfer an equal share of PageRank to A upon
the next iteration, totaling to 0.75.
• In another scenario where page B links to pages C and A, page C links to page A, and page D links to all three pages, the PageRank
transferred to page A in the first iteration is calculated based on the existing values and the number of outbound links from each
linking page. In the general case, the PageRank value for any page u depends on the PageRank values of pages linking to u, divided by
the number of outbound links from each linking page. This calculation involves a damping factor, similar to income tax, which ensures
fairness and stability in the algorithm.

The implementation of the PageRank algorithm involves several steps:
1.Data Collection: Gather information about web pages and their links. This typically involves crawling the web to create a web graph, where
nodes represent web pages and edges represent links between them.
2.Initialization: Assign an initial PageRank value to each web page. In the original version of PageRank, all pages are given an equal initial value.
However, modern implementations often use a probability distribution between 0 and 1.
3.Iteration: Perform a series of iterations to update the PageRank values. In each iteration, calculate the PageRank for each page based on the
PageRank values of the pages linking to it.
4.Damping Factor: Apply a damping factor to prevent manipulation and ensure fairness in the algorithm. The damping factor represents the
probability that a random surfer will continue clicking on links rather than jumping to a new page.
5.Convergence: Repeat the iteration process until the PageRank values converge to stable values. Convergence occurs when the PageRank values
no longer change significantly between iterations.
6.Normalization: Normalize the PageRank values to ensure they sum up to 1. This step ensures that the PageRank values represent a probability
distribution.
7.Implementation Considerations: Implement efficient data structures and algorithms to handle large-scale web graphs. This may involve
distributed computing techniques and optimizations to improve performance and scalability.

Copyright © 2024 Jayanti Rajdevendra Pande.
All rights reserved.
This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose
without the express written permission of the copyright owner.
This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and
other applicable laws.
For any further queries contact on email: jayantipande17@gmail.com

Data Mining Module 5 Business Analytics.pdf

More Related Content

What's hot (20)

Similar to Data Mining Module 5 Business Analytics.pdf (20)

More from Jayanti Pande (20)

Recently uploaded (20)

Data Mining Module 5 Business Analytics.pdf