SlideShare a Scribd company logo
2
Most read
4
Most read
8
Most read
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY
MBA
SEMESTER: 3
SPECIALIZATION
BUSINESS ANALYTICS (BA 2)
SUBJECT
DATA MINING
MODULE NO : 5
WEB MINING & TEXT MINING
- Jayanti R Pande
DGICM College, Nagpur
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q1. What is web mining. Explain its process.
Web mining is the process of using data-mining techniques to automatically extract information from various sources on the
web. It involves discovering insights from web documents and services, encompassing a range of tasks beyond just applying
standard data-mining tools.
PROCESS OF WEB MINING
1.Resource finding: This involves retrieving data from multimedia sources available online or offline, such as news articles,
forums, blogs, and HTML documents. It includes extracting text content from HTML documents by removing HTML tags.
2.Information selection and pre-processing: In this step, the original data obtained in the previous subtask undergoes
transformations. These transformations may include prep-rocessing tasks like removing stop words, stemming, or
restructuring the data to achieve the desired representation, such as identifying phrases in the training corpus or converting
text into first-order logic form.
3.Generalization: Generalization is the process of discovering general patterns within individual websites as well as across
multiple sites. Various machine-learning techniques, data-mining methods, and specialized web-oriented approaches are
utilized to identify these patterns.
4.Analysis: This final task involves validating and interpreting the patterns mined from the data. It includes assessing the
significance and relevance of the discovered patterns in relation to the objectives of the web mining process.
1
Resource finding
2
Information selection
and pre-processing
3
Generalization
4
Analysis
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Web Content Web Structure Web Usage
- Involves unstructured data, primarily text
documents.
- Deals with semi-structured data found in
hypertext documents.
- Focuses on interactive aspects of web
data, including link structures and
server/browser logs.
- Analysis typically involves machine
learning techniques, including statistical
methods like NLP.
- Utilizes proprietary algorithms for
analysis.
- Analysis methods encompass machine
learning and statistical techniques,
especially association rules.
- Data representation often includes
models like bag of words and n-gram terms.
- Representation is often depicted as edged
labeled graphs.
- Data represented using graphs and
relational tables.
- Applications include categorization,
clustering, and pattern identification within
textual data.
- Main applications include identifying
frequent substructures within web
documents and discovering website
schemas.
- Applications range from categorization
and clustering to site construction and rule
extraction from user behavior.
- It focuses on extracting meaningful
information from unstructured text data on
the web.
- It deals with the organization and
relationships between web elements like
pages and links.
- It emphasizes understanding user
behavior and interaction patterns on the
web.
- Techniques such as sentiment analysis and
topic modeling are commonly applied.
- Graph-based algorithms are often used to
analyze connectivity and relationships.
- Usage patterns are analyzed to improve
website design, content delivery, and
marketing strategies.
Q2. Compare Web Content, Web Structure and Web Usage.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q3. Explain the working of HITS Algorithm.
HITS, short for Hyperlink-Induced Topic Search, is an algorithm used for ranking web pages based on their authority and hub
scores. It evaluates the importance of a web page by considering both its authority, which is a measure of its relevance to a
specific topic, and its hub score, which indicates its capacity to link to other authoritative pages on the same topic. HITS
algorithm operates by analysing the link structure of the web and iteratively computing authority and hub scores for web
pages.
1 Root Set Retrieval
2 Base Set Construction
3 Authority and Hub Computation
4 Iteration Process
5 Score Normalization
6 Repeat Iterations
HITS Algorithm Steps
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
The HITS (Hyperlink-Induced Topic Search) algorithm operates in several steps:
1 Root Set Retrieval: Initially, the most relevant pages to the search query are retrieved. This set is termed the root set and is
typically obtained using a text-based search algorithm.
2 Base Set Construction: The root set is expanded by including all pages linked from it and some pages that link to it. This
augmented set forms the base set, ensuring that a substantial number of strong authorities are included. This base set and the
hyperlinks among its pages constitute a focused subgraph.
3 Authority and Hub Computation: Authority and hub values are computed iteratively in a mutually recursive manner. An
authority value is calculated as the sum of the scaled hub values of the pages that point to it, while a hub value is determined as
the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked
pages.
4 Iteration Process:
Authority Update: Each node's authority score is updated to be the sum of the hub scores of each node pointing to it. This
implies that a node gains a high authority score by being linked from pages recognized as hubs for information.
Hub Update: Each node's hub score is updated to be the sum of the authority scores of each node it points to. This means that a
node earns a high hub score by linking to nodes considered authorities on the subject.
These updates are iterated through a series of iterations.
5 Score Normalization: After each iteration, the hub and authority scores are normalized by dividing each hub score by the
square root of the sum of the squares of all hub scores, and each authority score by the square root of the sum of the squares
of all authority scores. This normalization process ensures that the scores remain comparable across iterations.
6 Repeat Iterations: The iterations continue until convergence, where the scores stabilize or until a predefined stopping criterion
is met.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q4. Write about Text Mining.
Text Mining is a process that involves transforming unstructured text into a structured format to uncover meaningful patterns
and insights. It utilizes advanced analytical techniques such as Naïve Bayes, Support Vector Machines (SVM), and deep learning
algorithms to explore hidden relationships within text data.
Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify
meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector
Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within
their unstructured data.
Text data can be organized into three main formats within databases:
1. Structured Data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store
and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and
phone numbers.
2. Unstructured Data: This data does not have a predefined format. It can include text from sources like social media or product
reviews, as well as rich media formats like video and audio files.
3. Semi-Structured Data: Semi-structured data is a blend between structured and unstructured formats. While it has some
organization, it lacks enough structure to meet the requirements of a relational database. Examples of semi-structured data
include XML, JSON, and HTML files.
Given that roughly 80% of data in the world resides in an unstructured format, text mining is an extremely valuable practice
within organizations. Text mining tools and natural language processing (NLP) techniques, such as information extraction, enable
the transformation of unstructured documents into a structured format for analysis and the generation of high-quality insights.
This, in turn, improves the decision-making of organizations, leading to better business outcomes.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Q5. Explain PageRank Algorithm and its implementation steps.
• The PageRank algorithm, developed by Google, is a method used to determine the importance of web pages in search engine
results. Named after Larry Page, one of Google's founders, PageRank measures the significance of a webpage based on the quantity
and quality of links pointing to it.
• Google describes PageRank as a system that assesses a webpage's importance by considering the number and quality of links it
receives from other pages. The underlying assumption is that more important websites are likely to attract more links from other
websites.
• The algorithm generates a probability distribution to represent the likelihood that a random surfer clicking on links will land on any
particular page. It can be applied to collections of documents of any size, assuming an even distribution of importance among all
documents at the start of the computation.
• In the PageRank computation, a series of iterations, or passes through the collection, are required to adjust the approximate
PageRank values to better reflect the true value. This involves transferring PageRank from a page to the targets of its outbound links,
with the transfer evenly divided among all outbound links.
• For example, in a scenario where web pages B, C, and D link to page A, each link would transfer an equal share of PageRank to A upon
the next iteration, totaling to 0.75.
• In another scenario where page B links to pages C and A, page C links to page A, and page D links to all three pages, the PageRank
transferred to page A in the first iteration is calculated based on the existing values and the number of outbound links from each
linking page. In the general case, the PageRank value for any page u depends on the PageRank values of pages linking to u, divided by
the number of outbound links from each linking page. This calculation involves a damping factor, similar to income tax, which ensures
fairness and stability in the algorithm.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
The implementation of the PageRank algorithm involves several steps:
1.Data Collection: Gather information about web pages and their links. This typically involves crawling the web to create a web graph, where
nodes represent web pages and edges represent links between them.
2.Initialization: Assign an initial PageRank value to each web page. In the original version of PageRank, all pages are given an equal initial value.
However, modern implementations often use a probability distribution between 0 and 1.
3.Iteration: Perform a series of iterations to update the PageRank values. In each iteration, calculate the PageRank for each page based on the
PageRank values of the pages linking to it.
4.Damping Factor: Apply a damping factor to prevent manipulation and ensure fairness in the algorithm. The damping factor represents the
probability that a random surfer will continue clicking on links rather than jumping to a new page.
5.Convergence: Repeat the iteration process until the PageRank values converge to stable values. Convergence occurs when the PageRank values
no longer change significantly between iterations.
6.Normalization: Normalize the PageRank values to ensure they sum up to 1. This step ensures that the PageRank values represent a probability
distribution.
7.Implementation Considerations: Implement efficient data structures and algorithms to handle large-scale web graphs. This may involve
distributed computing techniques and optimizations to improve performance and scalability.
Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved.
Copyright © 2024 Jayanti Rajdevendra Pande.
All rights reserved.
This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose
without the express written permission of the copyright owner.
This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and
other applicable laws.
For any further queries contact on email: jayantipande17@gmail.com

More Related Content

PDF
Data Mining Module 2 Business Analytics.
PDF
Data Mining Module 1 Business Analytics.
PDF
Business Analytics 1 Module 2.pdf
PDF
Data Mining Module 4 Business Analytics.pdf
PDF
Data Mining Module 3 Business Analtics..pdf
PDF
Business Analytics 1 Module 1.pdf
PDF
Business Analytics 1 Module 5.pdf
PDF
Business Analytics 1 Module 4.pdf
Data Mining Module 2 Business Analytics.
Data Mining Module 1 Business Analytics.
Business Analytics 1 Module 2.pdf
Data Mining Module 4 Business Analytics.pdf
Data Mining Module 3 Business Analtics..pdf
Business Analytics 1 Module 1.pdf
Business Analytics 1 Module 5.pdf
Business Analytics 1 Module 4.pdf

What's hot (20)

PDF
Web & Social Media Analytics Module 2.pdf
PDF
Business Analytics 1 Module 3.pdf
PDF
Web & Social Media Analytics Module 1.pdf
PDF
Web & Social Media Analytics Module 5.pdf
PDF
Web & Social Media Analytics Module 3.pdf
PDF
Web & Social Media Analytics Module 4.pdf
PDF
Team_Dynamics_Mod_2.pdf
PDF
Retail Sales Mod 2.pdf
PDF
Retail Sales Mod 3.pdf
PDF
Team_Dynamics_Mod_1_Part_1.pdf
PDF
Data Analyst Interview Questions & Answers
PDF
Data Analytics Strategy
PDF
Retail Sales Mod 5.pdf
PDF
Web & Social Media Analytics Previous Year Question Paper.pdf
PDF
Lecture1 introduction to big data
PDF
Retail Sales Mod 4.pdf
PPTX
Power bi implementation for finance services firms
PPTX
Data Visualization Design Best Practices Workshop
PDF
Web Intelligence - Tutorial1
PDF
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
Web & Social Media Analytics Module 2.pdf
Business Analytics 1 Module 3.pdf
Web & Social Media Analytics Module 1.pdf
Web & Social Media Analytics Module 5.pdf
Web & Social Media Analytics Module 3.pdf
Web & Social Media Analytics Module 4.pdf
Team_Dynamics_Mod_2.pdf
Retail Sales Mod 2.pdf
Retail Sales Mod 3.pdf
Team_Dynamics_Mod_1_Part_1.pdf
Data Analyst Interview Questions & Answers
Data Analytics Strategy
Retail Sales Mod 5.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
Lecture1 introduction to big data
Retail Sales Mod 4.pdf
Power bi implementation for finance services firms
Data Visualization Design Best Practices Workshop
Web Intelligence - Tutorial1
KIT-601 Lecture Notes-UNIT-3.pdf Mining Data Stream
Ad

Similar to Data Mining Module 5 Business Analytics.pdf (20)

PPT
Web mining
PDF
WEBMINING_SOWMYAJYOTHI.pdf
PPTX
WEB MINING.pptx
PPTX
Web Mining & Text Mining
PPT
4.5 mining the worldwideweb
PDF
Web content minin
PDF
Web content mining a case study for bput results
PDF
International conference On Computer Science And technology
PPTX
Discovering knowledge using web structure mining
PPTX
Web mining
PPTX
Web Mining.pptx
PPTX
Web mining
PPTX
Web Mining Presentation Final
PDF
Aa03401490154
PPTX
Web mining: Concepts and applications
PDF
Pagerank and hits
DOCX
Web Mining
PDF
A Study on Web Structure Mining
PDF
An Improved Annotation Based Summary Generation For Unstructured Data
PDF
A Study On Web Structure Mining
Web mining
WEBMINING_SOWMYAJYOTHI.pdf
WEB MINING.pptx
Web Mining & Text Mining
4.5 mining the worldwideweb
Web content minin
Web content mining a case study for bput results
International conference On Computer Science And technology
Discovering knowledge using web structure mining
Web mining
Web Mining.pptx
Web mining
Web Mining Presentation Final
Aa03401490154
Web mining: Concepts and applications
Pagerank and hits
Web Mining
A Study on Web Structure Mining
An Improved Annotation Based Summary Generation For Unstructured Data
A Study On Web Structure Mining
Ad

More from Jayanti Pande (20)

PDF
UGC NET 2025 Current Affairs Module 3.pdf
PDF
UGC NET 2025 Current Affairs Module 2.pdf
PDF
UGC NET 2025 Current Affairs Module 1.pdf
PDF
BBA Business Law Unit 4 Summary Notes.pdf
PDF
BBA Business Law Unit 3 Summary Notes.pdf
PDF
BBA Business Law Unit 2 Summary Notes.pdf
PDF
BBA Business Law Unit 1 Summary Notes.pdf
PDF
Asst Prof most probable Interview Questions.pdf
PDF
Digital and Social Media Marketing Module 2.pdf
PDF
Digital & Social Media Marketing Module 1.pdf
PDF
Marketing Management Paper 3 Module 5.pdf
PDF
Marketing Management Paper 3 Module 4.pdf
PDF
Marketing Management Paper 3 Module 3 .pdf
PDF
Marketing Management Paper 3 Module 2.pdf
PDF
World Tread Organization [WTO] Overview.pdf
PDF
Marketing Management Paper 3 Module 1.pdf
PDF
Research Aptitude MCQ Series 1 for MAH SET Exam.pdf
PDF
Strategy to qualify MH SET Exam in Management.pdf
PDF
Digital Marketing Careers after MBA..pdf
PDF
HRM Guide| Covering All HRM important topics | Best for Interview Preparation...
UGC NET 2025 Current Affairs Module 3.pdf
UGC NET 2025 Current Affairs Module 2.pdf
UGC NET 2025 Current Affairs Module 1.pdf
BBA Business Law Unit 4 Summary Notes.pdf
BBA Business Law Unit 3 Summary Notes.pdf
BBA Business Law Unit 2 Summary Notes.pdf
BBA Business Law Unit 1 Summary Notes.pdf
Asst Prof most probable Interview Questions.pdf
Digital and Social Media Marketing Module 2.pdf
Digital & Social Media Marketing Module 1.pdf
Marketing Management Paper 3 Module 5.pdf
Marketing Management Paper 3 Module 4.pdf
Marketing Management Paper 3 Module 3 .pdf
Marketing Management Paper 3 Module 2.pdf
World Tread Organization [WTO] Overview.pdf
Marketing Management Paper 3 Module 1.pdf
Research Aptitude MCQ Series 1 for MAH SET Exam.pdf
Strategy to qualify MH SET Exam in Management.pdf
Digital Marketing Careers after MBA..pdf
HRM Guide| Covering All HRM important topics | Best for Interview Preparation...

Recently uploaded (20)

PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
Complications of Minimal Access Surgery at WLH
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Insiders guide to clinical Medicine.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Lesson notes of climatology university.
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
master seminar digital applications in india
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Cell Types and Its function , kingdom of life
O5-L3 Freight Transport Ops (International) V1.pdf
Complications of Minimal Access Surgery at WLH
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Renaissance Architecture: A Journey from Faith to Humanism
Insiders guide to clinical Medicine.pdf
TR - Agricultural Crops Production NC III.pdf
Microbial disease of the cardiovascular and lymphatic systems
Sports Quiz easy sports quiz sports quiz
Lesson notes of climatology university.
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Pharma ospi slides which help in ospi learning
master seminar digital applications in india
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Final Presentation General Medicine 03-08-2024.pptx
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
STATICS OF THE RIGID BODIES Hibbelers.pdf
Cell Types and Its function , kingdom of life

Data Mining Module 5 Business Analytics.pdf

  • 1. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. RASHTRASANT TUKDOJI MAHARAJ NAGPUR UNIVERSITY MBA SEMESTER: 3 SPECIALIZATION BUSINESS ANALYTICS (BA 2) SUBJECT DATA MINING MODULE NO : 5 WEB MINING & TEXT MINING - Jayanti R Pande DGICM College, Nagpur
  • 2. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q1. What is web mining. Explain its process. Web mining is the process of using data-mining techniques to automatically extract information from various sources on the web. It involves discovering insights from web documents and services, encompassing a range of tasks beyond just applying standard data-mining tools. PROCESS OF WEB MINING 1.Resource finding: This involves retrieving data from multimedia sources available online or offline, such as news articles, forums, blogs, and HTML documents. It includes extracting text content from HTML documents by removing HTML tags. 2.Information selection and pre-processing: In this step, the original data obtained in the previous subtask undergoes transformations. These transformations may include prep-rocessing tasks like removing stop words, stemming, or restructuring the data to achieve the desired representation, such as identifying phrases in the training corpus or converting text into first-order logic form. 3.Generalization: Generalization is the process of discovering general patterns within individual websites as well as across multiple sites. Various machine-learning techniques, data-mining methods, and specialized web-oriented approaches are utilized to identify these patterns. 4.Analysis: This final task involves validating and interpreting the patterns mined from the data. It includes assessing the significance and relevance of the discovered patterns in relation to the objectives of the web mining process. 1 Resource finding 2 Information selection and pre-processing 3 Generalization 4 Analysis
  • 3. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Web Content Web Structure Web Usage - Involves unstructured data, primarily text documents. - Deals with semi-structured data found in hypertext documents. - Focuses on interactive aspects of web data, including link structures and server/browser logs. - Analysis typically involves machine learning techniques, including statistical methods like NLP. - Utilizes proprietary algorithms for analysis. - Analysis methods encompass machine learning and statistical techniques, especially association rules. - Data representation often includes models like bag of words and n-gram terms. - Representation is often depicted as edged labeled graphs. - Data represented using graphs and relational tables. - Applications include categorization, clustering, and pattern identification within textual data. - Main applications include identifying frequent substructures within web documents and discovering website schemas. - Applications range from categorization and clustering to site construction and rule extraction from user behavior. - It focuses on extracting meaningful information from unstructured text data on the web. - It deals with the organization and relationships between web elements like pages and links. - It emphasizes understanding user behavior and interaction patterns on the web. - Techniques such as sentiment analysis and topic modeling are commonly applied. - Graph-based algorithms are often used to analyze connectivity and relationships. - Usage patterns are analyzed to improve website design, content delivery, and marketing strategies. Q2. Compare Web Content, Web Structure and Web Usage.
  • 4. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q3. Explain the working of HITS Algorithm. HITS, short for Hyperlink-Induced Topic Search, is an algorithm used for ranking web pages based on their authority and hub scores. It evaluates the importance of a web page by considering both its authority, which is a measure of its relevance to a specific topic, and its hub score, which indicates its capacity to link to other authoritative pages on the same topic. HITS algorithm operates by analysing the link structure of the web and iteratively computing authority and hub scores for web pages. 1 Root Set Retrieval 2 Base Set Construction 3 Authority and Hub Computation 4 Iteration Process 5 Score Normalization 6 Repeat Iterations HITS Algorithm Steps
  • 5. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. The HITS (Hyperlink-Induced Topic Search) algorithm operates in several steps: 1 Root Set Retrieval: Initially, the most relevant pages to the search query are retrieved. This set is termed the root set and is typically obtained using a text-based search algorithm. 2 Base Set Construction: The root set is expanded by including all pages linked from it and some pages that link to it. This augmented set forms the base set, ensuring that a substantial number of strong authorities are included. This base set and the hyperlinks among its pages constitute a focused subgraph. 3 Authority and Hub Computation: Authority and hub values are computed iteratively in a mutually recursive manner. An authority value is calculated as the sum of the scaled hub values of the pages that point to it, while a hub value is determined as the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages. 4 Iteration Process: Authority Update: Each node's authority score is updated to be the sum of the hub scores of each node pointing to it. This implies that a node gains a high authority score by being linked from pages recognized as hubs for information. Hub Update: Each node's hub score is updated to be the sum of the authority scores of each node it points to. This means that a node earns a high hub score by linking to nodes considered authorities on the subject. These updates are iterated through a series of iterations. 5 Score Normalization: After each iteration, the hub and authority scores are normalized by dividing each hub score by the square root of the sum of the squares of all hub scores, and each authority score by the square root of the sum of the squares of all authority scores. This normalization process ensures that the scores remain comparable across iterations. 6 Repeat Iterations: The iterations continue until convergence, where the scores stabilize or until a predefined stopping criterion is met.
  • 6. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q4. Write about Text Mining. Text Mining is a process that involves transforming unstructured text into a structured format to uncover meaningful patterns and insights. It utilizes advanced analytical techniques such as Naïve Bayes, Support Vector Machines (SVM), and deep learning algorithms to explore hidden relationships within text data. Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. By applying advanced analytical techniques, such as Naïve Bayes, Support Vector Machines (SVM), and other deep learning algorithms, companies are able to explore and discover hidden relationships within their unstructured data. Text data can be organized into three main formats within databases: 1. Structured Data: This data is standardized into a tabular format with numerous rows and columns, making it easier to store and process for analysis and machine learning algorithms. Structured data can include inputs such as names, addresses, and phone numbers. 2. Unstructured Data: This data does not have a predefined format. It can include text from sources like social media or product reviews, as well as rich media formats like video and audio files. 3. Semi-Structured Data: Semi-structured data is a blend between structured and unstructured formats. While it has some organization, it lacks enough structure to meet the requirements of a relational database. Examples of semi-structured data include XML, JSON, and HTML files. Given that roughly 80% of data in the world resides in an unstructured format, text mining is an extremely valuable practice within organizations. Text mining tools and natural language processing (NLP) techniques, such as information extraction, enable the transformation of unstructured documents into a structured format for analysis and the generation of high-quality insights. This, in turn, improves the decision-making of organizations, leading to better business outcomes.
  • 7. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Q5. Explain PageRank Algorithm and its implementation steps. • The PageRank algorithm, developed by Google, is a method used to determine the importance of web pages in search engine results. Named after Larry Page, one of Google's founders, PageRank measures the significance of a webpage based on the quantity and quality of links pointing to it. • Google describes PageRank as a system that assesses a webpage's importance by considering the number and quality of links it receives from other pages. The underlying assumption is that more important websites are likely to attract more links from other websites. • The algorithm generates a probability distribution to represent the likelihood that a random surfer clicking on links will land on any particular page. It can be applied to collections of documents of any size, assuming an even distribution of importance among all documents at the start of the computation. • In the PageRank computation, a series of iterations, or passes through the collection, are required to adjust the approximate PageRank values to better reflect the true value. This involves transferring PageRank from a page to the targets of its outbound links, with the transfer evenly divided among all outbound links. • For example, in a scenario where web pages B, C, and D link to page A, each link would transfer an equal share of PageRank to A upon the next iteration, totaling to 0.75. • In another scenario where page B links to pages C and A, page C links to page A, and page D links to all three pages, the PageRank transferred to page A in the first iteration is calculated based on the existing values and the number of outbound links from each linking page. In the general case, the PageRank value for any page u depends on the PageRank values of pages linking to u, divided by the number of outbound links from each linking page. This calculation involves a damping factor, similar to income tax, which ensures fairness and stability in the algorithm.
  • 8. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. The implementation of the PageRank algorithm involves several steps: 1.Data Collection: Gather information about web pages and their links. This typically involves crawling the web to create a web graph, where nodes represent web pages and edges represent links between them. 2.Initialization: Assign an initial PageRank value to each web page. In the original version of PageRank, all pages are given an equal initial value. However, modern implementations often use a probability distribution between 0 and 1. 3.Iteration: Perform a series of iterations to update the PageRank values. In each iteration, calculate the PageRank for each page based on the PageRank values of the pages linking to it. 4.Damping Factor: Apply a damping factor to prevent manipulation and ensure fairness in the algorithm. The damping factor represents the probability that a random surfer will continue clicking on links rather than jumping to a new page. 5.Convergence: Repeat the iteration process until the PageRank values converge to stable values. Convergence occurs when the PageRank values no longer change significantly between iterations. 6.Normalization: Normalize the PageRank values to ensure they sum up to 1. This step ensures that the PageRank values represent a probability distribution. 7.Implementation Considerations: Implement efficient data structures and algorithms to handle large-scale web graphs. This may involve distributed computing techniques and optimizations to improve performance and scalability.
  • 9. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. Copyright © 2024 Jayanti Rajdevendra Pande. All rights reserved. This content may be printed for personal use only. It may not be copied, distributed, or used for any other purpose without the express written permission of the copyright owner. This content is protected by copyright law. Any unauthorized use of the content may violate copyright laws and other applicable laws. For any further queries contact on email: jayantipande17@gmail.com