SlideShare a Scribd company logo
3
Most read
4
Most read
11
Most read
Text Mining
Submitted to:
Ms. Mala Kalra
Dr. Rakesh Kumar
Assistant Professor
Department of CSE
NITTTR Chandigarh
Submitted by:
Pankaj Thakur
MECSE (Modular)
RN 171408
Contents
 Introduction & Need
 Information Retrieval and its Methods
 Approaches
 Process
 Techniques used
 Merits & Demerits
 Challenges
 Applications
 Text Mining Computer Programs
 Demo using python
 Latest Research work
 References
 Query
2
Introduction
3
• Data means known facts that can be recorded and that have implicit meaning.[1]
• Database means a collection of related data. [1]
• Data Warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and that usually resides at a single site.[2]
• Data Mining knowledge mining from data .[2]
(extracting knowledge from large amounts of data)
• Text databases(Document databases)
Large collections of documents from various sources:
news articles, research papers, books, digital libraries, e-mail messages, and web pages etc.
(unstructured, semi structured, structured)
• May be highly unstructured (some web pages on www)
• May be semi structured (email messages)
• May be structured ( Library catalogue database)
• Text databases with highly regular structures typically can be implemented
using relational database systems.
• Text Mining is the analysis of data contained in natural language text.
• Regular data mining Vs. Text mining:- in text mining the patterns are extracted from
natural language text rather than from structured databases of facts.[3]
Diagram
4
Text Mining Vs. Data Mining
Data Mining Text Mining
Data Object Numerical & categorical
data
Textual data
Data structure Structured Unstructured &semi-
structured
Data representation Straightforward Complex
Space dimension < tens of thousands > tens of thousands
Methods Data analysis, machine
learning, Data mining,
information
Statistic, neural networks
retrieval, NLP, ...
Maturity Broad implementation
since1994
Broad implementation
starting 2000
Market 105 analysts at large and
mid size companies
108 analysts corporate
workers and individual
users
5
Need of Text Mining
• Massive amount of new information being created doubles every 18 months.
• 80-90% of all data is held in various unstructured formats.
• Useful information can be derived from this unstructured data.
Unstructured or semi-structured
information
Structured, Numerical or coded
information
(News articles, research papers, books, digital libraries, email messages, and web pages )
• Text databases are rapidly growing due to the increasing amount of information available in
Electronic forms, such as electronic publication, various kinds of electronic documents, emails,
and www.
• Most of the information in government, industry, business, and other institutions are stored
Electronically in the form of text databases.
Information Retrieval[2]
6
Information retrieval (IR) is a field that has been developing in parallel with database systems.
Concerned with retrieval of information from a large number of text based documents.
Precision and Recall are two basic measures for accessing the quality of text retrieval.
Precision is the percentage of retrieved documents
that are in fact relevant to the query.
Recall is the percentage of documents that are relevant
to the query and were, in fact, retrieved.
Where {Relevant} is set of documents relevant to a query,
{Retrieved} is the set of documents retrieved.
Information Retrieval Methods[2]
7
Two Categories
IR
Methods
Document
Selection
Methods
Document
Ranking
Methods
• Document Selection
Problem
• Boolean retrieval model
• Document Ranking
Problem
• Vector space model
Vector Space Model[2]
8
• Represent a document and a query both as vectors in a high-dimensional space
corresponding to all the keywords and use an appropriate similarity measure to
compute the similarity between the query vector and the document vector.
• The similarity values can then be used for ranking document.
• Let freq(d, t) = term frequency = no. of occurrences of term t in the document d
• TF(d, t) = term frequency matrix, measures the association of a term t with respect
to the given document d.
TF-IDF(d, t) = TF(d, t) X IDF(t)
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency
Inverse Document Frequency
(represents scaling factor or the importance of term t)
Here, d is the document collection,
dt is the set of documents containing term t.
Vector Space Model[2]
(Example)
9
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
For t6 in, d4 we have
TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374
IDF(t6 ) = log (1+5)/3 = 0.301
TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403
0 if freq(d, t) = 0
TF(d, t) =
1+log(1+log(freq(d, t ))) Otherwise
d/t t1 t2 t3 t4 t5 t6 t7
d1 0 4 10 8 0 5 0
d2 5 19 7 16 0 0 32
d3 15 0 0 4 9 0 17
d4 22 3 12 0 5 15 0
d5 0 7 0 9 2 4 12
A Term Frequency Matrix
Text Mining Approaches[2]
10
Text Mining
Approaches
Keyword based
approach Tagging approach
Information
extraction
approach
• set of keywords or
terms in the documents
• may only discover relationship
e.g “database” & “system”,
“terrorist” & “explosion”
• may not bring deep
understanding to the text
Input
• set of tags
• may rely on
manual tagging
(costly & not feasible
for large collection of
documents)
• semantic information
(events, facts etc.)
• more advanced
• may lead to the discovery of
some deep knowledge
Text Mining Process[4]
11
Preprocessing
Text Mining
Technique is
applied
Analysis of Text
Text document from
different sources
Discovery of
knowledge
The technologies like
Information extraction, categorization, Clustering, Visualization, Summarization
are used in the text mining process
Techniques Used in Text Mining[4]
1. Information Extraction:
tokenization, identification of named entities, sentence segmentation, and part-of-
speech assignment.
2. Text categorization
procedure of assigning a category to the text among categories predefined by users.
3. Text clustering
procedure of segmenting texts into several clusters, depending on the substantial
relevance.
4. Visualization
improve and simplify the discovery of relevant information.
5. Text summarization
procedure to extract its partial content reflecting its whole contents automatically.
12
Merits and Demerits of Text mining[4]
Merits:
i) The names of different entities and relationship between them can easily be
found from the corpus of documents set (using the technique such as
information extraction. )
ii) The challenging problem of managing great amount of unstructured
information for extracting pattern is solved by text mining.
Demerits:
i) The information which is initially needed is no where written.
ii) To mine the text for information or knowledge no programs can be made in
order to analyze the unstructured text directly.
13
Challenges in Text Mining
(Representation issues)
• Each word has a dictionary meaning, or meanings
Run – (1) the verb. (2) the noun, in cricket
Cricket – (1) The game. (2) The insect.
Apple (the company) or apple (the fruit)
• Ambiguity and context sensitivity - Each word is used in various “senses”
Tendulkar made 100 runs
Because of an injury, Tendulkar can not run and will need a runner between the
wickets
• Capturing the “meaning” of sentences is an important issue as well.
(Grammar, parts of speech, time sense could be easy!)
• Order of words in the query
hot dog stand in the amusement park
hot amusement stand in the dog park
14
Text Mining Applications[5]
15
1. Security applications
(monitoring and analysis of online plain text sources such as Internet news, blogs, etc.
for national security purposes.)
2. Biomedical applications
(studies in protein docking, protein interactions, and protein-disease associations)
3. Software applications
(Within public sector much effort has been concentrated on creating software for
tracking and monitoring terrorist activities.)
4. Online media applications
(The Tribune Company, uses text mining to clarify information and to provide readers
with greater search experiences, which in turn increases site "stickiness" and revenue. )
5. Business and marketing applications
(CRM, to improve predictive analytics models for customer, stock returns prediction)
6. Sentiment analysis
(analysis of movie reviews, used to detect emotions, etc.)
7. Scientific literature mining and academic applications
Text Mining Computer Programs[5]
16
Demo
17
• Text Mining using Python
(Tweeter, Whatsapp Chats)
Latest Research work on Text Mining[6]
1. Sunil Kumar ; Maninder Singh, “Big data analytics for healthcare industry: impact,
applications, and tools”, DOI: 10.26599/BDMA.2018.9020031
2. Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, Chengfei Liu, Yanchun Zhang, “An
Efficient Method for High Quality and Cohesive Topical Phrase Mining”, DOI:
0.1109/TKDE.2018.2823758
3. Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung,
“Learning Stylometric Representations for Authorship Analysis”, DOI:
10.1109/TCYB.2017.2766189
4. Mohammed Nasri, Younes Jaafar, Karim Bouzoubaa, “Semantic Analysis of Arabic
Texts Within SAFAR Framework”, DOI: 10.1109/CIST.2018.8596491
5. Jayesh Choudhari, Anirban Dasgupta, Indrajit Bhattacharya, Srikanta Bedathur,
“Discovering Topical Interactions in Text-Based Cascades Using Hidden Markov
Hawkes Processes”, DOI: 10.1109/ICDM.2018.00112
6. Yong Luo, Huaizheng Zhang, Yongjie Wang, Yonggang Wen, Xinwen Zhang,
“ResumeNet: A Learning-Based Framework for Automatic Resume Quality
Assessment”, DOI: 10.1109/ICDM.2018.00046
7. Si-Yu Ding, Xu-Ying Liu, Min-Ling Zhang, “Imbalanced Augmented Class Learning with
Unlabeled Data by Label Confidence Propagation”, DOI: 10.1109/ICDM.2018.00023
18
References
[1] Ramez Elmasri and Shamkant B. Navathe, “Fundamentals of database systems”, 6th
edition.
[2] Jiawei Han and Micheline Kamber, “Data Mining, Concepts and Techniques”, 2nd
edition.
[3] http://people.ischool.berkeley.edu/~hearst/text-mining.html
[4] Sonali Vijay Gaikwad, Archana Chaugule, Pramod Patil, “Text Mining Methods and
Techniques”, International Journal of Computer Applications (0975 – 8887),
International Journal of Computer Applications (0975 – 8887), Volume 85 – No 17,
January 2014
[5] http://www.wikipedia.org
[6] https://ieeexplore.org
19
Questions
?
20
21
Thanks!
Data Warehouse
22
Data Source in Delhi
Data Source in Mumbai
Data Source in Kolkata
Data Source in Chennai
Clean
Integrate
Transform
Load
Refresh
Data
Warehouse
Query and
Analysis
Tools
Client
Client
Back

More Related Content

PPTX
Text mining
PPT
Textmining Introduction
PPTX
Data mining presentation.ppt
PDF
Data Mining & Data Warehousing Lecture Notes
PPTX
Text data mining1
PPTX
Text mining
PPTX
Text Mining
PPTX
Text MIning
Text mining
Textmining Introduction
Data mining presentation.ppt
Data Mining & Data Warehousing Lecture Notes
Text data mining1
Text mining
Text Mining
Text MIning

What's hot (20)

PPTX
Data mining query language
PPTX
Association Analysis in Data Mining
PPTX
Data Mining & Applications
PPTX
Architecture of data mining system
PPT
Topic Models
PPTX
Data mining an introduction
PDF
Exploratory data analysis data visualization
PPTX
Data mining fp growth
PPTX
Final ppt
PDF
Data analysis
PPTX
Classification of data mart
PPTX
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
PDF
Cluster analysis
PPTX
Text mining
PPTX
Data mining
PPTX
Data Mining: Data warehouse and olap technology
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PDF
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
PPT
Data mining basic fundamentals
Data mining query language
Association Analysis in Data Mining
Data Mining & Applications
Architecture of data mining system
Topic Models
Data mining an introduction
Exploratory data analysis data visualization
Data mining fp growth
Final ppt
Data analysis
Classification of data mart
Naive Bayes Classifier | Naive Bayes Algorithm | Naive Bayes Classifier With ...
Cluster analysis
Text mining
Data mining
Data Mining: Data warehouse and olap technology
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
What is Machine Learning | Introduction to Machine Learning | Machine Learnin...
Data mining basic fundamentals
Ad

Similar to Text mining (20)

PDF
Web_Mining_Overview_Nfaoui_El_Habib
PDF
Paper id 26201475
PDF
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
PDF
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
DOC
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
PDF
Decision Support for E-Governance: A Text Mining Approach
PDF
An Improved Annotation Based Summary Generation For Unstructured Data
PDF
A novel approach for text extraction using effective pattern matching technique
PDF
Unit 1 Information Storage and Retrieval
PDF
Ijetcas14 409
PPTX
Applying ocr to extract information : Text mining
PDF
Structured and Unstructured Information Extraction Using Text Mining and Natu...
PDF
An effective pre processing algorithm for information retrieval systems
PDF
B0410206010
PDF
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
DOCX
Post 1What is text analytics How does it differ from text mini
DOCX
Post 1What is text analytics How does it differ from text mini.docx
PDF
Information Retrieval based on Cluster Analysis Approach
PDF
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
Web_Mining_Overview_Nfaoui_El_Habib
Paper id 26201475
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
Applying Clustering Techniques for Efficient Text Mining in Twitter Data
Decision Support for E-Governance: A Text Mining Approach
An Improved Annotation Based Summary Generation For Unstructured Data
A novel approach for text extraction using effective pattern matching technique
Unit 1 Information Storage and Retrieval
Ijetcas14 409
Applying ocr to extract information : Text mining
Structured and Unstructured Information Extraction Using Text Mining and Natu...
An effective pre processing algorithm for information retrieval systems
B0410206010
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini.docx
Information Retrieval based on Cluster Analysis Approach
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
Ad

Recently uploaded (20)

PDF
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
Current and future trends in Computer Vision.pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PPTX
introduction to high performance computing
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Integrating Fractal Dimension and Time Series Analysis for Optimized Hyperspe...
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Current and future trends in Computer Vision.pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Automation-in-Manufacturing-Chapter-Introduction.pdf
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
UNIT no 1 INTRODUCTION TO DBMS NOTES.pdf
Safety Seminar civil to be ensured for safe working.
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Fundamentals of Mechanical Engineering.pptx
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
introduction to high performance computing
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems

Text mining

  • 1. Text Mining Submitted to: Ms. Mala Kalra Dr. Rakesh Kumar Assistant Professor Department of CSE NITTTR Chandigarh Submitted by: Pankaj Thakur MECSE (Modular) RN 171408
  • 2. Contents  Introduction & Need  Information Retrieval and its Methods  Approaches  Process  Techniques used  Merits & Demerits  Challenges  Applications  Text Mining Computer Programs  Demo using python  Latest Research work  References  Query 2
  • 3. Introduction 3 • Data means known facts that can be recorded and that have implicit meaning.[1] • Database means a collection of related data. [1] • Data Warehouse is a repository of information collected from multiple sources, stored under a unified schema, and that usually resides at a single site.[2] • Data Mining knowledge mining from data .[2] (extracting knowledge from large amounts of data) • Text databases(Document databases) Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and web pages etc. (unstructured, semi structured, structured) • May be highly unstructured (some web pages on www) • May be semi structured (email messages) • May be structured ( Library catalogue database) • Text databases with highly regular structures typically can be implemented using relational database systems. • Text Mining is the analysis of data contained in natural language text. • Regular data mining Vs. Text mining:- in text mining the patterns are extracted from natural language text rather than from structured databases of facts.[3] Diagram
  • 4. 4 Text Mining Vs. Data Mining Data Mining Text Mining Data Object Numerical & categorical data Textual data Data structure Structured Unstructured &semi- structured Data representation Straightforward Complex Space dimension < tens of thousands > tens of thousands Methods Data analysis, machine learning, Data mining, information Statistic, neural networks retrieval, NLP, ... Maturity Broad implementation since1994 Broad implementation starting 2000 Market 105 analysts at large and mid size companies 108 analysts corporate workers and individual users
  • 5. 5 Need of Text Mining • Massive amount of new information being created doubles every 18 months. • 80-90% of all data is held in various unstructured formats. • Useful information can be derived from this unstructured data. Unstructured or semi-structured information Structured, Numerical or coded information (News articles, research papers, books, digital libraries, email messages, and web pages ) • Text databases are rapidly growing due to the increasing amount of information available in Electronic forms, such as electronic publication, various kinds of electronic documents, emails, and www. • Most of the information in government, industry, business, and other institutions are stored Electronically in the form of text databases.
  • 6. Information Retrieval[2] 6 Information retrieval (IR) is a field that has been developing in parallel with database systems. Concerned with retrieval of information from a large number of text based documents. Precision and Recall are two basic measures for accessing the quality of text retrieval. Precision is the percentage of retrieved documents that are in fact relevant to the query. Recall is the percentage of documents that are relevant to the query and were, in fact, retrieved. Where {Relevant} is set of documents relevant to a query, {Retrieved} is the set of documents retrieved.
  • 7. Information Retrieval Methods[2] 7 Two Categories IR Methods Document Selection Methods Document Ranking Methods • Document Selection Problem • Boolean retrieval model • Document Ranking Problem • Vector space model
  • 8. Vector Space Model[2] 8 • Represent a document and a query both as vectors in a high-dimensional space corresponding to all the keywords and use an appropriate similarity measure to compute the similarity between the query vector and the document vector. • The similarity values can then be used for ranking document. • Let freq(d, t) = term frequency = no. of occurrences of term t in the document d • TF(d, t) = term frequency matrix, measures the association of a term t with respect to the given document d. TF-IDF(d, t) = TF(d, t) X IDF(t) 0 if freq(d, t) = 0 TF(d, t) = 1+log(1+log(freq(d, t ))) OtherwiseTerm Frequency Inverse Document Frequency (represents scaling factor or the importance of term t) Here, d is the document collection, dt is the set of documents containing term t.
  • 9. Vector Space Model[2] (Example) 9 d/t t1 t2 t3 t4 t5 t6 t7 d1 0 4 10 8 0 5 0 d2 5 19 7 16 0 0 32 d3 15 0 0 4 9 0 17 d4 22 3 12 0 5 15 0 d5 0 7 0 9 2 4 12 A Term Frequency Matrix For t6 in, d4 we have TF(d4, t6 ) = 1 + log(1+log(15)) = 1.3374 IDF(t6 ) = log (1+5)/3 = 0.301 TF-IDF(d4, t6) = 1.3377 X 0.301 = 0.403 0 if freq(d, t) = 0 TF(d, t) = 1+log(1+log(freq(d, t ))) Otherwise d/t t1 t2 t3 t4 t5 t6 t7 d1 0 4 10 8 0 5 0 d2 5 19 7 16 0 0 32 d3 15 0 0 4 9 0 17 d4 22 3 12 0 5 15 0 d5 0 7 0 9 2 4 12 A Term Frequency Matrix
  • 10. Text Mining Approaches[2] 10 Text Mining Approaches Keyword based approach Tagging approach Information extraction approach • set of keywords or terms in the documents • may only discover relationship e.g “database” & “system”, “terrorist” & “explosion” • may not bring deep understanding to the text Input • set of tags • may rely on manual tagging (costly & not feasible for large collection of documents) • semantic information (events, facts etc.) • more advanced • may lead to the discovery of some deep knowledge
  • 11. Text Mining Process[4] 11 Preprocessing Text Mining Technique is applied Analysis of Text Text document from different sources Discovery of knowledge The technologies like Information extraction, categorization, Clustering, Visualization, Summarization are used in the text mining process
  • 12. Techniques Used in Text Mining[4] 1. Information Extraction: tokenization, identification of named entities, sentence segmentation, and part-of- speech assignment. 2. Text categorization procedure of assigning a category to the text among categories predefined by users. 3. Text clustering procedure of segmenting texts into several clusters, depending on the substantial relevance. 4. Visualization improve and simplify the discovery of relevant information. 5. Text summarization procedure to extract its partial content reflecting its whole contents automatically. 12
  • 13. Merits and Demerits of Text mining[4] Merits: i) The names of different entities and relationship between them can easily be found from the corpus of documents set (using the technique such as information extraction. ) ii) The challenging problem of managing great amount of unstructured information for extracting pattern is solved by text mining. Demerits: i) The information which is initially needed is no where written. ii) To mine the text for information or knowledge no programs can be made in order to analyze the unstructured text directly. 13
  • 14. Challenges in Text Mining (Representation issues) • Each word has a dictionary meaning, or meanings Run – (1) the verb. (2) the noun, in cricket Cricket – (1) The game. (2) The insect. Apple (the company) or apple (the fruit) • Ambiguity and context sensitivity - Each word is used in various “senses” Tendulkar made 100 runs Because of an injury, Tendulkar can not run and will need a runner between the wickets • Capturing the “meaning” of sentences is an important issue as well. (Grammar, parts of speech, time sense could be easy!) • Order of words in the query hot dog stand in the amusement park hot amusement stand in the dog park 14
  • 15. Text Mining Applications[5] 15 1. Security applications (monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes.) 2. Biomedical applications (studies in protein docking, protein interactions, and protein-disease associations) 3. Software applications (Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.) 4. Online media applications (The Tribune Company, uses text mining to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. ) 5. Business and marketing applications (CRM, to improve predictive analytics models for customer, stock returns prediction) 6. Sentiment analysis (analysis of movie reviews, used to detect emotions, etc.) 7. Scientific literature mining and academic applications
  • 16. Text Mining Computer Programs[5] 16
  • 17. Demo 17 • Text Mining using Python (Tweeter, Whatsapp Chats)
  • 18. Latest Research work on Text Mining[6] 1. Sunil Kumar ; Maninder Singh, “Big data analytics for healthcare industry: impact, applications, and tools”, DOI: 10.26599/BDMA.2018.9020031 2. Bing Li, Xiaochun Yang, Rui Zhou, Bin Wang, Chengfei Liu, Yanchun Zhang, “An Efficient Method for High Quality and Cohesive Topical Phrase Mining”, DOI: 0.1109/TKDE.2018.2823758 3. Steven H. H. Ding, Benjamin C. M. Fung, Farkhund Iqbal, William K. Cheung, “Learning Stylometric Representations for Authorship Analysis”, DOI: 10.1109/TCYB.2017.2766189 4. Mohammed Nasri, Younes Jaafar, Karim Bouzoubaa, “Semantic Analysis of Arabic Texts Within SAFAR Framework”, DOI: 10.1109/CIST.2018.8596491 5. Jayesh Choudhari, Anirban Dasgupta, Indrajit Bhattacharya, Srikanta Bedathur, “Discovering Topical Interactions in Text-Based Cascades Using Hidden Markov Hawkes Processes”, DOI: 10.1109/ICDM.2018.00112 6. Yong Luo, Huaizheng Zhang, Yongjie Wang, Yonggang Wen, Xinwen Zhang, “ResumeNet: A Learning-Based Framework for Automatic Resume Quality Assessment”, DOI: 10.1109/ICDM.2018.00046 7. Si-Yu Ding, Xu-Ying Liu, Min-Ling Zhang, “Imbalanced Augmented Class Learning with Unlabeled Data by Label Confidence Propagation”, DOI: 10.1109/ICDM.2018.00023 18
  • 19. References [1] Ramez Elmasri and Shamkant B. Navathe, “Fundamentals of database systems”, 6th edition. [2] Jiawei Han and Micheline Kamber, “Data Mining, Concepts and Techniques”, 2nd edition. [3] http://people.ischool.berkeley.edu/~hearst/text-mining.html [4] Sonali Vijay Gaikwad, Archana Chaugule, Pramod Patil, “Text Mining Methods and Techniques”, International Journal of Computer Applications (0975 – 8887), International Journal of Computer Applications (0975 – 8887), Volume 85 – No 17, January 2014 [5] http://www.wikipedia.org [6] https://ieeexplore.org 19
  • 22. Data Warehouse 22 Data Source in Delhi Data Source in Mumbai Data Source in Kolkata Data Source in Chennai Clean Integrate Transform Load Refresh Data Warehouse Query and Analysis Tools Client Client Back