SlideShare a Scribd company logo
Text Mining
Maurice Masih
13030141093
04/02/15 1
Topic of Discussion
• Introduction
• Text mining Comparison with other mining
• Text Mining Process
• How Algorithm is derived for Text Mining
• Text Analysis For Google Sheet
• Conclusion
04/02/15 2
Introduction
• It is the process of deriving high-quality information
– Non trivial information
– Unstructured text.
• It is also called as text data mining or text analytics.
Need
Bio Tech Industry
80% of biological knowledge is only in research
paper(unstructured).
If a scientist manually read 50 research paper/week and only 10%
of data are useful then he/she manages only 5 research paper/week
04/02/15 3
Text mining Comparison with…
Text Mining
Information
Retrieval
Web Mining
Data Mining
Statistics
Computer
Linguistics &
natural
language
processing04/02/15 4
Text Mining Process
Text
transformation
Text
Preprocessing
Text
Attribute
Selection
Data Mining/
Patter Discovery
Interpretation/
Evaluation•Document
Clustering
•Text
Characteristics
•Text Cleanup
•Tokenization
•Text representation
•Feature Selection
•Reduce Dimensionality
•Remove irrelevant
attributes
•Structured
database
•Application
dependent
•Classic data mining
technique
Terminate or
iterate
04/02/15 5
1.Text
Document clustering
 Large volume of textual data.
 No clear picture what document suit the application.
 Common technique is K mean clustering.
Text Characteristics
 Dependency
 Ambiguity
 Noisy Data
 Unstructured data
04/02/15 6
2.Text Preprocessing
Text Cleanup
 Remove ads from page
 Convert from binary format
 Normalize text
 Deal with tables, figures and formulas
Tokenization
 Splitting up a string of characters into a set of tokens.
 Need to deal with issues like, Apostrophes, hyphens.
 Need to deal with tenses, part of speech, etc.
04/02/15 7
3.Text transformation
Text Representation
 Text document is represented by the words (features) it contains
and their occurrences.
Bag of Words
04/02/15 8
3.Text transformation contd..
04/02/15 9
4.Attribute Selection
Reduction of dimensionality
 Learners have difficulty addressing tasks with high dimensionality.
 Scarcity of resources and feasibility issues also call for a further
cutback of attributes.
Irrelevant features
 Not all features help!
e.g., the existence of a noun in a news article is unlikely to help
classify it as “politics” or “sport”.
04/02/15 10
5.Data Mining/ Pattern Discovery
 Text mining process merges with the traditional Data Mining process.
 Classic Data Mining techniques are used on the structured database
that resulted from the previous stages.
6.Interpretation & Evaluation
What to do next?
 Terminate
 Iterate
04/02/15 11
How Algorithm is derived for Text
Mining
04/02/15 12
Text Analysis For Google Sheet
•Perform Sentiment Analysis
•Extract mention of entities and
concepts.
•Summarize long chunks of text
•Detect the language of a
document
•Find the best hashtags .
•Extract the full text of an article,
as well as its author
name, embedded media, etc.
04/02/15 13
Conclusion
Text mining generally consists of the analysis of (multiple) text
documents by extracting key phrases, concepts, matches etc. and
the preparation of the text processed in that manner for further
analyses with numeric data mining techniques.
04/02/15 14
References
• http://www.r-bloggers.com/text-mining-in-r-automatic-categorization-
of-wikipedia-articles/
• http://www.kdd.org/sites/default/files/issues/7-1-2005-06/9-
Popowich.pdf
• www.Slideshare.net
04/02/15 15
04/02/15 16

More Related Content

PPTX
Text mining
PPT
Big Data & Text Mining
PPTX
Text MIning
PPTX
Text mining
PPTX
Knowledge Discovery and Data Mining
PPTX
Data Mining
PDF
OLTP vs OLAP
PPTX
Presentation on Big Data
Text mining
Big Data & Text Mining
Text MIning
Text mining
Knowledge Discovery and Data Mining
Data Mining
OLTP vs OLAP
Presentation on Big Data

What's hot (20)

PPTX
Big Data & Data Science
PDF
Text mining Pre-processing
PPT
10. XML in DBMS
PPTX
Data science life cycle
PDF
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
PPTX
Data Mining: Applying data mining
PPTX
Text mining
PPTX
Metadata ppt
PPT
Data Warehousing and Data Mining
PDF
Optics ordering points to identify the clustering structure
PPT
Data Warehouse Architectures
PPTX
Language Models for Information Retrieval
PPTX
DATA WAREHOUSING
PDF
The Advantages and Disadvantages of Big Data
PDF
Introduction to Data Science
PPT
Textmining Introduction
PPTX
Big Data PPT by Rohit Dubey
PDF
Lecture2 big data life cycle
PPTX
Big data analytics
PDF
Data lineage and observability with Marquez - subsurface 2020
Big Data & Data Science
Text mining Pre-processing
10. XML in DBMS
Data science life cycle
Data Analyst vs Data Engineer vs Data Scientist | Data Analytics Masters Prog...
Data Mining: Applying data mining
Text mining
Metadata ppt
Data Warehousing and Data Mining
Optics ordering points to identify the clustering structure
Data Warehouse Architectures
Language Models for Information Retrieval
DATA WAREHOUSING
The Advantages and Disadvantages of Big Data
Introduction to Data Science
Textmining Introduction
Big Data PPT by Rohit Dubey
Lecture2 big data life cycle
Big data analytics
Data lineage and observability with Marquez - subsurface 2020
Ad

Viewers also liked (9)

PPSX
BigData & Cloud @ Excelerate Systems France
PPT
Introduction to text mining
PPTX
Text data mining1
PPTX
Introduction to Text Mining
PPTX
3. introduction to text mining
PDF
Who is watson?
PPTX
Image processing ppt
PPTX
Une introduction au Text Mining et à la sémantique
PPT
Digital Image Processing
BigData & Cloud @ Excelerate Systems France
Introduction to text mining
Text data mining1
Introduction to Text Mining
3. introduction to text mining
Who is watson?
Image processing ppt
Une introduction au Text Mining et à la sémantique
Digital Image Processing
Ad

Similar to Tesxt mining (20)

PPTX
Text mining presentation in Data mining Area
PPTX
Text-Mining-Presentation artificial intelligence
PPTX
TEXT MINING.pptx
PPTX
Data_mining_ppt_CA2.pptx
PPT
Week12
PPTX
text Mining topic in data Mining subject
PPTX
TEXT MINING-ML
PPTX
Data, Text and Web Mining
PPTX
ML12_12500119160.pptx
PPTX
Text mining
PPT
4499994.ppt
PPTX
Text mining and analytics v6 - p1
PDF
A Survey on Text Mining-techniques and application
PPT
Text mining and data mining
PPT
turban_ch07ch07ch07ch07ch07ch07dss9e_ch07.ppt
PDF
A non-technical introduction to text mining for information specialists
PPT
Web & text mining lecture10
PPTX
sentiment analysis
PDF
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
PDF
Data Science - Part XI - Text Analytics
Text mining presentation in Data mining Area
Text-Mining-Presentation artificial intelligence
TEXT MINING.pptx
Data_mining_ppt_CA2.pptx
Week12
text Mining topic in data Mining subject
TEXT MINING-ML
Data, Text and Web Mining
ML12_12500119160.pptx
Text mining
4499994.ppt
Text mining and analytics v6 - p1
A Survey on Text Mining-techniques and application
Text mining and data mining
turban_ch07ch07ch07ch07ch07ch07dss9e_ch07.ppt
A non-technical introduction to text mining for information specialists
Web & text mining lecture10
sentiment analysis
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
Data Science - Part XI - Text Analytics

Recently uploaded (20)

PPTX
Modelling in Business Intelligence , information system
PPTX
Database Infoormation System (DBIS).pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Microsoft Core Cloud Services powerpoint
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
modul_python (1).pptx for professional and student
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
How to run a consulting project- client discovery
PDF
Business Analytics and business intelligence.pdf
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Transcultural that can help you someday.
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
DATA COLLECTION METHODS-ppt for nursing research
PPTX
Managing Community Partner Relationships
Modelling in Business Intelligence , information system
Database Infoormation System (DBIS).pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Microsoft Core Cloud Services powerpoint
[EN] Industrial Machine Downtime Prediction
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
modul_python (1).pptx for professional and student
SAP 2 completion done . PRESENTATION.pptx
ISS -ESG Data flows What is ESG and HowHow
How to run a consulting project- client discovery
Business Analytics and business intelligence.pdf
STERILIZATION AND DISINFECTION-1.ppthhhbx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Transcultural that can help you someday.
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
DATA COLLECTION METHODS-ppt for nursing research
Managing Community Partner Relationships

Tesxt mining

  • 2. Topic of Discussion • Introduction • Text mining Comparison with other mining • Text Mining Process • How Algorithm is derived for Text Mining • Text Analysis For Google Sheet • Conclusion 04/02/15 2
  • 3. Introduction • It is the process of deriving high-quality information – Non trivial information – Unstructured text. • It is also called as text data mining or text analytics. Need Bio Tech Industry 80% of biological knowledge is only in research paper(unstructured). If a scientist manually read 50 research paper/week and only 10% of data are useful then he/she manages only 5 research paper/week 04/02/15 3
  • 4. Text mining Comparison with… Text Mining Information Retrieval Web Mining Data Mining Statistics Computer Linguistics & natural language processing04/02/15 4
  • 5. Text Mining Process Text transformation Text Preprocessing Text Attribute Selection Data Mining/ Patter Discovery Interpretation/ Evaluation•Document Clustering •Text Characteristics •Text Cleanup •Tokenization •Text representation •Feature Selection •Reduce Dimensionality •Remove irrelevant attributes •Structured database •Application dependent •Classic data mining technique Terminate or iterate 04/02/15 5
  • 6. 1.Text Document clustering  Large volume of textual data.  No clear picture what document suit the application.  Common technique is K mean clustering. Text Characteristics  Dependency  Ambiguity  Noisy Data  Unstructured data 04/02/15 6
  • 7. 2.Text Preprocessing Text Cleanup  Remove ads from page  Convert from binary format  Normalize text  Deal with tables, figures and formulas Tokenization  Splitting up a string of characters into a set of tokens.  Need to deal with issues like, Apostrophes, hyphens.  Need to deal with tenses, part of speech, etc. 04/02/15 7
  • 8. 3.Text transformation Text Representation  Text document is represented by the words (features) it contains and their occurrences. Bag of Words 04/02/15 8
  • 10. 4.Attribute Selection Reduction of dimensionality  Learners have difficulty addressing tasks with high dimensionality.  Scarcity of resources and feasibility issues also call for a further cutback of attributes. Irrelevant features  Not all features help! e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport”. 04/02/15 10
  • 11. 5.Data Mining/ Pattern Discovery  Text mining process merges with the traditional Data Mining process.  Classic Data Mining techniques are used on the structured database that resulted from the previous stages. 6.Interpretation & Evaluation What to do next?  Terminate  Iterate 04/02/15 11
  • 12. How Algorithm is derived for Text Mining 04/02/15 12
  • 13. Text Analysis For Google Sheet •Perform Sentiment Analysis •Extract mention of entities and concepts. •Summarize long chunks of text •Detect the language of a document •Find the best hashtags . •Extract the full text of an article, as well as its author name, embedded media, etc. 04/02/15 13
  • 14. Conclusion Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, matches etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques. 04/02/15 14