Copyright 2003-4, SPSS Inc. 1
An Introduction to Text
Mining
Tim Daciuk
SPSS, Inc.
Services Manager, Canada
Copyright 2003-4, SPSS Inc. 2
AgendaAgenda
 Introductions
 An Overview of Document Warehousing
 Understanding Unstructured Text
 Concept Extraction
 Text Mining
 Data Mining
 Demonstration
Copyright 2003-4, SPSS Inc. 3
Tim DaciukTim Daciuk
 Background
 Social research
 Survey research
 SPSS
 25 years working with the product
 12 years working with the company
 5 years working with text analysis
 Prior history
 Consulting
 Education
Copyright 2003-4, SPSS Inc. 4
Predictive analysis helps connect data to effective
action by drawing reliable conclusions about
current conditions and future events.
— Gareth Herschel, Research Director, Gartner Group
Predictive Analytics: DefinedPredictive Analytics: Defined
Copyright 2003-4, SPSS Inc. 5
SPSS At A GlanceSPSS At A Glance
 Leadership
 Market leader in Predictive Analytics
 Focus on online & offline customer data acquisition and analysis
 Stability
 Founded in 1968
 30+ year heritage in analytic technologies
 Proven track record
 250,000+ customers worldwide
 NASDAQ: SPSS
 Analytics standard
 80% of Fortune 500 are SPSS customers
 80% plus market share in Survey & Market Research sector
 Ranked #1 Data Mining solution by KD Nuggets
Some of Our BrandsSome of Our Brands
Copyright 2003-4, SPSS Inc. 7
Unstructured Data ManagementUnstructured Data Management
Text Mining is a subset of Unstructured Data
Management.
UDM can be broken down into:
 Content and Document Management
 Search and Retrieval
 XML database and tools
 Categorization, Classification, and Visualization
Copyright 2003-4, SPSS Inc. 8
80% of Data is Unstructured80% of Data is Unstructured
 Database notes:
 Call center transcripts
 Other CRM
 Email
 Open-ended survey
responses
 Web pages
 NewsGroups
 Documents themselves
 Competitive information
Copyright 2003-4, SPSS Inc. 9
Applications for Text AnalysisApplications for Text Analysis
 Surveys
 ‘Reading’ email
 Call centre data
 Comment data
 Abstracts
 Document management
 Corporate history
 Thematic understanding of website
Copyright 2003-4, SPSS Inc. 10
Data Warehouse vs. DocumentData Warehouse vs. Document
WarehouseWarehouse
 Data warehouse
 Who, what, when, where, how much
 Internally focused
 Operational information
 Rarely include external information
 Document warehouse
 Why
 May not be internally focused
 May contain a range of information
 Often integrate external information
Copyright 2003-4, SPSS Inc. 11
Document Warehouse FeaturesDocument Warehouse Features
 There is no single document structure or document
type
 Documents are drawn from multiple sources
 Essential features of documents are automatically
extracted and explicitly stored in the document
warehouse
 Document warehouses are designed to integrate
semantically related documents
Copyright 2003-4, SPSS Inc. 12
Building the Document WarehouseBuilding the Document Warehouse
Identify
Sources
Retrieve
Document
Text
Analysis
Pre-process
Document
Compile
Metadata
Copyright 2003-4, SPSS Inc. 13
Predict, Impact, DeployPredict, Impact, Deploy
Customer
Data
Attitudes
Actions
Attributes
Business
User
Grow
Retain
Fraud
Outcomes
Attract
Data
Collection
Text
Surveys
Web
Channel
Operational
Systems
Text
BusinessUI
Expert UIExpert UI
Concepts
Concept
Maps
Clustering
Categoriza-
tion
Trending
Information
Extraction
Prediction
NLP
Copyright 2003-4, SPSS Inc. 14
The Building Blocks of LanguageThe Building Blocks of Language
 Morphology
 Syntax
 Semantics
 Phonology
 Pragmatics
Copyright 2003-4, SPSS Inc. 15
MorphologyMorphology
 Understanding words
 Stems
 Affixes
 Prefix
 Suffix
 Inflectional elements
 Reducing complexity of
analysis
 Reduces complexity of
representation
 Supports text mining
Noun
Prefix
Noun
Stem
Suffix
- abledisputein -
Copyright 2003-4, SPSS Inc. 16
SyntaxSyntax
 The Bank of Canada will curb inflation with higher
interest rates
Prepositional phrase
Adjective
Sentence
Noun phrase Verb phrase
Noun
VerbAux
Noun phrase
NounAdjective
Noun
The Bank of
Canada
inflationcurbwill
Interest rateshigher
with
Copyright 2003-4, SPSS Inc. 17
SemanticsSemantics
 The meaning of it all
 Approaches to meaning
 Semantic networks
 Deductive logic
 Rule-based systems
 Useful for classification
Copyright 2003-4, SPSS Inc. 18
Problems with NLPProblems with NLP
 Limitations of Natural Language Processing
 Correctly identifying the role of noun phrases
 Representing abstract concepts
 Classifying synonyms
 Representing the number of concepts
Copyright 2003-4, SPSS Inc. 19
Problems with NLPProblems with NLP
 Limitations of technology
 Language specific designs are required
 Classification speed
 Classifying hybrid words and sentences
Copyright 2003-4, SPSS Inc. 20
Underlying Technology is Based onUnderlying Technology is Based on
LinguisticsLinguistics
The Linguistic Approach:
 Does not treat a document as a bag of words
 Removes ambiguity by extracting structured concepts
Concepts are the DNA of text.
Text is unstructured, ambiguous, and language
dependent.
Copyright 2003-4, SPSS Inc. 21
From Text to ConceptsFrom Text to Concepts
Morphology
Syntax
Semantics Statistics
Linguistic
Terminology
Extractor
ScalableAccurate
Customizable Discovery-
Oriented
•Compound words
•Proper nouns
•Figures
•Named entities
•Domain specifics
•Speed
•Multiple formats
•Multiple languages
•SPSS dictionaries
•User dictionaries
•Extraction rules
•Extraction patterns
•Known terms
•Unknown terms
•New terms
•1GB/hour
•PDF, MS Office, text…
•English, French, German
Spanish, Italian, Dutch,
Japanese
• Inserm; merck & co…
• tnp-470; glut-4…
• factor receptor;
Inhibitory effect;
• D. John Paganoni, ..
• Positive/Negative opinion…
• London, Paris…
•Names, Orgs…
•MeSH, genes...
•Predicates
•Synonyms, stop
words..
•Trends
Copyright 2003-4, SPSS Inc. 22
From Concepts to PredictiveFrom Concepts to Predictive
Analytics ComponentsAnalytics Components
Linguistic
Terminology
Extractor
LexiQuest
Mine
Discover
concepts,
relationships
and trends
LexiQuest
Categorize
Understand
documents
and assign in
pre-defined
categories
Text Mining for
Clementine
Add text fields to
data mining for
better prediction
Copyright 2003-4, SPSS Inc. 23
Concept Extraction EngineConcept Extraction Engine
The extractor turns unstructured text into concepts:
LexiQuest Extractor Engine
Linguistic Processor
Visualization Probabilities
LexiQuest
Mine
Clementine
LexiQuest
Categorize
Copyright 2003-4, SPSS Inc. 24
Part-of-Speech TaggingPart-of-Speech Tagging
a: adjective b: adverb c: preposition
d: determiner n: noun v: verb
o: coordination p: participle s: stop word
Copyright 2003-4, SPSS Inc. 25
How is a Concept Extracted?How is a Concept Extracted?
Step 1: Part-of-Speech Tagging
Using a tool like LexiQuest Mine is a great
V P N A N N V P A
idea for any organization that is interested in maintaining
N P A N P V V P V
information on competitive intelligence.
N P N N
Copyright 2003-4, SPSS Inc. 26
How is a Concept Extracted?How is a Concept Extracted?
Step 2: Matching to Known Patterns
This:
V P N A N N V P A N PA N P V V P V N PN N
Looks Most Like:
N C D N N
(32 Known patterns for English)
Copyright 2003-4, SPSS Inc. 27
How is the Concept Extracted?How is the Concept Extracted?
The extractor looks at this sentence:
Using a tool like LexiQuest Mine is a great idea for any
organization that is interested in maintaining information on
competitive intelligence.
And extracts the concept:
Competitive Intelligence
Concepts are:
 Noun based
 Can be longer than one word
Copyright 2003-4, SPSS Inc. 28
Example: CategorizationExample: Categorization
Copyright 2003-4, SPSS Inc. 29
The Issue of LanguageThe Issue of Language
 NLP requires separate language understanding
 Clementine text mining
 French
 English
 English/French
 German
 Spanish
 Dutch
 Japanese
 Italian
 Mesh (Medical subject headings)
 http://www.nlm.nih.gov/mesh/meshhome.html
“The process of discovering meaningful
new relationships, patterns and trends by
sifting through data using pattern
recognition technologies as well as
statistical and mathematical techniques.”
- The Gartner group.
Data Mining DefinedData Mining Defined
Copyright 2003-4, SPSS Inc. 31
Why data mining?Why data mining?
 Data Mining software generally employs modeling
algorithms designed to handle non-linearities and
unusual patterns in data
 As opposed to classical linear models (e.g., linear
regression) that aren’t as capable
 A related issue is ‘noise’ in the data: where, for
example, 2 seemingly similar sets of inputs yield a
different output
Copyright 2003-4, SPSS Inc. 32
 Use the cross industry
standard process for
data mining (CRISP-
DM)
 Based on real-world
lessons:
 Focus on business
issues
 User-centric &
interactive
 Full process
 Results are used
A Data Mining MethodologyA Data Mining Methodology
Copyright 2003-4, SPSS Inc. 33
Data Mining is not…Data Mining is not…
 Keep in mind that data mining is not…
 “Blind” application of analysis/modeling algorithms
 Brute-force crunching of bulk data
 Black box technology
 Magic
Copyright 2003-4, SPSS Inc. 34
Back to the ProcessBack to the Process
Text
Mining
Copyright 2003-4, SPSS Inc. 35
UnderstandingUnderstanding
 Business Understanding
 Determine objective
 Assess situation
 Determine data mining goals
 Produce project plan
 Data Understanding
 Collect initial data
 Describe data
 Explore data
 Verify data quality
Copyright 2003-4, SPSS Inc. 36
Data PreparationData Preparation
 Data
 Data set
 Data set description
 Select data
 Clean data
 Construct data set / Integrate data
 Format data
 Text
 Concept extraction
 Concept combination
 Concept assessment
Copyright 2003-4, SPSS Inc. 37
ModelingModeling
 Select modeling technique
 Universe of techniques
 Appropriate techniques
 Data
 Text
 Requirements
 Constraints
 Selected tools
 Generate test design
 Run model(s)
 Assess model(s)
Copyright 2003-4, SPSS Inc. 38
EvaluationEvaluation
 Results = Models + Findings
 Evaluate results
 Review process
 Determine next steps
Copyright 2003-4, SPSS Inc. 39
DeploymentDeployment
 Plan deployment
 Plan monitoring and maintenance
 Final report
 Project review
Copyright 2003-4, SPSS Inc. 40
 Unsupervised methods:
 Group patients by drugs and demographic information
and try to find unusual patients
 Supervised methods:
 Attempt to predict amount due and find sets of cases
where the amount due is very different from the
predicted amount
Data Mining ApproachesData Mining Approaches
Copyright 2003-4, SPSS Inc. 41
What Does Data Mining Do?What Does Data Mining Do?
 Data mining uses existing data to:
 Predict
 Category membership
 Numeric Value
 Ie. Credit risk
 Group
 Cluster (group) things together
based on their characteristics
 Ie. Different types of TV viewers
 Associate
 Find events that occur together, or in
a sequence
 Ie. Beer and diapers
 Find outliers
 Identify cases that don’t follow
expected behavior
 Ie. Fraudulent behaviour
Copyright 2003-4, SPSS Inc. 42
Benefits of Document WarehousingBenefits of Document Warehousing
 Richer operational business intelligence
 Knowing your customers
 Macroenvironmental monitoring
 Technology assessment
Copyright 2003-4, SPSS Inc. 43
ConclusionsConclusions
 Text mining is
 More than word counts
 Linguistically based
 Concept extraction
 Data mining is
 Advanced analytics applied to datasets
 A family of techniques
 Supervised or unsupervised
Copyright 2003-4, SPSS Inc. 44
ConclusionsConclusions
 Text and data mining
 Add dimensionality to the data
 Allow for automation of the text analysis event
 Create 360 degree view
 Applications
 Websites
 Surveys
 Email
 Call centre
 Documentation
Copyright 2003-4, SPSS Inc. 45
?
Copyright 2003-4, SPSS Inc. 46
So How Do I Get Started?So How Do I Get Started?
 Document Warehousing and Text Mining
 Dan Sullivan, Wiley, 2001
 Survey of Text Mining: Clustering, Classification
and Retrieval
 Michael W. Berry (ed.), Springer, 2003
 Natural Language Processing for Online
Applications: Text Retrieval, Extraction and
Categorization
 P. Jackson and I. Moulinier, John Benjamins, 2002
Copyright 2003-4, SPSS Inc. 47
SPSS CanadaSPSS Canada
 Tim Daciuk
 Services Manager, Canada
 416-410-7921
 800-543-6607 ext. 5156
 tdaciuk@spss.com
 Hugh Rooney
 SPSS Sales Canada
 416-410-7921
 905-886-4322
 hrooney@spss.com
www.spss.com

More Related Content

PDF
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
PPTX
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
PPTX
Directed versus undirected network analysis of student essays
PPTX
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
PPTX
Relevancy and Search Quality Analysis - Search Technologies
PPTX
Text Analytics Presentation
PDF
Work towards a quantitative model of risk in patent litigation
PDF
Crowdsourced query augmentation through the semantic discovery of domain spec...
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
AI-SDV 2020: Combining Knowledge and Machine Learning for the Analysis of Sci...
Directed versus undirected network analysis of student essays
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Relevancy and Search Quality Analysis - Search Technologies
Text Analytics Presentation
Work towards a quantitative model of risk in patent litigation
Crowdsourced query augmentation through the semantic discovery of domain spec...

What's hot (20)

DOCX
Scalable keyword search on large rdf data
PPT
Predictive Text Analytics
PPTX
Text Analytics for Dummies 2010
PPTX
Using a keyword extraction pipeline to understand concepts in future work sec...
PPTX
Language Models for Information Retrieval
PDF
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
PDF
An Advanced IR System of Relational Keyword Search Technique
PDF
Web_Mining_Overview_Nfaoui_El_Habib
PPTX
An Introduction to Text Analytics: 2013 Workshop presentation
PPTX
Semantic Data Normalization For Efficient Clinical Trial Research
PPT
Tovek Presentation by Livio Costantini
PPTX
Lexalytics Text Analytics Workshop: Perfect Text Analytics
PPTX
Text mining
PPTX
Interleaving, Evaluation to Self-learning Search @904Labs
PDF
How Graph Algorithms Answer your Business Questions in Banking and Beyond
PPTX
South Big Data Hub: Text Data Analysis Panel
PPT
Text Analytics: Yesterday, Today and Tomorrow
PPTX
Text Analytics for Non-Experts
PDF
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
PDF
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Scalable keyword search on large rdf data
Predictive Text Analytics
Text Analytics for Dummies 2010
Using a keyword extraction pipeline to understand concepts in future work sec...
Language Models for Information Retrieval
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
An Advanced IR System of Relational Keyword Search Technique
Web_Mining_Overview_Nfaoui_El_Habib
An Introduction to Text Analytics: 2013 Workshop presentation
Semantic Data Normalization For Efficient Clinical Trial Research
Tovek Presentation by Livio Costantini
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Text mining
Interleaving, Evaluation to Self-learning Search @904Labs
How Graph Algorithms Answer your Business Questions in Banking and Beyond
South Big Data Hub: Text Data Analysis Panel
Text Analytics: Yesterday, Today and Tomorrow
Text Analytics for Non-Experts
ANALYTICS OF PATENT CASE RULINGS: EMPIRICAL EVALUATION OF MODELS FOR LEGAL RE...
Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConne...
Ad

Viewers also liked (17)

PPTX
Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
PPT
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
PPTX
Experimental design data analysis
PDF
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
PPTX
Uses of SPSS and Excel to analyze data
PPTX
Economic Analysis of Maruti Suzuki through various software tools
PDF
Applied Statistical Methods - Question & Answer on SPSS
PDF
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PPT
An Introduction to SPSS
PPT
SPSS an intro...
PPTX
Wifi and Lifi Technology
PPT
Data Analysis With Spss - Reliability
PDF
Six Sigma Quality Using R: Tools and Training
PPTX
Introduction to Mediation using SPSS
DOC
SPSS statistics - how to use SPSS
PDF
Data analysis using spss
Uploading Data From Microsoft Excel - Microsoft SLQ Server 2008 (by Rakesh Mi...
Original definition Predictive Analytics SPSS Jan 15, 2003 Intriduction Slides
Experimental design data analysis
Move from Business Intelligence to Advanced Analytics by Integrating IBM SPSS...
Uses of SPSS and Excel to analyze data
Economic Analysis of Maruti Suzuki through various software tools
Applied Statistical Methods - Question & Answer on SPSS
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
An Introduction to SPSS
SPSS an intro...
Wifi and Lifi Technology
Data Analysis With Spss - Reliability
Six Sigma Quality Using R: Tools and Training
Introduction to Mediation using SPSS
SPSS statistics - how to use SPSS
Data analysis using spss
Ad

Similar to Irmac presentation for website (20)

PPTX
Text mining introduction-1
PPTX
Text mining
PPT
Web & text mining lecture10
PPTX
text Mining topic in data Mining subject
PPT
turban_ch07ch07ch07ch07ch07ch07dss9e_ch07.ppt
PPTX
sentiment analysis
PPTX
Text mining
PPTX
Text mining
PPT
Week12
PDF
Understanding voice of the member via text mining
PPTX
Text Analytics Overview, 2011
PDF
Veda Semantics - introduction document
PPTX
Predictive Maintenance- From fixing to predicting problems
PPTX
Smarter Analytics - Businesses Use Analytics to Find Hidden Opportunities
PPT
Text mining turban_dss9e_ch07 to learn about
PPT
Web mining and data mining seminar topic
PDF
Decision Support and Business Intelligence Systems 9th Edition Turban Test Bank
PPTX
Prescriptive Analytics-1.pptx
PPTX
Introduction to Text Mining
PDF
Decision Support and Business Intelligence Systems 9th Edition Turban Test Bank
Text mining introduction-1
Text mining
Web & text mining lecture10
text Mining topic in data Mining subject
turban_ch07ch07ch07ch07ch07ch07dss9e_ch07.ppt
sentiment analysis
Text mining
Text mining
Week12
Understanding voice of the member via text mining
Text Analytics Overview, 2011
Veda Semantics - introduction document
Predictive Maintenance- From fixing to predicting problems
Smarter Analytics - Businesses Use Analytics to Find Hidden Opportunities
Text mining turban_dss9e_ch07 to learn about
Web mining and data mining seminar topic
Decision Support and Business Intelligence Systems 9th Edition Turban Test Bank
Prescriptive Analytics-1.pptx
Introduction to Text Mining
Decision Support and Business Intelligence Systems 9th Edition Turban Test Bank

Recently uploaded (20)

PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPTX
Machine Learning and working of machine Learning
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Global Data and Analytics Market Outlook Report
PPTX
New ISO 27001_2022 standard and the changes
PPT
statistic analysis for study - data collection
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
A biomechanical Functional analysis of the masitary muscles in man
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
MBA JAPAN: 2025 the University of Waseda
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPT
statistics analysis - topic 3 - describing data visually
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PPTX
recommendation Project PPT with details attached
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Machine Learning and working of machine Learning
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
retention in jsjsksksksnbsndjddjdnFPD.pptx
Global Data and Analytics Market Outlook Report
New ISO 27001_2022 standard and the changes
statistic analysis for study - data collection
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
expt-design-lecture-12 hghhgfggjhjd (1).ppt
A biomechanical Functional analysis of the masitary muscles in man
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
MBA JAPAN: 2025 the University of Waseda
Session 11 - Data Visualization Storytelling (2).pdf
statistics analysis - topic 3 - describing data visually
Caseware_IDEA_Detailed_Presentation.pptx
recommendation Project PPT with details attached

Irmac presentation for website

  • 1. Copyright 2003-4, SPSS Inc. 1 An Introduction to Text Mining Tim Daciuk SPSS, Inc. Services Manager, Canada
  • 2. Copyright 2003-4, SPSS Inc. 2 AgendaAgenda  Introductions  An Overview of Document Warehousing  Understanding Unstructured Text  Concept Extraction  Text Mining  Data Mining  Demonstration
  • 3. Copyright 2003-4, SPSS Inc. 3 Tim DaciukTim Daciuk  Background  Social research  Survey research  SPSS  25 years working with the product  12 years working with the company  5 years working with text analysis  Prior history  Consulting  Education
  • 4. Copyright 2003-4, SPSS Inc. 4 Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events. — Gareth Herschel, Research Director, Gartner Group Predictive Analytics: DefinedPredictive Analytics: Defined
  • 5. Copyright 2003-4, SPSS Inc. 5 SPSS At A GlanceSPSS At A Glance  Leadership  Market leader in Predictive Analytics  Focus on online & offline customer data acquisition and analysis  Stability  Founded in 1968  30+ year heritage in analytic technologies  Proven track record  250,000+ customers worldwide  NASDAQ: SPSS  Analytics standard  80% of Fortune 500 are SPSS customers  80% plus market share in Survey & Market Research sector  Ranked #1 Data Mining solution by KD Nuggets
  • 6. Some of Our BrandsSome of Our Brands
  • 7. Copyright 2003-4, SPSS Inc. 7 Unstructured Data ManagementUnstructured Data Management Text Mining is a subset of Unstructured Data Management. UDM can be broken down into:  Content and Document Management  Search and Retrieval  XML database and tools  Categorization, Classification, and Visualization
  • 8. Copyright 2003-4, SPSS Inc. 8 80% of Data is Unstructured80% of Data is Unstructured  Database notes:  Call center transcripts  Other CRM  Email  Open-ended survey responses  Web pages  NewsGroups  Documents themselves  Competitive information
  • 9. Copyright 2003-4, SPSS Inc. 9 Applications for Text AnalysisApplications for Text Analysis  Surveys  ‘Reading’ email  Call centre data  Comment data  Abstracts  Document management  Corporate history  Thematic understanding of website
  • 10. Copyright 2003-4, SPSS Inc. 10 Data Warehouse vs. DocumentData Warehouse vs. Document WarehouseWarehouse  Data warehouse  Who, what, when, where, how much  Internally focused  Operational information  Rarely include external information  Document warehouse  Why  May not be internally focused  May contain a range of information  Often integrate external information
  • 11. Copyright 2003-4, SPSS Inc. 11 Document Warehouse FeaturesDocument Warehouse Features  There is no single document structure or document type  Documents are drawn from multiple sources  Essential features of documents are automatically extracted and explicitly stored in the document warehouse  Document warehouses are designed to integrate semantically related documents
  • 12. Copyright 2003-4, SPSS Inc. 12 Building the Document WarehouseBuilding the Document Warehouse Identify Sources Retrieve Document Text Analysis Pre-process Document Compile Metadata
  • 13. Copyright 2003-4, SPSS Inc. 13 Predict, Impact, DeployPredict, Impact, Deploy Customer Data Attitudes Actions Attributes Business User Grow Retain Fraud Outcomes Attract Data Collection Text Surveys Web Channel Operational Systems Text BusinessUI Expert UIExpert UI Concepts Concept Maps Clustering Categoriza- tion Trending Information Extraction Prediction NLP
  • 14. Copyright 2003-4, SPSS Inc. 14 The Building Blocks of LanguageThe Building Blocks of Language  Morphology  Syntax  Semantics  Phonology  Pragmatics
  • 15. Copyright 2003-4, SPSS Inc. 15 MorphologyMorphology  Understanding words  Stems  Affixes  Prefix  Suffix  Inflectional elements  Reducing complexity of analysis  Reduces complexity of representation  Supports text mining Noun Prefix Noun Stem Suffix - abledisputein -
  • 16. Copyright 2003-4, SPSS Inc. 16 SyntaxSyntax  The Bank of Canada will curb inflation with higher interest rates Prepositional phrase Adjective Sentence Noun phrase Verb phrase Noun VerbAux Noun phrase NounAdjective Noun The Bank of Canada inflationcurbwill Interest rateshigher with
  • 17. Copyright 2003-4, SPSS Inc. 17 SemanticsSemantics  The meaning of it all  Approaches to meaning  Semantic networks  Deductive logic  Rule-based systems  Useful for classification
  • 18. Copyright 2003-4, SPSS Inc. 18 Problems with NLPProblems with NLP  Limitations of Natural Language Processing  Correctly identifying the role of noun phrases  Representing abstract concepts  Classifying synonyms  Representing the number of concepts
  • 19. Copyright 2003-4, SPSS Inc. 19 Problems with NLPProblems with NLP  Limitations of technology  Language specific designs are required  Classification speed  Classifying hybrid words and sentences
  • 20. Copyright 2003-4, SPSS Inc. 20 Underlying Technology is Based onUnderlying Technology is Based on LinguisticsLinguistics The Linguistic Approach:  Does not treat a document as a bag of words  Removes ambiguity by extracting structured concepts Concepts are the DNA of text. Text is unstructured, ambiguous, and language dependent.
  • 21. Copyright 2003-4, SPSS Inc. 21 From Text to ConceptsFrom Text to Concepts Morphology Syntax Semantics Statistics Linguistic Terminology Extractor ScalableAccurate Customizable Discovery- Oriented •Compound words •Proper nouns •Figures •Named entities •Domain specifics •Speed •Multiple formats •Multiple languages •SPSS dictionaries •User dictionaries •Extraction rules •Extraction patterns •Known terms •Unknown terms •New terms •1GB/hour •PDF, MS Office, text… •English, French, German Spanish, Italian, Dutch, Japanese • Inserm; merck & co… • tnp-470; glut-4… • factor receptor; Inhibitory effect; • D. John Paganoni, .. • Positive/Negative opinion… • London, Paris… •Names, Orgs… •MeSH, genes... •Predicates •Synonyms, stop words.. •Trends
  • 22. Copyright 2003-4, SPSS Inc. 22 From Concepts to PredictiveFrom Concepts to Predictive Analytics ComponentsAnalytics Components Linguistic Terminology Extractor LexiQuest Mine Discover concepts, relationships and trends LexiQuest Categorize Understand documents and assign in pre-defined categories Text Mining for Clementine Add text fields to data mining for better prediction
  • 23. Copyright 2003-4, SPSS Inc. 23 Concept Extraction EngineConcept Extraction Engine The extractor turns unstructured text into concepts: LexiQuest Extractor Engine Linguistic Processor Visualization Probabilities LexiQuest Mine Clementine LexiQuest Categorize
  • 24. Copyright 2003-4, SPSS Inc. 24 Part-of-Speech TaggingPart-of-Speech Tagging a: adjective b: adverb c: preposition d: determiner n: noun v: verb o: coordination p: participle s: stop word
  • 25. Copyright 2003-4, SPSS Inc. 25 How is a Concept Extracted?How is a Concept Extracted? Step 1: Part-of-Speech Tagging Using a tool like LexiQuest Mine is a great V P N A N N V P A idea for any organization that is interested in maintaining N P A N P V V P V information on competitive intelligence. N P N N
  • 26. Copyright 2003-4, SPSS Inc. 26 How is a Concept Extracted?How is a Concept Extracted? Step 2: Matching to Known Patterns This: V P N A N N V P A N PA N P V V P V N PN N Looks Most Like: N C D N N (32 Known patterns for English)
  • 27. Copyright 2003-4, SPSS Inc. 27 How is the Concept Extracted?How is the Concept Extracted? The extractor looks at this sentence: Using a tool like LexiQuest Mine is a great idea for any organization that is interested in maintaining information on competitive intelligence. And extracts the concept: Competitive Intelligence Concepts are:  Noun based  Can be longer than one word
  • 28. Copyright 2003-4, SPSS Inc. 28 Example: CategorizationExample: Categorization
  • 29. Copyright 2003-4, SPSS Inc. 29 The Issue of LanguageThe Issue of Language  NLP requires separate language understanding  Clementine text mining  French  English  English/French  German  Spanish  Dutch  Japanese  Italian  Mesh (Medical subject headings)  http://www.nlm.nih.gov/mesh/meshhome.html
  • 30. “The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” - The Gartner group. Data Mining DefinedData Mining Defined
  • 31. Copyright 2003-4, SPSS Inc. 31 Why data mining?Why data mining?  Data Mining software generally employs modeling algorithms designed to handle non-linearities and unusual patterns in data  As opposed to classical linear models (e.g., linear regression) that aren’t as capable  A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output
  • 32. Copyright 2003-4, SPSS Inc. 32  Use the cross industry standard process for data mining (CRISP- DM)  Based on real-world lessons:  Focus on business issues  User-centric & interactive  Full process  Results are used A Data Mining MethodologyA Data Mining Methodology
  • 33. Copyright 2003-4, SPSS Inc. 33 Data Mining is not…Data Mining is not…  Keep in mind that data mining is not…  “Blind” application of analysis/modeling algorithms  Brute-force crunching of bulk data  Black box technology  Magic
  • 34. Copyright 2003-4, SPSS Inc. 34 Back to the ProcessBack to the Process Text Mining
  • 35. Copyright 2003-4, SPSS Inc. 35 UnderstandingUnderstanding  Business Understanding  Determine objective  Assess situation  Determine data mining goals  Produce project plan  Data Understanding  Collect initial data  Describe data  Explore data  Verify data quality
  • 36. Copyright 2003-4, SPSS Inc. 36 Data PreparationData Preparation  Data  Data set  Data set description  Select data  Clean data  Construct data set / Integrate data  Format data  Text  Concept extraction  Concept combination  Concept assessment
  • 37. Copyright 2003-4, SPSS Inc. 37 ModelingModeling  Select modeling technique  Universe of techniques  Appropriate techniques  Data  Text  Requirements  Constraints  Selected tools  Generate test design  Run model(s)  Assess model(s)
  • 38. Copyright 2003-4, SPSS Inc. 38 EvaluationEvaluation  Results = Models + Findings  Evaluate results  Review process  Determine next steps
  • 39. Copyright 2003-4, SPSS Inc. 39 DeploymentDeployment  Plan deployment  Plan monitoring and maintenance  Final report  Project review
  • 40. Copyright 2003-4, SPSS Inc. 40  Unsupervised methods:  Group patients by drugs and demographic information and try to find unusual patients  Supervised methods:  Attempt to predict amount due and find sets of cases where the amount due is very different from the predicted amount Data Mining ApproachesData Mining Approaches
  • 41. Copyright 2003-4, SPSS Inc. 41 What Does Data Mining Do?What Does Data Mining Do?  Data mining uses existing data to:  Predict  Category membership  Numeric Value  Ie. Credit risk  Group  Cluster (group) things together based on their characteristics  Ie. Different types of TV viewers  Associate  Find events that occur together, or in a sequence  Ie. Beer and diapers  Find outliers  Identify cases that don’t follow expected behavior  Ie. Fraudulent behaviour
  • 42. Copyright 2003-4, SPSS Inc. 42 Benefits of Document WarehousingBenefits of Document Warehousing  Richer operational business intelligence  Knowing your customers  Macroenvironmental monitoring  Technology assessment
  • 43. Copyright 2003-4, SPSS Inc. 43 ConclusionsConclusions  Text mining is  More than word counts  Linguistically based  Concept extraction  Data mining is  Advanced analytics applied to datasets  A family of techniques  Supervised or unsupervised
  • 44. Copyright 2003-4, SPSS Inc. 44 ConclusionsConclusions  Text and data mining  Add dimensionality to the data  Allow for automation of the text analysis event  Create 360 degree view  Applications  Websites  Surveys  Email  Call centre  Documentation
  • 46. Copyright 2003-4, SPSS Inc. 46 So How Do I Get Started?So How Do I Get Started?  Document Warehousing and Text Mining  Dan Sullivan, Wiley, 2001  Survey of Text Mining: Clustering, Classification and Retrieval  Michael W. Berry (ed.), Springer, 2003  Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization  P. Jackson and I. Moulinier, John Benjamins, 2002
  • 47. Copyright 2003-4, SPSS Inc. 47 SPSS CanadaSPSS Canada  Tim Daciuk  Services Manager, Canada  416-410-7921  800-543-6607 ext. 5156  tdaciuk@spss.com  Hugh Rooney  SPSS Sales Canada  416-410-7921  905-886-4322  hrooney@spss.com www.spss.com

Editor's Notes

  • #5: According to Gartner Group: “Predictive Analysis helps you connect data to effective action by drawing reliable conclusions about current conditions and future events.” Predictive analysis: Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizens Bottom line: it turns data into effective actions that positively impact your bottom line
  • #6: Here are some of the stats you may want to know about SPSS (read highlights from the slide). SPSS has been a cornerstone of the software industry since 1968. We’ve also been on the forefront of blending both new and established technologies to help customers around the world solve business problems. We’ve continued to grow, deliberately and thoughtfully, over the years, acquiring companies and technology complimentary to our existing business. The bottom line for you: We’re here with innovative, proven solutions to help you solve your immediate business problems. And, we’ll be here in the future to support you and your organization…
  • #7: The Clementine Server data mining workbench fits within SPSS’ overall business intelligence product strategy. Our entire business intelligence product line includes products for collecting data, preparing data, reporting and OLAP, as well as modeling. Because different users have different needs and levels of expertise, they are presented with the appropriate product interface. SPSS delivers the right product for every person supported by our 30 years of experience in data analysis and data mining. The analytical solutions we deliver, are scalable and can be deployed throughout your organization to help you transform your business with information..
  • #9: It’s been estimated that 80% of the data in an organization is an unstructured format, that is, in the form of documents, HTML pages, database notes, email, open-ended survey responses, etc. This fact means that decision-makers often rely on only 20% of the data available and a little bit of the documents that they can read. Take open-ended surveys, for example: cross-tab reports of responses are common but open-ended responses, which hold valuable information which qualify the responses and bring up new themes. Organizations rarely have the tools or the time to truly process and disseminate this important information. In a similar fashion, database notes on customer contacts are effectively used to manage individual contacts, but this valuable source of customer information is never used to really understand the customer experience overall. What if you could use this information to keep and grow customers to increase customer lifetime value?
  • #10: So, I think we can agree that there is a need for text analysis, but, where can this technology be applied. [click] Well, surveys are the most obvious, we’ve just talked about that. [click] We could apply this to email, reading the email and making a handling decision based on the content of the email. [click] Call centre data is another candidate for text analysis. What are my customers complaints? Where are there problems with my product? What did the customers who left have to say about my service? [click] Reading comment data is an important potential application; an application that is laborious or subjective currently. In the State of Georgia example, comments are triaged and only those indicating a definite problem or requirement for re-arrest are used. The majority are ignored, even though there is a real sense that there is something in that group. [click] The ability to read abstracts from online databases using a more intelligent engine than a simple word search is an application. [click] Document management and the ability to categorize gigabytes of documents policies, procedures is an application, and along with that [click] Corporate history and the ability to manage corporate information resources is an application. Finally [click] we have seen some use of this technology in the analysis of message in websites. What concepts are your website conveying, and are these concepts the appropriate ones, appropriately placed?
  • #14: At a high level, you need linguistics or Natural Language Processing, to extract concepts which form the bases of business user interfaces like concept maps or feeding data mining techniques to predict customer behavior.
  • #15: Morphology is the study of the structure and form of words Syntax is the study of how words and phrases form sentences Semantics relates to the meaning of words and statements Phonology is the study of sounds in language Pragmatics is the study of idiomatic phrases that cannot be analyzed with strict semantic analysis We tend to deal with the first three and ignore the last two when we are talking about natural language processing.
  • #22: So how do we get from text to concepts? Linguistics, the science of text, includes ideas such as 1) morphology, or how words change based on part of speech, 2) syntax, or how sentences are structured 3) semantics, or the meaning of words and 4) statistics, such as the frequency of terms and patterns. It takes linguistics to cut through the noise of text to find relevant concepts without leaving important concepts undiscovered. Other statistical or machine learning approaches fall short of linguistic extraction, because only a linguistic approach can deal with the ambiguity and complexity of text. That is, linguistic extraction is [click] accurate, [click] scalable, [click] customizable and [click] discovery oriented. By accurate, I mean that [click] compound words, proper nouns, etc., [click] like these examples, are extracted. [click] In terms of scalability, we can process about 1 GB per hour, multiple formats and multiple languages [click]. By customizable, [click] I mean that you can you use dictionaries, rules and patterns [click] to tailor your extraction. Vertical resources can be used like the MeSH which is the official medical thesaurus. And Finally, [click] by discovery-oriented, I mean that, depending on your analysis, you can focus on known terms, unknown terms, new terms and [click] trends.
  • #23: For the next step, to move from concepts to Predictive Analytics, tools are available which address specific business needs by delivering knowledge to adding prediction to operational systems. [click] LexiQuest Mine enables users to quickly identify key concepts, and the relationships between them, within thousands of documents Mine displays these concepts and the links between them in an easy to navigate, color-coded graphical map and trend analysis charts. Mine is designed for people who want to discover, structure and anticipate. [click] LexiQuest Categorize automatically catalogues documents into a predefined taxonomy based on their content. Able to “read” and understand content, Categorize is able to automatically and accurately place a document into into its proper category. From there, it can be sent to the right audience based on their profile or simply reside there for easy retrieval from a portal, intranet or extranet site. [click] Text Mining for Clementine is a new component of Clementine, which we will see in a few minutes, has the ability to unlock knowledge contained in unstructured text data so that it can be combined with information from databases and other data sources to build better models using traditional data mining techniques.
  • #24: The extraction process works basically as three parts: First, a linguistic processor reads the text and comes up with a set of categories. These categories are passed to one of three different applications; depending upon the objective. These applications may be a stand alone concept understanding application, such as our LexiQuest mine. This application represents the concepts, and illustrates their relationship. Our Clementine application uses the concepts as data, and, as part of a larger data mining application. Finally, categorization uses the concept information as the basis of further analysis The final application layer can be used for visualization, data mining, or strict probabilistic assignment of information to known categories.
  • #29: Another text mining example is categorization. The folders on the left represent categories of different types of incoming emails. Text mining can be used to learn the which emails, depending on their content, should be placed in each category (and therefore routed appropriately and automatically). [click] In this case an email on a problem with an ActiveX control [click] can be routed to Dev support.
  • #30: MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Mental Disorders." At more narrow levels are found more specific headings such as "Ankle" and "Conduct Disorder." There are 21,973 descriptors in MeSH. There are also thousands of cross-references that assist in finding the most appropriate MeSH Heading, for example, Vitamin C see Ascorbic Acid. These entries include 23,512 printed see references and 102,346 other entry points
  • #31: Let’s define data mining “The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” Data Mining Means: finding patterns or relationships in your data that you can use to solve your organization’s problems
  • #33:   How does one mine data? The CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING provides a framework for all data mining efforts.. This process focuses on business issues, allows the user to work with and interact with the data, works on the data mining process from beginning to end, and USES the results.  Business Understanding where you might convert a business problem to a data mining problem Data Understanding – where you get your first look at the data Data Preparation – the hardest part where you clean the data Modeling – the neatest part where you build prediction or MODELS Evaluation – where you examine performance Deployment – where you actually integrate the results of your data mining into your organization Notices the arrows going around the chart and and back and forth amongst the boxes? These arrows show that the data mining process is an iterative process where the miner may step from on box to another an back for an effective data mining activity
  • #42: Broadly speaking data mining can serve four basic purposes: prediction, segmentation, association and outlier detection. Prediction takes a known result, and attempts to combine input fields in order to best replicate this result. An example may be deciding if someone is a good or bad credit risk, or, whether someone will churn or not. Segmentation finds groups within cases. The number of groups is unknown, ie. Any number of groups is possible. An example may be the examination of customer segments, or, the creation of groupings within financial data Association methods try to develop an “if this, then that” type of analysis. Examples of this are people who watch news programming also watch the weather network. Outlier detection is used to derive atypical cases. These are examples of unusual behaviour vis-s-vis the rest of the data. Examples of this may be fraud detection when examining claims data.
  • #43: The ability to move to the why from the other who, what, when, where, and how much Personalization for customer knowledge. This allows for marketers to craft messages aimed at more individuals than generalized groups Understanding events inside and outside the organization and how they relate The ability to assess competitors technology and to understand the technology positions of the market
  • #46: Questions