SlideShare a Scribd company logo
Processing Text

         Shilpa Shukla
       Graduate Student
School of Information, UT Austin
Indexing Process
Text Processing

● Goal: transforms documents into index terms or
  features.
● Why do text processing?
   ○ Exact search is too restrictive
   ○ E.g. "computer hardware" doesn't match
     "Computer hardware"
● Easy to handle this example by converting to
  lowercase
● But search engines go much further!
Outline of presentation
● Text statistics
   ○ meaning of text often captured by occurrences and
     co-occurrences of words
   ○ understanding of text statistics is fundamental
● Text transformation
   ○ Tokenization
   ○ Stopping
   ○ Stemming
   ○ Phrases & N-grams
● Document structure
   ○ Web pages have structure (headings, titles, tags)
     that can be exploited to improve search
Text Statistics
● Luhn observed in 1958: significance of a word
  depends on its frequency in the document
● Statistical models of word occurrences are
  therefore very important in IR
● Most obvious statistical feature: distribution of
  word frequencies is skewed
   ○ only a few words have high frequencies ("of",
     "the" alone account for 10% of all occurrences)
   ○ most words have low frequencies
● This is nicely captured by Zipf's Law
Zipf's law: The rank r of a word times its
probability of occurrence Pr is a constant
                  r * Pr = c
Text Transformation

● Tokenization
     ■ splitting words apart
● Stopping
     ■ ignoring some words
● Stemming
     ■ allowing similar words to match each other
       (like "run" and "running")
● Phrases and N-grams
     ■ storing sequence of words
Tokenizing
● Process of forming words called tokens from the
  sequence of characters
● Simple for English but not for all languages (e.g.
  Chinese)
● Earlier IR systems: sequence of 3+ alphanumeric
  characters separated by space or special character
  was considered a word
● Example:
 ● "Bigcorp's 2007 bi‐annual report showed profits rose 10%."


     ● "bigcorp 2007 annual report showed profits rose"
● Leads to too much information loss
(Some) Tokenizing Problems
            Problem             Examples
  Small words         xp, world war II

  Hyphens             e-bay, mazda rx-7

  Capital letters     Bush, Apple

  Apostrophes         can't, 80's, kid's
  Numbers             nokia 3250, 288358
  Periods             I.B.M., Ph.D., ischool.
                      utexas.edu
Steps in Tokenizing
● First: Identify parts of the document to be tokenized using a
  tokenizer and parser designed for a specific language.
● Second: Tokenize the relevant parts of the document
    ○ Defer complex decisions to other components
       ■ Identification of word variants - Stemmer
       ■ Recognizing that a string is a name or a date- Information
         Extractor
   ○ Retain capitalizations and punctuations till information
     extraction has been done
● Examples of rules used with TREC
   ○ Apostrophes in words ignored
       ■ o’connor → oconnor, bob’s → bobs
   ○ Periods in abbreviations ignored
          ■ I.B.M. → ibm, Ph.D. → ph d
Stopping
● Gets rid of stopwords
   ○ delimiters like a, an, the
   ○ prepositions like on, below, over
● Reasons to eliminate stopwords
   ○ Nearly all of the most frequent words fall in this
      category.
   ○ Do not convey relevant information on their own
● Stopping decreases index size, increase retrieval
  efficiency and generally improves effectiveness.
● Caution: Removing too many words might affect
  effectiveness
       ■ e.g. "Take That", "The Who"
Stopping continued

● Stopword list can be manually prepared from high-
  frequency words or based on a standard list.
● Lists are customized for applications, domains, and
  even parts of documents
 e.g., “click” is a good stopword for anchor text
● Best policy is to index all words in documents, make
  decisions about which words to use at query time
Stemming
● Captures the relationships between different variations
  of a word reducing all the forms (inflection, derivation)
  in which a word can occur to a common stem
● Examples
      ■ is, be ,was
      ■ ran, run
      ■ tweet, tweets
● Crucial for highly inflected languages (e.g. Arabic)
● There are three types of stemmers
      ■ Algorithm based: uses knowledge of word
        suffixes. e.g. Porter stemmer
      ■ Dictionary based: uses a pre-created dictionary
        of related terms
      ■ Hybrid approach: e.g. Krovetz stemmer
Phrases & N-grams
● Phrases are important as they are
   ○ More precise than single words
       ■ e.g "World Wide Web"
   ○ Less ambiguous
       ■ e.g. "green bush", "bush"
● Ranking issue
● Text processing issue - recognizing phrases
● Three possible approaches for recognizing phrases
   ○ Parts Of Speech (POS) tagger
   ○ Store word positions in indexes and use proximity
     operators in queries (not covered here)
   ○ N-gram
Recognizing Phrases
● POS tagger
   ○ uses syntactic structure of sentence
        ■ sequences of nouns or
        ■ adjectives followed by nouns
   ○ too slow for large databases
● N-grams
   ○ uses a simpler definition of phrase
   ○ phrase is just a sequence of N words
        ■ 1 word - unigram
        ■ 2 words - bigram
        ■ 3 words - trigram
        ■ N words - N-gram
   ○ fits the Zipf distribution better than words alone
   ○ improves retrieval effectiveness hence used
   ○ takes up a lot of memory
Document Structure and Markup

● Some parts of a document are more important
● Document parser recognizes structure using markup
   ○ Title, Heading, Bold text
   ○ Anchor tags
   ○ Meta data
   ○ Links - used in ranking algorithms
Information Retrieval

From Wikipedia, the free encyclopedia

Information retrieval (IR) is the area of study concerned with
searching for documents, for information within documents, and for
metadata about documents, as well as that of searching relational
databases and the World Wide Web. There is overlap in the usage
of the terms data retrieval, document retrieval, information retrieval,
and text retrieval, but each also has its own body of literature, theory,
praxis, and technologies. IR is interdisciplinary, based on computer
science, mathematics, library science, information science,
information architecture, cognitive psychology, linguistics, and
statistics.




            Part of a Web page from Wikipedia
<html>
<head>

<title>Information retrieval - Wikipedia, the free encyclopedia</title>

…

<body>

    <h1 id="firstHeading" class="firstHeading">Information retrieval</h1>
<p><b>Information retrieval</b> (<b>IR</b>) is the area of study concerned with searching for documents, for <a
href="/wiki/Information" title="Information">information</a> within documents, and for <a href="/wiki/Metadata_
(computing)" title="Metadata (computing)" class="mw-redirect">metadata</a> about documents, as well as that of
searching <a href="/wiki/Relational_database" title="Relational database">relational databases</a> and the <a
href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>.
...
</body>
</html>




           HTML source for example Wikipedia page
Questions??

   Thanks!

More Related Content

PPT
Textmining
PDF
Scalable Text Mining
PPTX
NLP todo
DOCX
Fragen visualisierung svantje
PPTX
EXTENSIBLE MARKUP LANGUAGE BY SAIKIRAN PANJALA
PPT
Text Mining
PPTX
Open nlp presentationss
PDF
OUTDATED Text Mining 4/5: Text Classification
Textmining
Scalable Text Mining
NLP todo
Fragen visualisierung svantje
EXTENSIBLE MARKUP LANGUAGE BY SAIKIRAN PANJALA
Text Mining
Open nlp presentationss
OUTDATED Text Mining 4/5: Text Classification

What's hot (11)

PDF
Comparisons of ranking algorithms
PDF
HPEC 2021 sparse binary format
PPT
4.4 text mining
PPT
Web search engines
DOCX
Fragen: visualisierung
DOCX
Fragebogen mit bildern
PDF
Search pitb
PDF
Text Mining Analytics 101
PPTX
Textmining Information Extraction
PPT
PDF
Presentation of OpenNLP
Comparisons of ranking algorithms
HPEC 2021 sparse binary format
4.4 text mining
Web search engines
Fragen: visualisierung
Fragebogen mit bildern
Search pitb
Text Mining Analytics 101
Textmining Information Extraction
Presentation of OpenNLP
Ad

Similar to Shilpa shukla processing_text (20)

PPT
2_text operationinformation retrieval. ppt
PPT
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
PDF
Information retrieval concept, practice and challenge
PPTX
Info 2402 irt-chapter_4
PPT
Information retrieval chapter 2-Text Operations.ppt
PPTX
Lecture 7- Text Statistics and Document Parsing
PDF
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
PPT
Information Retrieval
PPT
CHapter 2_text operation.ppt material for university students
PPT
Copy of 10text (2)
PPT
Chapter 10 Data Mining Techniques
PDF
learn about text preprocessing nip using nltk
PDF
information retrival and text processing
PDF
Text databases and information retrieval
PPTX
3. introduction to text mining
PPTX
3. introduction to text mining
PPTX
Chapter 1 Intro Information Rerieval.pptx
PDF
Research: Developing an Interactive Web Information Retrieval and Visualizati...
2_text operationinformation retrieval. ppt
2_text operatinnjjjjkkkkkkkkkkkkgggggggggggggggggggon.ppt
Information retrieval concept, practice and challenge
Info 2402 irt-chapter_4
Information retrieval chapter 2-Text Operations.ppt
Lecture 7- Text Statistics and Document Parsing
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
Information Retrieval
CHapter 2_text operation.ppt material for university students
Copy of 10text (2)
Chapter 10 Data Mining Techniques
learn about text preprocessing nip using nltk
information retrival and text processing
Text databases and information retrieval
3. introduction to text mining
3. introduction to text mining
Chapter 1 Intro Information Rerieval.pptx
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Ad

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Approach and Philosophy of On baking technology
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Spectroscopy.pptx food analysis technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Teaching material agriculture food technology
Approach and Philosophy of On baking technology
NewMind AI Weekly Chronicles - August'25 Week I
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Unlocking AI with Model Context Protocol (MCP)
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectroscopy.pptx food analysis technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Understanding_Digital_Forensics_Presentation.pptx

Shilpa shukla processing_text

  • 1. Processing Text Shilpa Shukla Graduate Student School of Information, UT Austin
  • 3. Text Processing ● Goal: transforms documents into index terms or features. ● Why do text processing? ○ Exact search is too restrictive ○ E.g. "computer hardware" doesn't match "Computer hardware" ● Easy to handle this example by converting to lowercase ● But search engines go much further!
  • 4. Outline of presentation ● Text statistics ○ meaning of text often captured by occurrences and co-occurrences of words ○ understanding of text statistics is fundamental ● Text transformation ○ Tokenization ○ Stopping ○ Stemming ○ Phrases & N-grams ● Document structure ○ Web pages have structure (headings, titles, tags) that can be exploited to improve search
  • 5. Text Statistics ● Luhn observed in 1958: significance of a word depends on its frequency in the document ● Statistical models of word occurrences are therefore very important in IR ● Most obvious statistical feature: distribution of word frequencies is skewed ○ only a few words have high frequencies ("of", "the" alone account for 10% of all occurrences) ○ most words have low frequencies ● This is nicely captured by Zipf's Law
  • 6. Zipf's law: The rank r of a word times its probability of occurrence Pr is a constant r * Pr = c
  • 7. Text Transformation ● Tokenization ■ splitting words apart ● Stopping ■ ignoring some words ● Stemming ■ allowing similar words to match each other (like "run" and "running") ● Phrases and N-grams ■ storing sequence of words
  • 8. Tokenizing ● Process of forming words called tokens from the sequence of characters ● Simple for English but not for all languages (e.g. Chinese) ● Earlier IR systems: sequence of 3+ alphanumeric characters separated by space or special character was considered a word ● Example: ● "Bigcorp's 2007 bi‐annual report showed profits rose 10%." ● "bigcorp 2007 annual report showed profits rose" ● Leads to too much information loss
  • 9. (Some) Tokenizing Problems Problem Examples Small words xp, world war II Hyphens e-bay, mazda rx-7 Capital letters Bush, Apple Apostrophes can't, 80's, kid's Numbers nokia 3250, 288358 Periods I.B.M., Ph.D., ischool. utexas.edu
  • 10. Steps in Tokenizing ● First: Identify parts of the document to be tokenized using a tokenizer and parser designed for a specific language. ● Second: Tokenize the relevant parts of the document ○ Defer complex decisions to other components ■ Identification of word variants - Stemmer ■ Recognizing that a string is a name or a date- Information Extractor ○ Retain capitalizations and punctuations till information extraction has been done ● Examples of rules used with TREC ○ Apostrophes in words ignored ■ o’connor → oconnor, bob’s → bobs ○ Periods in abbreviations ignored ■ I.B.M. → ibm, Ph.D. → ph d
  • 11. Stopping ● Gets rid of stopwords ○ delimiters like a, an, the ○ prepositions like on, below, over ● Reasons to eliminate stopwords ○ Nearly all of the most frequent words fall in this category. ○ Do not convey relevant information on their own ● Stopping decreases index size, increase retrieval efficiency and generally improves effectiveness. ● Caution: Removing too many words might affect effectiveness ■ e.g. "Take That", "The Who"
  • 12. Stopping continued ● Stopword list can be manually prepared from high- frequency words or based on a standard list. ● Lists are customized for applications, domains, and even parts of documents e.g., “click” is a good stopword for anchor text ● Best policy is to index all words in documents, make decisions about which words to use at query time
  • 13. Stemming ● Captures the relationships between different variations of a word reducing all the forms (inflection, derivation) in which a word can occur to a common stem ● Examples ■ is, be ,was ■ ran, run ■ tweet, tweets ● Crucial for highly inflected languages (e.g. Arabic) ● There are three types of stemmers ■ Algorithm based: uses knowledge of word suffixes. e.g. Porter stemmer ■ Dictionary based: uses a pre-created dictionary of related terms ■ Hybrid approach: e.g. Krovetz stemmer
  • 14. Phrases & N-grams ● Phrases are important as they are ○ More precise than single words ■ e.g "World Wide Web" ○ Less ambiguous ■ e.g. "green bush", "bush" ● Ranking issue ● Text processing issue - recognizing phrases ● Three possible approaches for recognizing phrases ○ Parts Of Speech (POS) tagger ○ Store word positions in indexes and use proximity operators in queries (not covered here) ○ N-gram
  • 15. Recognizing Phrases ● POS tagger ○ uses syntactic structure of sentence ■ sequences of nouns or ■ adjectives followed by nouns ○ too slow for large databases ● N-grams ○ uses a simpler definition of phrase ○ phrase is just a sequence of N words ■ 1 word - unigram ■ 2 words - bigram ■ 3 words - trigram ■ N words - N-gram ○ fits the Zipf distribution better than words alone ○ improves retrieval effectiveness hence used ○ takes up a lot of memory
  • 16. Document Structure and Markup ● Some parts of a document are more important ● Document parser recognizes structure using markup ○ Title, Heading, Bold text ○ Anchor tags ○ Meta data ○ Links - used in ranking algorithms
  • 17. Information Retrieval From Wikipedia, the free encyclopedia Information retrieval (IR) is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. There is overlap in the usage of the terms data retrieval, document retrieval, information retrieval, and text retrieval, but each also has its own body of literature, theory, praxis, and technologies. IR is interdisciplinary, based on computer science, mathematics, library science, information science, information architecture, cognitive psychology, linguistics, and statistics. Part of a Web page from Wikipedia
  • 18. <html> <head> <title>Information retrieval - Wikipedia, the free encyclopedia</title> … <body> <h1 id="firstHeading" class="firstHeading">Information retrieval</h1> <p><b>Information retrieval</b> (<b>IR</b>) is the area of study concerned with searching for documents, for <a href="/wiki/Information" title="Information">information</a> within documents, and for <a href="/wiki/Metadata_ (computing)" title="Metadata (computing)" class="mw-redirect">metadata</a> about documents, as well as that of searching <a href="/wiki/Relational_database" title="Relational database">relational databases</a> and the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>. ... </body> </html> HTML source for example Wikipedia page
  • 19. Questions?? Thanks!