SlideShare a Scribd company logo
Information Extraction: Distilling Structured Data from Unstructured Text.Presenter: Shanshan Lu03/04/2010
Referenced paperAndrew McCallum: Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, volume 3, Number 9, November 2005. Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Eng. Bull. 23(4): 33-41 (2000)
Example Information Extraction: Distilling Structured Data from Unstructured TextTask: try to build a website to help people find continuing education opportunities at colleges, universities, and organization across the country, to support field searches over locations, dates, times etc.Problem: much of the data was not available in structured form.The only universally available public interfaces were web pages designed for human browsing.
Information Extraction: Distilling Structured Data from Unstructured Text
Information extractionInformation Extraction: Distilling Structured Data from Unstructured TextInformation extraction is the process of filling the fields and records of a database from unstructured or loosely formatted text.
Information Extraction
Information extractionInformation Extraction: Distilling Structured Data from Unstructured TextInformation extraction involves five major subtasks
Technique in information extractionInformation Extraction: Distilling Structured Data from Unstructured TextSome simple extraction tasks can be solved by writing regular expressions. Due to Frequently change of web pages, the previous method is not sufficient for the information extraction task. Over the past decade there has been a revolution in the use of statistical and machine-learning methods for information extraction.
A Machine Learning ApproachAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachA wrapper is a piece of software that enables a semi-structured Web source to be queried as if it were a database.
Contributions Accurately and Reliably Extracting Data from the Web: A Machine Learning ApproachThe ability to learn highly accurate extraction rules.To verify the wrapper to ensure that the correct data continues to be extracted.To automatically adapt to changes in the sites from which the data is being extracted.
Learning extraction rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachOne of the critical problems in building a wrapper is defining a set of extraction rules that precisely define how to locate the information on the page.For any given item to be extracted from a page, one needs an extraction rule to locate both the beginning and end of that item.A key idea underlying our work is that the extraction rules are based on “landmarks” (i.e., groups of consecutive tokens) that enable a wrapper to locate the start and end of the item within the page.
Samples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
Rules Accurately and Reliably Extracting Data from the Web: A Machine Learning ApproachStart rules: End rules are similar to start rules.Disjunctive rules:
STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachSTALKER : a hierarchical wrapper induction algorithm that learns extraction rules based on examples labeled by the user.STALKER only requires no more than 10 examples because of the fixed web page format and the hierarchical structure.STALKER exploits the hierarchical structure of the source to constrain the learning problem.
STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachFor instance, instead of using one complex rule that extracts all restaurant names, addresses and phone numbers from a page, they take a hierarchical approach.Apply a rule that extracts the whole list of restaurants;
Then use another rule to break the list into tuples that correspond to individual restaurants;
finally, from each such tuple they extract the name, address, and phone number of the corresponding restaurant.STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachAlgorithm to learn each ruleSTALKER is a sequential covering algorithm that, given the training examples E, tries to learn a minimal number of perfect disjuncts that cover all examples in E.A perfect disjunct is a rule that covers at least one training example and on any example the rule matches, it produces the correct result.
STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLearning a start rule for address:First, it selects an example, say E4, to guide the search. Second, it generates a set of initial candidates, which are rules that consist of a single 1-token landmark; these landmarks are chosen so that they match the token that immediately precedes the beginning of the address in the guiding example.
STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLearning a start rule for address:Because R6 has a better generalization potential, STALKER selects R6 for further refinements. While refining R6, STALKER creates, among others, the new candidates R7, R8, R9, and R10 shown below.
STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLearning a start rule for address:As R10 works correctly on all four examples, STALKER stops the learning process and returns R10.Result of STALKER:In an empirical evaluation on 28 sources STALKER had to learn 206 extraction rules. They learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate.
Identifying highly informative examplesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachThe most informative examples illustrate exceptional cases.They have developed an active learning approach called co-testing that analyzes the set of unlabeled examples to automatically select examples for the user to label.Backward:
Identifying highly informative examplesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachBasic idea:after the user labels one or two examples, the system learns both a forward and a backward rule. Then it runs both rules on a given set of unlabeled pages. Whenever the rules disagree on an example, the system considers that as an example for the user to label next.Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples.
Identifying highly informative examplesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachAssume that the initial training set consists of E1 and E2, while E3 and E4 are not labeled. Based on these examples, we learn the rules:
Identifying highly informative examplesWe applied co-testing on the 24 tasks on which STALKER fails to learn perfect rules.The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2%.Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate.
Verifying the extracted dataAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachSince the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution.When a significant difference is found, an operator can be notified or we can automatically launch the wrapper repair process.
Automatically repairing wrappersAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLocate correct examples of the data field on new pages.Re-label the new pages automatically.Labeled and re-labeled examples re-run through the STALKER to produce the correct rules for this site.
How to locate the correct example?Accurately and Reliably Extracting Data from the Web: A Machine Learning ApproachEach new page is scanned to identify all text segments that begin with one of the starting patterns and end with one of the ending patterns. Those segments, which we call candidates.The candidates are then clustered to identify subgroups that share common features (relative position on the page, adjacent landmarks, and whether it is visible to the user). Each group is then given a score based on how similar it is to the training examples. We expect the highest ranked group to contain the correct examples of the data field.
Automatically repairing wrappersAccurately and Reliably Extracting Data from the Web: A Machine Learning Approach
Upcoming trends and capabilitiesInformation Extraction: Distilling Structured Data from Unstructured TextCombine IE and data mining to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE.
Upcoming trends and capabilitiesInformation Extraction: Distilling Structured Data from Unstructured TextSQL --> Database
Information extraction, the Web and the futureInformation Extraction: Distilling Structured Data from Unstructured TextSecond half internet revolution: machine access to this immense knowledge base
Information extraction, the Web and the futureInformation Extraction: Distilling Structured Data from Unstructured TextIn web search there will be a transition from keyword search on documents to higher-level queries: queries where the search hits will be objects, such as people or companies instead of simply documents; queries that are structured and return information that has been integrated and synthesized from multiple pages; queries that are stated as natural language questions (“Who were the first three female U.S. Senators?”) and answered with succinct responses.

More Related Content

PPTX
Carma internet research module detecting bad data
PDF
Using Page Size for Controlling Duplicate Query Results in Semantic Web
PDF
Document Classification Using Expectation Maximization with Semi Supervised L...
PDF
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
PPTX
FINAL REVIEW
PDF
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
PDF
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
PPTX
Trending Topics in Machine Learning
Carma internet research module detecting bad data
Using Page Size for Controlling Duplicate Query Results in Semantic Web
Document Classification Using Expectation Maximization with Semi Supervised L...
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
FINAL REVIEW
EXECUTION OF ASSOCIATION RULE MINING WITH DATA GRIDS IN WEKA 3.8
Multi Similarity Measure based Result Merging Strategies in Meta Search Engine
Trending Topics in Machine Learning

What's hot (19)

PDF
Comparative analysis of relative and exact search for web information retrieval
DOCX
Document clustering for forensic analysis an approach for improving computer ...
DOCX
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
PPTX
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
DOCX
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
DOCX
Query aware determinization of uncertain
PDF
Discovering latent informaion by
PDF
IRJET- Missing Data Imputation by Evidence Chain
DOC
Query aware determinization of uncertain objects
PDF
Cl4201593597
PDF
A PROCESS OF LINK MINING
DOC
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
PPTX
Text mining
PDF
Meta documents and query extension to enhance information retrieval process
DOCX
Introductionedited
PDF
M033059064
PDF
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
DOCX
IEEE Projects 2015 | Query aware determinization of uncertain objects
PPTX
Multidimensioal database
Comparative analysis of relative and exact search for web information retrieval
Document clustering for forensic analysis an approach for improving computer ...
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Query aware determinization of uncertain
Discovering latent informaion by
IRJET- Missing Data Imputation by Evidence Chain
Query aware determinization of uncertain objects
Cl4201593597
A PROCESS OF LINK MINING
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
Text mining
Meta documents and query extension to enhance information retrieval process
Introductionedited
M033059064
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
IEEE Projects 2015 | Query aware determinization of uncertain objects
Multidimensioal database
Ad

Viewers also liked (8)

PDF
OUTDATED Text Mining 5/5: Information Extraction
PPTX
Textmining Information Extraction
PDF
OUTDATED Text Mining 1/5: Introduction
PDF
Information Extraction
PPTX
Text mining
PPTX
Introduction to Text Mining
PPT
Big Data & Text Mining
PPT
Textmining Introduction
OUTDATED Text Mining 5/5: Information Extraction
Textmining Information Extraction
OUTDATED Text Mining 1/5: Introduction
Information Extraction
Text mining
Introduction to Text Mining
Big Data & Text Mining
Textmining Introduction
Ad

Similar to Information Extraction (20)

PPT
Accurately and Reliably Extracting Data from the Web:
PPTX
Structured Data Extraction
PPT
Semantic Web
PDF
The Data Records Extraction from Web Pages
PPT
osm.cs.byu.edu
ODP
Information Extraction from the Web - Algorithms and Tools
PDF
Vision Based Deep Web data Extraction on Nested Query Result Records
PDF
PPT
mlas06_nigam_tie_01.ppt
PDF
F0362036045
PDF
IRJET- Intelligence Extraction using Various Machine Learning Algorithms
PPTX
Information Extraction
PPTX
Information Extraction
PPTX
Information Extraction
PDF
Df25632640
PDF
IRJET- Intelligence Extraction using Machine Learning Technics
PPT
ppt
PDF
Agent based Authentication for Deep Web Data Extraction
PDF
Web Data Extraction: A Crash Course
PDF
Similarity based Dynamic Web Data Extraction and Integration System from Sear...
Accurately and Reliably Extracting Data from the Web:
Structured Data Extraction
Semantic Web
The Data Records Extraction from Web Pages
osm.cs.byu.edu
Information Extraction from the Web - Algorithms and Tools
Vision Based Deep Web data Extraction on Nested Query Result Records
mlas06_nigam_tie_01.ppt
F0362036045
IRJET- Intelligence Extraction using Various Machine Learning Algorithms
Information Extraction
Information Extraction
Information Extraction
Df25632640
IRJET- Intelligence Extraction using Machine Learning Technics
ppt
Agent based Authentication for Deep Web Data Extraction
Web Data Extraction: A Crash Course
Similarity based Dynamic Web Data Extraction and Integration System from Sear...

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Information Extraction

  • 1. Information Extraction: Distilling Structured Data from Unstructured Text.Presenter: Shanshan Lu03/04/2010
  • 2. Referenced paperAndrew McCallum: Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, volume 3, Number 9, November 2005. Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Eng. Bull. 23(4): 33-41 (2000)
  • 3. Example Information Extraction: Distilling Structured Data from Unstructured TextTask: try to build a website to help people find continuing education opportunities at colleges, universities, and organization across the country, to support field searches over locations, dates, times etc.Problem: much of the data was not available in structured form.The only universally available public interfaces were web pages designed for human browsing.
  • 4. Information Extraction: Distilling Structured Data from Unstructured Text
  • 5. Information extractionInformation Extraction: Distilling Structured Data from Unstructured TextInformation extraction is the process of filling the fields and records of a database from unstructured or loosely formatted text.
  • 7. Information extractionInformation Extraction: Distilling Structured Data from Unstructured TextInformation extraction involves five major subtasks
  • 8. Technique in information extractionInformation Extraction: Distilling Structured Data from Unstructured TextSome simple extraction tasks can be solved by writing regular expressions. Due to Frequently change of web pages, the previous method is not sufficient for the information extraction task. Over the past decade there has been a revolution in the use of statistical and machine-learning methods for information extraction.
  • 9. A Machine Learning ApproachAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachA wrapper is a piece of software that enables a semi-structured Web source to be queried as if it were a database.
  • 10. Contributions Accurately and Reliably Extracting Data from the Web: A Machine Learning ApproachThe ability to learn highly accurate extraction rules.To verify the wrapper to ensure that the correct data continues to be extracted.To automatically adapt to changes in the sites from which the data is being extracted.
  • 11. Learning extraction rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachOne of the critical problems in building a wrapper is defining a set of extraction rules that precisely define how to locate the information on the page.For any given item to be extracted from a page, one needs an extraction rule to locate both the beginning and end of that item.A key idea underlying our work is that the extraction rules are based on “landmarks” (i.e., groups of consecutive tokens) that enable a wrapper to locate the start and end of the item within the page.
  • 12. Samples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
  • 13. Rules Accurately and Reliably Extracting Data from the Web: A Machine Learning ApproachStart rules: End rules are similar to start rules.Disjunctive rules:
  • 14. STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachSTALKER : a hierarchical wrapper induction algorithm that learns extraction rules based on examples labeled by the user.STALKER only requires no more than 10 examples because of the fixed web page format and the hierarchical structure.STALKER exploits the hierarchical structure of the source to constrain the learning problem.
  • 15. STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachFor instance, instead of using one complex rule that extracts all restaurant names, addresses and phone numbers from a page, they take a hierarchical approach.Apply a rule that extracts the whole list of restaurants;
  • 16. Then use another rule to break the list into tuples that correspond to individual restaurants;
  • 17. finally, from each such tuple they extract the name, address, and phone number of the corresponding restaurant.STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachAlgorithm to learn each ruleSTALKER is a sequential covering algorithm that, given the training examples E, tries to learn a minimal number of perfect disjuncts that cover all examples in E.A perfect disjunct is a rule that covers at least one training example and on any example the rule matches, it produces the correct result.
  • 18. STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLearning a start rule for address:First, it selects an example, say E4, to guide the search. Second, it generates a set of initial candidates, which are rules that consist of a single 1-token landmark; these landmarks are chosen so that they match the token that immediately precedes the beginning of the address in the guiding example.
  • 19. STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLearning a start rule for address:Because R6 has a better generalization potential, STALKER selects R6 for further refinements. While refining R6, STALKER creates, among others, the new candidates R7, R8, R9, and R10 shown below.
  • 20. STALKER to learn rulesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLearning a start rule for address:As R10 works correctly on all four examples, STALKER stops the learning process and returns R10.Result of STALKER:In an empirical evaluation on 28 sources STALKER had to learn 206 extraction rules. They learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate.
  • 21. Identifying highly informative examplesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachThe most informative examples illustrate exceptional cases.They have developed an active learning approach called co-testing that analyzes the set of unlabeled examples to automatically select examples for the user to label.Backward:
  • 22. Identifying highly informative examplesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachBasic idea:after the user labels one or two examples, the system learns both a forward and a backward rule. Then it runs both rules on a given set of unlabeled pages. Whenever the rules disagree on an example, the system considers that as an example for the user to label next.Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples.
  • 23. Identifying highly informative examplesAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachAssume that the initial training set consists of E1 and E2, while E3 and E4 are not labeled. Based on these examples, we learn the rules:
  • 24. Identifying highly informative examplesWe applied co-testing on the 24 tasks on which STALKER fails to learn perfect rules.The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2%.Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate.
  • 25. Verifying the extracted dataAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachSince the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution.When a significant difference is found, an operator can be notified or we can automatically launch the wrapper repair process.
  • 26. Automatically repairing wrappersAccurately and Reliably Extracting Data from the Web: A Machine Learning ApproachLocate correct examples of the data field on new pages.Re-label the new pages automatically.Labeled and re-labeled examples re-run through the STALKER to produce the correct rules for this site.
  • 27. How to locate the correct example?Accurately and Reliably Extracting Data from the Web: A Machine Learning ApproachEach new page is scanned to identify all text segments that begin with one of the starting patterns and end with one of the ending patterns. Those segments, which we call candidates.The candidates are then clustered to identify subgroups that share common features (relative position on the page, adjacent landmarks, and whether it is visible to the user). Each group is then given a score based on how similar it is to the training examples. We expect the highest ranked group to contain the correct examples of the data field.
  • 28. Automatically repairing wrappersAccurately and Reliably Extracting Data from the Web: A Machine Learning Approach
  • 29. Upcoming trends and capabilitiesInformation Extraction: Distilling Structured Data from Unstructured TextCombine IE and data mining to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE.
  • 30. Upcoming trends and capabilitiesInformation Extraction: Distilling Structured Data from Unstructured TextSQL --> Database
  • 31. Information extraction, the Web and the futureInformation Extraction: Distilling Structured Data from Unstructured TextSecond half internet revolution: machine access to this immense knowledge base
  • 32. Information extraction, the Web and the futureInformation Extraction: Distilling Structured Data from Unstructured TextIn web search there will be a transition from keyword search on documents to higher-level queries: queries where the search hits will be objects, such as people or companies instead of simply documents; queries that are structured and return information that has been integrated and synthesized from multiple pages; queries that are stated as natural language questions (“Who were the first three female U.S. Senators?”) and answered with succinct responses.