Diadem 1.0

1. DIADEMDomain-centric, Intelligent, Automated Data ExtractionTim Furche, Georg Gottlob, Giorgio OrsiMay 11th, 2011@ Oxford University Computing Laboratoriesjoint work with Giovanni Grasso, Omer Gunes, XiaonanGuo, AndreyKravchenko, Thomas Lukasiewicz, Christian Schallhart, Andrew Sellers, Gerardo Simaris, Cheng Wang

3. 31Web Data Extraction

4. 4Section 1: Web Data ExtractionData on the Webthere is more of it than we can useno longer availability, but finding, integrating, analysing, …

5. 5Section 1: Web Data ExtractionSurface vs. Deep Webestimated 500 × surface webestimated 400000 deep web databases What?Products (stores)Directories (yellow pages)Catalogs (libraries)Public DBs (publications, census, data.gov,…)Public services (weather, location, …)

6. 6And it’s not just one haystack …

10. 10

11. 117 bedrooms5 bedrooms

12. 12Section 1: Web Data ExtractionThe Web is more than HTML

13. 13Section 1: Web Data ExtractionOverviewIntroducing Web Data ExtractionScenariosWhy now?Supervised Web Data ExtractionUnsupervised Web Data ExtractionDIADEMOPALAMBEROXPathIVLIADatalog±

14. 141.1Web Data Extraction:Scenarios

15. 15Section 1: Web Data ExtractionThe Need of Web Data Extractioninformationdrives business (decision making, trend analysis, …)available in troves on the internetbut: as HTML made for humans, not as structured datacompanies needproduct specificationspricing informationmarket trendsregulatory information

16. 16keyword search failsexample due to Fabian Suchaneck

17. 17keyword search fails

18. 18Section 1: Web Data ExtractionScenario ➀: Electronics retailerelectronics retailer: online market intelligencecomprehensive overview of the marketdaily information on price, shipping costs, trends, product mixby product, geographical region, or competitorthousands of productshundreds of competitorsnowadays: specialised companiesmostly manual, interpolationlarge cost

19. 19Section 1: Web Data ExtractionScenario ➁: Supermarket chainsupermarket chaincompetitors’product prices special offer or promotion (time sensitive)new products, product formats & packaging

20. 20Section 1: Web Data ExtractionScenario ➂: Hotel Agencyonline travel agencybest price guarantee prices of competing agenciesaverage market price

21. 21Section 1: Web Data ExtractionScenario ➃: Hedge Fundhouse price indexpublished in regular intervals by national statistics agencyaffects share values of various industrieshedge fundonline market intelligence to predict the house price index

22. 22Section 1: Web Data ExtractionAnd a lot more …monitor blogs and forumsmarket intelligence, e.g., complaints, common problemscustomer opinionsranking and analysing product reviewsfinancial analystsmonitor trends and stats for products of a certain company / categoryinterest rates from financial institutionspress releases and financial reportspatent search & analysis…

23. 23

24. 241.1Web Data Extraction:Why Now?

25. 25Scale

26. 26Applications

27. 27Section 1: Web Data ExtractionHow to book a flight?

28. How to find a history book?28Section 1: Web Data Extraction

29. How to find a paper?29Section 1: Web Data Extraction

30. 30Section 1: Web Data ExtractionHow to find a flat?

31. 31Structured Data

32. 32

33. 33Section 1: Web Data ExtractionWhy Web Data Extraction Now?Why now? TrendsTrend ➊: scale—every business is onlineautomation at scaleTrend ➋: web applications rather than web documentsautomated form filling (deep web navigation)Trend ➌: structured, common-sense data available allows more sophisticated automated analysisalso a tool for improved data extraction?

34. Web Data Extraction:Supervised342

35. 35manual: (e.g., Web Harvest)user writes the wrapper, sometimes using wrapping librariessupervised: (e.g., Lixto)user provides examples and refines the wrappersemi-supervised: user provides examples (per site), wrapper is automatically learnedunsupervised: entirely automated (e.g., DIADEM)some systems omit examples and run analysis directly on all pages some systems automatically guess examples

36. 36Section 2: Supervised Web Data ExtractionSupervised Web Data ExtractionUser interaction needed torather than manually writing in a programming languagerecord interaction sequences (such as form fillings)visually select examples for dataCurrent gold standard for high-accuracy extractionExamples: LixtoAutomation AnywhereWeb Harvest…

37. 37

38. 38

39. 39

40. 40Section 1: Supervised Web Data ExtractionLixto: Extraction & AnalysisLixto: sophisticated, visual semi-automated extraction toolvisually select, automatically derives patterns, verificationhighly scalable extraction and processing with Lixto serverbut also: data integration & business analytics suitedata cleaningdata flow scenarios: merge & filter from different web sitesmarket intelligence & analytics

41. 41

42. 42

43. Web Data Extraction:Unsupervised433

44. 4417000 real estatesites in the UK alone

45. 45Section 3: Unsupervised Web Data ExtractionWhy Automating Data Extraction?Too many fish in the pond> 17000 real estate UK sitessimilar for restaurants, travel, airlines, pharmacies, retail shops, …aggregators cover only a fractionupdated slowlyper site manual work infeasiblewrapper construction too expensive tracking changesexcludes manual & (semi-) supervised

46. 46Section 3: Unsupervised Web Data ExtractionWhy Automating Data Extraction?All the fish are differentlarge, modern aggregators (>100000)nation-wide agencies (>10000)agencies for single quarter (< 15)no single unsupervised wrappercan do this today

47. 47Section 3: Unsupervised Web Data Extraction… and we really need it!search engine providers (Google, Microsoft, Yahoo!) all work on information and data extraction for“vertical”, “object” and “semantic” searchturn search engines into knowledge bases for decision support

48. 48“no one really has done this successfully at scale yet”Raghu Ramakrishnan, Yahoo!, March 2009“Current technologies are not good enough yet to provide what search engines really need. [...] Any successful approach would probably need a combination of knowledge and learning.”Alon Halevy, Google, Feb. 2009

49. 49Section 3: Unsupervised Web Data ExtractionUnsupervised: The Story so FarKey observation: “database” web sites are generated using templateswrapper generators need to automatically identifying templatesTwo major approachesmachine learning from a few hand-labeled examplessimilar to semi-supervised, but only one set of examples for an entire domainhigh precision only for simple domains (single entity type, few attributes)fully automatically exploit the repeated structure of result pagesgood precision needs a lot of data (many records per page, many pages)doesn’t work for forms (no repetition)

51. ?51

52. 524DIADEM

53. 53Section 4: DIADEMDomain-Centric Data ExtractionBlackbox analyser thatturns any of the thousands of websites of a domaininto structured data

54. 54host of domain specific annotators

55. 55domain ontology & phenomenology

56. 56+ everything the others are doingtemplate discoverymachine learning for classification

57. 57

58. 58

59. 59Section 4: DIADEMDIADEM: OverviewDIADEM combineshost of domain-specific annotators withgives us a first “guess” to automatically generate exampleshigh-level ontology about domain entities andtheir phenomenology on web sites of the domainallows us to verify & refine examples+ advances in existing techniques for repeated structure analysis page & block classificationbottom-up understanding & top-down reasoning

60. 604.1DEMO

61. 61

62. 62DIADEM 0.1First prototype

63. 63

65. 65Form successfully filledNext step

66. 66Section 4: DIADEMAchievements in Numbers15k-150k facts (5-50MB) generated per web pagetime: usually between 30-60 sec, at most few minutes300-400 predicatesSome numbers on the prototype:Java files: 293 with 44993 lines of codeDLV rules: over 500 rules, over 200 predicatesGazetteers: 111 gazetteers with 48000 entries JAPE rules: 23 rules files with 30 rules

67. 67☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☀☂☂☂☀☀☀☣☣☣☣☣☣☣☣☣☣☣☣☣☣☣☣☣☀☀☣☣☣☣☣

68. 68☀☀☀☀☀☀☀☀☀☀☀☀☣☣☣☣☣☣☣☣☣☣☣☀☂☂☣

69. 69

70. OPAL:Ontologies for Form Analysis704.2

71. 71

72. 72Diversity

73. 73

74. 74Section 4: DIADEM » OPALOPAL: OverviewThree step process:browser extraction and annotationlabelling & segmentationclassification (phenomenological mapping)Model-based, knowledge drivenlatter two steps are model transformationsthin layer of domain-dependent conceptsfield types and labelstriggers for field & form creation

75. 75

76. 76

77. 77

78. 78

79. 79ICQ Data Set: Application to Other Domains

80. AMBER:Ontologies for Record Extraction804.3

82. 82just opposite as in OPAL

83. AMBER: OverviewThree step process like OPALbrowser extraction and annotationclassification (phenomenological mapping)record segmentation (much harder than in OPAL)Model-based, knowledge drivenlatter two steps are model transformationsthin layer of domain-dependent conceptsrecord and attribute typestriggers for record & attribute creation83Section 4: DIADEM » AMBER

84. 84

85. 85

86. 86Repeating

87. 87Similarity

88. 88

89. OXPath:Scalable, Memory-Efficient Web Extraction894.4

90. How to book a flight?90Section 4: DIADEM » OXPath

91. How to find a history book?91Section 4: DIADEM » OXPath

92. How to find a flat?92Section 4: DIADEM » OXPath

93. How to find a paper?93Scenarios

94. How to find a flat with OXPathSection 4: DIADEM » OXPathStart at rightmove.co.uk: doc("rightmove.co.uk")Fill “oxford’ into the first visible field/descendant::field()[1]/{"oxford"}Click on the second next button/following::field()[2]/{click /}On the refinement form just continue by clicking on the last field/descendant::field()[last()]/{click /}Grab all the prices//p.price94

95. State of Web ExtractionNo interaction with rich, scripted interfacesno actions other than form filling and submission➀ Imperative extraction scriptsexplicit variable assignments, flow control, etc.either proprietary selection language or mix of XPath & external flow control➁ Focus on automation and visual interfacesno or very limited extraction language, only ad-hoc extractionsno multiway navigation, no optimization95Section 4: DIADEM » OXPath

96. Why OXPath?96Section 4: DIADEM » OXPathscalabilityfamiliaritythere is no XPath for data extractionsimplicityweb applications

98. Summary of Complexity98Section 4: DIADEM » OXPathCombined: PTime-hardPTime-hardData: NLogSpaceLogSpaceExtraction marker = n-ary, nested queriesActions = multiple pagesO(n4⋅q2)O(n3⋅q2)Contextual actions (action free prefix)Buffer bounded by page depth

99. 99Constant Memory

100. 100browser bound

101. 101… for many pages

102. 102… for many results

103. 103memory

104. 104faster

105. 105even faster

106. 1064.5IVLIA:Ontologies for PDF Extraction

107. 107

108. PDF Analysis108Section 4: DIADEM » IVLIA

109. Semantic Analysis and Annotation109Section 4: DIADEM » IVLIA

110. Datalog±:Ontological Reasoning at Web Scale1104.6

111. 111Section 4: DIADEM » Datalog±Much is possible with DatalogDL axiomDatalog ruleConcept Inclusionemployee(X) -> person(X)employeevperson(Inverse) Role Inclusionreports¡vmanagerreports(X,Y) -> manager(Y,X)Role Transitivitytrans(manager)manager(X,Y), manager(Y,Z) -> manager(X,Z)Datalog and ontological reasoning

112. 112Section 4: DIADEM » Datalog±but it’s not enough …DL axiomDatalog(?) ruleParticipationemployeev∃reportemployee(X) -> ∃Yreport(X,Y)Disjointnessemployee(X), customer(X) -> ⊥employee v:customerFunctionalityreports(X,Y), reports(X,Z) -> Y = Zfunct(reports)Datalog and ontological reasoning

113. 113Section 4: DIADEM » Datalog±Ontological DatabasesE/R SchemaObject Relational SchemaRelational Schemaperson(ssn, name, birthdate)employee (ssn, empID, name, birthdate, department)department (depName, building)project (projID, startDate, duration)supervision (supervisor, supervised)assignment (employee, project)

114. 114Section 4: DIADEM » Datalog±Ontological ConstraintsTaxonomy Definitionsemployee(X,Y,Z,W) -> ∃V person(V,Y,Z)project(X,Y,Z) -> activity(X,Y,Z)Concept Definitionsemployee(X1,Y1,Z1,W1,U1), supervision(Y1,Y2), employee(X2,Y2,Z2,W2,U2) -> supervisor(X1,Y1,Z1,W1,U1)An employee who supervises another employee is a supervisorgeneralManager(X1,Y1,Z1,W1,U1) -> supervision(Y1,Y1)A general manager supervises him/herself

115. 115expressivenessefficiencyKRexpressivenessefficiencyDBBig Picture

116. 116Big Picture

117. 117Our goal …DBtechnology+constraintsDatalogDLs(DL-Lite, EL, Flogic Lite)Unifying FrameworkSection 4: DIADEM » Datalog±while maintaining query answering tractable in data complexity!

118. 118employee(X), inProject(X,Y) ->∃Zemployee(Z),supervises(Z,X)reports(X,Y),reports(Z,X)->Y = Zemployee(X),customer(X) -> ⊥Section 4: DIADEM » Datalog±Extend Datalog by allowing in the head: existential (∃) variables  Tuple-generating dependencies (TGDs)equality (=) Equality-generating dependencies (EGDs)constant false (⊥)  Negative constraints (NCs)What we get is Datalog[∃,=,⊥] Datalog+Datalog±

119. 119LinearDL-LiteSticky-joinFO-rewritableGuardedELPTIMEDatalog±: OverviewSection 4: DIADEM » Datalog±

120. 120Section 4: DIADEM » Datalog±Comparison with existing semantic data management solutionsIBM IODT [Ma et Al. SIGMOD ‘08]Ontotext BigOWLLim [Kiryakov WWW ‘06]Requiem [Horrocks et Al. ISWC ‘09]Prototype implementation:Nyaya (http://mais.dia.uniroma3.it/Nyaya/Home.html)Implements guarded, weakly-acyclic, linear and sticky Datalog ±Couples a Datalog ± engine with efficient storage mechanismDatalog±: In practice (experiments)

121. 121Section 4: DIADEM » Datalog±Paper Semantic Data Markets: Store, Reason and Queryby R. De Virgilio, G. Orsi, L. Tanca and R. Torlone (submitted) Findings:commercial systems do not identify FO-rewritable fragmentsthey could answer queries much faster than they do nowtesting FO-rewritability conditions is easyDatalog±: In practice (experiments)

122. 122Section 4: DIADEM » Datalog±If the language of Σis FO-rewritablefact updates reduce to updates in a RDBMSpredicate updates reduce to re-compute the rewritingDatalog±: Updates

123. 123

124. Q&Adiadem-project.info

Diadem 1.0

More Related Content

Similar to Diadem 1.0 (20)

More from Giorgio Orsi (20)

Recently uploaded (20)

Diadem 1.0