SlideShare a Scribd company logo
Analysing Structured Scholarly Data
Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu,
Sriparna Saha and Stefan Dietze
WWW 2016
April 11th
, 2016
Montreal, Canada
OVERVIEW
❏ INTRODUCTION
❏ MOTIVATION
❏ RESEARCH
QUESTIONS
❏ ANALYSES
❏ CONCLUSIONS
❏ FUTURE WORK
INTRODUCTION (1/3)
The Web: nearly 46 trillion
Web pages indexed by Google
VS
Linked Data: approx. 1000
datasets & 100 billion
statements
● different order of
magnitude w.r.t. scale &
dynamics
Are there other semantics (structured facts) on the Web?
INTRODUCTION (2/3)
● Web pages embed structured data
(microdata, microformats and RDFa)
○ Interpretation of web documents
(search & retrieval)
● Increase in prevalence of embedded
markup (2014 Google study of 12 bn
pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al.
[ISWC’14])
○ Markup from Common Crawl (2.2 bn
pages)
○ 17 billion RDF quads
○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
Other semantics
(structured facts) on
the Web!
INTRODUCTION (3/3)
Characteristics of Markup Data
MOTIVATION
● Embedded markup ⇒ sparsely
linked, large % of coreferences,
redundant statements
● Uptake and reuse of embedded
markup is hindered by the lack
of dynamics, scale
● Lack of understanding of the
adoption of markup for
scholarly resource metadata
WHAT WE BRING TO THE TABLE ...
● Study of scholarly data
extracted from embedded
annotations (Web Data
Commons)
● Shape & characteristics of
entity descriptions
● Level of adoption of terms
& types, distributions
across TLDs, PLDs, data
publishers
RESEARCH QUESTIONS
RQ1 What are frequently used
terms & types for scholarly data?
RQ2 How are statements about
bibliographic data distributed
across the web? Who are the key
providers of bibliographic markup?
RQ3 What are the frequent errors
that can be observed?
DATASET
● Web Data Commons (WDC) 2014 dataset
● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-
occuring on same document with any s:
ScholarlyArticle instance
○ 6,793,764 quads
○ 1,184,623 entities
○ 83 distinct classes
○ 429 distinct predicates
DATASET - Considerations
● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles
● We focus on schema.org, the most
widely used schema
● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization
○ 280,616 instances (s:
ScholarlyArticle)
○ 847,417 insrances (s:Person)
○ 3,798 instances (s:Organization)
SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across
extracted types
1 to 14
1 to 9 1 to 4
SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs
DOMAINS & DOCUMENTS (2/5)
Top-10 PLDs (ranked by no. of entities)
DOMAINS & DOCUMENTS (3/5)
Distribution of Entities & Statements across TLDs
DOMAINS & DOCUMENTS (4/5)
Distribution of Entities & Statements across HTML
Documents
DOMAINS & DOCUMENTS (5/5)
Top-10 Documents Ranked According to
Embedded Entities
TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers
TOPICS & PUBLICATION TYPES (2/4)
Top-10 Publishers and corresponding no. of
Publications
TOPICS & PUBLICATION TYPES (3/4)
Top-10 Publication Types (genres) across WDC
TOPICS & PUBLICATION TYPES (4/4)
Top-10 Article Titles (ranked by frequency of occurrence)
FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates
CONCLUSIONS (½)
● First study on coverage & char. of
bibliographic metadata embedded
in web pages.
● Early adopters ⇒ publishers,
libraries, other providers of
bibliographic data.
● Usage of terms, types ⇒ dist.
across providers, domains and
topics follows a power law; few
providers & documents
contributing to majority of data.
● Top-k genres & publishers indicate a
bias towards French, English data
providers.
● Article titles, PLDs & publishers ⇒
bias Computer Science and Life
Sciences.
● In this study we only consider entities
tagged explicitly as "scholarlyArticle",
a deeper analysis considering more
types (article, book, etc.) and other
creative works can shed light on the
true scale of and potential of
embedded markup data.
CONCLUSIONS (2/2)
FUTURE WORK
● Targeted crawl of typical
providers of scholarly data
(publishers, academic
orgs., libraries, etc.)
● Consider implicitly typed
bibliographic or creative
work as scholarly data
Contact Details :
gadiraju@l3s.de
http://www.L3S.de
LIMITATIONS
● Our study is limited to
schema.org & the types of
s:ScholarlyArticle, s:
Person, s:Organization.
● We consider only explicitly
linked scholarly works.

More Related Content

PPTX
The expanding dataverse
PPTX
Linked Data: A short(-ish) introduction
PDF
PPTX
Semantic Data Normalization For Efficient Clinical Trial Research
PPTX
Open semantic chemical structures
PPTX
Publishing and Using Linked Open Data - Day 4
PDF
Managing RDF data with graph databases
PPTX
Deriving an Emergent Relational Schema from RDF Data
The expanding dataverse
Linked Data: A short(-ish) introduction
Semantic Data Normalization For Efficient Clinical Trial Research
Open semantic chemical structures
Publishing and Using Linked Open Data - Day 4
Managing RDF data with graph databases
Deriving an Emergent Relational Schema from RDF Data

What's hot (20)

PDF
Open science platforms
PDF
Linked Open Data
PPTX
RDF Graph Data Management in Oracle Database and NoSQL Platforms
PPTX
Linked data 101: Getting Caught in the Semantic Web
PPTX
BibBase Linked Data Triplification Challenge 2010 Presentation
PPTX
Reference Hackers
PPTX
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
PPTX
Research Data Sharing: A Basic Framework
PPTX
Data Publishing and Institutional Repositories
PPTX
Creating Incentives
PDF
Sparql a simple knowledge query
PDF
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
PPT
Bluffer's Guide to Institutional Repositories
PPTX
Expanding the content categories at JaLC
PPT
DataCite overview 2014
PPT
Freire model api
PPT
GBIF ideas
PPTX
Mcentyre dryad-orcid_may2013
PDF
Efficient Practices for Large Scale Text Mining Process
Open science platforms
Linked Open Data
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Linked data 101: Getting Caught in the Semantic Web
BibBase Linked Data Triplification Challenge 2010 Presentation
Reference Hackers
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Research Data Sharing: A Basic Framework
Data Publishing and Institutional Repositories
Creating Incentives
Sparql a simple knowledge query
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
Bluffer's Guide to Institutional Repositories
Expanding the content categories at JaLC
DataCite overview 2014
Freire model api
GBIF ideas
Mcentyre dryad-orcid_may2013
Efficient Practices for Large Scale Text Mining Process
Ad

Viewers also liked (20)

PPTX
Photos retrouvaille 2015 provigo
PDF
Plan grand palais visiteur
PPTX
January 15, 2015
PPTX
체감형 게임활용 교육사례 2014 Kinect School
PDF
Clipping pacto ong pacto ambiental anexo
PDF
by geethuraj
DOCX
Obejtos yeissa ortiz
PDF
Xerradamotivacional
PDF
Jenki formation-jenkins-hudson-integration-continue
PDF
Work Sample - Arch Design 2
DOC
And Then I Met Her
PDF
Prashanth_Ramaswamy_Resume_11-22-2015
PDF
Manual software para acionamneto v75
PDF
e-network_IWM3765_14[1] (2 files merged)
PDF
King Of Buns
PPTX
너 커서 뭐 될래? Dream Come True
PPT
Ғалымдар өмірінен
PDF
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
PPTX
Cricket quiz 2014 mains
PPTX
Diapos de sindrome treacher collins
Photos retrouvaille 2015 provigo
Plan grand palais visiteur
January 15, 2015
체감형 게임활용 교육사례 2014 Kinect School
Clipping pacto ong pacto ambiental anexo
by geethuraj
Obejtos yeissa ortiz
Xerradamotivacional
Jenki formation-jenkins-hudson-integration-continue
Work Sample - Arch Design 2
And Then I Met Her
Prashanth_Ramaswamy_Resume_11-22-2015
Manual software para acionamneto v75
e-network_IWM3765_14[1] (2 files merged)
King Of Buns
너 커서 뭐 될래? Dream Come True
Ғалымдар өмірінен
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
Cricket quiz 2014 mains
Diapos de sindrome treacher collins
Ad

Similar to Analysing Structured Scholarly Data Embedded in Web Pages (20)

PPTX
A theory of Metadata enriching & filtering
PDF
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PPTX
Researcher identifiers in 21st c-rev to submit
PDF
Introduction to linked data
PDF
Linked Data
PPT
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
PPTX
Metadata for researchers
PPT
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
PPTX
Removing Barriers to Data Sharing: the Research Data Alliance
PPTX
Research data management workshop april12 2016
PPTX
Research data management workshop April 2016
PDF
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
PPT
Summary of Trends in Cataloging
PDF
Linking Open Government Data at Scale
PPTX
Semantic Web Technologies: A Paradigm for Medical Informatics
PPTX
RDA Presentation
PPTX
Linked data presentation for who umc 21 jan 2015
PDF
Reuse of Structured Data: Semantics, Linkage, and Realization
PDF
Engaging Information Professionals in the Process of Authoritative Interlinki...
PDF
Getting Started with Knowledge Graphs
A theory of Metadata enriching & filtering
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Researcher identifiers in 21st c-rev to submit
Introduction to linked data
Linked Data
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
Metadata for researchers
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Removing Barriers to Data Sharing: the Research Data Alliance
Research data management workshop april12 2016
Research data management workshop April 2016
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Summary of Trends in Cataloging
Linking Open Government Data at Scale
Semantic Web Technologies: A Paradigm for Medical Informatics
RDA Presentation
Linked data presentation for who umc 21 jan 2015
Reuse of Structured Data: Semantics, Linkage, and Realization
Engaging Information Professionals in the Process of Authoritative Interlinki...
Getting Started with Knowledge Graphs

Recently uploaded (20)

PDF
Sports Quiz easy sports quiz sports quiz
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
Pharma ospi slides which help in ospi learning
PDF
RMMM.pdf make it easy to upload and study
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Classroom Observation Tools for Teachers
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Sports Quiz easy sports quiz sports quiz
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Institutional Correction lecture only . . .
Module 4: Burden of Disease Tutorial Slides S2 2025
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Pharma ospi slides which help in ospi learning
RMMM.pdf make it easy to upload and study
Final Presentation General Medicine 03-08-2024.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
2.FourierTransform-ShortQuestionswithAnswers.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
O7-L3 Supply Chain Operations - ICLT Program
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Classroom Observation Tools for Teachers
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

Analysing Structured Scholarly Data Embedded in Web Pages

  • 1. Analysing Structured Scholarly Data Embedded in Web Pages Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze WWW 2016 April 11th , 2016 Montreal, Canada
  • 2. OVERVIEW ❏ INTRODUCTION ❏ MOTIVATION ❏ RESEARCH QUESTIONS ❏ ANALYSES ❏ CONCLUSIONS ❏ FUTURE WORK
  • 3. INTRODUCTION (1/3) The Web: nearly 46 trillion Web pages indexed by Google VS Linked Data: approx. 1000 datasets & 100 billion statements ● different order of magnitude w.r.t. scale & dynamics Are there other semantics (structured facts) on the Web?
  • 4. INTRODUCTION (2/3) ● Web pages embed structured data (microdata, microformats and RDFa) ○ Interpretation of web documents (search & retrieval) ● Increase in prevalence of embedded markup (2014 Google study of 12 bn pages estimates an adoption of 26%) ● “Web Data Commons” (Meusel et al. [ISWC’14]) ○ Markup from Common Crawl (2.2 bn pages) ○ 17 billion RDF quads ○ Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)
  • 7. MOTIVATION ● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements ● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale ● Lack of understanding of the adoption of markup for scholarly resource metadata
  • 8. WHAT WE BRING TO THE TABLE ... ● Study of scholarly data extracted from embedded annotations (Web Data Commons) ● Shape & characteristics of entity descriptions ● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers
  • 9. RESEARCH QUESTIONS RQ1 What are frequently used terms & types for scholarly data? RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup? RQ3 What are the frequent errors that can be observed?
  • 10. DATASET ● Web Data Commons (WDC) 2014 dataset ● Subset ⇒ all statements describing entities of type s:ScholarlyArticle or co- occuring on same document with any s: ScholarlyArticle instance ○ 6,793,764 quads ○ 1,184,623 entities ○ 83 distinct classes ○ 429 distinct predicates
  • 11. DATASET - Considerations ● s:ScholarlyArticle is the only type which explicitly refers to scholarly articles ● We focus on schema.org, the most widely used schema ● Types considered ⇒ s:ScholarlyArticle, s:Person and s:Organization ○ 280,616 instances (s: ScholarlyArticle) ○ 847,417 insrances (s:Person) ○ 3,798 instances (s:Organization)
  • 12. SCHOLARLY TYPES & PREDICATES (½) Cumulative dist. of predicates over instances across extracted types 1 to 14 1 to 9 1 to 4
  • 13. SCHOLARLY TYPES & PREDICATES (2/2) Top-10 Predicates for s:ScholarlyArticle
  • 14. DOMAINS & DOCUMENTS (1/5) Distribution of Entities & Statements across PLDs
  • 15. DOMAINS & DOCUMENTS (2/5) Top-10 PLDs (ranked by no. of entities)
  • 16. DOMAINS & DOCUMENTS (3/5) Distribution of Entities & Statements across TLDs
  • 17. DOMAINS & DOCUMENTS (4/5) Distribution of Entities & Statements across HTML Documents
  • 18. DOMAINS & DOCUMENTS (5/5) Top-10 Documents Ranked According to Embedded Entities
  • 19. TOPICS & PUBLICATION TYPES (1/4) Distribution of Scholarly Articles across Publishers
  • 20. TOPICS & PUBLICATION TYPES (2/4) Top-10 Publishers and corresponding no. of Publications
  • 21. TOPICS & PUBLICATION TYPES (3/4) Top-10 Publication Types (genres) across WDC
  • 22. TOPICS & PUBLICATION TYPES (4/4) Top-10 Article Titles (ranked by frequency of occurrence)
  • 23. FREQUENT ERRORS - Schema Violations Top-10 Misused Predicates
  • 24. CONCLUSIONS (½) ● First study on coverage & char. of bibliographic metadata embedded in web pages. ● Early adopters ⇒ publishers, libraries, other providers of bibliographic data. ● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.
  • 25. ● Top-k genres & publishers indicate a bias towards French, English data providers. ● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences. ● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data. CONCLUSIONS (2/2)
  • 26. FUTURE WORK ● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.) ● Consider implicitly typed bibliographic or creative work as scholarly data
  • 28. LIMITATIONS ● Our study is limited to schema.org & the types of s:ScholarlyArticle, s: Person, s:Organization. ● We consider only explicitly linked scholarly works.