SlideShare a Scribd company logo
Legislative document content extraction based
on Semantic Web technologies
A use case about processing the History of the Law at Chile
Francisco Cifuentes Silva
Library of Congress, Chile
PhD Student
WESO research group
Jose Emilio Labra Gayo
WESO research group
University of Oviedo, Spain
Chilean Library of Congress
In Spanish: BCN (Biblioteca del Congreso Nacional de Chile)
Political
powers
ExecutiveJudiciaryLegislative
Independent body inside the Legislative power
Advices the parliament and gives services to citizens
http://www.bcn.cl
2 projects at library of congress (BCN)
History of the Law
Parliamentary work
History of the Law (LeyChile)
Collect all documents generated during a law legislative process
Phases:
An initiative sees life as a draft bill
Subject to debates
Validity time (it is published)
Modifications, additions,...
Derogation
Goal:
Capture the spirit of the law
Traceability
https://www.bcn.cl/historiadelaley
Parliamentary work
Collect all legislative activity by each Member of Parliament
Retrieve all interventions made
Parliamentary motion
Session journal
Commission report
Ordered and categorised
https://www.bcn.cl/laborparlamentaria/
Both projects adopted semantic technologies
Some initial reasons:
Semantic technologies considered one pillar of strategic plan (in 2014)
Innovative action to generate new products
Improve interoperability mechanisms
Sem. Web aligned well with open & public data
Which semantic technologies?
Text mining and content enrichment
Entity extraction
Topic identification
Automatic markup
Classification
Machine readable info
XML & URIs
RDF
Ontologies
Linked Open Data
Workflow pipelines
3 main steps
Automatic XML Marker
RDF & Linked data generation
Content delivery
Linked Open
Data
Query DB
Workflow overview
National library
Legislative documents
• Paper (requires OCR)
• Text documents
Automatic
XML
marker
SVN repository
Akoma-Ntoso
XML editor &
tools
Publishing
(RDF extraction
From Akoma-Ntoso)
Services
layer
Content
portals
Automatic XML marker
Source: Text Target: XML following Akoma-Ntoso
Automatic XML marker
Text
Entity Type
MediatorLegal Knowledge
Base
Entity Type URI Structural
marker
Internal XML
representation
Converter
XML
AKN
Text
Text
Named Entity
Recognizer 4 phases
1. Named Entity Recognizer
Detection of entities & types of entities
Web service implementing the Stanford NER with a CRF classifier
Evaluation in production: detects 97% entities
Type Some examples # of entities
Person Salvador Allende, Sebastián Piñera 5.139
Organization Ministerio de Salud, SERNATUR 2.848
Location Valparaíso, Santiago de Chile 1.251
Document Ley 20.000, Diario de sesión nº 12 732.497
Role Senador, Diputado, Alcalde 428
Events Nacimiento de Eduardo Frei, Sesión Nº 23 14.389
Law Boletín 11536-04, Prohíbe fumar en espacios cerrados 12.737
Dates 27 de febrero de 2010, el próximo año, ... 20.632
Text
Entity Type
Text
Named Entity
Recognizer
2. Mediator
Entity linking and disambiguation
Text similarity algorithms
Based on Apache Lucene
In-house development
- Use of context information to narrow
list of candidates
- Custom filters and association
heuristics
- Specialized web services
Entity Type
Mediator
Legal Knowledge
Base
Entity Type URI
Text
Text
3. Structural marker
Detect structures in the text
Titles, subtitles, paragraphs, sections,...
Special structure for debates: participation
Regular expressions + custom rules
Entity Type URI
Structural
marker
Internal XML
representation
Text
4. XML converter to Akom-Ntoso
Programmatic approach
Internal XML representation similar to DOM
Each node converted to text in AKN-XML
Internal XML
representation
Converter
XML
AKN
Human edition of AKN-Documents
Quality assurance by human analysts
They review the generated XML documents
2 editors:
Ad-hoc XML editor
Commercial editor: LegisPro (Xcential)
Linked data generation
The pilot project (2011) carefully defined a stable URI model
URIs have been maintained since them
URIs = IDs in the whole system
URIs are dereferentiable
Content negotiation
Custom linked data browser
Documentation (in Spanish)
http://datos.bcn.cl/es/documentacion
AKN2RDF
RDF extraction from Akoma-Ntoso XML
● Custom-made converter (XSL discarded for perceived complexity)
● Each XML tag implemented in one Class
● Extracted data saved into multiple databases (Relational and RDF)
Linked data generation
Source: AKN XML documents
Linked data browser (WESO-DESH)
Target: RDF data
http://datos.bcn.cl/recurso/cl/documento/579095/http://datos.bcn.cl/recurso/cl/documento/579095.xml
SPARQL endpoint
RDF triples are published as a public SPARQL endpoint
Number of norms by municipality
Content delivery
Web portals using Open Source Technologies
CMS (Typo3)
Python/Java
Varnish
Apache Lucene
REST Web service layers which connect to RDF triplestore and DB
Data exports to PDF, Doc and XML formats
URIs of parliamentary profiles = URIs in triplestore
History of the Law portal
https://www.bcn.cl/historiadelaley
Links to
Members of
Parliament
Each article
has a link
Different
versions
of a law
History of the Law portal
https://www.bcn.cl/historiadelaley
Compare
different
versions
Parliamentary Work
https://www.bcn.cl/laborparlamentaria
Show
participation of
each Member of
Parliament
Some experimental visualizations
Relationships between laws
Historical Parliament
Parliamentary genealogy (family relationships)
Regions mentioned in laws (legislative hackathon)
Links between laws
Historical parliament
http://datos.bcn.cl/visualizaciones/genealogia-parlamentaria/
Parliamentary genealogy
http://datos.bcn.cl/visualizaciones/genealogia-parlamentaria/consulta.jsp
Regions mentioned by law
Result of a legislative hackathon
http://datos.bcn.cl/global-legislative-hackathon-2016/Hackaton/www/html/master.html
In 2010 there was an
Earthquake in BioBio region
Some statistics
24.368 documents (nov. 2018)
Number of RDF triples: 28 millions
According to Google analytics
Average browsing time: 2min 26s
Visits received 331,481 (nov. 2016-2017)  476,241 (nov. 2016-2017)
And some findings...
Question: why are there some valleys?
Dictatorship time
Session attendance by year
RDF triples generated by year
Some lessons learnt
RDF granularity & inference trade-off
RDF statements + inference (high running times...queries that didn't terminate)
A priori inferred triples added to triple store (high response times for large docs)
Small subset of RDF triples (structural parts of docs and metadata)
Performance problems in XML editor browsing long docs (>1000pages)
Low SPARQL endpoint usage by external apps
If we could start again, I would recommend ShEx
Personal note: These kind of data portals led to my interest in ShEx
Conclusions & future projects
Well designed URIs can act as a perfect glue for interoperability
Automatic workflow pipelines help long-term survival of LD-based projects
SPARQL endpoint since 2011
Future projects on top of existing ones
National Budget as Linked data
Diana Project: Members of Parliament linked to social network analysis
New portal: User customization & recommender systems
End of presentation
Acknowledgements:
David Vilches, Eridan Otto, Christian Sifaqui

More Related Content

PPTX
Legislative data portals and linked data quality
PPTX
Registry Technical Training
PPTX
Registry webinar
PPT
Introduction to linked data and the semantic web
PPTX
Linked Data for Czech Legislation
PPTX
Ukgovld registry-webinar-v3
Legislative data portals and linked data quality
Registry Technical Training
Registry webinar
Introduction to linked data and the semantic web
Linked Data for Czech Legislation
Ukgovld registry-webinar-v3

What's hot (20)

PPTX
SEMANTIC WEB SOURCES – comparison of open-source Knowledge Graphs
PDF
Enabling re-use via CKAN: discoverability and interoperability
KEY
Snac webinar v3
PPTX
Building NextGen Enterprise data platforms | Graham Cousins
PPT
UKAD forum 2013: What is an API and what might the Discovery API mean for con...
PDF
Industry Ontologies: Case Studies in Creating and Extending Schema.org
PPTX
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
PPTX
Open standards for linked organisations | meeting Estonia - Flemish Governmen...
PPT
Lodlam saa 2011_jenelfarrell_2
PPT
Moving to the network level: discovery and disclosure
PDF
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
PPTX
Linked data HHS 2015
PPTX
Linked data MLA 2015
PPTX
Linked Data MLA 2015
PDF
Godby "'What are the 'entities that matter?' And how much should we say about...
PDF
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
PDF
Sebastian Hellmann
PDF
Semantic web
SEMANTIC WEB SOURCES – comparison of open-source Knowledge Graphs
Enabling re-use via CKAN: discoverability and interoperability
Snac webinar v3
Building NextGen Enterprise data platforms | Graham Cousins
UKAD forum 2013: What is an API and what might the Discovery API mean for con...
Industry Ontologies: Case Studies in Creating and Extending Schema.org
BIBFRAME and OCLC Works: Defining Models and Discovering Evidence
Open standards for linked organisations | meeting Estonia - Flemish Governmen...
Lodlam saa 2011_jenelfarrell_2
Moving to the network level: discovery and disclosure
Shieh "Enabling Descriptive Data to be Linked at the Smithsonian Libraries"
Linked data HHS 2015
Linked data MLA 2015
Linked Data MLA 2015
Godby "'What are the 'entities that matter?' And how much should we say about...
Sparling and Cohen "BIBFRAME Implementation at the University of Alberta Libr...
Sebastian Hellmann
Semantic web
Ad

Similar to Legislative document content extraction based on Semantic Web technologies (20)

PPTX
Towards an architecture and adoption process for Linked Data technologies in ...
PPT
IFLA Semantic Web at the BCN, 15.08.2012
PDF
CAEPIA 2011 Linked Data Methodology
PDF
congress_project_w205_conference-FINAL
PPTX
Linked Data: thinking big, starting small
PDF
Intro to Exhibit Workshop
PPT
Lex school 2011
PPT
Service-Oriented Architecture for automatic markup of documents
PDF
Hala skafkeynote@conferencedata2021
PPTX
Linked Open Data - Masaryk University in Brno 8.11.2016
PDF
Ontologies and semantic web
PPTX
Semantic Web: introduction & overview
PPTX
Widening the limits of cognitive reception with online digital library graph ...
PPTX
CSHALS 2010 W3C Semanic Web Tutorial
PDF
Open Government Data and MongoDB
PDF
Semantic Interoperability - grafi della conoscenza
PPT
Linked Open Government Data and the Semantic Web
PPTX
TPDL2013 tutorial linked data for digital libraries 2013-10-22
PDF
Drupal and Apache Stanbol. What if you could reliably do autotagging?
PPT
Semantic Web Science
Towards an architecture and adoption process for Linked Data technologies in ...
IFLA Semantic Web at the BCN, 15.08.2012
CAEPIA 2011 Linked Data Methodology
congress_project_w205_conference-FINAL
Linked Data: thinking big, starting small
Intro to Exhibit Workshop
Lex school 2011
Service-Oriented Architecture for automatic markup of documents
Hala skafkeynote@conferencedata2021
Linked Open Data - Masaryk University in Brno 8.11.2016
Ontologies and semantic web
Semantic Web: introduction & overview
Widening the limits of cognitive reception with online digital library graph ...
CSHALS 2010 W3C Semanic Web Tutorial
Open Government Data and MongoDB
Semantic Interoperability - grafi della conoscenza
Linked Open Government Data and the Semantic Web
TPDL2013 tutorial linked data for digital libraries 2013-10-22
Drupal and Apache Stanbol. What if you could reliably do autotagging?
Semantic Web Science
Ad

More from Jose Emilio Labra Gayo (20)

PPTX
Publicaciones de investigación
PPTX
Introducción a la investigación/doctorado
PPTX
Challenges and applications of RDF shapes
PPTX
Validating RDF data: Challenges and perspectives
PPTX
ShEx by Example
PPTX
Introduction to SPARQL
PPTX
Introducción a la Web Semántica
PPTX
RDF Data Model
PPTX
2017 Tendencias en informática
PPTX
RDF, linked data and semantic web
PPTX
Introduction to SPARQL
PPTX
19 javascript servidor
PPTX
Como publicar datos: hacia los datos abiertos enlazados
PPTX
16 Alternativas XML
PPTX
Arquitectura de la Web y Computación en el Servidor
PPTX
RDF validation tutorial
PPTX
RDF Validation Future work and applications
Publicaciones de investigación
Introducción a la investigación/doctorado
Challenges and applications of RDF shapes
Validating RDF data: Challenges and perspectives
ShEx by Example
Introduction to SPARQL
Introducción a la Web Semántica
RDF Data Model
2017 Tendencias en informática
RDF, linked data and semantic web
Introduction to SPARQL
19 javascript servidor
Como publicar datos: hacia los datos abiertos enlazados
16 Alternativas XML
Arquitectura de la Web y Computación en el Servidor
RDF validation tutorial
RDF Validation Future work and applications

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Big Data Technologies - Introduction.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Programs and apps: productivity, graphics, security and other tools
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
Big Data Technologies - Introduction.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Review of recent advances in non-invasive hemoglobin estimation

Legislative document content extraction based on Semantic Web technologies

  • 1. Legislative document content extraction based on Semantic Web technologies A use case about processing the History of the Law at Chile Francisco Cifuentes Silva Library of Congress, Chile PhD Student WESO research group Jose Emilio Labra Gayo WESO research group University of Oviedo, Spain
  • 2. Chilean Library of Congress In Spanish: BCN (Biblioteca del Congreso Nacional de Chile) Political powers ExecutiveJudiciaryLegislative Independent body inside the Legislative power Advices the parliament and gives services to citizens http://www.bcn.cl
  • 3. 2 projects at library of congress (BCN) History of the Law Parliamentary work
  • 4. History of the Law (LeyChile) Collect all documents generated during a law legislative process Phases: An initiative sees life as a draft bill Subject to debates Validity time (it is published) Modifications, additions,... Derogation Goal: Capture the spirit of the law Traceability https://www.bcn.cl/historiadelaley
  • 5. Parliamentary work Collect all legislative activity by each Member of Parliament Retrieve all interventions made Parliamentary motion Session journal Commission report Ordered and categorised https://www.bcn.cl/laborparlamentaria/
  • 6. Both projects adopted semantic technologies Some initial reasons: Semantic technologies considered one pillar of strategic plan (in 2014) Innovative action to generate new products Improve interoperability mechanisms Sem. Web aligned well with open & public data
  • 7. Which semantic technologies? Text mining and content enrichment Entity extraction Topic identification Automatic markup Classification Machine readable info XML & URIs RDF Ontologies Linked Open Data
  • 8. Workflow pipelines 3 main steps Automatic XML Marker RDF & Linked data generation Content delivery
  • 9. Linked Open Data Query DB Workflow overview National library Legislative documents • Paper (requires OCR) • Text documents Automatic XML marker SVN repository Akoma-Ntoso XML editor & tools Publishing (RDF extraction From Akoma-Ntoso) Services layer Content portals
  • 10. Automatic XML marker Source: Text Target: XML following Akoma-Ntoso
  • 11. Automatic XML marker Text Entity Type MediatorLegal Knowledge Base Entity Type URI Structural marker Internal XML representation Converter XML AKN Text Text Named Entity Recognizer 4 phases
  • 12. 1. Named Entity Recognizer Detection of entities & types of entities Web service implementing the Stanford NER with a CRF classifier Evaluation in production: detects 97% entities Type Some examples # of entities Person Salvador Allende, Sebastián Piñera 5.139 Organization Ministerio de Salud, SERNATUR 2.848 Location Valparaíso, Santiago de Chile 1.251 Document Ley 20.000, Diario de sesión nº 12 732.497 Role Senador, Diputado, Alcalde 428 Events Nacimiento de Eduardo Frei, Sesión Nº 23 14.389 Law Boletín 11536-04, Prohíbe fumar en espacios cerrados 12.737 Dates 27 de febrero de 2010, el próximo año, ... 20.632 Text Entity Type Text Named Entity Recognizer
  • 13. 2. Mediator Entity linking and disambiguation Text similarity algorithms Based on Apache Lucene In-house development - Use of context information to narrow list of candidates - Custom filters and association heuristics - Specialized web services Entity Type Mediator Legal Knowledge Base Entity Type URI Text Text
  • 14. 3. Structural marker Detect structures in the text Titles, subtitles, paragraphs, sections,... Special structure for debates: participation Regular expressions + custom rules Entity Type URI Structural marker Internal XML representation Text
  • 15. 4. XML converter to Akom-Ntoso Programmatic approach Internal XML representation similar to DOM Each node converted to text in AKN-XML Internal XML representation Converter XML AKN
  • 16. Human edition of AKN-Documents Quality assurance by human analysts They review the generated XML documents 2 editors: Ad-hoc XML editor Commercial editor: LegisPro (Xcential)
  • 17. Linked data generation The pilot project (2011) carefully defined a stable URI model URIs have been maintained since them URIs = IDs in the whole system URIs are dereferentiable Content negotiation Custom linked data browser Documentation (in Spanish) http://datos.bcn.cl/es/documentacion
  • 18. AKN2RDF RDF extraction from Akoma-Ntoso XML ● Custom-made converter (XSL discarded for perceived complexity) ● Each XML tag implemented in one Class ● Extracted data saved into multiple databases (Relational and RDF)
  • 19. Linked data generation Source: AKN XML documents Linked data browser (WESO-DESH) Target: RDF data http://datos.bcn.cl/recurso/cl/documento/579095/http://datos.bcn.cl/recurso/cl/documento/579095.xml
  • 20. SPARQL endpoint RDF triples are published as a public SPARQL endpoint Number of norms by municipality
  • 21. Content delivery Web portals using Open Source Technologies CMS (Typo3) Python/Java Varnish Apache Lucene REST Web service layers which connect to RDF triplestore and DB Data exports to PDF, Doc and XML formats URIs of parliamentary profiles = URIs in triplestore
  • 22. History of the Law portal https://www.bcn.cl/historiadelaley Links to Members of Parliament Each article has a link Different versions of a law
  • 23. History of the Law portal https://www.bcn.cl/historiadelaley Compare different versions
  • 25. Some experimental visualizations Relationships between laws Historical Parliament Parliamentary genealogy (family relationships) Regions mentioned in laws (legislative hackathon)
  • 29. Regions mentioned by law Result of a legislative hackathon http://datos.bcn.cl/global-legislative-hackathon-2016/Hackaton/www/html/master.html In 2010 there was an Earthquake in BioBio region
  • 30. Some statistics 24.368 documents (nov. 2018) Number of RDF triples: 28 millions According to Google analytics Average browsing time: 2min 26s Visits received 331,481 (nov. 2016-2017)  476,241 (nov. 2016-2017)
  • 31. And some findings... Question: why are there some valleys? Dictatorship time Session attendance by year RDF triples generated by year
  • 32. Some lessons learnt RDF granularity & inference trade-off RDF statements + inference (high running times...queries that didn't terminate) A priori inferred triples added to triple store (high response times for large docs) Small subset of RDF triples (structural parts of docs and metadata) Performance problems in XML editor browsing long docs (>1000pages) Low SPARQL endpoint usage by external apps If we could start again, I would recommend ShEx Personal note: These kind of data portals led to my interest in ShEx
  • 33. Conclusions & future projects Well designed URIs can act as a perfect glue for interoperability Automatic workflow pipelines help long-term survival of LD-based projects SPARQL endpoint since 2011 Future projects on top of existing ones National Budget as Linked data Diana Project: Members of Parliament linked to social network analysis New portal: User customization & recommender systems
  • 34. End of presentation Acknowledgements: David Vilches, Eridan Otto, Christian Sifaqui