Legislative document content extraction based on Semantic Web technologies

Legislative document content extraction based
on Semantic Web technologies
A use case about processing the History of the Law at Chile
Francisco Cifuentes Silva
Library of Congress, Chile
PhD Student
WESO research group
Jose Emilio Labra Gayo
WESO research group
University of Oviedo, Spain

Chilean Library of Congress
In Spanish: BCN (Biblioteca del Congreso Nacional de Chile)
Political
powers
ExecutiveJudiciaryLegislative
Independent body inside the Legislative power
Advices the parliament and gives services to citizens
http://www.bcn.cl

2 projects at library of congress (BCN)
History of the Law
Parliamentary work

History of the Law (LeyChile)
Collect all documents generated during a law legislative process
Phases:
An initiative sees life as a draft bill
Subject to debates
Validity time (it is published)
Modifications, additions,...
Derogation
Goal:
Capture the spirit of the law
Traceability
https://www.bcn.cl/historiadelaley

Parliamentary work
Collect all legislative activity by each Member of Parliament
Retrieve all interventions made
Parliamentary motion
Session journal
Commission report
Ordered and categorised
https://www.bcn.cl/laborparlamentaria/

Both projects adopted semantic technologies
Some initial reasons:
Semantic technologies considered one pillar of strategic plan (in 2014)
Innovative action to generate new products
Improve interoperability mechanisms
Sem. Web aligned well with open & public data

Which semantic technologies?
Text mining and content enrichment
Entity extraction
Topic identification
Automatic markup
Classification
Machine readable info
XML & URIs
RDF
Ontologies
Linked Open Data

Workflow pipelines
3 main steps
Automatic XML Marker
RDF & Linked data generation
Content delivery

Linked Open
Data
Query DB
Workflow overview
National library
Legislative documents
• Paper (requires OCR)
• Text documents
Automatic
XML
marker
SVN repository
Akoma-Ntoso
XML editor &
tools
Publishing
(RDF extraction
From Akoma-Ntoso)
Services
layer
Content
portals

Automatic XML marker
Source: Text Target: XML following Akoma-Ntoso

Automatic XML marker
Text
Entity Type
MediatorLegal Knowledge
Base
Entity Type URI Structural
marker
Internal XML
representation
Converter
XML
AKN
Text
Text
Named Entity
Recognizer 4 phases

1. Named Entity Recognizer
Detection of entities & types of entities
Web service implementing the Stanford NER with a CRF classifier
Evaluation in production: detects 97% entities
Type Some examples # of entities
Person Salvador Allende, Sebastián Piñera 5.139
Organization Ministerio de Salud, SERNATUR 2.848
Location Valparaíso, Santiago de Chile 1.251
Document Ley 20.000, Diario de sesión nº 12 732.497
Role Senador, Diputado, Alcalde 428
Events Nacimiento de Eduardo Frei, Sesión Nº 23 14.389
Law Boletín 11536-04, Prohíbe fumar en espacios cerrados 12.737
Dates 27 de febrero de 2010, el próximo año, ... 20.632
Text
Entity Type
Text
Named Entity
Recognizer

2. Mediator
Entity linking and disambiguation
Text similarity algorithms
Based on Apache Lucene
In-house development
- Use of context information to narrow
list of candidates
- Custom filters and association
heuristics
- Specialized web services
Entity Type
Mediator
Legal Knowledge
Base
Entity Type URI
Text
Text

3. Structural marker
Detect structures in the text
Titles, subtitles, paragraphs, sections,...
Special structure for debates: participation
Regular expressions + custom rules
Entity Type URI
Structural
marker
Internal XML
representation
Text

4. XML converter to Akom-Ntoso
Programmatic approach
Internal XML representation similar to DOM
Each node converted to text in AKN-XML
Internal XML
representation
Converter
XML
AKN

Human edition of AKN-Documents
Quality assurance by human analysts
They review the generated XML documents
2 editors:
Ad-hoc XML editor
Commercial editor: LegisPro (Xcential)

Linked data generation
The pilot project (2011) carefully defined a stable URI model
URIs have been maintained since them
URIs = IDs in the whole system
URIs are dereferentiable
Content negotiation
Custom linked data browser
Documentation (in Spanish)
http://datos.bcn.cl/es/documentacion

AKN2RDF
RDF extraction from Akoma-Ntoso XML
● Custom-made converter (XSL discarded for perceived complexity)
● Each XML tag implemented in one Class
● Extracted data saved into multiple databases (Relational and RDF)

Linked data generation
Source: AKN XML documents
Linked data browser (WESO-DESH)
Target: RDF data
http://datos.bcn.cl/recurso/cl/documento/579095/http://datos.bcn.cl/recurso/cl/documento/579095.xml

SPARQL endpoint
RDF triples are published as a public SPARQL endpoint
Number of norms by municipality

Content delivery
Web portals using Open Source Technologies
CMS (Typo3)
Python/Java
Varnish
Apache Lucene
REST Web service layers which connect to RDF triplestore and DB
Data exports to PDF, Doc and XML formats
URIs of parliamentary profiles = URIs in triplestore

History of the Law portal
Links to
Members of
Parliament
Each article
has a link
Different
versions
of a law

History of the Law portal
Compare
different
versions

Parliamentary Work
https://www.bcn.cl/laborparlamentaria
Show
participation of
each Member of
Parliament

Some experimental visualizations
Relationships between laws
Historical Parliament
Parliamentary genealogy (family relationships)
Regions mentioned in laws (legislative hackathon)

Historical parliament
http://datos.bcn.cl/visualizaciones/genealogia-parlamentaria/

Parliamentary genealogy
http://datos.bcn.cl/visualizaciones/genealogia-parlamentaria/consulta.jsp

Regions mentioned by law
Result of a legislative hackathon
http://datos.bcn.cl/global-legislative-hackathon-2016/Hackaton/www/html/master.html
In 2010 there was an
Earthquake in BioBio region

Some statistics
24.368 documents (nov. 2018)
Number of RDF triples: 28 millions
According to Google analytics
Average browsing time: 2min 26s
Visits received 331,481 (nov. 2016-2017)  476,241 (nov. 2016-2017)

And some findings...
Question: why are there some valleys?
Dictatorship time
Session attendance by year
RDF triples generated by year

Some lessons learnt
RDF granularity & inference trade-off
RDF statements + inference (high running times...queries that didn't terminate)
A priori inferred triples added to triple store (high response times for large docs)
Small subset of RDF triples (structural parts of docs and metadata)
Performance problems in XML editor browsing long docs (>1000pages)
Low SPARQL endpoint usage by external apps
If we could start again, I would recommend ShEx
Personal note: These kind of data portals led to my interest in ShEx

Conclusions & future projects
Well designed URIs can act as a perfect glue for interoperability
Automatic workflow pipelines help long-term survival of LD-based projects
SPARQL endpoint since 2011
Future projects on top of existing ones
National Budget as Linked data
Diana Project: Members of Parliament linked to social network analysis
New portal: User customization & recommender systems

End of presentation
Acknowledgements:
David Vilches, Eridan Otto, Christian Sifaqui

Legislative document content extraction based on Semantic Web technologies

More Related Content

What's hot (20)

Similar to Legislative document content extraction based on Semantic Web technologies (20)

More from Jose Emilio Labra Gayo (20)

Recently uploaded (20)

Legislative document content extraction based on Semantic Web technologies