SlideShare a Scribd company logo
WWW.SPAZIODATI.EU




                                       JSONpedia
                          Facilitating consumption of MediaWiki content.




                  Michele Mostarda <mostarda@spaziodati.eu>, TW: @micmos
mercoledì 10 ottobre 12
What is JSONpedia?



mercoledì 10 ottobre 12
“JSONpedia is a library and a web service
                     meant to read WikiText markup as JSON.”




mercoledì 10 ottobre 12
‣       Initially conceived as a tool to produce data to
                     train Machine Learning models.
             ‣       The REST service,inspired by Sweeble
                     Crystalball,produces JSON, HTML and
                     (coming soon) RDF data.
             ‣       Written over a context-dependent event based
                     parser to be more performant than an Regex
                     based parser (like the wikiparser) or a DOM
                     based parser (like Sweeble).


mercoledì 10 ottobre 12
Differences with Sweeble




mercoledì 10 ottobre 12
‣    Lightweight Event based parser.
                     ‣    More tolerant to frequent syntax errors
                          present within WikiText pages.
                     ‣    Serializes to JSON output which is easier
                          to consume!




mercoledì 10 ottobre 12
Differences with DBpedia




mercoledì 10 ottobre 12
‣       JSONpedia doesn't add any semantic to
                          the extracted data.
                  ‣       JSONpedia could integrate the current
                          DBpedia regex-based parser.
                  ‣       JSONpedia is a not competitor of DBpedia
                          but rather a complement.




mercoledì 10 ottobre 12
JSONpedia Internals




mercoledì 10 ottobre 12
Architecture
                             Parser      Structure




                                         Validator


                             Input
                            WikiText
                                         Extractor




                                          Splitter




                          DBpedia API/    Linker
                           Freebase




                            Output
                            JSON             +




mercoledì 10 ottobre 12
WikiText Parser Events
                   // Document bounding.                    // Links
                   void beginDocument(URL document);        void beginLink(String url);
                   void endDocument();                      void endLink(String url);

                   // Error handling.                       // lists
                   void parseWarning(String msg,            void beginList();
                   ParserLocation location);                void listItem();
                   void parseError(Exception e,             void endList();
                   ParserLocation location);
                                                            // Templates
                   // Tag handling.                         void beginTemplate(String name);
                   void beginTag(String node, Attribute[]   void endTemplate(String name);
                   attributes);
                   void endTag(String node);                // Tables
                   void inlineTag(String node,              void beginTable();
                   Attribute[] attributes);                 void headCell(int row, int col);
                   void commentTag(String comment);         void bodyCell(int row, int col);
                                                            void endTable();
                   // Sections
                   void section(String title, int level);   // Generic parameter
                                                            void parameter(String param);
                   // References                            // parameter / text value
                   void beginReference(String label);       void text(String content);
                   void endReference(String label);



mercoledì 10 ottobre 12
WikiText Processors
                Processors receive the stream of events generated by the
                parser and perform data construction and transformation.

                ‣    Structure
                ‣    Extractors
                ‣    Linkers
                ‣    Splitters
                ‣    Validator



mercoledì 10 ottobre 12
Structure



                 The Structure Processor receives a stream of
                 WikiText parsing events and builds a 1-1JSON
                 representation of the document DOM.




mercoledì 10 ottobre 12
Extractors

                          Extractors are specific Processors that
                          collect a certain type of data from the
                          event stream: for example the
                          SectionsExtractor collects the list of all
                          sections detected in the document
                          stream.



mercoledì 10 ottobre 12
Linkers


                      A Linker is a Processor which links the
                      current document entity to other
                      informations acquired from external sources.
                      An example of Linker is the FreebaseLinker
                      which connects an entity to the same
                      representation in Freebase if any.



mercoledì 10 ottobre 12
Splitters


                          A Splitter is a Processor able to cut sub
                          trees of the JSON document built by the
                          Structure processor. An example of
                          Splitter is the TableSplitter which extract
                          the JSON structures representing the
                          tables declared in the document.



mercoledì 10 ottobre 12
Validator



                          A Validator is a Processor performing the
                          check of data structures parsed from a
                          document.




mercoledì 10 ottobre 12
Forthcoming Features

                     ‣    JSONpedia DB (based on MongoDB +
                          ElasticSearch) can be queried online.
                          Also JSONpedia dumps will be available.
                     ‣    Online data model Exporter Tool (CSV)
                     ‣    RDF output.



mercoledì 10 ottobre 12
Release



                          JSONpedia will be fully released
                          OpenSource in by the end of the year.




mercoledì 10 ottobre 12
Live Demo


                          http://bit.ly/jsonpedia
                                    or
        http://json.it.dbpedia.org/frontend/form.html




mercoledì 10 ottobre 12
WWW.SPAZIODATI.EU




                                   Thanks!

                  Michele Mostarda <mostarda@spaziodati.eu>, TW: @micmos
mercoledì 10 ottobre 12

More Related Content

PDF
JSONpedia - Facilitating consumption of MediaWiki content
PPTX
Mongo db
PDF
Mongo db basics
PPTX
MongoDB basics & Introduction
PPTX
Mongo db nosql (1)
PDF
FITC presents: Mobile & offline data synchronization in Angular JS
PDF
Difference between xml and json
JSONpedia - Facilitating consumption of MediaWiki content
Mongo db
Mongo db basics
MongoDB basics & Introduction
Mongo db nosql (1)
FITC presents: Mobile & offline data synchronization in Angular JS
Difference between xml and json

What's hot (20)

PPTX
Using Webservice in iOS
PDF
Rupy2012 ArangoDB Workshop Part1
PPTX
Files and JavaScript
PPT
Connecting to a REST API in iOS
PPTX
Introduction to mongo db
PPT
Xml and DTD's
PDF
Electron, databases, and RxDB
PDF
Updating materialized views and caches using kafka
PDF
Getting started with MongoDB and Scala - Open Source Bridge 2012
PPT
Intro to XML in libraries
KEY
Legislation.gov.uk
PPTX
Rails meets no sql
PPTX
MongoDB for Beginners
PDF
iOS: Web Services and XML parsing
PDF
Scala with mongodb
PDF
Quick overview on mongo db
PPTX
Mongo DB Presentation
PDF
Getting Started with MongoDB (TCF ITPC 2014)
PDF
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
PPTX
Get expertise with mongo db
Using Webservice in iOS
Rupy2012 ArangoDB Workshop Part1
Files and JavaScript
Connecting to a REST API in iOS
Introduction to mongo db
Xml and DTD's
Electron, databases, and RxDB
Updating materialized views and caches using kafka
Getting started with MongoDB and Scala - Open Source Bridge 2012
Intro to XML in libraries
Legislation.gov.uk
Rails meets no sql
MongoDB for Beginners
iOS: Web Services and XML parsing
Scala with mongodb
Quick overview on mongo db
Mongo DB Presentation
Getting Started with MongoDB (TCF ITPC 2014)
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
Get expertise with mongo db
Ad

Similar to Introducing JSONpedia (20)

PDF
OrientDB introduction - NoSQL
PPT
Open Access Publishing on the Semantic Web
PDF
Node.js and Ruby
PPTX
Querying the Web of Data
PDF
Apache Beam de A à Z
PDF
[Deprecated] Integrating libSyntax into the compiler pipeline
PPTX
Java se7 features
PPTX
Introduction to dotNetRDF
PPTX
PPTX
DCMI/RDA Task Group Report, DC-2010 Pittsburgh
ODP
State of the Semantic Web
PPTX
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
PPTX
Ks2009 Semanticweb In Action
ODP
ODF Mashups
PDF
Xtext beyond the defaults - how to tackle performance problems
PDF
Streams of information - Chicago crystal language monthly meetup
PDF
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
PPT
Hatkit Project - Datafiddler
PDF
Catmandu / LibreCat Project
KEY
node.js: Javascript's in your backend
OrientDB introduction - NoSQL
Open Access Publishing on the Semantic Web
Node.js and Ruby
Querying the Web of Data
Apache Beam de A à Z
[Deprecated] Integrating libSyntax into the compiler pipeline
Java se7 features
Introduction to dotNetRDF
DCMI/RDA Task Group Report, DC-2010 Pittsburgh
State of the Semantic Web
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
Ks2009 Semanticweb In Action
ODF Mashups
Xtext beyond the defaults - how to tackle performance problems
Streams of information - Chicago crystal language monthly meetup
Introduction to Mongo DB-open-­‐source, high-­‐performance, document-­‐orient...
Hatkit Project - Datafiddler
Catmandu / LibreCat Project
node.js: Javascript's in your backend
Ad

More from SpazioDati (17)

PDF
Dandelion API e Atoka: due strumenti utili al Data Journalism
PDF
Data Curation @ SpazioDati - NEXA Lunch Seminar
PDF
SpazioDati presents Dandelion dataTXT - SenTaClAus project - final meeting
PDF
SpazioDati presents dataTXT - SenTaClAus project - final meeting
PDF
Opening “Big Data Challenge” data: some insights on our role in the story
PDF
News Fact-checking: One Practical Application of Linked Statistics
PDF
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
PDF
SpazioDati presents dataTXT - SenTaClAus project - 2nd open day
PDF
Text analytics for Google Spreadsheets using Text Mining add-on
PDF
Find the specific Wikipedia page you’re looking for, using Wikisearch API
PDF
Using entity extraction extension with OpenRefine and Dandelion API
PDF
Dandelion API and mobile payment: food for thoughts for H-ACK PAYMENT
PDF
Cerved Group scommette sull'analisi semantica made in Italy
PDF
LinkedStat: making ISTAT data more valuable
PDF
Smart Open Data Kickoff - Madrid - Linked
PDF
Linked STAT per l'evento datalab con ISTAT alla Smart City Exhibition 2013
PDF
Pubblicare Linked Open Data, lezione 1
Dandelion API e Atoka: due strumenti utili al Data Journalism
Data Curation @ SpazioDati - NEXA Lunch Seminar
SpazioDati presents Dandelion dataTXT - SenTaClAus project - final meeting
SpazioDati presents dataTXT - SenTaClAus project - final meeting
Opening “Big Data Challenge” data: some insights on our role in the story
News Fact-checking: One Practical Application of Linked Statistics
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
SpazioDati presents dataTXT - SenTaClAus project - 2nd open day
Text analytics for Google Spreadsheets using Text Mining add-on
Find the specific Wikipedia page you’re looking for, using Wikisearch API
Using entity extraction extension with OpenRefine and Dandelion API
Dandelion API and mobile payment: food for thoughts for H-ACK PAYMENT
Cerved Group scommette sull'analisi semantica made in Italy
LinkedStat: making ISTAT data more valuable
Smart Open Data Kickoff - Madrid - Linked
Linked STAT per l'evento datalab con ISTAT alla Smart City Exhibition 2013
Pubblicare Linked Open Data, lezione 1

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Cloud computing and distributed systems.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
cuic standard and advanced reporting.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Teaching material agriculture food technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
MYSQL Presentation for SQL database connectivity
sap open course for s4hana steps from ECC to s4
Spectral efficient network and resource selection model in 5G networks
Review of recent advances in non-invasive hemoglobin estimation
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
cuic standard and advanced reporting.pdf
MIND Revenue Release Quarter 2 2025 Press Release
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Programs and apps: productivity, graphics, security and other tools
Teaching material agriculture food technology

Introducing JSONpedia

  • 1. WWW.SPAZIODATI.EU JSONpedia Facilitating consumption of MediaWiki content. Michele Mostarda <mostarda@spaziodati.eu>, TW: @micmos mercoledì 10 ottobre 12
  • 3. “JSONpedia is a library and a web service meant to read WikiText markup as JSON.” mercoledì 10 ottobre 12
  • 4. Initially conceived as a tool to produce data to train Machine Learning models. ‣ The REST service,inspired by Sweeble Crystalball,produces JSON, HTML and (coming soon) RDF data. ‣ Written over a context-dependent event based parser to be more performant than an Regex based parser (like the wikiparser) or a DOM based parser (like Sweeble). mercoledì 10 ottobre 12
  • 6. Lightweight Event based parser. ‣ More tolerant to frequent syntax errors present within WikiText pages. ‣ Serializes to JSON output which is easier to consume! mercoledì 10 ottobre 12
  • 8. JSONpedia doesn't add any semantic to the extracted data. ‣ JSONpedia could integrate the current DBpedia regex-based parser. ‣ JSONpedia is a not competitor of DBpedia but rather a complement. mercoledì 10 ottobre 12
  • 10. Architecture Parser Structure Validator Input WikiText Extractor Splitter DBpedia API/ Linker Freebase Output JSON + mercoledì 10 ottobre 12
  • 11. WikiText Parser Events // Document bounding. // Links void beginDocument(URL document); void beginLink(String url); void endDocument(); void endLink(String url); // Error handling. // lists void parseWarning(String msg, void beginList(); ParserLocation location); void listItem(); void parseError(Exception e, void endList(); ParserLocation location); // Templates // Tag handling. void beginTemplate(String name); void beginTag(String node, Attribute[] void endTemplate(String name); attributes); void endTag(String node); // Tables void inlineTag(String node, void beginTable(); Attribute[] attributes); void headCell(int row, int col); void commentTag(String comment); void bodyCell(int row, int col); void endTable(); // Sections void section(String title, int level); // Generic parameter void parameter(String param); // References // parameter / text value void beginReference(String label); void text(String content); void endReference(String label); mercoledì 10 ottobre 12
  • 12. WikiText Processors Processors receive the stream of events generated by the parser and perform data construction and transformation. ‣ Structure ‣ Extractors ‣ Linkers ‣ Splitters ‣ Validator mercoledì 10 ottobre 12
  • 13. Structure The Structure Processor receives a stream of WikiText parsing events and builds a 1-1JSON representation of the document DOM. mercoledì 10 ottobre 12
  • 14. Extractors Extractors are specific Processors that collect a certain type of data from the event stream: for example the SectionsExtractor collects the list of all sections detected in the document stream. mercoledì 10 ottobre 12
  • 15. Linkers A Linker is a Processor which links the current document entity to other informations acquired from external sources. An example of Linker is the FreebaseLinker which connects an entity to the same representation in Freebase if any. mercoledì 10 ottobre 12
  • 16. Splitters A Splitter is a Processor able to cut sub trees of the JSON document built by the Structure processor. An example of Splitter is the TableSplitter which extract the JSON structures representing the tables declared in the document. mercoledì 10 ottobre 12
  • 17. Validator A Validator is a Processor performing the check of data structures parsed from a document. mercoledì 10 ottobre 12
  • 18. Forthcoming Features ‣ JSONpedia DB (based on MongoDB + ElasticSearch) can be queried online. Also JSONpedia dumps will be available. ‣ Online data model Exporter Tool (CSV) ‣ RDF output. mercoledì 10 ottobre 12
  • 19. Release JSONpedia will be fully released OpenSource in by the end of the year. mercoledì 10 ottobre 12
  • 20. Live Demo http://bit.ly/jsonpedia or http://json.it.dbpedia.org/frontend/form.html mercoledì 10 ottobre 12
  • 21. WWW.SPAZIODATI.EU Thanks! Michele Mostarda <mostarda@spaziodati.eu>, TW: @micmos mercoledì 10 ottobre 12