SlideShare a Scribd company logo
Elasticsearch: a key element of Invenio 3
Elasticsearch Meetup
Johnny Mariéthoz
Lausanne, 2017/03/10
About Me
12 years as computer scientist in machine learning
7 years as Invenio developer and instance maintainer
bass and double bass player
newbie as analog camera photographer
Library Network of Western Switzerland 2 Lausanne, 2017/03/10
Library Network of Western Switzerland
Library Network of Western Switzerland 3 Lausanne, 2017/03/10
RERO: Library Network of Western Switzerland
220 libraries
academic libraries, heritage
libraries, public libraries, school
libraries or specialized libraries
50’000 students
5 cantons: FR, GE, JU, NE, VS
280’000 registered patrons
3 academic universities
Geneva, Fribourg, Neuchâtel
1 University of Applied Sciences
central office
in Martigny
19 employees
Library Network of Western Switzerland 4 Lausanne, 2017/03/10
Typical Data Centered Web Application
Data Web Server
Data
Schema
Persistant
Storage
Search
Engine
PID Store
External
Files
REST API
HTML WEB
Pages
GUI
Search Engines:
Google, etc.
External Services
Access Rights
Files
Download /
Preview for users
Other
Formats
Browser
apps
Desktop
apps
Library Network of Western Switzerland 5 Lausanne, 2017/03/10
Common Needed Features
data management with versioning, validation and PID
(Persistent Identifiers)
search engine
rights management (ACL, oauth)
web page management with templates (search results and
others, such as news, front-page, etc.)
url management (routing)
REST API generation
format conversion/migration
CLI utilities
data acquisition (html forms based editor)
Library Network of Western Switzerland 6 Lausanne, 2017/03/10
Development
modular software architecture
easy new module creation
webassets management for less, sass, nodejs, etc.
asynchronous task management
unit testing and logging
i18n (translations)
web front-end and back-end
and many more...
Library Network of Western Switzerland 7 Lausanne, 2017/03/10
Library Network of Western Switzerland 8 Lausanne, 2017/03/10
History
digital library and document repository software
created by CERN
mature platform: first public release v0.0.9 in 2002
open source project
originated in high-energy physics
institutional repository: CERN Document Server
integrated library system: CERN Document Server
disciplinary repository: INSPIRE
open research data server: ZENODO
self-contained python, mysql web application until 1.x
transition with v2.x
complete new rewritten v3
Library Network of Western Switzerland 9 Lausanne, 2017/03/10
Used Technologies
set of python modules
include module interaction mechanisms
delivered as a framework around state-of-the-art
technologies
steep learning curve
Library Network of Western Switzerland 10 Lausanne, 2017/03/10
+
Library Network of Western Switzerland 11 Lausanne, 2017/03/10
Elasticsearch Integration
SQL only for persistent data, no more SQL query during
the HTTP request
data model using JSON, JSON-Schema and ES Mapping
sorting, facets and query configuration by type of object
use official elasticsearch python package:
elasticsearch-dsl
CLI to create indexes, push mappings and index the data
Library Network of Western Switzerland 12 Lausanne, 2017/03/10
JSON-Schema
JSON File
[{
"album": "A Tribute to Jack Johnson",
"artist": "Miles Davis",
"genre": [
"Jazz"
],
"mime": "audio/mp3",
"performers": [
"Miles Davis"
],
"tracks": [
"Part 1",
"Part 2"
],
"year": 1970
}, ...]
Schema File
{"title": "Music Album",
"type": "object",
"properties": {
"album": {
"type": "string",
"minLength": 1
},
"mime": {
"type": "string",
"minLength": 7,
"pattern": "^audio/",
"enum": [
"audio/flac",
"audio/mp2",
"audio/mp3",
"audio/mp4",
"audio/vorbis"
]
},
...},
"required": ["album", "year"],
"additionalProperties": false}
Library Network of Western Switzerland 13 Lausanne, 2017/03/10
ES Mapping
JSON File
[{
"album": "A Tribute to Jack Johnson",
"artist": "Miles Davis",
"genre": [
"Jazz"
],
"mime": "audio/mp3",
"performers": [
"Miles Davis"
],
"tracks": [
"Part 1",
"Part 2"
],
"year": 1970
}, ...]
Mapping File
{"mappings": {
"record-v1.0.0": {
"date_detection": false,
"numeric_detection" : false,
"properties": {
"album": {
"type": "string",
"analyzer": "english",
"copy_to": "sort_album"
},
"sort_album": {
"type": "string",
"index": "not_analyzed"
},
"artist": {
"type": "string",
"analyzer": "standard",
"copy_to": "facet_artist"
},
"facet_artist": {
"type": "string",
"index": "not_analyzed"
},
...
} } } }
Library Network of Western Switzerland 14 Lausanne, 2017/03/10
Schema and Mapping
JSON-SCHEMA: schemas/records/record-v1.0.0.json
should be included in the data ($schema)
used for data validation
is the documentation for humans
name is important: i.e. index_name: records-record-v1.0.0,
document_type: record-v1.0.0 for record indexing
Mapping mappings/records/record-v1.0.0.json
can be set using a CLI
name is important: index_name: records-record-v1.0.0 with
alias=records, document_type: record-v1.0.0 during the
index creation
Library Network of Western Switzerland 15 Lausanne, 2017/03/10
Configuration
Facets Configuration
RECORDS_REST_FACETS = dict(
records=dict(
aggs={
’genre’: dict(terms=dict(field=’genre’, size=10)),
’years’: dict(date_histogram=dict(
field=’year’,
interval=’year’,
format=’yyyy’)
)
},
filters=dict(
genre=terms_filter(’genre’)
),
post_filters=dict(
years=range_filter(
’year’,
format=’yyyy’,
end_date_math=’/y’),
)
)
)
Library Network of Western Switzerland 16 Lausanne, 2017/03/10
Configuration
Sorting Configuration
RECORDS_REST_SORT_OPTIONS = dict(
records=dict(
bestmatch=dict(
fields=[’-_score’],
title=’Best match’,
default_order=’asc’,
order=1,
),
mostrecent=dict(
fields=[’-_created’],
title=’Most recent’,
default_order=’asc’,
order=2,
)
)
)
RECORDS_REST_DEFAULT_SORT = dict(
records=dict(query=’bestmatch’, noquery=’mostrecent’),
)
Library Network of Western Switzerland 17 Lausanne, 2017/03/10
Configuration
REST Configuration
RECORDS_REST_ENDPOINTS = dict(
recid=dict(
search_class=RecordsSearch,
search_index=’records’,
search_type=None,
record_serializers={
’application/json’: (’invenio_records_rest.serializers’
’:json_v1_response’),
},
search_serializers={
’application/json’: (’invenio_records_rest.serializers’
’:json_v1_search’),
},
search_factory_imp=es_search_factory,
list_route=’/records/’,
item_route=’/records/<pid(recid):pid_value>’
)
)
Library Network of Western Switzerland 18 Lausanne, 2017/03/10
Demo
Library Network of Western Switzerland 19 Lausanne, 2017/03/10
Conclusion
very generic and flexible tool
great open source community (many thanks to the CERN)
easy to prototype and develop new applications and
features
demands time to master (learning curve)
at the center of RERO’s future developments
swiss open access research publications repository
(SONAR)
new Integrated Library System (ILS) for public libraries (3
years project)
and many more projects...
Library Network of Western Switzerland 20 Lausanne, 2017/03/10
References
RERO http://www.rero.ch
Invenio http://invenio-software.org/
Invenio Documentation
http://invenio.readthedocs.io
Elasticsearch https://www.elastic.co
CERN http://home.cern
JSON-LD http://json-ld.org/
JSON Schema http://json-schema.org/
ZENODO https://zenodo.org/
Library Network of Western Switzerland 21 Lausanne, 2017/03/10

More Related Content

PPTX
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
PDF
Biblissima et IIIF (MAE)
PDF
Library Linked Data in Latvia - #LIBER2014 poster
PDF
About a Mapping Proposal from FRBRoo to SharedCanvas
PPTX
WG5: A data wrangling experiment
PPTX
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
PDF
20170501 Distributed Network of Digital Heritage Information
PPTX
OA - Shared Canvas - TEI - Biblissima project
d:swarm - A Library Data Management Platform Based on a Linked Open Data Appr...
Biblissima et IIIF (MAE)
Library Linked Data in Latvia - #LIBER2014 poster
About a Mapping Proposal from FRBRoo to SharedCanvas
WG5: A data wrangling experiment
Data Management and Integration with d:swarm (Lightning talk, ELAG 2014)
20170501 Distributed Network of Digital Heritage Information
OA - Shared Canvas - TEI - Biblissima project

What's hot (18)

PDF
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
PPTX
When the Web of Linked Data Arrives
PDF
Ceba geoportail
PPT
Towards Integration of Web Data into a coherent Educational Data Graph
PPTX
Making social science more reproducible by encapsulating access to linked data
PDF
2016 05-20-clariah-wp4
PPT
BHL-Europe_MINERVA_20111116_hrainer
PDF
Csdh sbg clariah_intr01
ODP
Linked Data at BnF : We Made It Happen... Now What? / Mélanie Roche (Nationa...
PDF
DBPedia-past-present-future
PDF
No sql databases
PDF
Richard Wallis Linked Data
PPTX
QB'er demonstration
PPTX
Linked Data for Libraries: Great progress, but what is the benefit?
PDF
Linked Data
PDF
ROI in Linking Content to CRM by Applying the Linked Data Stack
PPTX
Viaf and isni ifla 2013 08-16
PDF
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
When the Web of Linked Data Arrives
Ceba geoportail
Towards Integration of Web Data into a coherent Educational Data Graph
Making social science more reproducible by encapsulating access to linked data
2016 05-20-clariah-wp4
BHL-Europe_MINERVA_20111116_hrainer
Csdh sbg clariah_intr01
Linked Data at BnF : We Made It Happen... Now What? / Mélanie Roche (Nationa...
DBPedia-past-present-future
No sql databases
Richard Wallis Linked Data
QB'er demonstration
Linked Data for Libraries: Great progress, but what is the benefit?
Linked Data
ROI in Linking Content to CRM by Applying the Linked Data Stack
Viaf and isni ifla 2013 08-16
Interactive exploration of complex relational data sets in a web - SemWeb.Pro...
Ad

Similar to Elasticsearch: a key element of Invenio 3 (20)

PDF
Portable Lucene Index Format & Applications - Andrzej Bialecki
PPT
Developing A Semantic Web Application - ISWC 2008 tutorial
PPTX
Intro elasticsearch taswarbhatti
PDF
Roaring with elastic search sangam2018
PPTX
Jones "Enabling Discovery in the Library"
PPT
Project Panorama: vistas on validated information
PPT
Repositories and the wider context
PPTX
ElasticSearch as (only) datastore
PDF
Linked Data Publication of Live Music Archives
PPTX
ELK-Stack-Grid-KA-School.pptx
PPTX
5 years of Dataverse evolution
 
PDF
Drupal case study: ABC Dig Music
PPT
Dcap Ja Progmeet 2007 07 05
PPTX
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
PPTX
Mashed Up Playlist
PPT
Digital Library Infrastructure for a Million Books
PPT
Inteligent Catalogue Final
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PDF
M.Sc. Research Proposal
PPT
W3C Library Linked Data Incubator Group - 2011
Portable Lucene Index Format & Applications - Andrzej Bialecki
Developing A Semantic Web Application - ISWC 2008 tutorial
Intro elasticsearch taswarbhatti
Roaring with elastic search sangam2018
Jones "Enabling Discovery in the Library"
Project Panorama: vistas on validated information
Repositories and the wider context
ElasticSearch as (only) datastore
Linked Data Publication of Live Music Archives
ELK-Stack-Grid-KA-School.pptx
5 years of Dataverse evolution
 
Drupal case study: ABC Dig Music
Dcap Ja Progmeet 2007 07 05
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Mashed Up Playlist
Digital Library Infrastructure for a Million Books
Inteligent Catalogue Final
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
M.Sc. Research Proposal
W3C Library Linked Data Incubator Group - 2011
Ad

Recently uploaded (20)

PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
MYSQL Presentation for SQL database connectivity
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
MIND Revenue Release Quarter 2 2025 Press Release
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Elasticsearch: a key element of Invenio 3

  • 1. Elasticsearch: a key element of Invenio 3 Elasticsearch Meetup Johnny Mariéthoz Lausanne, 2017/03/10
  • 2. About Me 12 years as computer scientist in machine learning 7 years as Invenio developer and instance maintainer bass and double bass player newbie as analog camera photographer Library Network of Western Switzerland 2 Lausanne, 2017/03/10
  • 3. Library Network of Western Switzerland Library Network of Western Switzerland 3 Lausanne, 2017/03/10
  • 4. RERO: Library Network of Western Switzerland 220 libraries academic libraries, heritage libraries, public libraries, school libraries or specialized libraries 50’000 students 5 cantons: FR, GE, JU, NE, VS 280’000 registered patrons 3 academic universities Geneva, Fribourg, Neuchâtel 1 University of Applied Sciences central office in Martigny 19 employees Library Network of Western Switzerland 4 Lausanne, 2017/03/10
  • 5. Typical Data Centered Web Application Data Web Server Data Schema Persistant Storage Search Engine PID Store External Files REST API HTML WEB Pages GUI Search Engines: Google, etc. External Services Access Rights Files Download / Preview for users Other Formats Browser apps Desktop apps Library Network of Western Switzerland 5 Lausanne, 2017/03/10
  • 6. Common Needed Features data management with versioning, validation and PID (Persistent Identifiers) search engine rights management (ACL, oauth) web page management with templates (search results and others, such as news, front-page, etc.) url management (routing) REST API generation format conversion/migration CLI utilities data acquisition (html forms based editor) Library Network of Western Switzerland 6 Lausanne, 2017/03/10
  • 7. Development modular software architecture easy new module creation webassets management for less, sass, nodejs, etc. asynchronous task management unit testing and logging i18n (translations) web front-end and back-end and many more... Library Network of Western Switzerland 7 Lausanne, 2017/03/10
  • 8. Library Network of Western Switzerland 8 Lausanne, 2017/03/10
  • 9. History digital library and document repository software created by CERN mature platform: first public release v0.0.9 in 2002 open source project originated in high-energy physics institutional repository: CERN Document Server integrated library system: CERN Document Server disciplinary repository: INSPIRE open research data server: ZENODO self-contained python, mysql web application until 1.x transition with v2.x complete new rewritten v3 Library Network of Western Switzerland 9 Lausanne, 2017/03/10
  • 10. Used Technologies set of python modules include module interaction mechanisms delivered as a framework around state-of-the-art technologies steep learning curve Library Network of Western Switzerland 10 Lausanne, 2017/03/10
  • 11. + Library Network of Western Switzerland 11 Lausanne, 2017/03/10
  • 12. Elasticsearch Integration SQL only for persistent data, no more SQL query during the HTTP request data model using JSON, JSON-Schema and ES Mapping sorting, facets and query configuration by type of object use official elasticsearch python package: elasticsearch-dsl CLI to create indexes, push mappings and index the data Library Network of Western Switzerland 12 Lausanne, 2017/03/10
  • 13. JSON-Schema JSON File [{ "album": "A Tribute to Jack Johnson", "artist": "Miles Davis", "genre": [ "Jazz" ], "mime": "audio/mp3", "performers": [ "Miles Davis" ], "tracks": [ "Part 1", "Part 2" ], "year": 1970 }, ...] Schema File {"title": "Music Album", "type": "object", "properties": { "album": { "type": "string", "minLength": 1 }, "mime": { "type": "string", "minLength": 7, "pattern": "^audio/", "enum": [ "audio/flac", "audio/mp2", "audio/mp3", "audio/mp4", "audio/vorbis" ] }, ...}, "required": ["album", "year"], "additionalProperties": false} Library Network of Western Switzerland 13 Lausanne, 2017/03/10
  • 14. ES Mapping JSON File [{ "album": "A Tribute to Jack Johnson", "artist": "Miles Davis", "genre": [ "Jazz" ], "mime": "audio/mp3", "performers": [ "Miles Davis" ], "tracks": [ "Part 1", "Part 2" ], "year": 1970 }, ...] Mapping File {"mappings": { "record-v1.0.0": { "date_detection": false, "numeric_detection" : false, "properties": { "album": { "type": "string", "analyzer": "english", "copy_to": "sort_album" }, "sort_album": { "type": "string", "index": "not_analyzed" }, "artist": { "type": "string", "analyzer": "standard", "copy_to": "facet_artist" }, "facet_artist": { "type": "string", "index": "not_analyzed" }, ... } } } } Library Network of Western Switzerland 14 Lausanne, 2017/03/10
  • 15. Schema and Mapping JSON-SCHEMA: schemas/records/record-v1.0.0.json should be included in the data ($schema) used for data validation is the documentation for humans name is important: i.e. index_name: records-record-v1.0.0, document_type: record-v1.0.0 for record indexing Mapping mappings/records/record-v1.0.0.json can be set using a CLI name is important: index_name: records-record-v1.0.0 with alias=records, document_type: record-v1.0.0 during the index creation Library Network of Western Switzerland 15 Lausanne, 2017/03/10
  • 16. Configuration Facets Configuration RECORDS_REST_FACETS = dict( records=dict( aggs={ ’genre’: dict(terms=dict(field=’genre’, size=10)), ’years’: dict(date_histogram=dict( field=’year’, interval=’year’, format=’yyyy’) ) }, filters=dict( genre=terms_filter(’genre’) ), post_filters=dict( years=range_filter( ’year’, format=’yyyy’, end_date_math=’/y’), ) ) ) Library Network of Western Switzerland 16 Lausanne, 2017/03/10
  • 17. Configuration Sorting Configuration RECORDS_REST_SORT_OPTIONS = dict( records=dict( bestmatch=dict( fields=[’-_score’], title=’Best match’, default_order=’asc’, order=1, ), mostrecent=dict( fields=[’-_created’], title=’Most recent’, default_order=’asc’, order=2, ) ) ) RECORDS_REST_DEFAULT_SORT = dict( records=dict(query=’bestmatch’, noquery=’mostrecent’), ) Library Network of Western Switzerland 17 Lausanne, 2017/03/10
  • 18. Configuration REST Configuration RECORDS_REST_ENDPOINTS = dict( recid=dict( search_class=RecordsSearch, search_index=’records’, search_type=None, record_serializers={ ’application/json’: (’invenio_records_rest.serializers’ ’:json_v1_response’), }, search_serializers={ ’application/json’: (’invenio_records_rest.serializers’ ’:json_v1_search’), }, search_factory_imp=es_search_factory, list_route=’/records/’, item_route=’/records/<pid(recid):pid_value>’ ) ) Library Network of Western Switzerland 18 Lausanne, 2017/03/10
  • 19. Demo Library Network of Western Switzerland 19 Lausanne, 2017/03/10
  • 20. Conclusion very generic and flexible tool great open source community (many thanks to the CERN) easy to prototype and develop new applications and features demands time to master (learning curve) at the center of RERO’s future developments swiss open access research publications repository (SONAR) new Integrated Library System (ILS) for public libraries (3 years project) and many more projects... Library Network of Western Switzerland 20 Lausanne, 2017/03/10
  • 21. References RERO http://www.rero.ch Invenio http://invenio-software.org/ Invenio Documentation http://invenio.readthedocs.io Elasticsearch https://www.elastic.co CERN http://home.cern JSON-LD http://json-ld.org/ JSON Schema http://json-schema.org/ ZENODO https://zenodo.org/ Library Network of Western Switzerland 21 Lausanne, 2017/03/10