CrawlerLD - Distributed crawler for linked data

CrawlerLD -
Distributed Crawler
for Linked Data
RAPHAEL DO VALE

Summary
Introduction
Until now=)
Issues
Large Memory Footprint
Graphical Interface

Introduction
How can we recommend linked data sources to a beginner user?
◦ Data sources may not use popular ontologies.
◦ There might be more than one ontology for the same domain.
◦ The user may not know all (if any) of the ontologies.
3

Introduction
Our solution:
◦ Create a recommender system that receives a small set of generic URI
resources and returns a complete report of related resources (URIs, Datasets
and Ontologies).
◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have
to know about specific datasets or ontologies, he only need to know how to get started.
◦ The recommender system would benefit from a Linked Data crawler, based
on metadata.
4

Introduction
Metadata focused crawler
◦ INPUT:
◦ User should summarize the desired domain with a small set of related terms (URI Resources).
◦ OUTPUT:
◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output
was generated.
◦ With the output results, the user should evaluate the most relevant
vocabularies for triplification or linkage process.
◦ This step could be manual or use another tool (e.g.: recommender system).
5

Introduction
Our solution:
◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data
Cloud).
◦ For each dataset, applies several queries trying to discover relationships
between datasets and the crawling resource.
◦ A breath first algorithm is used to discover more data in cycles.
6

Until now
Simplified Workflow:
7
List of Terms Processor
Mediator

Until now
Processors:
◦ Each way to recover data from the Linked Data is mapped into a processor.
◦ Small pieces of code that can be plugged and unplugged.
◦ Any user can create a new processor.
8

Until now
Crawling stages.
◦ Challenge: based on generic terms, how can we discover more data?
◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and
instanceOf).
9
Schema.org
DBpedia WordNet
Music Ontology
BBC Music
More specific

Issues
Large Memory Footprint
◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!)
Absence of Graphical Interface
‘Locked code’
◦ Open source on roadmap
Small amout of processors

Identifying the issue
Processor
ResultSets
One request for each
dataset
Over 500 distinct
datasets
Asynchronous
Synchronous
Several processors
running at the same
time
Each of them with a
increasing resultset
Jena resultset is far
from being small

Theorical Solution
Processor
ResultSets
One request for each
dataset
Over 500 distinct
datasets
Asynchronous
Asynchronous
Several processors
running at the same
time
The results are
immediately
processed
Even with bigger
resultsets, the
memory is controlled

The reactive manifesto
Reactive Systems are
◦ Responsive
◦ The system responds in a timely manner if at all possible
◦ Resilient
◦ The system stays responsive in the face of failure
◦ Elastic
◦ The system stays responsive under varying workload.
◦ Message Driven
◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between
components that ensures loose coupling, isolation, location transparency, and provides the
means to delegate errors as messages
◦ Essentially, reactive systems are event driven applications where modules
send events (messages) to other modules. Each module should ask
something to another asynchronously.
http://www.reactivemanifesto.org/

Actor model
The actor model in computer science is a mathematical model of
concurrent computation that treats "actors" as the universal primitives of
concurrent computation: in response to a message that it receives, an actor
can make local decisions, create more actors, send more messages, and
determine how to respond to the next message received. The actor model
originated in 1973.[1] It has been used both as a framework for a
theoretical understanding of computation, and as the theoretical basis for
several practical implementations of concurrent systems. The relationship
of the model to other work is discussed in Indeterminacy in concurrent
computation and Actor model and process calculi.
http://en.wikipedia.org/wiki/Actor_model
1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular
Actor Formalism for Artificial Intelligence". IJCAI.
http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency

Actor model
http://codermonkey65.blogspot.com.br/2012/09/actors-in-c-with-nact.html

Akka
http://akka.io/
Java or Scala framework for the Actor Model

Akka
Comparisson with Java’s thread model
◦ + Simpler
◦ CrawlerLD worked with two thread pools:
◦ One to manage all the system’s algorithm
◦ Other to make calls to datasets
◦ Using the same thread pool could block all threads in IO operations
◦ + No thread blocking
◦ Not need to worry about shared resources
◦ Each actor runs at most one task at a time
◦ + Better performance
◦ No blocking
◦ Allows distributed computing
◦ + Better error management
◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks
◦ Support for transactions (atomic operations between several actors, even if distributed over several
machines)
◦ + Configuration can change system behavior without code change
◦ Change number of allocated threads, create thread pools for different actors, distribute over several
machines, change message priority without touching the code.

Akka
Comparisson with Java’s thread model
◦ - Much harder to learn
◦ New paradigm
◦ - Not native

Results
CrawlerLDMainActor Calculate
CalculateResource LevelFinished ResourceProcessedFromLevel
LevelActor
Calculate ResourceProcessed
ResourceActor
Calculate Calculate Calculate Calculate
ResourceProcessed ResourceProcessed ResourceProcessed
ResourceProcessed
DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor

Results
Processor
Calculate QueryFinishedMessage
SparqlResultset
SparqlQuerierMasterActor
CrawlerLD
UtilitiesSemanticWeb
ProcessSparqlOnDataset SparqlResultset
SparqlQuerierActor
Jena
Modified
version
Blocking calls
Managed by another
Akka Dispatcher
Critical message. Must be
processed immediately.
One actor for
each dataset

Results
Complete refactor of the code
◦ Better organization
◦ Better understanding
◦ Bugs found and resolved
◦ Almost two months to understand the paradigm, change the code and test
Better performance
◦ Even in heavy workload, the system is always available,
◦ Another message to another actor
◦ Distributed code made easy
◦ Each SparqlQuerierActor could run in a separated machine
◦ Not yet implemented / tested
(Much) better memory footprint
◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!)
◦ Number of levels or any other parameter does not seem to affect the memory
footprint

Graphical Interface
60% completed

Graphical Interface
New actor message to retrieve task status while running
CrawlerLDMainActor
Calculate
GetSimplifiedStatus
CrawlerLDSimplifiedStatus
GetFullStatus
CrawlerLDFullStatus

Graphical Interface
Allows creation and monitoring of the tasks
Takes advantage of actor model
Anyone will be able to create new tasks
URL available soon

CrawlerLD - Distributed crawler for linked data

More Related Content

What's hot (20)

Similar to CrawlerLD - Distributed crawler for linked data (20)

Recently uploaded (20)

CrawlerLD - Distributed crawler for linked data