SlideShare a Scribd company logo
CrawlerLD - 
Distributed Crawler 
for Linked Data 
RAPHAEL DO VALE
Summary 
Introduction 
Until now=) 
Issues 
Large Memory Footprint 
Graphical Interface
Introduction 
How can we recommend linked data sources to a beginner user? 
◦ Data sources may not use popular ontologies. 
◦ There might be more than one ontology for the same domain. 
◦ The user may not know all (if any) of the ontologies. 
3
Introduction 
Our solution: 
◦ Create a recommender system that receives a small set of generic URI 
resources and returns a complete report of related resources (URIs, Datasets 
and Ontologies). 
◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have 
to know about specific datasets or ontologies, he only need to know how to get started. 
◦ The recommender system would benefit from a Linked Data crawler, based 
on metadata. 
4
Introduction 
Metadata focused crawler 
◦ INPUT: 
◦ User should summarize the desired domain with a small set of related terms (URI Resources). 
◦ OUTPUT: 
◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output 
was generated. 
◦ With the output results, the user should evaluate the most relevant 
vocabularies for triplification or linkage process. 
◦ This step could be manual or use another tool (e.g.: recommender system). 
5
Introduction 
Our solution: 
◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data 
Cloud). 
◦ For each dataset, applies several queries trying to discover relationships 
between datasets and the crawling resource. 
◦ A breath first algorithm is used to discover more data in cycles. 
6
Until now 
Simplified Workflow: 
7 
List of Terms Processor 
Mediator
Until now 
Processors: 
◦ Each way to recover data from the Linked Data is mapped into a processor. 
◦ Small pieces of code that can be plugged and unplugged. 
◦ Any user can create a new processor. 
8
Until now 
Crawling stages. 
◦ Challenge: based on generic terms, how can we discover more data? 
◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and 
instanceOf). 
9 
Schema.org 
DBpedia WordNet 
Music Ontology 
BBC Music 
More specific
Issues 
Large Memory Footprint 
◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!) 
Absence of Graphical Interface 
‘Locked code’ 
◦ Open source on roadmap 
Small amout of processors
LARGE MEMORY FOOTPRINT
Identifying the issue 
Processor 
ResultSets 
One request for each 
dataset 
Over 500 distinct 
datasets 
Asynchronous 
Synchronous 
Several processors 
running at the same 
time 
Each of them with a 
increasing resultset 
Jena resultset is far 
from being small
Theorical Solution 
Processor 
ResultSets 
One request for each 
dataset 
Over 500 distinct 
datasets 
Asynchronous 
Asynchronous 
Several processors 
running at the same 
time 
The results are 
immediately 
processed 
Even with bigger 
resultsets, the 
memory is controlled
The reactive manifesto 
Reactive Systems are 
◦ Responsive 
◦ The system responds in a timely manner if at all possible 
◦ Resilient 
◦ The system stays responsive in the face of failure 
◦ Elastic 
◦ The system stays responsive under varying workload. 
◦ Message Driven 
◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between 
components that ensures loose coupling, isolation, location transparency, and provides the 
means to delegate errors as messages 
◦ Essentially, reactive systems are event driven applications where modules 
send events (messages) to other modules. Each module should ask 
something to another asynchronously. 
http://www.reactivemanifesto.org/
Actor model 
The actor model in computer science is a mathematical model of 
concurrent computation that treats "actors" as the universal primitives of 
concurrent computation: in response to a message that it receives, an actor 
can make local decisions, create more actors, send more messages, and 
determine how to respond to the next message received. The actor model 
originated in 1973.[1] It has been used both as a framework for a 
theoretical understanding of computation, and as the theoretical basis for 
several practical implementations of concurrent systems. The relationship 
of the model to other work is discussed in Indeterminacy in concurrent 
computation and Actor model and process calculi. 
http://en.wikipedia.org/wiki/Actor_model 
1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular 
Actor Formalism for Artificial Intelligence". IJCAI. 
http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency
Actor model 
http://codermonkey65.blogspot.com.br/2012/09/actors-in-c-with-nact.html
Akka 
http://akka.io/ 
Java or Scala framework for the Actor Model
Akka 
Comparisson with Java’s thread model 
◦ + Simpler 
◦ CrawlerLD worked with two thread pools: 
◦ One to manage all the system’s algorithm 
◦ Other to make calls to datasets 
◦ Using the same thread pool could block all threads in IO operations 
◦ + No thread blocking 
◦ Not need to worry about shared resources 
◦ Each actor runs at most one task at a time 
◦ + Better performance 
◦ No blocking 
◦ Allows distributed computing 
◦ + Better error management 
◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks 
◦ Support for transactions (atomic operations between several actors, even if distributed over several 
machines) 
◦ + Configuration can change system behavior without code change 
◦ Change number of allocated threads, create thread pools for different actors, distribute over several 
machines, change message priority without touching the code.
Akka 
Comparisson with Java’s thread model 
◦ - Much harder to learn 
◦ New paradigm 
◦ - Not native
Results 
CrawlerLDMainActor Calculate 
CalculateResource LevelFinished ResourceProcessedFromLevel 
LevelActor 
Calculate ResourceProcessed 
ResourceActor 
Calculate Calculate Calculate Calculate 
ResourceProcessed ResourceProcessed ResourceProcessed 
ResourceProcessed 
DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor
Results 
Processor 
Calculate QueryFinishedMessage 
SparqlResultset 
SparqlQuerierMasterActor 
CrawlerLD 
UtilitiesSemanticWeb 
ProcessSparqlOnDataset SparqlResultset 
SparqlQuerierActor 
Jena 
Modified 
version 
Blocking calls 
Managed by another 
Akka Dispatcher 
Critical message. Must be 
processed immediately. 
One actor for 
each dataset
Results 
Complete refactor of the code 
◦ Better organization 
◦ Better understanding 
◦ Bugs found and resolved 
◦ Almost two months to understand the paradigm, change the code and test 
Better performance 
◦ Even in heavy workload, the system is always available, 
◦ Another message to another actor 
◦ Distributed code made easy 
◦ Each SparqlQuerierActor could run in a separated machine 
◦ Not yet implemented / tested 
(Much) better memory footprint 
◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!) 
◦ Number of levels or any other parameter does not seem to affect the memory 
footprint
Graphical Interface 
60% completed
Graphical Interface 
New actor message to retrieve task status while running 
CrawlerLDMainActor 
Calculate 
GetSimplifiedStatus 
CrawlerLDSimplifiedStatus 
GetFullStatus 
CrawlerLDFullStatus
Graphical Interface 
Allows creation and monitoring of the tasks 
Takes advantage of actor model 
Anyone will be able to create new tasks 
URL available soon
Questions?

More Related Content

PDF
Distributed Crawler Service architecture presentation
PDF
No sq lv1_0
PPTX
Centralized log-management-with-elastic-stack
PDF
Repository As A Service (RaaS) at ICPSR
ODP
Deep Dive Into Elasticsearch
PPTX
Nosql databases
PPTX
Log analysis using elk
PPTX
ELK - Stack - Munich .net UG
Distributed Crawler Service architecture presentation
No sq lv1_0
Centralized log-management-with-elastic-stack
Repository As A Service (RaaS) at ICPSR
Deep Dive Into Elasticsearch
Nosql databases
Log analysis using elk
ELK - Stack - Munich .net UG

What's hot (20)

PPTX
Case study of Rujhaan.com (A social news app )
PDF
Log analysis with elastic stack
PDF
Roaring with elastic search sangam2018
PPTX
NOSQL Databases types and Uses
PPTX
Incorta spark integration
PPT
7. Key-Value Databases: In Depth
PPTX
Appache Cassandra
PDF
Schema Agnostic Indexing with Azure DocumentDB
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PPTX
Cool NoSQL on Azure with DocumentDB
PPTX
NoSQL databases
PDF
Design of Experiments on Federator Polystore Architecture
PDF
New Security Features in Apache HBase 0.98: An Operator's Guide
PPTX
Key-Value NoSQL Database
PDF
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
PDF
Automating Research Data Management at Scale with Globus
PPTX
Elasticsearch as a search alternative to a relational database
PPTX
introduction to NOSQL Database
PPT
9. Document Oriented Databases
PPTX
Azure DocumentDB
Case study of Rujhaan.com (A social news app )
Log analysis with elastic stack
Roaring with elastic search sangam2018
NOSQL Databases types and Uses
Incorta spark integration
7. Key-Value Databases: In Depth
Appache Cassandra
Schema Agnostic Indexing with Azure DocumentDB
Building a Large Scale SEO/SEM Application with Apache Solr
Cool NoSQL on Azure with DocumentDB
NoSQL databases
Design of Experiments on Federator Polystore Architecture
New Security Features in Apache HBase 0.98: An Operator's Guide
Key-Value NoSQL Database
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Automating Research Data Management at Scale with Globus
Elasticsearch as a search alternative to a relational database
introduction to NOSQL Database
9. Document Oriented Databases
Azure DocumentDB
Ad

Similar to CrawlerLD - Distributed crawler for linked data (20)

PPTX
The Big Data Stack
PDF
"Big Data" Bioinformatics
PDF
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
PDF
Adaptive Semantic Data Management Techniques for Federations of Endpoints
PDF
Extracting Resources that Help Tell Events' Stories
PPTX
Repository for data crawled from multiple social networks
PPTX
Software Engineering System Modeling (Context models)
PDF
Sharing-akka-pub
PPT
Large scale computing
KEY
Akka london scala_user_group
KEY
Introduction to Actor Model and Akka
PPTX
Democratizing Big Semantic Data management
ODP
Web-scale data processing: practical approaches for low-latency and batch
PDF
Actor model in F# and Akka.NET
PPTX
Hadoop introduction
PDF
Introducing Akka
PDF
Introducingakkajavazone2012 120914094033-phpapp02
PPTX
2. hadoop fundamentals
PDF
CrowdSearcher. Reactive and multiplatform Crowdsourcing. keynote speech at DB...
PPTX
Consuming Linked Data 4/5 Semtech2011
The Big Data Stack
"Big Data" Bioinformatics
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Extracting Resources that Help Tell Events' Stories
Repository for data crawled from multiple social networks
Software Engineering System Modeling (Context models)
Sharing-akka-pub
Large scale computing
Akka london scala_user_group
Introduction to Actor Model and Akka
Democratizing Big Semantic Data management
Web-scale data processing: practical approaches for low-latency and batch
Actor model in F# and Akka.NET
Hadoop introduction
Introducing Akka
Introducingakkajavazone2012 120914094033-phpapp02
2. hadoop fundamentals
CrowdSearcher. Reactive and multiplatform Crowdsourcing. keynote speech at DB...
Consuming Linked Data 4/5 Semtech2011
Ad

Recently uploaded (20)

PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Lecture1 pattern recognition............
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Computer network topology notes for revision
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Foundation of Data Science unit number two notes
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction to machine learning and Linear Models
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Reliability_Chapter_ presentation 1221.5784
Lecture1 pattern recognition............
Business Ppt On Nestle.pptx huunnnhhgfvu
Supervised vs unsupervised machine learning algorithms
IB Computer Science - Internal Assessment.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Miokarditis (Inflamasi pada Otot Jantung)
1_Introduction to advance data techniques.pptx
Introduction to Knowledge Engineering Part 1
Computer network topology notes for revision
IBA_Chapter_11_Slides_Final_Accessible.pptx
Foundation of Data Science unit number two notes
annual-report-2024-2025 original latest.
Introduction to machine learning and Linear Models
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...

CrawlerLD - Distributed crawler for linked data

  • 1. CrawlerLD - Distributed Crawler for Linked Data RAPHAEL DO VALE
  • 2. Summary Introduction Until now=) Issues Large Memory Footprint Graphical Interface
  • 3. Introduction How can we recommend linked data sources to a beginner user? ◦ Data sources may not use popular ontologies. ◦ There might be more than one ontology for the same domain. ◦ The user may not know all (if any) of the ontologies. 3
  • 4. Introduction Our solution: ◦ Create a recommender system that receives a small set of generic URI resources and returns a complete report of related resources (URIs, Datasets and Ontologies). ◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have to know about specific datasets or ontologies, he only need to know how to get started. ◦ The recommender system would benefit from a Linked Data crawler, based on metadata. 4
  • 5. Introduction Metadata focused crawler ◦ INPUT: ◦ User should summarize the desired domain with a small set of related terms (URI Resources). ◦ OUTPUT: ◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output was generated. ◦ With the output results, the user should evaluate the most relevant vocabularies for triplification or linkage process. ◦ This step could be manual or use another tool (e.g.: recommender system). 5
  • 6. Introduction Our solution: ◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data Cloud). ◦ For each dataset, applies several queries trying to discover relationships between datasets and the crawling resource. ◦ A breath first algorithm is used to discover more data in cycles. 6
  • 7. Until now Simplified Workflow: 7 List of Terms Processor Mediator
  • 8. Until now Processors: ◦ Each way to recover data from the Linked Data is mapped into a processor. ◦ Small pieces of code that can be plugged and unplugged. ◦ Any user can create a new processor. 8
  • 9. Until now Crawling stages. ◦ Challenge: based on generic terms, how can we discover more data? ◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and instanceOf). 9 Schema.org DBpedia WordNet Music Ontology BBC Music More specific
  • 10. Issues Large Memory Footprint ◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!) Absence of Graphical Interface ‘Locked code’ ◦ Open source on roadmap Small amout of processors
  • 12. Identifying the issue Processor ResultSets One request for each dataset Over 500 distinct datasets Asynchronous Synchronous Several processors running at the same time Each of them with a increasing resultset Jena resultset is far from being small
  • 13. Theorical Solution Processor ResultSets One request for each dataset Over 500 distinct datasets Asynchronous Asynchronous Several processors running at the same time The results are immediately processed Even with bigger resultsets, the memory is controlled
  • 14. The reactive manifesto Reactive Systems are ◦ Responsive ◦ The system responds in a timely manner if at all possible ◦ Resilient ◦ The system stays responsive in the face of failure ◦ Elastic ◦ The system stays responsive under varying workload. ◦ Message Driven ◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages ◦ Essentially, reactive systems are event driven applications where modules send events (messages) to other modules. Each module should ask something to another asynchronously. http://www.reactivemanifesto.org/
  • 15. Actor model The actor model in computer science is a mathematical model of concurrent computation that treats "actors" as the universal primitives of concurrent computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received. The actor model originated in 1973.[1] It has been used both as a framework for a theoretical understanding of computation, and as the theoretical basis for several practical implementations of concurrent systems. The relationship of the model to other work is discussed in Indeterminacy in concurrent computation and Actor model and process calculi. http://en.wikipedia.org/wiki/Actor_model 1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular Actor Formalism for Artificial Intelligence". IJCAI. http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency
  • 17. Akka http://akka.io/ Java or Scala framework for the Actor Model
  • 18. Akka Comparisson with Java’s thread model ◦ + Simpler ◦ CrawlerLD worked with two thread pools: ◦ One to manage all the system’s algorithm ◦ Other to make calls to datasets ◦ Using the same thread pool could block all threads in IO operations ◦ + No thread blocking ◦ Not need to worry about shared resources ◦ Each actor runs at most one task at a time ◦ + Better performance ◦ No blocking ◦ Allows distributed computing ◦ + Better error management ◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks ◦ Support for transactions (atomic operations between several actors, even if distributed over several machines) ◦ + Configuration can change system behavior without code change ◦ Change number of allocated threads, create thread pools for different actors, distribute over several machines, change message priority without touching the code.
  • 19. Akka Comparisson with Java’s thread model ◦ - Much harder to learn ◦ New paradigm ◦ - Not native
  • 20. Results CrawlerLDMainActor Calculate CalculateResource LevelFinished ResourceProcessedFromLevel LevelActor Calculate ResourceProcessed ResourceActor Calculate Calculate Calculate Calculate ResourceProcessed ResourceProcessed ResourceProcessed ResourceProcessed DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor
  • 21. Results Processor Calculate QueryFinishedMessage SparqlResultset SparqlQuerierMasterActor CrawlerLD UtilitiesSemanticWeb ProcessSparqlOnDataset SparqlResultset SparqlQuerierActor Jena Modified version Blocking calls Managed by another Akka Dispatcher Critical message. Must be processed immediately. One actor for each dataset
  • 22. Results Complete refactor of the code ◦ Better organization ◦ Better understanding ◦ Bugs found and resolved ◦ Almost two months to understand the paradigm, change the code and test Better performance ◦ Even in heavy workload, the system is always available, ◦ Another message to another actor ◦ Distributed code made easy ◦ Each SparqlQuerierActor could run in a separated machine ◦ Not yet implemented / tested (Much) better memory footprint ◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!) ◦ Number of levels or any other parameter does not seem to affect the memory footprint
  • 24. Graphical Interface New actor message to retrieve task status while running CrawlerLDMainActor Calculate GetSimplifiedStatus CrawlerLDSimplifiedStatus GetFullStatus CrawlerLDFullStatus
  • 25. Graphical Interface Allows creation and monitoring of the tasks Takes advantage of actor model Anyone will be able to create new tasks URL available soon