SlideShare a Scribd company logo
Intro
This memo describes steps to configure and run a language resource processing. It
is intended for internal use only.
Architecture overview
Main components
There are three main components involved in the language resources processing:
● The Resource Server (hereafter RS) manages information about resources,
their status and associated files.
● The Workflow Server (hereafter WS) is responsible to process resource
input files to output files that are loaded to the Virtuoso server. The WS is
implemented using Oozie and Hadoop.
● DERI and others participants processing components
Data and Processing Flow
The following diagram shows communication between WS and RS during processing
a resource:
The flow:
1. The flow is started by the administrator with an http call to the RS REST API.
The call URL contains resource ID as a parameter. Example: POST
/resources/48957c5d-456c-4d7a-abc9-3062c91dafdd/processed
2. First step in the processing is done by the RS. It downloads the resource input
file, uploads it to the SCP server with name: ${resource_id}.ext
3. The resource server then selects flow by resource type, sets flow properties
and starts the flow using WS API of Oozie.
4. Oozie executes the flow that contains data moving steps and execution of the
resource processing components. The penultimate step in the flow moves is
the loading of data to the Virtuoso server, that is done by the miniLoader java
action.
5. The last step in the Oozie flow is notification of the resource server about
Virtuoso load status. The resource server then notify LRPMA about processing
status.
Processing set up overview
The whole processing is configured by following steps
1. resource type definition
2. registration of resource
3. definition of workflow
Processing set up
Definition of the resource type
1st is necessary to create an resource type using the resource server. Creating of
the resource type is the HTTP POST request so it is possible to do it either by
command line HTTP tool like curl or using a REST client. There are screen-shots
from the Postman REST client in following text for illustration. Beside it there are
also request parameters in table because it is easier to read. (and copy&paste).
The HTTP header ContentType should be set to “application/json”.
The resource server address is http://54.201.101.125:9999. Suppose that it is
necessary to process resources provided by Paradigma ltd. That contains a lexicon
so result of processing will be one graph.
Reques
t
POST http://54.201.101.125:9999/resourcestypes
Exampl
e body
{
"id":"paradigma",
"description": "type intended for processing of resources provided by
Paradigma ",
"graphsSuffixes": ["lexicon"]
}
Exampl
e
respons
e
{
"id": "paradigma"
}
The resource type define which workflow is used for processing of the resource and
the resource type id is used as a name of subfolder on HDFS for Oozie workflow.
Registration of the resource
The language resource should be registered in the resource server. Normally it is
done via the LRPMA but it it is possible to do it manually for test purposes using the
resource server REST API.
Request POST http://54.201.101.125:9999/resources
Example {
body "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0",
"resourceType": "paradigma",
"downloadUri":
"scp://ubuntu@54.201.101.125/home/ubuntu/ParadigmaData/hotel_
ca_tricks.csv",
"credentials": "-----BEGIN RSA PRIVATE KEY----- …...,
"language": "ca",
"domain": "hotel",
"provider": "Paradigma ltd",
"licence": "LRGPL",
"graphNamesPrefix":
"http://www.eurosentiment.com/hotel/ca/lexicon/paradigma/"
}
Example
response
{
"id": "48957c5d-456c-4d7a-abc9-3062c91dafE0"
}
Definition of Workflow
Processing steps are defined by XML work flow file that should be copied to Hadoop
Distributed File System to the location that is configured in the Resource file
configuration. The flow contains actions. Every action defines next action in case of
its success.
Properties populated by the resources server are used in the workflow definition
XML files.
Properties of flows populated by the Resource Server:
Properties calculated or retrieved from the resource properties:
Property Description
rsresourceid id of the resource
rsgraphprefix prefix for graphs, please see the miniLoader java action
description below
rsgraphsufix0,
[rsgraphsufix1]...
graph suffixes, one for each file produced by the flow
rsdomain domain of the processed resource
rslanguage language of the processed resource
rsprovider provider
rslicense license
oozie.wf.application.p
ath
${hdfs-folder-uri}/${resourceTypeId}
hdfs-folder-uri is specified in conf.properties of the rs,
resourceTypeId is property of the resource on the rs
The resource server also copy properties from the resource server
configuration file conf/job.properties to the flow properties. It can be used for
properties common for all flows like:
Property Description
nameNode HDFS name node address
jobTracker Map reduce job tracker address
queueName Map reduce jobs queue name
user.name user used to run the OOzie flow
inputfolder where downloaded resource files are stored
rspfilesdir folder for processed files
rsvirtuosoloadfolder absolute path to the folder where files for loading are
stored
rsvirtuosohost hostname or address of the virtuoso server
rsvirtuosojdbcport JDBC port
rsvirtuosojdbcuserr user
rsvirtuosojdbcpasswd password
rsprocessedurl url to send result of the virtuoso load
Example:
Configuring Actions
Work flows usually contains following sequence
◦ Move of data to place when it can be reached by the first processing
component
◦ Processing by the first component
◦ Move of data to place when it can be reached by the second processing
component
◦ Processing by second component
◦ ….
◦ Load to the Virtuoso triple store
Moving the resource file to the processing components
The following snippet shows an example of configuration of first step in flow to
move the resource files to folder where it can be picked up by a processing
component.
<workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow">
<start to="move-resource-file"/>
<action name="move-resource-file" retry-max="2" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptwf</host>
<command>${moveScriptPath} -onlyCopy ${inputfolder}$
{rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/$
{rsresourceid}.csv</command>
<capture-output/>
</sshWithRetry>
<ok to="lemon-marl-generator"/>
<error to="fail"/>
</action>
Configuring processing
The following xml snippet shows an example of processing by the Lomon Marl
generator.
<action name="lemon-marl-generator" retry-max="3" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptnuig</host>
<command>~/bin/runLemonMarlGeneratorParadigma.sh
/home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/$
{rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}$
{rsgraphsufix0}</command>
<capture-output/>
</sshWithRetry>
<ok to="move-file2virtuoso"/>
<error to="fail"/>
</action>
Moving data to Virtuoso Server
The following xml snippet shows an action which move output of previous step to
the Virtuoso server.
<action name="move-file2virtuoso" retry-max="2" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptnuig</host>
<command>${moveScriptPath} /home/ubuntu/data/outputs/$
{rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$
{rsresourceid}.ttl</command>
<capture-output/>
</sshWithRetry>
<ok to="load2virtuoso"/>
<error to="fail"/>
</action>
Load data to the Virtuoso Server
The following xml snippet shows an example configuration of the miniLoader
component that is used for load of the processed resources files to the Virtuoso
server.
<action name="load2virtuoso" retry-max="2" retry-interval="10">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.sindice.miniloader.Miniloader</main-class>
<arg>${rsvirtuosohost}</arg>
<arg>${rsvirtuosojdbcport}</arg>
<arg>${rsvirtuosojdbcuser}</arg>
<arg>${rsvirtuosojdbcpasswd}</arg>
<arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg>
<arg>${rsgraphprefix}${rsgraphsufix0}</arg>
<capture-output/>
</java>
<ok to="notify_rs" />
<error to="fail" />
</action>
Notifying the resource server
Last step notifies the RS that data was loaded to the Virtuoso server.
<action name="notify_rs">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>curl</exec>
<argument>-H</argument>
<argument>Content-Type:application/json</argument>
<argument>-X</argument>
<argument>POST</argument>
<argument>-d</argument>
<argument>${wf:actionData('load2virtuoso')
['miniloader_json4rs']}</argument>
<argument>${rsprocessedurl}$
{rsresourceid}/processed</argument>
</shell>
<ok to="end" />
<error to="fail" />
</action>
Copy the configuration to the HDFS
The property “hdfs-folder-uri” in conf.properties RS configuration file define the path
where the configuration should be stored.
The resource type ID (paradigma) is part of the HDFS path so it is firs necessary to
check if exists:
If the folder for given resource file does not exists yet it is necessary to create it.
Now is necessary to copy the workflow and required jars. In this case only the
miniloader jar is required and it should be copied to the lib subfolder.
hadoop fs -put workflow.xml /user/ubuntu/nuig-flows/paradigma/
fs -put ~/virtuoso-miniloader-0.0.1-SNAPSHOT.jar /user/ubuntu/nuig-
flows/paradigma/lib
Processing Resources
Processing is started by HTTP POST request to the RS server with empty body.
It is possible to control status of the processing using Oozie web console:
clicking the running line the detail window appears
When processing finished all step should have status OK
When resource is processed successfully it is possible to make a sparql request to
verify the content.
Appendix A: example of whole flow definition
<workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow">
<start to="move-resource-file"/>
<action name="move-resource-file" retry-max="2" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptwf</host>
<command>${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}*
ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv</command>
<capture-output/>
</sshWithRetry>
<ok to="lemon-marl-generator"/>
<error to="fail"/>
</action>
<action name="lemon-marl-generator" retry-max="3" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptnuig</host>
<command>~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/$
{rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} $
{rslanguage} ${rsgraphprefix}${rsgraphsufix0}</command>
<capture-output/>
</sshWithRetry>
<ok to="move-file2virtuoso"/>
<error to="fail"/>
</action>
<action name="move-file2virtuoso" retry-max="2" retry-interval="1">
<sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1">
<host>ubuntu@ptnuig</host>
<command>${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl $
{virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$
{rsresourceid}.ttl</command>
<capture-output/>
</sshWithRetry>
<ok to="load2virtuoso"/>
<error to="fail"/>
</action>
<action name="load2virtuoso" retry-max="2" retry-interval="10">
<java>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<main-class>com.sindice.miniloader.Miniloader</main-class>
<arg>${rsvirtuosohost}</arg>
<arg>${rsvirtuosojdbcport}</arg>
<arg>${rsvirtuosojdbcuser}</arg>
<arg>${rsvirtuosojdbcpasswd}</arg>
<arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg>
<arg>${rsgraphprefix}${rsgraphsufix0}</arg>
<capture-output/>
</java>
<ok to="notify_rs" />
<error to="fail" />
</action>
<action name="notify_rs">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>curl</exec>
<argument>-H</argument>
<argument>Content-Type:application/json</argument>
<argument>-X</argument>
<argument>POST</argument>
<argument>-d</argument>
<argument>${wf:actionData('load2virtuoso')
['miniloader_json4rs']}</argument>
<argument>${rsprocessedurl}${rsresourceid}/processed</argument>
</shell>
<ok to="dir4processed_file" />
<error to="fail" />
</action>
<action name="dir4processed_file">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>mkdir</exec>
<argument>${rspfilesdir}/${rsresourceid}</argument>
</shell>
<ok to="move_processed_file" />
<error to="fail" />
</action>
<action name="move_processed_file">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>mv</exec>
<argument>${rsvirtuosoloadfolder}${rsresourceid}.ttl</argument>
<argument>${rspfilesdir}/${rsresourceid}</argument>
</shell>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>SSH action failed, error message[$
{wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>

More Related Content

PDF
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
PDF
Fluentd unified logging layer
PPTX
Apache
PDF
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
PDF
Fluentd and WebHDFS
PDF
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
PDF
Jersey and JAX-RS
KEY
Using and scaling Rack and Rack-based middleware
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Fluentd unified logging layer
Apache
Java Servlet Programming under Ubuntu Linux by Tushar B Kute
Fluentd and WebHDFS
Enable Database Service over HTTP or IBM WebSphere MQ in 15_minutes with IAS
Jersey and JAX-RS
Using and scaling Rack and Rack-based middleware

What's hot (20)

PDF
Import Database Data using RODBC in R Studio
PDF
PDF
So various polymorphism in Scala
PPT
Oracle database - Get external data via HTTP, FTP and Web Services
PDF
Psr 7 symfony-day
PDF
Event Processing and Integration with IAS Data Processors
PPTX
Technical Overview of Apache Drill by Jacques Nadeau
PDF
The basics of fluentd
PDF
PDF
Ldap configuration documentation
PDF
Fluentd v1.0 in a nutshell
PDF
Configuring the Apache Web Server
PDF
MidwestPHP Symfony2 Internals
PPTX
Metadata Extraction and Content Transformation
PPTX
TO Hack an ASP .NET website?
PPTX
Restful webservices
PPT
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
KEY
MongoSF - mongodb @ foursquare
PDF
Network Device Database Management with REST using Jersey
PPT
intro unix/linux 10
Import Database Data using RODBC in R Studio
So various polymorphism in Scala
Oracle database - Get external data via HTTP, FTP and Web Services
Psr 7 symfony-day
Event Processing and Integration with IAS Data Processors
Technical Overview of Apache Drill by Jacques Nadeau
The basics of fluentd
Ldap configuration documentation
Fluentd v1.0 in a nutshell
Configuring the Apache Web Server
MidwestPHP Symfony2 Internals
Metadata Extraction and Content Transformation
TO Hack an ASP .NET website?
Restful webservices
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
MongoSF - mongodb @ foursquare
Network Device Database Management with REST using Jersey
intro unix/linux 10
Ad

Viewers also liked (8)

PPTX
Tugas tik bab1
PPTX
Eurosentiment - Developing a new service
PPTX
TIK BAB 4
PPTX
Eurosentiment - Developing a new service
PPTX
TIK KELAS IX SEMESTER 1 BAB 1
PPTX
Global e payment system ppt
PPTX
Tugas tik bab 2 tik
PPT
Presentation on memory
Tugas tik bab1
Eurosentiment - Developing a new service
TIK BAB 4
Eurosentiment - Developing a new service
TIK KELAS IX SEMESTER 1 BAB 1
Global e payment system ppt
Tugas tik bab 2 tik
Presentation on memory
Ad

Similar to Language Resource Processing Configuration and Run (20)

PPT
Red5workshop 090619073420-phpapp02
PPT
nodejs_at_a_glance.ppt
PPT
nodejs_at_a_glance, understanding java script
PDF
Import web resources using R Studio
PDF
The basics of fluentd
PPTX
Apache web server
PPTX
Scalable network applications, event-driven - Node JS
PDF
Networked APIs with swift
DOC
Use perl creating web services with xml rpc
PPTX
Intro to Node
PPTX
JAX-RS 2.0 and OData
PPTX
RoR guide_p1
PDF
Fluentd and Embulk Game Server 4
ODP
SCDJWS 6. REST JAX-P
PPT
Apache
PPT
Red5 - PHUG Workshops
PDF
Create Home Directories on Storage Using WFA and ServiceNow integration
PDF
PPTX
Copper: A high performance workflow engine
PPT
Node js beginner
Red5workshop 090619073420-phpapp02
nodejs_at_a_glance.ppt
nodejs_at_a_glance, understanding java script
Import web resources using R Studio
The basics of fluentd
Apache web server
Scalable network applications, event-driven - Node JS
Networked APIs with swift
Use perl creating web services with xml rpc
Intro to Node
JAX-RS 2.0 and OData
RoR guide_p1
Fluentd and Embulk Game Server 4
SCDJWS 6. REST JAX-P
Apache
Red5 - PHUG Workshops
Create Home Directories on Storage Using WFA and ServiceNow integration
Copper: A high performance workflow engine
Node js beginner

Recently uploaded (20)

PPTX
Introduction to Artificial Intelligence
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Transform Your Business with a Software ERP System
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
L1 - Introduction to python Backend.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
medical staffing services at VALiNTRY
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Nekopoi APK 2025 free lastest update
PDF
System and Network Administration Chapter 2
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Understanding Forklifts - TECH EHS Solution
Introduction to Artificial Intelligence
Softaken Excel to vCard Converter Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Transform Your Business with a Software ERP System
CHAPTER 2 - PM Management and IT Context
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
L1 - Introduction to python Backend.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Which alternative to Crystal Reports is best for small or large businesses.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Upgrade and Innovation Strategies for SAP ERP Customers
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Wondershare Filmora 15 Crack With Activation Key [2025
medical staffing services at VALiNTRY
Online Work Permit System for Fast Permit Processing
Nekopoi APK 2025 free lastest update
System and Network Administration Chapter 2
Operating system designcfffgfgggggggvggggggggg
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Understanding Forklifts - TECH EHS Solution

Language Resource Processing Configuration and Run

  • 1. Intro This memo describes steps to configure and run a language resource processing. It is intended for internal use only. Architecture overview Main components There are three main components involved in the language resources processing: ● The Resource Server (hereafter RS) manages information about resources, their status and associated files. ● The Workflow Server (hereafter WS) is responsible to process resource input files to output files that are loaded to the Virtuoso server. The WS is implemented using Oozie and Hadoop. ● DERI and others participants processing components Data and Processing Flow The following diagram shows communication between WS and RS during processing a resource: The flow: 1. The flow is started by the administrator with an http call to the RS REST API. The call URL contains resource ID as a parameter. Example: POST /resources/48957c5d-456c-4d7a-abc9-3062c91dafdd/processed 2. First step in the processing is done by the RS. It downloads the resource input file, uploads it to the SCP server with name: ${resource_id}.ext
  • 2. 3. The resource server then selects flow by resource type, sets flow properties and starts the flow using WS API of Oozie. 4. Oozie executes the flow that contains data moving steps and execution of the resource processing components. The penultimate step in the flow moves is the loading of data to the Virtuoso server, that is done by the miniLoader java action. 5. The last step in the Oozie flow is notification of the resource server about Virtuoso load status. The resource server then notify LRPMA about processing status. Processing set up overview The whole processing is configured by following steps 1. resource type definition 2. registration of resource 3. definition of workflow Processing set up Definition of the resource type 1st is necessary to create an resource type using the resource server. Creating of the resource type is the HTTP POST request so it is possible to do it either by command line HTTP tool like curl or using a REST client. There are screen-shots from the Postman REST client in following text for illustration. Beside it there are also request parameters in table because it is easier to read. (and copy&paste). The HTTP header ContentType should be set to “application/json”. The resource server address is http://54.201.101.125:9999. Suppose that it is necessary to process resources provided by Paradigma ltd. That contains a lexicon so result of processing will be one graph.
  • 3. Reques t POST http://54.201.101.125:9999/resourcestypes Exampl e body { "id":"paradigma", "description": "type intended for processing of resources provided by Paradigma ", "graphsSuffixes": ["lexicon"] } Exampl e respons e { "id": "paradigma" } The resource type define which workflow is used for processing of the resource and the resource type id is used as a name of subfolder on HDFS for Oozie workflow. Registration of the resource The language resource should be registered in the resource server. Normally it is done via the LRPMA but it it is possible to do it manually for test purposes using the resource server REST API. Request POST http://54.201.101.125:9999/resources Example {
  • 4. body "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0", "resourceType": "paradigma", "downloadUri": "scp://ubuntu@54.201.101.125/home/ubuntu/ParadigmaData/hotel_ ca_tricks.csv", "credentials": "-----BEGIN RSA PRIVATE KEY----- …..., "language": "ca", "domain": "hotel", "provider": "Paradigma ltd", "licence": "LRGPL", "graphNamesPrefix": "http://www.eurosentiment.com/hotel/ca/lexicon/paradigma/" } Example response { "id": "48957c5d-456c-4d7a-abc9-3062c91dafE0" } Definition of Workflow Processing steps are defined by XML work flow file that should be copied to Hadoop Distributed File System to the location that is configured in the Resource file configuration. The flow contains actions. Every action defines next action in case of its success. Properties populated by the resources server are used in the workflow definition XML files. Properties of flows populated by the Resource Server: Properties calculated or retrieved from the resource properties: Property Description rsresourceid id of the resource rsgraphprefix prefix for graphs, please see the miniLoader java action description below rsgraphsufix0, [rsgraphsufix1]... graph suffixes, one for each file produced by the flow rsdomain domain of the processed resource rslanguage language of the processed resource rsprovider provider
  • 5. rslicense license oozie.wf.application.p ath ${hdfs-folder-uri}/${resourceTypeId} hdfs-folder-uri is specified in conf.properties of the rs, resourceTypeId is property of the resource on the rs The resource server also copy properties from the resource server configuration file conf/job.properties to the flow properties. It can be used for properties common for all flows like: Property Description nameNode HDFS name node address jobTracker Map reduce job tracker address queueName Map reduce jobs queue name user.name user used to run the OOzie flow inputfolder where downloaded resource files are stored rspfilesdir folder for processed files rsvirtuosoloadfolder absolute path to the folder where files for loading are stored rsvirtuosohost hostname or address of the virtuoso server rsvirtuosojdbcport JDBC port rsvirtuosojdbcuserr user rsvirtuosojdbcpasswd password rsprocessedurl url to send result of the virtuoso load Example:
  • 6. Configuring Actions Work flows usually contains following sequence ◦ Move of data to place when it can be reached by the first processing component ◦ Processing by the first component ◦ Move of data to place when it can be reached by the second processing component ◦ Processing by second component ◦ …. ◦ Load to the Virtuoso triple store Moving the resource file to the processing components The following snippet shows an example of configuration of first step in flow to move the resource files to folder where it can be picked up by a processing component. <workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow"> <start to="move-resource-file"/> <action name="move-resource-file" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptwf</host> <command>${moveScriptPath} -onlyCopy ${inputfolder}$ {rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/$ {rsresourceid}.csv</command> <capture-output/> </sshWithRetry> <ok to="lemon-marl-generator"/>
  • 7. <error to="fail"/> </action> Configuring processing The following xml snippet shows an example of processing by the Lomon Marl generator. <action name="lemon-marl-generator" retry-max="3" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/${rsresourceid}.csv /home/ubuntu/data/outputs/$ {rsresourceid}.ttl ${rsdomain} ${rslanguage} ${rsgraphprefix}$ {rsgraphsufix0}</command> <capture-output/> </sshWithRetry> <ok to="move-file2virtuoso"/> <error to="fail"/> </action> Moving data to Virtuoso Server The following xml snippet shows an action which move output of previous step to the Virtuoso server. <action name="move-file2virtuoso" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>${moveScriptPath} /home/ubuntu/data/outputs/$ {rsresourceid}.ttl ${virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$ {rsresourceid}.ttl</command> <capture-output/> </sshWithRetry> <ok to="load2virtuoso"/> <error to="fail"/> </action> Load data to the Virtuoso Server The following xml snippet shows an example configuration of the miniLoader component that is used for load of the processed resources files to the Virtuoso server.
  • 8. <action name="load2virtuoso" retry-max="2" retry-interval="10"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>com.sindice.miniloader.Miniloader</main-class> <arg>${rsvirtuosohost}</arg> <arg>${rsvirtuosojdbcport}</arg> <arg>${rsvirtuosojdbcuser}</arg> <arg>${rsvirtuosojdbcpasswd}</arg> <arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg> <arg>${rsgraphprefix}${rsgraphsufix0}</arg> <capture-output/> </java> <ok to="notify_rs" /> <error to="fail" /> </action> Notifying the resource server Last step notifies the RS that data was loaded to the Virtuoso server. <action name="notify_rs"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>curl</exec> <argument>-H</argument> <argument>Content-Type:application/json</argument> <argument>-X</argument> <argument>POST</argument> <argument>-d</argument> <argument>${wf:actionData('load2virtuoso') ['miniloader_json4rs']}</argument> <argument>${rsprocessedurl}$ {rsresourceid}/processed</argument> </shell>
  • 9. <ok to="end" /> <error to="fail" /> </action> Copy the configuration to the HDFS The property “hdfs-folder-uri” in conf.properties RS configuration file define the path where the configuration should be stored. The resource type ID (paradigma) is part of the HDFS path so it is firs necessary to check if exists: If the folder for given resource file does not exists yet it is necessary to create it. Now is necessary to copy the workflow and required jars. In this case only the miniloader jar is required and it should be copied to the lib subfolder. hadoop fs -put workflow.xml /user/ubuntu/nuig-flows/paradigma/ fs -put ~/virtuoso-miniloader-0.0.1-SNAPSHOT.jar /user/ubuntu/nuig- flows/paradigma/lib Processing Resources Processing is started by HTTP POST request to the RS server with empty body.
  • 10. It is possible to control status of the processing using Oozie web console: clicking the running line the detail window appears
  • 11. When processing finished all step should have status OK
  • 12. When resource is processed successfully it is possible to make a sparql request to verify the content. Appendix A: example of whole flow definition <workflow-app xmlns="uri:oozie:workflow:0.3" name="deri-workflow"> <start to="move-resource-file"/> <action name="move-resource-file" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptwf</host> <command>${moveScriptPath} -onlyCopy ${inputfolder}${rsresourceid}* ubuntu@ptnuig:/home/ubuntu/data/${rsresourceid}.csv</command> <capture-output/> </sshWithRetry> <ok to="lemon-marl-generator"/> <error to="fail"/> </action> <action name="lemon-marl-generator" retry-max="3" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>~/bin/runLemonMarlGeneratorParadigma.sh /home/ubuntu/data/$ {rsresourceid}.csv /home/ubuntu/data/outputs/${rsresourceid}.ttl ${rsdomain} $ {rslanguage} ${rsgraphprefix}${rsgraphsufix0}</command> <capture-output/> </sshWithRetry> <ok to="move-file2virtuoso"/> <error to="fail"/> </action> <action name="move-file2virtuoso" retry-max="2" retry-interval="1"> <sshWithRetry xmlns="uri:oozie:sshWithRetry-action:0.1"> <host>ubuntu@ptnuig</host> <command>${moveScriptPath} /home/ubuntu/data/outputs/${rsresourceid}.ttl $ {virtuosoUser}@${rsvirtuosohost}:${rsvirtuosoloadfolder}$
  • 13. {rsresourceid}.ttl</command> <capture-output/> </sshWithRetry> <ok to="load2virtuoso"/> <error to="fail"/> </action> <action name="load2virtuoso" retry-max="2" retry-interval="10"> <java> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <main-class>com.sindice.miniloader.Miniloader</main-class> <arg>${rsvirtuosohost}</arg> <arg>${rsvirtuosojdbcport}</arg> <arg>${rsvirtuosojdbcuser}</arg> <arg>${rsvirtuosojdbcpasswd}</arg> <arg>${rsvirtuosoloadfolder}${rsresourceid}.ttl</arg> <arg>${rsgraphprefix}${rsgraphsufix0}</arg> <capture-output/> </java> <ok to="notify_rs" /> <error to="fail" /> </action> <action name="notify_rs"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>curl</exec> <argument>-H</argument> <argument>Content-Type:application/json</argument> <argument>-X</argument> <argument>POST</argument> <argument>-d</argument> <argument>${wf:actionData('load2virtuoso') ['miniloader_json4rs']}</argument> <argument>${rsprocessedurl}${rsresourceid}/processed</argument> </shell> <ok to="dir4processed_file" /> <error to="fail" /> </action> <action name="dir4processed_file"> <shell xmlns="uri:oozie:shell-action:0.1">
  • 14. <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>mkdir</exec> <argument>${rspfilesdir}/${rsresourceid}</argument> </shell> <ok to="move_processed_file" /> <error to="fail" /> </action> <action name="move_processed_file"> <shell xmlns="uri:oozie:shell-action:0.1"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <exec>mv</exec> <argument>${rsvirtuosoloadfolder}${rsresourceid}.ttl</argument> <argument>${rspfilesdir}/${rsresourceid}</argument> </shell> <ok to="end" /> <error to="fail" /> </action> <kill name="fail"> <message>SSH action failed, error message[$ {wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>