SlideShare a Scribd company logo
[Unclear] words are denoted in brackets
Webinar: Tracking Research Data Footprints via
Integration with Research Graph
1 March 2018
Video & slides available from ANDS website
START OF TRANSCRIPT
Facilitator: Good afternoon everyone, thanks for coming to the webinar today. We
have a talk today on the topic of tracking the footprint of research data
across infrastructures, using the Research Graph API. The speakers
today are Doctor Ben Evans from NCI, Associate Director of NCI, and
Doctor Jingbo Wang who's a Collection Manager in NCI. So with that
introduction, I'll actually hand over the talk to Ben for starting the talk.
Ben Evans: So we're going to be talking about work that's going on to help track
research data and how it's used in a broader setting. I should mention,
NCI's got a lot of partners as a part of this that have been backing and
worked with us in this, including from NCRIS and Bureau of
Meteorology, Geoscience Australia, CSIRO, the ANU and a host of
other partners and collaborators, including ANDS in particular, for this
work.
So some of the open questions, motivating questions, beyond just
getting data management in place is - so say you publish data and
datasets, is how is the research community actually connecting with
that data? After you've put it into a public arena they could be
connecting with it in various ways and making [use], so how do you
track that? Also, how do you track the impact of that investment of that
Page 2 of 8
research data for other derived products downstream? So that's a
challenging question that we can't answer fully with inside] of a single
centre; you're really into an international world. That motivated us a lot
to be working on this particular project which has part of the solution.
So I should say that the standing of this work and this piece of
infrastructure that we'll be going through on Research Graph started
with a fairly small partnership. But now it's grown quite a bit and RDA,
Research Data Alliance, have picked it up as this Registry
Interoperability Working Group. It's got a number of players. You can
see some of the players who've been strongly supporting this work
over a period of time listed there and you can follow that link on RD
Alliance website to track this. But, furthermore, now really through
Amir's good work and others, the European Commission have picked
this up and said, yes, this needs to now be pushed into an ICT
specification. So all that is to say that this work is now on a pretty
strong pathway and well worth paying attention to now as it goes
forward.
So there's four types of what we call nodes in this graph network when
you're publishing data and using data. So one is the researcher, one's
the dataset, one's the publication, one's grants. There could be other
nodes as well, but the status of these whole graphs at the moment is
basically built up of those fundamental areas. When we get down to it
inside of the tool, you can see the attributes through that graphic on
the right-hand side. Research is always in green and datasets are in
orange, publication is blue and grants in yellow. You can see some of
the attributes that are listed there and we'll talk about that.
The other thing is that this graph network that's been built up
understands very well-known metadata standards like ISO 19115-4;
that's geospatial data, a lot of geospatial data fits into that. But also
things like RIF-CS that's used in the librarian world, and inside of
Research Data Australia - if you know that catalogue - uses RIF-CS,
and MARC 21 and there are others as well. So just to say that this
graph system is already supporting that framework.
Page 3 of 8
For NCI, we make a number of major national reference datasets
available on NCI. We've curated them and put them into a certain
form. They come, in principle, from a lot of the science agencies,
being Bureau of Meteorology and Geoscience Australia and so forth,
also sometimes from our research community itself. But they're being
classified as really the major national reference collections that are
associated with NCI. You can see some of the things listed there,
climate, weather and satellite imagery, bathymetry, elevation, all of
these earth systems, geospatial data in particular.
As an example of a dataset now is - so we've got this thing called
Bluelink ReANalysis dataset. On the left-hand side it gives you a
summary of what it is. On the right-hand side many people are familiar
and work with catalogue systems, so we're using GeoNetwork as part
of our core catalogue system. So you get the title, so that's the blue -
you can see on the right-hand side it's circled there and an abstract
about it. You can see points of contact. So this is all part of this ISO
19115 standard, that's how all of this is recorded, how to get hold of
that data.
So the question that you've got off something like this is what
researchers are working on that, or related datasets, how they're
publishing, is there anything else connected to it. So you end up with
this little graph of stuff. Just down on the bottom right-hand side here,
just off this basic diagram here, you can see [Peter Oak], who's the
main contact for that dataset, is somehow associated with this BRAN -
Bluelink ReANalysis - dataset. So they're somehow associated with
that even off our local information. So you can find out a little bit more
about Peter. We have other information systems that have got Peter's
details, so what project he's working on, publications somehow linked
to him, his contact detail and a pretty picture there of Peter looking
very spritely.
So we have that information in NCI. So on the left-hand side, in this
dotted line that you can see with the NCI logo around it, we know a
fair bit about Peter, that's the number one with the green, there he is.
Page 4 of 8
There he is with his - as a researcher and an identity and attributes
inside of our local information. We know various things about datasets
that Peter is associated with. But there's other things that live outside
of NIC. In particular, on the right-hand side there, you can say out in
the real world, or out in the external world, Peter Oak has what's
called an ORCID ID, and many of you know this. Inside of - associated
with his ORCID ID we know things about his publication record.
So the trick for all of this stuff is to try and associate our internal
information to the external information. There's a number of steps that
we go through here. Number one, let's have the information recorded
inside of a little graph that we'll go through in a second. Then we can
augment the graph with how it gets connected up with the ORCID ID.
Then we can find out further information, in particular about other
external records like his publication record.
So almost redescribing this same [step] is, in a fundamental way what
we do is we've got a GeoNetwork catalogue with a lot of this
information; that is via the utilities in the Research Graph system.
Harvest that and puts it into a Neo4j, which is a type of a graph
database, just the one that we happen to be using for this. That Neo4j
is just hosted inside of the cloud. That has our information, it's just a
recasting of the local information and put inside of this system. Then
what we do is go out into a broader Research Graph on the outside
world, and we augment then the local graph database with that extra
information.
Then we can visualise it in various ways. So that's what this image -
and there is a graphical tool that comes along with this, to start seeing
a whole bunch of connected things to do with this data that can start
to be exploited. So if we just had the local information of various
datasets, then all we would have is the left-hand side of this. Through
that extra augmentation, going and querying in the international
Research Graph and then augmenting for the local data, we end up
with a much richer set of information about what each of the individual
Page 5 of 8
datasets and researchers and what they're doing and their
associations. So that's pretty simply what's going on.
The Research Graph system that's been put in place really by the
partners, and particularly Amir driving this, interoperates with a whole
bunch of different services; ORCID, DataCite, Skolix has come on
board, and other major datacentres like [ASIS] and so on and so forth.
So there's a list there, and a growing list, of information being put into
an interoperable graph system. So now there's richer and deeper
details that we can start harvesting. There's actually - we did the
simplest augmentation, is the description on this previous page. But,
actually, you can run several levels of augmentation and we're still I
guess trying to explore what's the best way of augmenting the data of
the questions that we're trying to face.
So, look, I'm going to hand over now to Jingbo who's going to take us
a little bit more through some of the details of Research Graph and
where it's going.
Jingbo Wang: Thank you, Ben. Hi, from this point of time I wanted to go through a
couple of slides, in the next 10 minutes or so, to demonstrate how we
implement the Research Graph [pack line]. Also, report what are we
currently working on, plus some future plans going forward. So in this
slide, it shows you what is the input and what is the output. The input
is NCI's metadata database. As you see in the previous slides by Ben,
our dataset available in GeoNetwork in various formats - it could be
CSV or XML or JSON - they are the input so that Jenkins server take
that input from the [data hub] and build the NCI graph. So the output
will be NCI graph.
On the right-hand side, the bottom screenshot just shows you how
easy to maintain and update the database with only one click of the
button. The five different modules, in green colour, shows you the
step-by-step inside of the Jenkins server to build the NCI graph and
also augmentation with other database such as a geo - [ORCID]. So
what we get eventually is an NCI graph [ML]. There are different ways
Page 6 of 8
to visualise the graph. One way, which was not presented here, is we
can use the [GAVI] software to visualise. But a more popular way
would be we present our graph in a web-based format.
So if you click that link or type this link in your browser, you can
actually see this is online. I'm going to show you three screenshots on
this webpage, followed by a little live demo afterwards. Basically, this
is the interesting part, once we get the graph and we're going to
analyse the graph and try to tell the story from the graph. The first
screenshot just really gives you an overview of how many publications
in our augmented graph and how many datasets and how many
researchers here. I'm going to run a little live demo to repeat the story
that Ben told you about Peter Oak. If you type this,
researchgraph.org/NCI.
Jingbo Wang: Alright, in the web browser you can see a webpage about NCI's
graph. Click that orange button, it'll open a new tab to show the graph.
This is the actual graph look like. If I find Peter Oak as a researcher
and click that one, it only shows the connection with this researcher.
The colour code of the dot is that this is the dataset which is the
Bluelink ReANalysis data associated with Peter Oak. If you notice,
there is another green dot over here and this is the augmented part
from ORCID. The blue dot represents the publication associated with
this researcher. So this really demonstrates that, through the
augmentation, our own database with the dataset and researcher are
connected to the rest of the world.
Let me go back to my presentation again. I should say that we did
play around with the different analytics and this is the most interesting
part. We demonstrate a few cases that we think people are interested.
For example, what is the most publication related to a researcher, and
this researcher is always identified with the ORCID ID. Also, which
researcher has the most dataset associated with him, with his
affiliation. On the right-hand side, if you are still with the web browser,
Page 7 of 8
you can actually put your mouse onto some of the name. It will only
show the connections between this researcher and other researchers.
So it's more like an interactive mode.
I should also say that this augmentation is still work in progress. It
means that we can augment with other databases, such as DataCite
or other European data repository, and we can actually make our
graph bigger and bigger. The last screenshot is just showing the
number of publications along the year. As I said, this is not a static
graph because we can always augment with other database and we
can introduce more publication if it is not in the ORCID database. So
behind the scene we use the Jupyter Notebook to generate this web
interactive format. We plan to play around more by providing maybe
predefined query, so that people can put the person's name on
ORCID, find out what is the connection between this researcher and
the publication and the dataset and, in the future, even the grants if it's
available in our database.
So next is we think that Research Graph can be useful for a number
of different groups of people. We think also providing Research Graph
in the linked-data format would be beneficial for people who want to
work with more machine-searchable and actionable approach. So
what we've done is we did a bit of proof-concept work by extending
our current format of the Research Graph in JSON to JSON-LD, using
schema.org to enhance the schematic feature of the Research Graph.
We have a publication last year talking about the approach and the
ideas, so the reference is at the bottom of the slide.
The other thing is, once we build the Research Graph there are a lot
of interesting analysis that we can do. So we are currently exploring
the new ways of analysing the information in the Research Graph and
trying to pick up the good stories about what Research Graph can tell
us. The other thing is, because we are the national data repository we
actually encourage people to do the cross-disciplinary research based
on our high-performance platform. If we can demonstrate the value of
[cross-system] and disciplinary research, by showing that when
Page 8 of 8
different type of dataset available on the same platform, more
research, more publication and more funding was granted, it will be
quite good to demonstrate the impact of our data management
practice.
So in summary, I think Research Graph really means a couple of
things for a different group of user. For example, for a user itself of the
data repository, they can understand the dynamic research integration
through these analytics. I remember when some researcher submit an
ARC grant, they sometimes show their publication citation along the
year being increasingly better and better. But with the Research
Graph they can actually show more information, not just publication
but also their contribution of the dataset and their award on other
additional funding using the Research Graph.
For the higher-level executive and board, as a data repository we can
demonstrate the value of our good data management practice and
provide the interoperability of the data services through these more
advanced services. We also advance the science research by having
more publication and more impact in the matrix. Finally, for the funding
body, since they invested a good amount of money for the data
repository, we can demonstrate the impact of the investment on the
data repository by showing the quantitative analysis of the impact
matrix within the research community.
So if you want to learn more about the graph, we have the GitHub
source code and we also have the interactive demo of the graph, and
there is Twitter also if you wanted to socialise it. I think that's it.
Facilitator: Okay, thanks Jingbo. I'd like to thank Ben and Jingbo for giving this
talk and thank you, everyone, for attending the webinar. Thank you.
END OF TRANSCRIPT

More Related Content

PDF
Learning Multilingual Semantics from Big Data on the Web
PPT
Data, data, data
PDF
Web Ontologies: Lessons Learned from Conceptual Modeling at Scale
PDF
Information Extraction from Web-Scale N-Gram Data
PPTX
Open Data - a goldmine (JavaZone 2009)
PPTX
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
PPTX
LD4 Wikidata Affinity Group - Shorthouse
PDF
Tracking research data footprints - slides
Learning Multilingual Semantics from Big Data on the Web
Data, data, data
Web Ontologies: Lessons Learned from Conceptual Modeling at Scale
Information Extraction from Web-Scale N-Gram Data
Open Data - a goldmine (JavaZone 2009)
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
LD4 Wikidata Affinity Group - Shorthouse
Tracking research data footprints - slides

Similar to Transcript - Tracking Research Data Footprints via Integration with Research Graph (20)

PPTX
Metadata and Linked Data. Where is it all going?
PDF
Learning from past infrastructure to embrace friction and create the Research...
PPT
Aggregation as tactic sm new
PPT
Aggregation as Tactic
PPT
Going for GOLD - Adventures in Open Linked Geospatial Metadata
PPT
Introduction to RAGLD
PDF
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
PDF
Research Data Alliance Plenary 9: DDRI Working Group Session
PDF
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
PDF
Stories of “Glocality"—Nations in a Global Infrastructure
PDF
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
DOC
Notes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
PPTX
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
PDF
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
PDF
20120419 linkedopendataandteamsciencemcguinnesschicago
PDF
Information Visualization for Social Network Analysis,
PDF
"Plans are worthless, but planning is essential"
PDF
Keynote: Mark Parsons - Plans are Useless, But Planning is Essential
PDF
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
ODP
Ontology based semantics and graphical notation as directed graphs
Metadata and Linked Data. Where is it all going?
Learning from past infrastructure to embrace friction and create the Research...
Aggregation as tactic sm new
Aggregation as Tactic
Going for GOLD - Adventures in Open Linked Geospatial Metadata
Introduction to RAGLD
ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums
Research Data Alliance Plenary 9: DDRI Working Group Session
20120718 linkedopendataandnextgenerationsciencemcguinnessesip final
Stories of “Glocality"—Nations in a Global Infrastructure
Beyond Meta-Data: Nano-Publications Recording Scientific Endeavour
Notes for talk on 12th June 2013 to Open Innovation meeting, Glasgow
The Information Workbench - Linked Data and Semantic Wikis in the Enterprise
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
20120419 linkedopendataandteamsciencemcguinnesschicago
Information Visualization for Social Network Analysis,
"Plans are worthless, but planning is essential"
Keynote: Mark Parsons - Plans are Useless, But Planning is Essential
En un mundo hiperconectado, las bases de datos de grafos son tu arma secreta
Ontology based semantics and graphical notation as directed graphs
Ad

More from ARDC (20)

PPTX
Introduction to ADA
PPTX
Architecture and Standards
PPTX
Data Sharing and Release Legislation
PPT
Australian Dementia Network (ADNet)
PPTX
Investigator-initiated clinical trials: a community perspective
PPTX
NCRIS and the health domain
PPTX
International perspective for sharing publicly funded medical research data
PPTX
Clinical trials data sharing
PPTX
Clinical trials and cohort studies
PPTX
Introduction to vision and scope
PPTX
FAIR for the future: embracing all things data
PDF
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
PDF
Skilling-up-in-research-data-management-20181128
PDF
Research data management and sharing of medical data
PPTX
Findable, Accessible, Interoperable and Reusable (FAIR) data
PPTX
Applying FAIR principles to linked datasets: Opportunities and Challenges
PDF
How to make your data count webinar, 26 Nov 2018
PDF
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
PDF
How FAIR is your data? Copyright, licensing and reuse of data
PDF
Peter neish DMPs BoF eResearch 2018
Introduction to ADA
Architecture and Standards
Data Sharing and Release Legislation
Australian Dementia Network (ADNet)
Investigator-initiated clinical trials: a community perspective
NCRIS and the health domain
International perspective for sharing publicly funded medical research data
Clinical trials data sharing
Clinical trials and cohort studies
Introduction to vision and scope
FAIR for the future: embracing all things data
ARDC 2018 state engagements - Nov-Dec 2018 - Slides - Ian Duncan
Skilling-up-in-research-data-management-20181128
Research data management and sharing of medical data
Findable, Accessible, Interoperable and Reusable (FAIR) data
Applying FAIR principles to linked datasets: Opportunities and Challenges
How to make your data count webinar, 26 Nov 2018
Ready, Set, Go! Join the Top 10 FAIR Data Things Global Sprint
How FAIR is your data? Copyright, licensing and reuse of data
Peter neish DMPs BoF eResearch 2018
Ad

Recently uploaded (20)

PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Classroom Observation Tools for Teachers
PPTX
Pharma ospi slides which help in ospi learning
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PDF
Insiders guide to clinical Medicine.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PPTX
Cell Structure & Organelles in detailed.
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Pre independence Education in Inndia.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Complications of Minimal Access Surgery at WLH
Module 4: Burden of Disease Tutorial Slides S2 2025
Classroom Observation Tools for Teachers
Pharma ospi slides which help in ospi learning
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
TR - Agricultural Crops Production NC III.pdf
Anesthesia in Laparoscopic Surgery in India
2.FourierTransform-ShortQuestionswithAnswers.pdf
Insiders guide to clinical Medicine.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Final Presentation General Medicine 03-08-2024.pptx
Renaissance Architecture: A Journey from Faith to Humanism
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
GDM (1) (1).pptx small presentation for students
Microbial diseases, their pathogenesis and prophylaxis
Cell Structure & Organelles in detailed.
Abdominal Access Techniques with Prof. Dr. R K Mishra
Pre independence Education in Inndia.pdf
O7-L3 Supply Chain Operations - ICLT Program
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Complications of Minimal Access Surgery at WLH

Transcript - Tracking Research Data Footprints via Integration with Research Graph

  • 1. [Unclear] words are denoted in brackets Webinar: Tracking Research Data Footprints via Integration with Research Graph 1 March 2018 Video & slides available from ANDS website START OF TRANSCRIPT Facilitator: Good afternoon everyone, thanks for coming to the webinar today. We have a talk today on the topic of tracking the footprint of research data across infrastructures, using the Research Graph API. The speakers today are Doctor Ben Evans from NCI, Associate Director of NCI, and Doctor Jingbo Wang who's a Collection Manager in NCI. So with that introduction, I'll actually hand over the talk to Ben for starting the talk. Ben Evans: So we're going to be talking about work that's going on to help track research data and how it's used in a broader setting. I should mention, NCI's got a lot of partners as a part of this that have been backing and worked with us in this, including from NCRIS and Bureau of Meteorology, Geoscience Australia, CSIRO, the ANU and a host of other partners and collaborators, including ANDS in particular, for this work. So some of the open questions, motivating questions, beyond just getting data management in place is - so say you publish data and datasets, is how is the research community actually connecting with that data? After you've put it into a public arena they could be connecting with it in various ways and making [use], so how do you track that? Also, how do you track the impact of that investment of that
  • 2. Page 2 of 8 research data for other derived products downstream? So that's a challenging question that we can't answer fully with inside] of a single centre; you're really into an international world. That motivated us a lot to be working on this particular project which has part of the solution. So I should say that the standing of this work and this piece of infrastructure that we'll be going through on Research Graph started with a fairly small partnership. But now it's grown quite a bit and RDA, Research Data Alliance, have picked it up as this Registry Interoperability Working Group. It's got a number of players. You can see some of the players who've been strongly supporting this work over a period of time listed there and you can follow that link on RD Alliance website to track this. But, furthermore, now really through Amir's good work and others, the European Commission have picked this up and said, yes, this needs to now be pushed into an ICT specification. So all that is to say that this work is now on a pretty strong pathway and well worth paying attention to now as it goes forward. So there's four types of what we call nodes in this graph network when you're publishing data and using data. So one is the researcher, one's the dataset, one's the publication, one's grants. There could be other nodes as well, but the status of these whole graphs at the moment is basically built up of those fundamental areas. When we get down to it inside of the tool, you can see the attributes through that graphic on the right-hand side. Research is always in green and datasets are in orange, publication is blue and grants in yellow. You can see some of the attributes that are listed there and we'll talk about that. The other thing is that this graph network that's been built up understands very well-known metadata standards like ISO 19115-4; that's geospatial data, a lot of geospatial data fits into that. But also things like RIF-CS that's used in the librarian world, and inside of Research Data Australia - if you know that catalogue - uses RIF-CS, and MARC 21 and there are others as well. So just to say that this graph system is already supporting that framework.
  • 3. Page 3 of 8 For NCI, we make a number of major national reference datasets available on NCI. We've curated them and put them into a certain form. They come, in principle, from a lot of the science agencies, being Bureau of Meteorology and Geoscience Australia and so forth, also sometimes from our research community itself. But they're being classified as really the major national reference collections that are associated with NCI. You can see some of the things listed there, climate, weather and satellite imagery, bathymetry, elevation, all of these earth systems, geospatial data in particular. As an example of a dataset now is - so we've got this thing called Bluelink ReANalysis dataset. On the left-hand side it gives you a summary of what it is. On the right-hand side many people are familiar and work with catalogue systems, so we're using GeoNetwork as part of our core catalogue system. So you get the title, so that's the blue - you can see on the right-hand side it's circled there and an abstract about it. You can see points of contact. So this is all part of this ISO 19115 standard, that's how all of this is recorded, how to get hold of that data. So the question that you've got off something like this is what researchers are working on that, or related datasets, how they're publishing, is there anything else connected to it. So you end up with this little graph of stuff. Just down on the bottom right-hand side here, just off this basic diagram here, you can see [Peter Oak], who's the main contact for that dataset, is somehow associated with this BRAN - Bluelink ReANalysis - dataset. So they're somehow associated with that even off our local information. So you can find out a little bit more about Peter. We have other information systems that have got Peter's details, so what project he's working on, publications somehow linked to him, his contact detail and a pretty picture there of Peter looking very spritely. So we have that information in NCI. So on the left-hand side, in this dotted line that you can see with the NCI logo around it, we know a fair bit about Peter, that's the number one with the green, there he is.
  • 4. Page 4 of 8 There he is with his - as a researcher and an identity and attributes inside of our local information. We know various things about datasets that Peter is associated with. But there's other things that live outside of NIC. In particular, on the right-hand side there, you can say out in the real world, or out in the external world, Peter Oak has what's called an ORCID ID, and many of you know this. Inside of - associated with his ORCID ID we know things about his publication record. So the trick for all of this stuff is to try and associate our internal information to the external information. There's a number of steps that we go through here. Number one, let's have the information recorded inside of a little graph that we'll go through in a second. Then we can augment the graph with how it gets connected up with the ORCID ID. Then we can find out further information, in particular about other external records like his publication record. So almost redescribing this same [step] is, in a fundamental way what we do is we've got a GeoNetwork catalogue with a lot of this information; that is via the utilities in the Research Graph system. Harvest that and puts it into a Neo4j, which is a type of a graph database, just the one that we happen to be using for this. That Neo4j is just hosted inside of the cloud. That has our information, it's just a recasting of the local information and put inside of this system. Then what we do is go out into a broader Research Graph on the outside world, and we augment then the local graph database with that extra information. Then we can visualise it in various ways. So that's what this image - and there is a graphical tool that comes along with this, to start seeing a whole bunch of connected things to do with this data that can start to be exploited. So if we just had the local information of various datasets, then all we would have is the left-hand side of this. Through that extra augmentation, going and querying in the international Research Graph and then augmenting for the local data, we end up with a much richer set of information about what each of the individual
  • 5. Page 5 of 8 datasets and researchers and what they're doing and their associations. So that's pretty simply what's going on. The Research Graph system that's been put in place really by the partners, and particularly Amir driving this, interoperates with a whole bunch of different services; ORCID, DataCite, Skolix has come on board, and other major datacentres like [ASIS] and so on and so forth. So there's a list there, and a growing list, of information being put into an interoperable graph system. So now there's richer and deeper details that we can start harvesting. There's actually - we did the simplest augmentation, is the description on this previous page. But, actually, you can run several levels of augmentation and we're still I guess trying to explore what's the best way of augmenting the data of the questions that we're trying to face. So, look, I'm going to hand over now to Jingbo who's going to take us a little bit more through some of the details of Research Graph and where it's going. Jingbo Wang: Thank you, Ben. Hi, from this point of time I wanted to go through a couple of slides, in the next 10 minutes or so, to demonstrate how we implement the Research Graph [pack line]. Also, report what are we currently working on, plus some future plans going forward. So in this slide, it shows you what is the input and what is the output. The input is NCI's metadata database. As you see in the previous slides by Ben, our dataset available in GeoNetwork in various formats - it could be CSV or XML or JSON - they are the input so that Jenkins server take that input from the [data hub] and build the NCI graph. So the output will be NCI graph. On the right-hand side, the bottom screenshot just shows you how easy to maintain and update the database with only one click of the button. The five different modules, in green colour, shows you the step-by-step inside of the Jenkins server to build the NCI graph and also augmentation with other database such as a geo - [ORCID]. So what we get eventually is an NCI graph [ML]. There are different ways
  • 6. Page 6 of 8 to visualise the graph. One way, which was not presented here, is we can use the [GAVI] software to visualise. But a more popular way would be we present our graph in a web-based format. So if you click that link or type this link in your browser, you can actually see this is online. I'm going to show you three screenshots on this webpage, followed by a little live demo afterwards. Basically, this is the interesting part, once we get the graph and we're going to analyse the graph and try to tell the story from the graph. The first screenshot just really gives you an overview of how many publications in our augmented graph and how many datasets and how many researchers here. I'm going to run a little live demo to repeat the story that Ben told you about Peter Oak. If you type this, researchgraph.org/NCI. Jingbo Wang: Alright, in the web browser you can see a webpage about NCI's graph. Click that orange button, it'll open a new tab to show the graph. This is the actual graph look like. If I find Peter Oak as a researcher and click that one, it only shows the connection with this researcher. The colour code of the dot is that this is the dataset which is the Bluelink ReANalysis data associated with Peter Oak. If you notice, there is another green dot over here and this is the augmented part from ORCID. The blue dot represents the publication associated with this researcher. So this really demonstrates that, through the augmentation, our own database with the dataset and researcher are connected to the rest of the world. Let me go back to my presentation again. I should say that we did play around with the different analytics and this is the most interesting part. We demonstrate a few cases that we think people are interested. For example, what is the most publication related to a researcher, and this researcher is always identified with the ORCID ID. Also, which researcher has the most dataset associated with him, with his affiliation. On the right-hand side, if you are still with the web browser,
  • 7. Page 7 of 8 you can actually put your mouse onto some of the name. It will only show the connections between this researcher and other researchers. So it's more like an interactive mode. I should also say that this augmentation is still work in progress. It means that we can augment with other databases, such as DataCite or other European data repository, and we can actually make our graph bigger and bigger. The last screenshot is just showing the number of publications along the year. As I said, this is not a static graph because we can always augment with other database and we can introduce more publication if it is not in the ORCID database. So behind the scene we use the Jupyter Notebook to generate this web interactive format. We plan to play around more by providing maybe predefined query, so that people can put the person's name on ORCID, find out what is the connection between this researcher and the publication and the dataset and, in the future, even the grants if it's available in our database. So next is we think that Research Graph can be useful for a number of different groups of people. We think also providing Research Graph in the linked-data format would be beneficial for people who want to work with more machine-searchable and actionable approach. So what we've done is we did a bit of proof-concept work by extending our current format of the Research Graph in JSON to JSON-LD, using schema.org to enhance the schematic feature of the Research Graph. We have a publication last year talking about the approach and the ideas, so the reference is at the bottom of the slide. The other thing is, once we build the Research Graph there are a lot of interesting analysis that we can do. So we are currently exploring the new ways of analysing the information in the Research Graph and trying to pick up the good stories about what Research Graph can tell us. The other thing is, because we are the national data repository we actually encourage people to do the cross-disciplinary research based on our high-performance platform. If we can demonstrate the value of [cross-system] and disciplinary research, by showing that when
  • 8. Page 8 of 8 different type of dataset available on the same platform, more research, more publication and more funding was granted, it will be quite good to demonstrate the impact of our data management practice. So in summary, I think Research Graph really means a couple of things for a different group of user. For example, for a user itself of the data repository, they can understand the dynamic research integration through these analytics. I remember when some researcher submit an ARC grant, they sometimes show their publication citation along the year being increasingly better and better. But with the Research Graph they can actually show more information, not just publication but also their contribution of the dataset and their award on other additional funding using the Research Graph. For the higher-level executive and board, as a data repository we can demonstrate the value of our good data management practice and provide the interoperability of the data services through these more advanced services. We also advance the science research by having more publication and more impact in the matrix. Finally, for the funding body, since they invested a good amount of money for the data repository, we can demonstrate the impact of the investment on the data repository by showing the quantitative analysis of the impact matrix within the research community. So if you want to learn more about the graph, we have the GitHub source code and we also have the interactive demo of the graph, and there is Twitter also if you wanted to socialise it. I think that's it. Facilitator: Okay, thanks Jingbo. I'd like to thank Ben and Jingbo for giving this talk and thank you, everyone, for attending the webinar. Thank you. END OF TRANSCRIPT