SlideShare a Scribd company logo
Applying a new subject classification
scheme for a database
by a data-driven correspondence
Kei Kurakawa, Yuan Sun
National Institute of Informatics
Satoko Ando
Clarivate Analytics Co., Ltd.
This is the presentation slides for the joint conference of the 134th SIG conference of Information Fundamentals and Access
Technologies (IFAT) and 112th SIG conference of Document Communication (DC), Information Processing Society of Japan (IPSJ)
March 22, 2019, at Toyo University, Hakusan Campus.
Cite: Kei Kurakawa, Yuan Sun, and Satoko Ando, Applying a new subject classification scheme for a database by a data-driven
correspondence, IPSJ SIG Technical Report, Vol.2019-IFAT-134/2019-DC-112, No.7, pp.1-10, (2019).
Outline
• Information science methodologies
• Subject classification for research evaluation
• An issue
• How can we apply a new classification scheme for the database, cost-effectively and efficiently?
• Our approach
• A subject classification model of database
• Forming a compact topological space for yet another subject classification scheme
• Deciding a correspondence between two subject classification schemes by means of a research
project database
• A case study
• InCites - a benchmarking tool for research evaluation
• Web of Science subject classification scheme
• KAKEN subject classification scheme
• Conclusions and future work
2
Major objectives on information resources
• Library
• Information management and knowledge organization
• Information retrieval
• Knowledge
• Domain knowledge representation
• Knowledge extraction
3
Information science methodologies
Method
Terms and associations Categorization
Objective
Information management
and knowledge
organization /
information retrieval
Thesaurus Classification
Domain knowledge
representation /
knowledge extraction
Ontology Taxonomy
4
Practices of the knowledge structures
• Classification
• Library classification
• Dewey Decimal Classification (DDC) (1876 - )
• Universal Decimal Classification (UDC) (1895 - )
• Library of Congress Classification (LCC) (1897 - )
• Colon Classification (CC) (1933 - )
• Nippon Decimal Code (NDC) (1928 - )
• NDL Classification (NDLC) (1963 - )
• Journal classification
• Web of Science subject classification
• Taxonomy
• Scientific classification
• Research project classification
• KAKENHI subject classification
• Journal classification for research evaluation
• Essential Science Indicator subject classification
• Thesaurus
• Subject headings
• Library of Congress Subject Headings (1898 - )
• Medical Subject Headings (MeSH)
• NDL Subject Headings (NDLSH)(1964 - )
• Ontology
• (Wikipedia definition) In computer science and information
science, an ontology encompasses a representation, formal
naming, and definition of the categories, properties,
and relations between the concepts, data, and entities that
substantiate one, many, or all domains.
• Automatic document classification methods (cf. Wikipedia)
• Expectation maximization (EM)
• Naive Bayes classifier
• tf–idf
• Instantaneously trained neural networks
• Latent semantic indexing
• Support vector machines (SVM)
• Artificial neural network
• K-nearest neighbor algorithms
• Decision trees such as ID3 or C4.5
• Concept Mining
• Rough set-based classifier
• Soft set-based classifier
• Multiple-instance learning
• Natural language processing approaches
5
Subject classification for research evaluation –
a type of taxonomy
• Subject classification for research evaluation
• Scientific fields for national research evaluation
• UK RAE (research assessment exercise) Unites of
Assessment (UK)
• UK REF (research excellence framework) Unites of
Assessment (UK)
• The ANVUR (Agenzia Nazionale di Valutazione del
Sistema Universitario e della Ricerca) category
scheme (Italia)
• Australia ERA FoR (excellence in research for
Australia, fields of research) (Australia)
• FAPESP (The São Paulo Research Foundation)
classification scheme (Brazil)
• CAPES (Coordenação de Aperfeiçoamento de Pessoal
de Nível Superior) (Brazil)
• China SCADC (State Council Academic Degrees
Committee) subject categories (China)
• Scientific fields for international research
evaluation
• OECD category scheme (OECD)
• GIPP(Global Institutional Profiles Project) categories
(Clarivate Analytics)
• Subject classification for funding
• Scientific fields for research project funds
• KAKENHI subject classification
• Subject classification for research output
• Journal classification
• Web of Science subject classification
• Essential Science Indicator subject classification
6
An issue
• In assessing research activities based on bibliometrics, analysts are
accustomed to use the major citation database Web of Science whose
subject classification schemes, i.e. WoS Subject Category, ESI, and
GIPP are prepared for qualitative analysis.
• Analysts need domestic subject classification schemes for their
analysis, which are not implemented on the database.
• Applying a new classification scheme for the database by hand is too
much labor intensive and time consuming task.
• How can we apply a new classification scheme for the database, cost-
effectively and efficiently?
7
Our approach
• Induce a correspondence between two subject classification
schemes, one of which is already applied to the database and the
other one is not yet.
8
A subject classification model of database
• There exists a bibliographic database that represents a set of articles
for scientific research. Each article is labeled with at least one
category of a subject classification scheme. It means that all articles
are classified under the subject classification scheme.
• The subject classification scheme implies its compact topological
space in the database. It states the database structure, which affects
analysis by the subject classification scheme.
9
Definition 1 (a database with a subject
classification scheme)
• A database 𝑆 is a set of articles 𝑎 𝑛.
• A subject classification scheme 𝐶 is a set of subject categories 𝑐𝜆.
• Articles attributed to a subject category comprise a subset of 𝑆, so
that subject categories in a subject classification scheme refer to a
family of subsets 𝑂𝜆 𝜆∈Λ of 𝑆. 𝑂 is an open set. Λ is an index set.
• A subset 𝑂𝜆 depends on the corresponding subject category 𝑐𝜆.
Therefore, we define a map 𝑓 from a subject classification scheme 𝐶
to the powerset 𝔓 𝑆 .
10
Theorem 1 (a finite cover)
• Theorem 1
• A practical subject classification scheme 𝐶 is mapped to a finite cover 𝔒 of 𝑆.
• Proof
• In practical databases, a subject classification scheme 𝐶 consists of finite
elements 𝑐𝜆 that are mapped to finite subsets 𝑂𝜆 by a map 𝑓.
• Let 𝔒 be a subset of 𝔓 𝑆 which consists of 𝑂𝑖 𝑖 ∈ 𝐼 . 𝐼 is a finite index set..
• And, usually 𝑆 = 𝑖∈𝐼 𝑂𝑖 (𝑂𝑖 ∈ 𝔒).
• 𝔒 is called a finite cover of 𝑆.
11
Theorem 2 (a compact topological space)
• Theorem 2
• A practical subject classification scheme 𝐶 implies a compact topological space 𝑆, 𝔒 .
• Proof
• In practical databases, a subject classification scheme 𝐶 consists of finite elements 𝑐𝑖 that are
mapped to finite subsets 𝑂𝑖 by a map 𝑓.
• Let 𝔒 be a subset of 𝔓 𝑆 which consists of 𝑂𝑖 𝑖 ∈ 𝐼 . 𝐼 is a finite index set.
• As a basis, let 𝔒0 be a subset of 𝔓 𝑆 which consists of {∩𝑖∈𝐼 𝐴𝑖|𝐴𝑖 ∈ 𝔒} where the element
is 𝑆 if 𝐼 = ∅.
• Let 𝔒 be a subset of 𝔓 𝑆 which consists of {∪ 𝜆∈Λ 𝐵𝜆|𝐵𝜆 ∈ 𝔒0} where the element is ∅ if
Λ = ∅. Λ is a finite or infinite index set.
• Thus, 𝔒 ⊃ 𝔒, 𝑆 ∈ 𝔒, and ∅ ∈ 𝔒.
• The 𝔒 is satisfied with the necessary and sufficient conditions to be a topology.
• In addition to the theorem 1, it implies a compact topological space 𝑆, 𝔒 .
• When there exists a finite cover in a topological space, we call it as a compact topological space.
12
Forming a compact topological space for yet
another subject classification scheme
• A prior condition
• A subject classification scheme 𝐶 1
that consists of subject categories 𝑐𝑖
1
is mapped to a finite cover 𝔒 1
= 𝑂𝑖
1
𝑖 ∈ 𝐼 1
by a map 𝑓1, which implies a compact topological space 𝑆, 𝔒 1
.
• Direct category assignment approach
• In the same way, we assign subject categories 𝑐𝑖
2
of a new classification scheme 𝐶 2
to each article of 𝑆.
• This creates a map 𝑓2 from 𝐶 2
to a finite cover 𝔒 2
= 𝑂𝑖
2
𝑖 ∈ 𝐼 2
, which implies a compact topological space 𝑆, 𝔒 2
.
• Indirect correspondence approach (our approach)
• We build a correspondence Γ: 𝐶 2
→ 𝐶 1
(Γ = 𝐶 2
, 𝐶 1
; 𝐺 , 𝐺 ⊂ 𝐶 2
× 𝐶 1
), where 𝑐𝑖
2
∈ 𝐶 2
, 𝑐𝑗
1
∈ 𝐶 1
, 𝑐𝑖
2
×
𝑐𝑗
1
∈ 𝐺, 𝐶 2
= 𝑖 𝑐𝑖
2
, and 𝐶 1
= 𝑗 𝑐𝑗
1
to guarantee existence of a finite cover.
• Then, we create a map 𝑔1: 𝐶 2
→ ℭ 1
= 𝐶𝑖
1
𝑐𝑖
2
∈ 𝐶 2
, 𝑐𝑗
1
∈ 𝐶 1
, 𝑐𝑖
2
× 𝑐𝑗
1
∈ 𝐺, 𝑖 ∈ 𝐼 2
, 𝐶𝑖
1
=
𝑗∈𝐼𝑖
1 𝑐𝑗
1
,
where 𝑆 = 𝑖∈𝐼 2 𝐶𝑖
1
to be a finite cover.
• Finally, we create a map 𝑔2: ℭ 1
→ 𝔒 1
= 𝑂𝑖
1
𝐶𝑖
1
∈ ℭ 1
, 𝑐𝑗
1
∈ 𝐶𝑖
1
, 𝑂𝑗
1
= 𝑓1 𝑐𝑗
1
, 𝑂𝑖
1
=
𝑗∈𝐼𝑖
1 𝑂𝑗
1
, where 𝑆 =
𝑖∈𝐼 2 𝑂𝑖
1
to be a finite cover.
• We get a composite map 𝑔2 ∘ 𝑔1 from 𝐶 2
to a finite cover 𝔒 1
, which implies a compact topological space 𝑆, 𝔒 1
.
Obviously, 𝔒 1
⊂ 𝔒 1
.
13
Deciding a correspondence between two
subject classification schemes
• Expert driven approach
• Experts of the two subject classification schemes decide a correspondence
between them based on their knowledge and practical experiences.
• Data driven approach (our approach)
• Data scientists analyze a database where an entity is categorized with the two
subject classification schemes, and decide a correspondence between them
based on the analysis.
14
By means of a research project database
• A research project database
• A database 𝑇 describes research projects 𝑏 𝑛 one of whose outputs is a list of research articles 𝑎 𝑛 on a database 𝑆.
• Research articles 𝑎 𝑛 of 𝑆 are categorized with a subject classification scheme 𝐶 1
. We define a map 𝑓1 where 𝐶 1
is
mapped to a finite cover 𝔒 𝑆
1
= 𝑂𝑖
1
𝑖 ∈ 𝐼 1
of 𝑆, which implies a compact topological space 𝑆, 𝔒 𝑆
1
.
• Research projects 𝑏 𝑛 of 𝑇 are categorized with a subject classification scheme 𝐶 2
. We define a map ℎ1 where 𝐶 2
is
mapped to a finite cover 𝔒 𝑇
2
= 𝑂𝑖
2
𝑖 ∈ 𝐼 2
of 𝑇, which implies a compact topological space 𝑇, 𝔒 𝑇
2
.
• Research projects 𝑏 𝑛 produce a set of research articles 𝑎 𝑛, so that we define a map ℎ2: 𝑇 → 𝔓 𝑆 so as to mean such the
thing. Here, let the image of the map be reduced to 𝔖 ⊂ 𝔓 𝑆 to be a surjection. Then, we also define a map ℎ2
′
: 𝑇 → 𝔓 𝑆′
where 𝑆′
= 𝑖∈𝐼 𝔖
𝑂𝑖 𝑂𝑖 ∈ 𝔖 and 𝑆′
⊂ 𝑆. For image 𝑆′
, We define a map 𝑓1
′
where 𝐶 1
is mapped to a finite cover 𝔒 𝑆′
1
=
𝑂𝑖
′ 1
𝑖 ∈ 𝐼 1
of 𝑆, which implies a compact topological space 𝑆′
, 𝔒 𝑆′
1
.
• Create a map
• We create a map ℎ3: 𝔒 𝑇
2
→ 𝔒 𝑆′
2
= 𝑂𝑆′ 𝑖
2
𝑂 𝑇𝑖
2
∈ 𝔒 𝑇
2
, 𝑏𝑗
2
∈ 𝑂 𝑇𝑖
2
, 𝑂𝑆′ 𝑗
2
= ℎ2
′
𝑏𝑗
2
, 𝑂𝑆′ 𝑖
2
= 𝑗 𝑂𝑆′ 𝑗
2
that is a subset of
𝔓 𝑆′
, where 𝔒 𝑆′
2
is a finite cover.
• We get a composite map ℎ3 ∘ ℎ1: 𝐶 2
→ 𝔒 𝑆′
2
. Since 𝔒 𝑆′
2
is a finite cover, it induces a compact topological space.
• Supposition
• The composite map ℎ3 ∘ ℎ1: 𝐶 2
→ 𝔒 𝑆′
2
represents the classification of articles by the subject classification scheme.
• If two images on 𝑆′
by a map 𝑓1
′
and a map ℎ3 ∘ ℎ1 are equivalent, the inverse images of them are of an equivalence relation.
15
Data driven approach to decide a
correspondence
• An observation
• In a database 𝑆′
, elements of finite covers 𝔒 𝑆′
1
and 𝔒 𝑆′
2
represent natural overlapping sets.
• For an 𝑂 2
(∈ 𝔒 𝑆′
2
), there exist its intersections 𝑂 2
∩ 𝑂 1
to all 𝑂 1
(∈ 𝔒 𝑆′
1
). Its cardinalities greater than zero, if sorted
in rank order, obey a discrete version of a generalized beta distribution [Martínez-Mekler, et al. (2009)], which is given by
𝑓(𝑟) = 𝐴 𝑁 + 1 − 𝑟 𝑏
/𝑟 𝑎
, where 𝑟 is the rank, 𝑁 its maximum value, 𝐴 the normalization constant and (𝑎, 𝑏) two fitting
exponents.
• Decide a correspondence by calculating precision and recall
• To decide a correspondence, find a subset 𝑂𝑖
1
𝑖 ∈ 𝐼1 of 𝔒 𝑆′
1
for an 𝑂𝑗∈𝐼2
2
to be satisfied that 𝑂𝑗
2
= 𝑖 𝑂𝑖
1
.
• In most cases, 𝑂𝑗
2
⊅ 𝑂𝑖
1
and 𝑂𝑗
2
≠ 𝑂𝑖
1
.
• So we define the following metrics;
• 𝑑 𝑝 =
𝑖∈𝐼
𝑗
1 𝑂𝑗
2
∩𝑂𝑖
1
𝑖∈𝐼
𝑗
1 𝑂𝑖
1 (precision),
• 𝑑 𝑟 =
𝑖∈𝐼
𝑗
1 𝑂𝑗
2
∩𝑂𝑖
1
𝑂𝑗
2 (recall).
• And a generalized harmonic mean of precision and recall;
• 𝑑 𝑓 =
1+𝛽2 𝑑 𝑝 𝑑 𝑟
𝛽2 𝑑 𝑝+𝑑 𝑟
, 𝛽 > 0, (𝐹𝛽-measure)
• Finally, we decide a threshold of the f-measure to determine which element has a correspondence relation.
Martínez-Mekler, G., Alvarez Martínez, R., Beltrán del Río, M., Mansilla, R., Miramontes, P., & Cocho, G. (2009). Universality of rank-ordering distributions in the arts and sciences. PloS One, 4(3), e4791. http://doi.org/10.1371/journal.pone.0004791
16
Application example
• InCites™ (Clarivate Analytics)
• A world class research evaluation platform
• User scenarios
• Users
• Research organizations
• Funding and policy organizations
• Publishers
• The users can
• Identify and manage research activities and their impact,
• Benchmark and compare performance to peers,
• Identify experts both inside and outside the organization,
• Identify emerging subject areas, researchers, and experts,
• Manage funding activity from submission to progress
reports through outcomes,
• Demonstrate results and impact of funding policy,
• Identify new trends and key indicators to enable policy
development,
• Uncover new or emerging areas in which to publish,
• Monitor trends within a field or geographic region,
• Identify the best authors and reviewers,
• Maintain competitive advantage by monitoring the
competition.
• Dataset
• Web of Science™ Core Collection
• Entities
• People
• Organizations
• Regions
• Research Areas
• Journals, Books, Conference Proceedings
17
Linking bibliographic entities between WoS
and KAKEN
18
𝑆′
𝑆𝑇
𝑏
𝑂
ℎ
Web of ScienceKAKEN
𝑎1
𝑎2
𝑎3
𝑎1
′
𝑎2
′
𝑎3
′
articles articlesprojects
Bibliographic linkage
𝑎1
′
≡ 𝑎1
𝑎2
′
≡ 𝑎2
𝑎3
′
≡ 𝑎3
A bibliographic linkage [Kurakawa, et al. 2014]
• Databases
• KAKEN as of 2009
• 173,940 article citations in English
• WoS as of 2009, 2010
• 3,925,776 article citations
• Method
• Identifying pairs of article citations by
the following techniques
• i-Linkage
• Blocking top 5 candidate article
citations of WoS for an article citation
of KAKEN as a pair of citation
• SVM (support vector machine)
• Detecting true or not of pairs of
citation
• Output result
• 75,042 pairs of citation
• 43.1% of 173,940 from KAKEN
• 10 fold cross validation (800 true
pairs by human judge)
• Accuracy 95.01
• Precision 94.92
• Recall 95.10
• F-Measure 94.98
• Pairs of citation that are
categorized with the both subject
classification schemes
• 59,595 pairs of citation
19
Kurakawa, K., Sun, Y., & Aizawa, A. (2014). Mapping between research fields of grants-in-aid for scientific research and web of science subject areas. NII
Technical Reports. National Institute of Informatics. Retrieved from https://www.nii.ac.jp/TechReports/public_html/14-002J.html
Subject classification schemes
• Web of Science subject
classification scheme
• Web of Science subject areas
• 251 subject categories
• ESI research areas
• 22 subject categories
• GIPP research areas
• 6 subject categories
• KAKEN subject classification
scheme (as of 2009)
• 4 categories
• 10 areas
• 67 disciplines
• 284 research fields
20
A contingency table for two subject
classification schemes
21
𝑂1
1
⋮
𝑂 𝑚
1
𝑂1
2
⋯ 𝑂 𝑛
2
𝑓11 ⋯ 𝑓1𝑗 ⋯ 𝑓1𝑛
⋮
𝑓𝑖1
⋮
⋱ ⋮
⋯ 𝑓𝑖𝑗 ⋯
⋮ ⋱
⋮
𝑓𝑖𝑛
⋮
𝑓 𝑚1 ⋯ 𝑓 𝑚𝑗 ⋯ 𝑓𝑚𝑛
Web of Science
subject categories
KAKENHI subject categories
𝑓𝑖𝑗 = 𝑂𝑖
1
∩ 𝑂𝑗
2
An example contingency table (a part view of
251 WoS and 67 KAKENHI subject categories)
22
Analysis of the contingency table
23
,where is the rank value, its maximum value,
a normalized constant
and two fitting components.
The discrete generalized beta distribution (DGBD)
24
25
Pseudo precision and recall for subject categories
of a new subject classification scheme
26
𝑂𝑗
2
𝑂1
1
𝑂2
1
𝑂4
1
𝑂3
1
As for the whole counting of papers, we define
𝑑 𝑝
′
=
𝑖 𝑂𝑗
2
∩𝑂𝑖
1
𝑖 𝑂𝑖
1 (pseudo precision),
𝑑 𝑟
′ =
𝑖 𝑂 𝑗
2
∩𝑂𝑖
1
𝑂 𝑗
2 (pseudo recall),
𝑑 𝑓
′
=
1+𝛽2 𝑑 𝑝
′ 𝑑 𝑟
′
𝛽2 𝑑 𝑝
′ +𝑑 𝑟
′ , 𝛽 > 0, (pseudo 𝐹𝛽-measure).
Maximum pseudo f-measure
27
4 categories –
67 disciplines
KAKEN subject category Translation # of WoS subject
categories to cover
Pseudo precision Pseudo recall Max pseudo F1 measure
(01-01) 情報学 Informatics 17 0.57582 0.62589 0.59981
(01-02) 神経科学 Brain sciences 1 0.21829 0.36497 0.27318
(01-03) 実験動物学 Laboratory animal science 1 0.05863 0.07438 0.06557
(01-04) 人間医工学 Human informatics 8 0.22199 0.21253 0.21716
(01-05) 健康・スポーツ科学 Health / sports science 5 0.18095 0.29028 0.22293
(01-06) 生活科学 Human life science 4 0.23905 0.28051 0.25813
(01-07) 科学教育・教育工学
Science education /educational
technology 2 0.37736 0.10309 0.16194
(01-08) 科学社会学・科学技術史
Sociology / history of science and
technology 6 0.11111 0.16279 0.13208
(01-09) 文化財科学 Cultural assets study 1 0.2 0.03636 0.06154
(01-10) 地理学 Geography 4 0.11719 0.2027 0.14851
(01-11) 環境学 Environmental science 14 0.26227 0.3853 0.3121
(01-12) ナノ・マイクロ科学 Nano / micro science 4 0.10326 0.31317 0.15531
(01-13) 社会・安全システム科学 Social / safety system science 14 0.18656 0.21429 0.19946
(01-14) ゲノム科学 Genome science 3 0.04047 0.20305 0.06748
(01-15) 生物分子科学 Biomedical engineering 2 0.11913 0.32457 0.17429
(01-16) 資源保全学 Culture assets and museology 3 0.18116 0.14535 0.16129
(01-17) 地域研究 Area studies 7 0.16429 0.27059 0.20444
(01-18) ジェンダー Gender 3 0.23077 0.11111 0.15
28
4 categories –
67 disciplines
KAKEN subject category Translation # of WoS subject
categories to cover
Pseudo precision Pseudo recall Max pseudo F1 measure
(02-01) 哲学 Philosophy 4 0.4359 0.28333 0.34343
(02-02) 芸術学 Art studies 1 0.09091 0.11111 0.1
(02-03) 文学 Literature 10 0.7 0.68293 0.69136
(02-04) 言語学 Linguistics 3 0.70504 0.41004 0.51852
(02-05) 史学 History 6 0.41176 0.34146 0.37333
(02-06) 人文地理学 Human geography 3 0.175 0.5 0.25926
(02-07) 文化人類学 Cultural anthropology 3 0.05634 0.10526 0.07339
(02-08) 法学 Law 3 0.38462 0.12195 0.18519
(02-09) 政治学 Politics 2 0.40909 0.45763 0.432
(02-10) 経済学 Economics 12 0.6917 0.62198 0.65499
(02-11) 経営学 Management 5 0.29412 0.38462 0.33333
(02-12) 社会学 Sociology 8 0.17606 0.27778 0.21552
(02-13) 心理学 Psychology 14 0.4878 0.47859 0.48315
(02-14) 教育学 Education 9 0.24375 0.25828 0.2508
(03-01) 数学 Mathematics 4 0.73424 0.79181 0.76194
(03-02) 天文学 Astronomy 1 0.5052 0.86965 0.63912
(03-03) 物理学 Physics 6 0.49831 0.65128 0.56462
(03-04) 地球惑星科学 Earth and planetary science 7 0.6186 0.66222 0.63967
(03-05) プラズマ科学 Plasma science 1 0.23261 0.19094 0.20973
(03-06) 基礎化学 Basic chemistry 7 0.22929 0.80065 0.35649
(03-07) 複合化学 Applied chemistry 6 0.28307 0.52645 0.36817
(03-08) 材料化学 Materials chemistry 7 0.1571 0.34801 0.21647
(03-09) 応用物理学・工学基礎 Applied physics 5 0.17011 0.39374 0.23758
(03-10) 機械工学 Mechanical engineering 11 0.43053 0.38804 0.40818
(03-11) 電気電子工学 Electrical and electric engineering 10 0.33758 0.66933 0.4488
(03-12) 土木工学 Civil engineering 8 0.37069 0.48383 0.41977
(03-13) 建築学 Architecture and building engineering 3 0.28571 0.50588 0.36518
(03-14) 材料工学 Material engineering 6 0.34794 0.52269 0.41778
(03-15) プロセス工学 Process / chemical engineering 4 0.14529 0.30553 0.19694
(03-16) 総合工学 Integrated engineering 8 0.25637 0.30922 0.28032
29
Average pseudo
precision
Average pseudo
recall
Average pseudo F1
measure
0.31469 0.36724 0.31718
4 categories –
67 disciplines
KAKEN subject category Translation # of WoS subject
categories to cover
Pseudo precision Pseudo recall Max pseudo F1 measure
(04-01) 基礎生物学 Basic biology 7 0.375 0.39992 0.38706
(04-02) 生物科学 Biological science 4 0.16679 0.58193 0.25927
(04-03) 人類学 Anthropology 3 0.31504 0.44 0.36718
(04-04) 農学
Plant production and environmental
agriculture 4 0.30676 0.44939 0.36462
(04-05) 農芸化学 Agricultural chemistry 6 0.22042 0.38632 0.28069
(04-06) 林学 Forest and forest products science 5 0.40751 0.25224 0.3116
(04-07) 水産学 Applied aquatic science 2 0.4185 0.32702 0.36715
(04-08) 農業経済学
Agricultural science in society and
economy 2 0.33333 0.09677 0.15
(04-09) 農業工学 Agro-engineering 4 0.15686 0.25926 0.19546
(04-10) 畜産学・獣医学 Animal life science 4 0.51054 0.38655 0.43998
(04-11) 境界農学 Boundary agriculture 4 0.2346 0.14787 0.18141
(04-12) 薬学 Pharmacy 4 0.29417 0.3694 0.32752
(04-13) 基礎医学 Basic medicine 16 0.21266 0.55141 0.30695
(04-14) 境界医学 Boundary medicine 12 0.16156 0.11176 0.13213
(04-15) 社会医学 Society medicine 8 0.28153 0.26197 0.2714
(04-16) 内科系臨床医学 Clinical internal medicine 24 0.44074 0.61743 0.51433
(04-17) 外科系臨床医学 Clinical surgery 20 0.41795 0.468 0.44156
(04-18) 歯学 Denticity 3 0.64007 0.27983 0.38941
(04-19) 看護学 Nursing 2 0.73684 0.44304 0.55336
Miscellaneous considerations to decide a
correspondence
• A threshold for decision
• At most top 4 rank elements have correspondence relations.
• For every Web of Science subject category 𝑂𝑖
1
, the number of relations with KAKENHI
subject categories 𝑂𝑗
2
is limited to 4 at most.
• For every Web of Science subject category 𝑂𝑖
1
, when the recall rate exceeds a half, we
stop adding any more relation.
• Decision by experts
• Professionals who know about the subject classification schemes check all
correspondence between 𝑂𝑖
1
and 𝑂𝑗
2
.
• Add or remove correspondence relations between them by means of subject
classification keywords.
30
InCites example screen (Analysis by KAKEN
subject classification scheme)
31
WoS Documents: 58,395,008
for Web of Science subject categories
WoS Documents: 3,192,449
for Web of Science subject categories
limited with
“LOCATION = JAPAN”
WoS Documents: 3,191,448
for KAKEN L3 subject categories
limited with
“LOCATION = JAPAN”
(a snapshot of 2018-12-14)
InCites outputs by subject classification
schemes
• WoS Documents
• “LOCATION = JAPAN”
• Web of Science subject areas
• 251 subject categories
• ESI research areas
• 22 subject categories
• KAKEN subject classification
scheme (as of 2009)
• 10 areas (KAKEN L2)
• 67 disciplines (KAKEN L3)
32
33
34
35
36
User feedback
• KAKEN classification scheme
• April 2016, released on InCites Benchmarking
• User survey
• March 2017 by online questionnaire for institutional active users
• 18 questions
• Results
• 26 institutional users feedback
37
User role in the institution Yes (multiple answers possible)
RA (research administrator) 20
Administrator / officer 3
IR (institutional research) staff 5
Others 2
User feedback results (degree of expertise)
38
Other: 4, when needed
1, when evaluating researchers
User feedback results (validity of applying
KAKENHI subject classification scheme)
39
Other: 1, I need more detail categories
User feedback results (miscellaneous user
voices)
• Comments
• Needs of KAKENHI subject classification scheme
• When I apply a set of Web of Science documents to KAKENHI subject classification scheme, metrics of top
1 % papers by WoS categories and ESI categories is not comfortable. I need a feature to recalculate the
citation ranking by KAKENHI subject classification scheme.
• I need KAKENHI subject classification scheme in the Web of Science search service as well.
• I hope for updating KAKENHI subject classification scheme to new one as possible. (It might be hard to
catch up on updating it since it changes every year.)
• It is very timely to add KAKENHI subject classification scheme.
• Although it is in the case of a limited area of subject, KAKENHI subject classification scheme has
advantage for us to precisely analyze researches because it is higher resolvable than ESI, which make it
available to map “Animal & Plant Science” of ESI to more precise ones, i.e. “Applied Physics”, “Applied
Chemistry”, etc. of Japan.
• Need more precise categories of KAKENHI subject classification scheme
• Sixty-over categories of KAKENHI is not sufficient to relatively compare researches as much as ES (22
only) and WoS (251, four times and more). And, it may cause over-evaluation in comparison between
research fields because KAKENHI subject classification is made in a clock counter-like classification
method. We need more accurate analysis of more concrete examples.
40
Discussion (1)
• Our approach, i.e. deciding a correspondence between two subject classification
schemes has an inherent limitation.
• In natural correlations between subject categories of two subject classification schemes, each
subject category of one scheme partly overlaps several subject categories of the other scheme.
• There is no inclusion relationship between them.
• Correspondence relations are probabilistic.
• Research projects and journal articles have similarities and differences on subject.
• Projects and articles have a strong correlation on subject.
• In our approach, we used a grants database which describes that research projects produce outputs, i.e.
research articles.
• We focused on the subject classification scheme for the research projects and its relationship to a set of
research articles. Research articles are classified with another subject classification scheme.
• We compared those two subject classification schemes through its relationship.
• But, they also have differences on subject.
• Projects precede articles. There is a time lag of project starting and article outputs. This makes a subject
divergence of drift between them.
• Projects tend to indicate the central concept with essential keywords. This allows a subject diversification of
articles.
41
Discussion (2)
• Nevertheless, the classification results were accepted by InCites users.
• The users might focus on comparative analysis of bibliometrics by the subject categories, and not care about
specific case of articles.
• They might need rough quality of metrics at the evaluation stage.
• Metrics are central limits of quantitative attributes of a set of entities, which is the main indicator to be
checked for the research evaluation.
• Our approach is extremely cost effective.
• The numbers of journal titles in Web of Science citation database is 24,688.
• The number of Web of Science documents of InCites is 58,395,008.
• The number of total journals in Web of Science is 24,688.
• http://mjl.clarivate.com/cgi-bin/jrnlst/jlresults.cgi
• The number of subject category pairs to decide a correspondence is 16,817.
• For KAKEN 67 - WoS 251, the number of the pairs is 16,817.
• For KAKEN 10 - WoS 251, the number of the pairs is 2,510.
• Evidence data is too small.
• The sum of frequency counts of the contingency table is 97,175. It is not enough to automatically decide a
correspondence between subject classification schemes. Manual handling was needed.
42
Conclusions and future work
• We proposed an approach to apply a new subject classification scheme for a bibliographic database that is
classified by a subject classification scheme.
• We defined a subject classification model of database that consists of a topological space.
• Then, we showed our approach based on the model, where the step is to form a compact topological space for a new
subject classification scheme.
• To form the space, it utilizes a correspondence between two subject classification schemes by a research project database as
data.
• We applied the approach to a practical example, i.e. InCites - a benchmarking tool for research evaluation
based on the Web of Science citation database so as to add the KAKENHI subject classification scheme.
• Subject classification schemes
• Web of Science subject categories 251
• KAKENHI subject categories 67 / 10
• 59,595 pairs of bibliographic records classified by WoS subject categories and KAKEN subject categories induce a
correspondence between the two subject classification schemes.
• User feedback revealed that users accepted our classification results.
• Future work
• In present data age, it will be handled on the basis of external data and artificial intelligence. Our approach become robust
by large amount of data.
• In an alternative way, it is promising to directly look into content and extract knowledge for the same purposes on metadata.
43

More Related Content

PPTX
Data Clustering Using Swarm Intelligence Algorithms An Overview
PPT
3.5 model based clustering
PDF
A survey on Efficient Enhanced K-Means Clustering Algorithm
PPT
Chap8 basic cluster_analysis
PPTX
Document clustering and classification
PPTX
Document clustering for forensic analysis
PPT
Cluster analysis
PPTX
Introduction to Clustering algorithm
Data Clustering Using Swarm Intelligence Algorithms An Overview
3.5 model based clustering
A survey on Efficient Enhanced K-Means Clustering Algorithm
Chap8 basic cluster_analysis
Document clustering and classification
Document clustering for forensic analysis
Cluster analysis
Introduction to Clustering algorithm

What's hot (18)

PDF
Big data Clustering Algorithms And Strategies
PDF
Bl24409420
PDF
10 clusbasic
PPTX
Clustering in data Mining (Data Mining)
PPTX
Data Mining: clustering and analysis
PPT
3.2 partitioning methods
PDF
F04463437
PPT
10 clusbasic
PPTX
Clusters techniques
PPT
CLUSTERING
PPT
Capter10 cluster basic
PPTX
Document clustering for forensic analysis an approach for improving compute...
PDF
Current clustering techniques
PPT
PPTX
A multi criteria evaluation of environmental databases using hasse
PDF
10 Algorithms in data mining
PPTX
Data clustring
PDF
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Big data Clustering Algorithms And Strategies
Bl24409420
10 clusbasic
Clustering in data Mining (Data Mining)
Data Mining: clustering and analysis
3.2 partitioning methods
F04463437
10 clusbasic
Clusters techniques
CLUSTERING
Capter10 cluster basic
Document clustering for forensic analysis an approach for improving compute...
Current clustering techniques
A multi criteria evaluation of environmental databases using hasse
10 Algorithms in data mining
Data clustring
Charting the Digital Library Evaluation Domain with a Semantically Enhanced M...
Ad

Similar to Applying a new subject classification scheme for a database by a data-driven correspondence (20)

PPTX
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
PDF
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
PDF
As we may link: a model to support aggregated scientific knowledge
PDF
Topics In Rough Set Theory Current Applications To Granular Computing Seiki A...
PDF
Scientific Publication Retrieval in Linked Data
PDF
Knowledge Representation on the Web
PDF
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
PDF
algoritma klastering.pdf
PPTX
Semi-automated Exploration and Extraction of Data in Scientific Tables
PDF
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
PDF
Using parallel hierarchical clustering to
PPTX
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
PDF
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
PPTX
Information retrieval 20 divergence from randomness
PPTX
Deductive databases
PPTX
Survey of natural language processing(midp2)
PDF
OntoMаthPro Ontology: A Linked Data Hub for Mathematics
PPTX
PgVector + : Enable Richer Interaction with vector database.pptx
PPTX
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
PPTX
LIBRARY_CLASSIFICATION_-_ASSIGNMENT.pptx
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of Ontology in Semantic Information Retrieval by Prof Shahrul Azm...
As we may link: a model to support aggregated scientific knowledge
Topics In Rough Set Theory Current Applications To Granular Computing Seiki A...
Scientific Publication Retrieval in Linked Data
Knowledge Representation on the Web
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
algoritma klastering.pdf
Semi-automated Exploration and Extraction of Data in Scientific Tables
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Using parallel hierarchical clustering to
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Information retrieval 20 divergence from randomness
Deductive databases
Survey of natural language processing(midp2)
OntoMаthPro Ontology: A Linked Data Hub for Mathematics
PgVector + : Enable Richer Interaction with vector database.pptx
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
LIBRARY_CLASSIFICATION_-_ASSIGNMENT.pptx
Ad

More from National Institute of Informatics (19)

PPTX
Toward universal information access on the digital object cloud
PDF
Making data typing efforts or automatically detecting data types for automat...
PDF
Applying tensor decompositions to author name disambiguation of common Japane...
PPTX
Emerging domain agnostic functionalities on the handle-centered networks
PPTX
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
PPTX
研究者識別子の重要性とORCIDアップデート
PPTX
離散一般化ベータ分布を仮定した研究分野マッピングの導出
PDF
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
PPTX
レコードリンケージに基づく科研費分野-WoS分野マッピング
PPTX
科研費分野-トピック分類マトリックスへの主成分分析の適用
PDF
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
PDF
機械学習を用いたWeb上の産学連携関連文書の抽出
PDF
科研費データベースの分野分類とトピック分類の比較分析
PDF
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
PPTX
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
PDF
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
PDF
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
PDF
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
PDF
ORCIDのプロトタイプシステムと著者ID関連技術の動向
Toward universal information access on the digital object cloud
Making data typing efforts or automatically detecting data types for automat...
Applying tensor decompositions to author name disambiguation of common Japane...
Emerging domain agnostic functionalities on the handle-centered networks
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
研究者識別子の重要性とORCIDアップデート
離散一般化ベータ分布を仮定した研究分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピング
科研費分野-トピック分類マトリックスへの主成分分析の適用
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
機械学習を用いたWeb上の産学連携関連文書の抽出
科研費データベースの分野分類とトピック分類の比較分析
A SVM Applied Text Categorization of Academia-Industry Collaborative Research...
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
ORCIDのプロトタイプシステムと著者ID関連技術の動向

Recently uploaded (20)

PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Welding lecture in detail for understanding
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPT
Project quality management in manufacturing
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
573137875-Attendance-Management-System-original
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Construction Project Organization Group 2.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Geodesy 1.pptx...............................................
PDF
composite construction of structures.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Welding lecture in detail for understanding
CH1 Production IntroductoryConcepts.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Project quality management in manufacturing
Embodied AI: Ushering in the Next Era of Intelligent Systems
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
573137875-Attendance-Management-System-original
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Construction Project Organization Group 2.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Geodesy 1.pptx...............................................
composite construction of structures.pdf
Foundation to blockchain - A guide to Blockchain Tech
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx

Applying a new subject classification scheme for a database by a data-driven correspondence

  • 1. Applying a new subject classification scheme for a database by a data-driven correspondence Kei Kurakawa, Yuan Sun National Institute of Informatics Satoko Ando Clarivate Analytics Co., Ltd. This is the presentation slides for the joint conference of the 134th SIG conference of Information Fundamentals and Access Technologies (IFAT) and 112th SIG conference of Document Communication (DC), Information Processing Society of Japan (IPSJ) March 22, 2019, at Toyo University, Hakusan Campus. Cite: Kei Kurakawa, Yuan Sun, and Satoko Ando, Applying a new subject classification scheme for a database by a data-driven correspondence, IPSJ SIG Technical Report, Vol.2019-IFAT-134/2019-DC-112, No.7, pp.1-10, (2019).
  • 2. Outline • Information science methodologies • Subject classification for research evaluation • An issue • How can we apply a new classification scheme for the database, cost-effectively and efficiently? • Our approach • A subject classification model of database • Forming a compact topological space for yet another subject classification scheme • Deciding a correspondence between two subject classification schemes by means of a research project database • A case study • InCites - a benchmarking tool for research evaluation • Web of Science subject classification scheme • KAKEN subject classification scheme • Conclusions and future work 2
  • 3. Major objectives on information resources • Library • Information management and knowledge organization • Information retrieval • Knowledge • Domain knowledge representation • Knowledge extraction 3
  • 4. Information science methodologies Method Terms and associations Categorization Objective Information management and knowledge organization / information retrieval Thesaurus Classification Domain knowledge representation / knowledge extraction Ontology Taxonomy 4
  • 5. Practices of the knowledge structures • Classification • Library classification • Dewey Decimal Classification (DDC) (1876 - ) • Universal Decimal Classification (UDC) (1895 - ) • Library of Congress Classification (LCC) (1897 - ) • Colon Classification (CC) (1933 - ) • Nippon Decimal Code (NDC) (1928 - ) • NDL Classification (NDLC) (1963 - ) • Journal classification • Web of Science subject classification • Taxonomy • Scientific classification • Research project classification • KAKENHI subject classification • Journal classification for research evaluation • Essential Science Indicator subject classification • Thesaurus • Subject headings • Library of Congress Subject Headings (1898 - ) • Medical Subject Headings (MeSH) • NDL Subject Headings (NDLSH)(1964 - ) • Ontology • (Wikipedia definition) In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains. • Automatic document classification methods (cf. Wikipedia) • Expectation maximization (EM) • Naive Bayes classifier • tf–idf • Instantaneously trained neural networks • Latent semantic indexing • Support vector machines (SVM) • Artificial neural network • K-nearest neighbor algorithms • Decision trees such as ID3 or C4.5 • Concept Mining • Rough set-based classifier • Soft set-based classifier • Multiple-instance learning • Natural language processing approaches 5
  • 6. Subject classification for research evaluation – a type of taxonomy • Subject classification for research evaluation • Scientific fields for national research evaluation • UK RAE (research assessment exercise) Unites of Assessment (UK) • UK REF (research excellence framework) Unites of Assessment (UK) • The ANVUR (Agenzia Nazionale di Valutazione del Sistema Universitario e della Ricerca) category scheme (Italia) • Australia ERA FoR (excellence in research for Australia, fields of research) (Australia) • FAPESP (The São Paulo Research Foundation) classification scheme (Brazil) • CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) (Brazil) • China SCADC (State Council Academic Degrees Committee) subject categories (China) • Scientific fields for international research evaluation • OECD category scheme (OECD) • GIPP(Global Institutional Profiles Project) categories (Clarivate Analytics) • Subject classification for funding • Scientific fields for research project funds • KAKENHI subject classification • Subject classification for research output • Journal classification • Web of Science subject classification • Essential Science Indicator subject classification 6
  • 7. An issue • In assessing research activities based on bibliometrics, analysts are accustomed to use the major citation database Web of Science whose subject classification schemes, i.e. WoS Subject Category, ESI, and GIPP are prepared for qualitative analysis. • Analysts need domestic subject classification schemes for their analysis, which are not implemented on the database. • Applying a new classification scheme for the database by hand is too much labor intensive and time consuming task. • How can we apply a new classification scheme for the database, cost- effectively and efficiently? 7
  • 8. Our approach • Induce a correspondence between two subject classification schemes, one of which is already applied to the database and the other one is not yet. 8
  • 9. A subject classification model of database • There exists a bibliographic database that represents a set of articles for scientific research. Each article is labeled with at least one category of a subject classification scheme. It means that all articles are classified under the subject classification scheme. • The subject classification scheme implies its compact topological space in the database. It states the database structure, which affects analysis by the subject classification scheme. 9
  • 10. Definition 1 (a database with a subject classification scheme) • A database 𝑆 is a set of articles 𝑎 𝑛. • A subject classification scheme 𝐶 is a set of subject categories 𝑐𝜆. • Articles attributed to a subject category comprise a subset of 𝑆, so that subject categories in a subject classification scheme refer to a family of subsets 𝑂𝜆 𝜆∈Λ of 𝑆. 𝑂 is an open set. Λ is an index set. • A subset 𝑂𝜆 depends on the corresponding subject category 𝑐𝜆. Therefore, we define a map 𝑓 from a subject classification scheme 𝐶 to the powerset 𝔓 𝑆 . 10
  • 11. Theorem 1 (a finite cover) • Theorem 1 • A practical subject classification scheme 𝐶 is mapped to a finite cover 𝔒 of 𝑆. • Proof • In practical databases, a subject classification scheme 𝐶 consists of finite elements 𝑐𝜆 that are mapped to finite subsets 𝑂𝜆 by a map 𝑓. • Let 𝔒 be a subset of 𝔓 𝑆 which consists of 𝑂𝑖 𝑖 ∈ 𝐼 . 𝐼 is a finite index set.. • And, usually 𝑆 = 𝑖∈𝐼 𝑂𝑖 (𝑂𝑖 ∈ 𝔒). • 𝔒 is called a finite cover of 𝑆. 11
  • 12. Theorem 2 (a compact topological space) • Theorem 2 • A practical subject classification scheme 𝐶 implies a compact topological space 𝑆, 𝔒 . • Proof • In practical databases, a subject classification scheme 𝐶 consists of finite elements 𝑐𝑖 that are mapped to finite subsets 𝑂𝑖 by a map 𝑓. • Let 𝔒 be a subset of 𝔓 𝑆 which consists of 𝑂𝑖 𝑖 ∈ 𝐼 . 𝐼 is a finite index set. • As a basis, let 𝔒0 be a subset of 𝔓 𝑆 which consists of {∩𝑖∈𝐼 𝐴𝑖|𝐴𝑖 ∈ 𝔒} where the element is 𝑆 if 𝐼 = ∅. • Let 𝔒 be a subset of 𝔓 𝑆 which consists of {∪ 𝜆∈Λ 𝐵𝜆|𝐵𝜆 ∈ 𝔒0} where the element is ∅ if Λ = ∅. Λ is a finite or infinite index set. • Thus, 𝔒 ⊃ 𝔒, 𝑆 ∈ 𝔒, and ∅ ∈ 𝔒. • The 𝔒 is satisfied with the necessary and sufficient conditions to be a topology. • In addition to the theorem 1, it implies a compact topological space 𝑆, 𝔒 . • When there exists a finite cover in a topological space, we call it as a compact topological space. 12
  • 13. Forming a compact topological space for yet another subject classification scheme • A prior condition • A subject classification scheme 𝐶 1 that consists of subject categories 𝑐𝑖 1 is mapped to a finite cover 𝔒 1 = 𝑂𝑖 1 𝑖 ∈ 𝐼 1 by a map 𝑓1, which implies a compact topological space 𝑆, 𝔒 1 . • Direct category assignment approach • In the same way, we assign subject categories 𝑐𝑖 2 of a new classification scheme 𝐶 2 to each article of 𝑆. • This creates a map 𝑓2 from 𝐶 2 to a finite cover 𝔒 2 = 𝑂𝑖 2 𝑖 ∈ 𝐼 2 , which implies a compact topological space 𝑆, 𝔒 2 . • Indirect correspondence approach (our approach) • We build a correspondence Γ: 𝐶 2 → 𝐶 1 (Γ = 𝐶 2 , 𝐶 1 ; 𝐺 , 𝐺 ⊂ 𝐶 2 × 𝐶 1 ), where 𝑐𝑖 2 ∈ 𝐶 2 , 𝑐𝑗 1 ∈ 𝐶 1 , 𝑐𝑖 2 × 𝑐𝑗 1 ∈ 𝐺, 𝐶 2 = 𝑖 𝑐𝑖 2 , and 𝐶 1 = 𝑗 𝑐𝑗 1 to guarantee existence of a finite cover. • Then, we create a map 𝑔1: 𝐶 2 → ℭ 1 = 𝐶𝑖 1 𝑐𝑖 2 ∈ 𝐶 2 , 𝑐𝑗 1 ∈ 𝐶 1 , 𝑐𝑖 2 × 𝑐𝑗 1 ∈ 𝐺, 𝑖 ∈ 𝐼 2 , 𝐶𝑖 1 = 𝑗∈𝐼𝑖 1 𝑐𝑗 1 , where 𝑆 = 𝑖∈𝐼 2 𝐶𝑖 1 to be a finite cover. • Finally, we create a map 𝑔2: ℭ 1 → 𝔒 1 = 𝑂𝑖 1 𝐶𝑖 1 ∈ ℭ 1 , 𝑐𝑗 1 ∈ 𝐶𝑖 1 , 𝑂𝑗 1 = 𝑓1 𝑐𝑗 1 , 𝑂𝑖 1 = 𝑗∈𝐼𝑖 1 𝑂𝑗 1 , where 𝑆 = 𝑖∈𝐼 2 𝑂𝑖 1 to be a finite cover. • We get a composite map 𝑔2 ∘ 𝑔1 from 𝐶 2 to a finite cover 𝔒 1 , which implies a compact topological space 𝑆, 𝔒 1 . Obviously, 𝔒 1 ⊂ 𝔒 1 . 13
  • 14. Deciding a correspondence between two subject classification schemes • Expert driven approach • Experts of the two subject classification schemes decide a correspondence between them based on their knowledge and practical experiences. • Data driven approach (our approach) • Data scientists analyze a database where an entity is categorized with the two subject classification schemes, and decide a correspondence between them based on the analysis. 14
  • 15. By means of a research project database • A research project database • A database 𝑇 describes research projects 𝑏 𝑛 one of whose outputs is a list of research articles 𝑎 𝑛 on a database 𝑆. • Research articles 𝑎 𝑛 of 𝑆 are categorized with a subject classification scheme 𝐶 1 . We define a map 𝑓1 where 𝐶 1 is mapped to a finite cover 𝔒 𝑆 1 = 𝑂𝑖 1 𝑖 ∈ 𝐼 1 of 𝑆, which implies a compact topological space 𝑆, 𝔒 𝑆 1 . • Research projects 𝑏 𝑛 of 𝑇 are categorized with a subject classification scheme 𝐶 2 . We define a map ℎ1 where 𝐶 2 is mapped to a finite cover 𝔒 𝑇 2 = 𝑂𝑖 2 𝑖 ∈ 𝐼 2 of 𝑇, which implies a compact topological space 𝑇, 𝔒 𝑇 2 . • Research projects 𝑏 𝑛 produce a set of research articles 𝑎 𝑛, so that we define a map ℎ2: 𝑇 → 𝔓 𝑆 so as to mean such the thing. Here, let the image of the map be reduced to 𝔖 ⊂ 𝔓 𝑆 to be a surjection. Then, we also define a map ℎ2 ′ : 𝑇 → 𝔓 𝑆′ where 𝑆′ = 𝑖∈𝐼 𝔖 𝑂𝑖 𝑂𝑖 ∈ 𝔖 and 𝑆′ ⊂ 𝑆. For image 𝑆′ , We define a map 𝑓1 ′ where 𝐶 1 is mapped to a finite cover 𝔒 𝑆′ 1 = 𝑂𝑖 ′ 1 𝑖 ∈ 𝐼 1 of 𝑆, which implies a compact topological space 𝑆′ , 𝔒 𝑆′ 1 . • Create a map • We create a map ℎ3: 𝔒 𝑇 2 → 𝔒 𝑆′ 2 = 𝑂𝑆′ 𝑖 2 𝑂 𝑇𝑖 2 ∈ 𝔒 𝑇 2 , 𝑏𝑗 2 ∈ 𝑂 𝑇𝑖 2 , 𝑂𝑆′ 𝑗 2 = ℎ2 ′ 𝑏𝑗 2 , 𝑂𝑆′ 𝑖 2 = 𝑗 𝑂𝑆′ 𝑗 2 that is a subset of 𝔓 𝑆′ , where 𝔒 𝑆′ 2 is a finite cover. • We get a composite map ℎ3 ∘ ℎ1: 𝐶 2 → 𝔒 𝑆′ 2 . Since 𝔒 𝑆′ 2 is a finite cover, it induces a compact topological space. • Supposition • The composite map ℎ3 ∘ ℎ1: 𝐶 2 → 𝔒 𝑆′ 2 represents the classification of articles by the subject classification scheme. • If two images on 𝑆′ by a map 𝑓1 ′ and a map ℎ3 ∘ ℎ1 are equivalent, the inverse images of them are of an equivalence relation. 15
  • 16. Data driven approach to decide a correspondence • An observation • In a database 𝑆′ , elements of finite covers 𝔒 𝑆′ 1 and 𝔒 𝑆′ 2 represent natural overlapping sets. • For an 𝑂 2 (∈ 𝔒 𝑆′ 2 ), there exist its intersections 𝑂 2 ∩ 𝑂 1 to all 𝑂 1 (∈ 𝔒 𝑆′ 1 ). Its cardinalities greater than zero, if sorted in rank order, obey a discrete version of a generalized beta distribution [Martínez-Mekler, et al. (2009)], which is given by 𝑓(𝑟) = 𝐴 𝑁 + 1 − 𝑟 𝑏 /𝑟 𝑎 , where 𝑟 is the rank, 𝑁 its maximum value, 𝐴 the normalization constant and (𝑎, 𝑏) two fitting exponents. • Decide a correspondence by calculating precision and recall • To decide a correspondence, find a subset 𝑂𝑖 1 𝑖 ∈ 𝐼1 of 𝔒 𝑆′ 1 for an 𝑂𝑗∈𝐼2 2 to be satisfied that 𝑂𝑗 2 = 𝑖 𝑂𝑖 1 . • In most cases, 𝑂𝑗 2 ⊅ 𝑂𝑖 1 and 𝑂𝑗 2 ≠ 𝑂𝑖 1 . • So we define the following metrics; • 𝑑 𝑝 = 𝑖∈𝐼 𝑗 1 𝑂𝑗 2 ∩𝑂𝑖 1 𝑖∈𝐼 𝑗 1 𝑂𝑖 1 (precision), • 𝑑 𝑟 = 𝑖∈𝐼 𝑗 1 𝑂𝑗 2 ∩𝑂𝑖 1 𝑂𝑗 2 (recall). • And a generalized harmonic mean of precision and recall; • 𝑑 𝑓 = 1+𝛽2 𝑑 𝑝 𝑑 𝑟 𝛽2 𝑑 𝑝+𝑑 𝑟 , 𝛽 > 0, (𝐹𝛽-measure) • Finally, we decide a threshold of the f-measure to determine which element has a correspondence relation. Martínez-Mekler, G., Alvarez Martínez, R., Beltrán del Río, M., Mansilla, R., Miramontes, P., & Cocho, G. (2009). Universality of rank-ordering distributions in the arts and sciences. PloS One, 4(3), e4791. http://doi.org/10.1371/journal.pone.0004791 16
  • 17. Application example • InCites™ (Clarivate Analytics) • A world class research evaluation platform • User scenarios • Users • Research organizations • Funding and policy organizations • Publishers • The users can • Identify and manage research activities and their impact, • Benchmark and compare performance to peers, • Identify experts both inside and outside the organization, • Identify emerging subject areas, researchers, and experts, • Manage funding activity from submission to progress reports through outcomes, • Demonstrate results and impact of funding policy, • Identify new trends and key indicators to enable policy development, • Uncover new or emerging areas in which to publish, • Monitor trends within a field or geographic region, • Identify the best authors and reviewers, • Maintain competitive advantage by monitoring the competition. • Dataset • Web of Science™ Core Collection • Entities • People • Organizations • Regions • Research Areas • Journals, Books, Conference Proceedings 17
  • 18. Linking bibliographic entities between WoS and KAKEN 18 𝑆′ 𝑆𝑇 𝑏 𝑂 ℎ Web of ScienceKAKEN 𝑎1 𝑎2 𝑎3 𝑎1 ′ 𝑎2 ′ 𝑎3 ′ articles articlesprojects Bibliographic linkage 𝑎1 ′ ≡ 𝑎1 𝑎2 ′ ≡ 𝑎2 𝑎3 ′ ≡ 𝑎3
  • 19. A bibliographic linkage [Kurakawa, et al. 2014] • Databases • KAKEN as of 2009 • 173,940 article citations in English • WoS as of 2009, 2010 • 3,925,776 article citations • Method • Identifying pairs of article citations by the following techniques • i-Linkage • Blocking top 5 candidate article citations of WoS for an article citation of KAKEN as a pair of citation • SVM (support vector machine) • Detecting true or not of pairs of citation • Output result • 75,042 pairs of citation • 43.1% of 173,940 from KAKEN • 10 fold cross validation (800 true pairs by human judge) • Accuracy 95.01 • Precision 94.92 • Recall 95.10 • F-Measure 94.98 • Pairs of citation that are categorized with the both subject classification schemes • 59,595 pairs of citation 19 Kurakawa, K., Sun, Y., & Aizawa, A. (2014). Mapping between research fields of grants-in-aid for scientific research and web of science subject areas. NII Technical Reports. National Institute of Informatics. Retrieved from https://www.nii.ac.jp/TechReports/public_html/14-002J.html
  • 20. Subject classification schemes • Web of Science subject classification scheme • Web of Science subject areas • 251 subject categories • ESI research areas • 22 subject categories • GIPP research areas • 6 subject categories • KAKEN subject classification scheme (as of 2009) • 4 categories • 10 areas • 67 disciplines • 284 research fields 20
  • 21. A contingency table for two subject classification schemes 21 𝑂1 1 ⋮ 𝑂 𝑚 1 𝑂1 2 ⋯ 𝑂 𝑛 2 𝑓11 ⋯ 𝑓1𝑗 ⋯ 𝑓1𝑛 ⋮ 𝑓𝑖1 ⋮ ⋱ ⋮ ⋯ 𝑓𝑖𝑗 ⋯ ⋮ ⋱ ⋮ 𝑓𝑖𝑛 ⋮ 𝑓 𝑚1 ⋯ 𝑓 𝑚𝑗 ⋯ 𝑓𝑚𝑛 Web of Science subject categories KAKENHI subject categories 𝑓𝑖𝑗 = 𝑂𝑖 1 ∩ 𝑂𝑗 2
  • 22. An example contingency table (a part view of 251 WoS and 67 KAKENHI subject categories) 22
  • 23. Analysis of the contingency table 23 ,where is the rank value, its maximum value, a normalized constant and two fitting components. The discrete generalized beta distribution (DGBD)
  • 24. 24
  • 25. 25
  • 26. Pseudo precision and recall for subject categories of a new subject classification scheme 26 𝑂𝑗 2 𝑂1 1 𝑂2 1 𝑂4 1 𝑂3 1 As for the whole counting of papers, we define 𝑑 𝑝 ′ = 𝑖 𝑂𝑗 2 ∩𝑂𝑖 1 𝑖 𝑂𝑖 1 (pseudo precision), 𝑑 𝑟 ′ = 𝑖 𝑂 𝑗 2 ∩𝑂𝑖 1 𝑂 𝑗 2 (pseudo recall), 𝑑 𝑓 ′ = 1+𝛽2 𝑑 𝑝 ′ 𝑑 𝑟 ′ 𝛽2 𝑑 𝑝 ′ +𝑑 𝑟 ′ , 𝛽 > 0, (pseudo 𝐹𝛽-measure).
  • 27. Maximum pseudo f-measure 27 4 categories – 67 disciplines KAKEN subject category Translation # of WoS subject categories to cover Pseudo precision Pseudo recall Max pseudo F1 measure (01-01) 情報学 Informatics 17 0.57582 0.62589 0.59981 (01-02) 神経科学 Brain sciences 1 0.21829 0.36497 0.27318 (01-03) 実験動物学 Laboratory animal science 1 0.05863 0.07438 0.06557 (01-04) 人間医工学 Human informatics 8 0.22199 0.21253 0.21716 (01-05) 健康・スポーツ科学 Health / sports science 5 0.18095 0.29028 0.22293 (01-06) 生活科学 Human life science 4 0.23905 0.28051 0.25813 (01-07) 科学教育・教育工学 Science education /educational technology 2 0.37736 0.10309 0.16194 (01-08) 科学社会学・科学技術史 Sociology / history of science and technology 6 0.11111 0.16279 0.13208 (01-09) 文化財科学 Cultural assets study 1 0.2 0.03636 0.06154 (01-10) 地理学 Geography 4 0.11719 0.2027 0.14851 (01-11) 環境学 Environmental science 14 0.26227 0.3853 0.3121 (01-12) ナノ・マイクロ科学 Nano / micro science 4 0.10326 0.31317 0.15531 (01-13) 社会・安全システム科学 Social / safety system science 14 0.18656 0.21429 0.19946 (01-14) ゲノム科学 Genome science 3 0.04047 0.20305 0.06748 (01-15) 生物分子科学 Biomedical engineering 2 0.11913 0.32457 0.17429 (01-16) 資源保全学 Culture assets and museology 3 0.18116 0.14535 0.16129 (01-17) 地域研究 Area studies 7 0.16429 0.27059 0.20444 (01-18) ジェンダー Gender 3 0.23077 0.11111 0.15
  • 28. 28 4 categories – 67 disciplines KAKEN subject category Translation # of WoS subject categories to cover Pseudo precision Pseudo recall Max pseudo F1 measure (02-01) 哲学 Philosophy 4 0.4359 0.28333 0.34343 (02-02) 芸術学 Art studies 1 0.09091 0.11111 0.1 (02-03) 文学 Literature 10 0.7 0.68293 0.69136 (02-04) 言語学 Linguistics 3 0.70504 0.41004 0.51852 (02-05) 史学 History 6 0.41176 0.34146 0.37333 (02-06) 人文地理学 Human geography 3 0.175 0.5 0.25926 (02-07) 文化人類学 Cultural anthropology 3 0.05634 0.10526 0.07339 (02-08) 法学 Law 3 0.38462 0.12195 0.18519 (02-09) 政治学 Politics 2 0.40909 0.45763 0.432 (02-10) 経済学 Economics 12 0.6917 0.62198 0.65499 (02-11) 経営学 Management 5 0.29412 0.38462 0.33333 (02-12) 社会学 Sociology 8 0.17606 0.27778 0.21552 (02-13) 心理学 Psychology 14 0.4878 0.47859 0.48315 (02-14) 教育学 Education 9 0.24375 0.25828 0.2508 (03-01) 数学 Mathematics 4 0.73424 0.79181 0.76194 (03-02) 天文学 Astronomy 1 0.5052 0.86965 0.63912 (03-03) 物理学 Physics 6 0.49831 0.65128 0.56462 (03-04) 地球惑星科学 Earth and planetary science 7 0.6186 0.66222 0.63967 (03-05) プラズマ科学 Plasma science 1 0.23261 0.19094 0.20973 (03-06) 基礎化学 Basic chemistry 7 0.22929 0.80065 0.35649 (03-07) 複合化学 Applied chemistry 6 0.28307 0.52645 0.36817 (03-08) 材料化学 Materials chemistry 7 0.1571 0.34801 0.21647 (03-09) 応用物理学・工学基礎 Applied physics 5 0.17011 0.39374 0.23758 (03-10) 機械工学 Mechanical engineering 11 0.43053 0.38804 0.40818 (03-11) 電気電子工学 Electrical and electric engineering 10 0.33758 0.66933 0.4488 (03-12) 土木工学 Civil engineering 8 0.37069 0.48383 0.41977 (03-13) 建築学 Architecture and building engineering 3 0.28571 0.50588 0.36518 (03-14) 材料工学 Material engineering 6 0.34794 0.52269 0.41778 (03-15) プロセス工学 Process / chemical engineering 4 0.14529 0.30553 0.19694 (03-16) 総合工学 Integrated engineering 8 0.25637 0.30922 0.28032
  • 29. 29 Average pseudo precision Average pseudo recall Average pseudo F1 measure 0.31469 0.36724 0.31718 4 categories – 67 disciplines KAKEN subject category Translation # of WoS subject categories to cover Pseudo precision Pseudo recall Max pseudo F1 measure (04-01) 基礎生物学 Basic biology 7 0.375 0.39992 0.38706 (04-02) 生物科学 Biological science 4 0.16679 0.58193 0.25927 (04-03) 人類学 Anthropology 3 0.31504 0.44 0.36718 (04-04) 農学 Plant production and environmental agriculture 4 0.30676 0.44939 0.36462 (04-05) 農芸化学 Agricultural chemistry 6 0.22042 0.38632 0.28069 (04-06) 林学 Forest and forest products science 5 0.40751 0.25224 0.3116 (04-07) 水産学 Applied aquatic science 2 0.4185 0.32702 0.36715 (04-08) 農業経済学 Agricultural science in society and economy 2 0.33333 0.09677 0.15 (04-09) 農業工学 Agro-engineering 4 0.15686 0.25926 0.19546 (04-10) 畜産学・獣医学 Animal life science 4 0.51054 0.38655 0.43998 (04-11) 境界農学 Boundary agriculture 4 0.2346 0.14787 0.18141 (04-12) 薬学 Pharmacy 4 0.29417 0.3694 0.32752 (04-13) 基礎医学 Basic medicine 16 0.21266 0.55141 0.30695 (04-14) 境界医学 Boundary medicine 12 0.16156 0.11176 0.13213 (04-15) 社会医学 Society medicine 8 0.28153 0.26197 0.2714 (04-16) 内科系臨床医学 Clinical internal medicine 24 0.44074 0.61743 0.51433 (04-17) 外科系臨床医学 Clinical surgery 20 0.41795 0.468 0.44156 (04-18) 歯学 Denticity 3 0.64007 0.27983 0.38941 (04-19) 看護学 Nursing 2 0.73684 0.44304 0.55336
  • 30. Miscellaneous considerations to decide a correspondence • A threshold for decision • At most top 4 rank elements have correspondence relations. • For every Web of Science subject category 𝑂𝑖 1 , the number of relations with KAKENHI subject categories 𝑂𝑗 2 is limited to 4 at most. • For every Web of Science subject category 𝑂𝑖 1 , when the recall rate exceeds a half, we stop adding any more relation. • Decision by experts • Professionals who know about the subject classification schemes check all correspondence between 𝑂𝑖 1 and 𝑂𝑗 2 . • Add or remove correspondence relations between them by means of subject classification keywords. 30
  • 31. InCites example screen (Analysis by KAKEN subject classification scheme) 31 WoS Documents: 58,395,008 for Web of Science subject categories WoS Documents: 3,192,449 for Web of Science subject categories limited with “LOCATION = JAPAN” WoS Documents: 3,191,448 for KAKEN L3 subject categories limited with “LOCATION = JAPAN” (a snapshot of 2018-12-14)
  • 32. InCites outputs by subject classification schemes • WoS Documents • “LOCATION = JAPAN” • Web of Science subject areas • 251 subject categories • ESI research areas • 22 subject categories • KAKEN subject classification scheme (as of 2009) • 10 areas (KAKEN L2) • 67 disciplines (KAKEN L3) 32
  • 33. 33
  • 34. 34
  • 35. 35
  • 36. 36
  • 37. User feedback • KAKEN classification scheme • April 2016, released on InCites Benchmarking • User survey • March 2017 by online questionnaire for institutional active users • 18 questions • Results • 26 institutional users feedback 37 User role in the institution Yes (multiple answers possible) RA (research administrator) 20 Administrator / officer 3 IR (institutional research) staff 5 Others 2
  • 38. User feedback results (degree of expertise) 38 Other: 4, when needed 1, when evaluating researchers
  • 39. User feedback results (validity of applying KAKENHI subject classification scheme) 39 Other: 1, I need more detail categories
  • 40. User feedback results (miscellaneous user voices) • Comments • Needs of KAKENHI subject classification scheme • When I apply a set of Web of Science documents to KAKENHI subject classification scheme, metrics of top 1 % papers by WoS categories and ESI categories is not comfortable. I need a feature to recalculate the citation ranking by KAKENHI subject classification scheme. • I need KAKENHI subject classification scheme in the Web of Science search service as well. • I hope for updating KAKENHI subject classification scheme to new one as possible. (It might be hard to catch up on updating it since it changes every year.) • It is very timely to add KAKENHI subject classification scheme. • Although it is in the case of a limited area of subject, KAKENHI subject classification scheme has advantage for us to precisely analyze researches because it is higher resolvable than ESI, which make it available to map “Animal & Plant Science” of ESI to more precise ones, i.e. “Applied Physics”, “Applied Chemistry”, etc. of Japan. • Need more precise categories of KAKENHI subject classification scheme • Sixty-over categories of KAKENHI is not sufficient to relatively compare researches as much as ES (22 only) and WoS (251, four times and more). And, it may cause over-evaluation in comparison between research fields because KAKENHI subject classification is made in a clock counter-like classification method. We need more accurate analysis of more concrete examples. 40
  • 41. Discussion (1) • Our approach, i.e. deciding a correspondence between two subject classification schemes has an inherent limitation. • In natural correlations between subject categories of two subject classification schemes, each subject category of one scheme partly overlaps several subject categories of the other scheme. • There is no inclusion relationship between them. • Correspondence relations are probabilistic. • Research projects and journal articles have similarities and differences on subject. • Projects and articles have a strong correlation on subject. • In our approach, we used a grants database which describes that research projects produce outputs, i.e. research articles. • We focused on the subject classification scheme for the research projects and its relationship to a set of research articles. Research articles are classified with another subject classification scheme. • We compared those two subject classification schemes through its relationship. • But, they also have differences on subject. • Projects precede articles. There is a time lag of project starting and article outputs. This makes a subject divergence of drift between them. • Projects tend to indicate the central concept with essential keywords. This allows a subject diversification of articles. 41
  • 42. Discussion (2) • Nevertheless, the classification results were accepted by InCites users. • The users might focus on comparative analysis of bibliometrics by the subject categories, and not care about specific case of articles. • They might need rough quality of metrics at the evaluation stage. • Metrics are central limits of quantitative attributes of a set of entities, which is the main indicator to be checked for the research evaluation. • Our approach is extremely cost effective. • The numbers of journal titles in Web of Science citation database is 24,688. • The number of Web of Science documents of InCites is 58,395,008. • The number of total journals in Web of Science is 24,688. • http://mjl.clarivate.com/cgi-bin/jrnlst/jlresults.cgi • The number of subject category pairs to decide a correspondence is 16,817. • For KAKEN 67 - WoS 251, the number of the pairs is 16,817. • For KAKEN 10 - WoS 251, the number of the pairs is 2,510. • Evidence data is too small. • The sum of frequency counts of the contingency table is 97,175. It is not enough to automatically decide a correspondence between subject classification schemes. Manual handling was needed. 42
  • 43. Conclusions and future work • We proposed an approach to apply a new subject classification scheme for a bibliographic database that is classified by a subject classification scheme. • We defined a subject classification model of database that consists of a topological space. • Then, we showed our approach based on the model, where the step is to form a compact topological space for a new subject classification scheme. • To form the space, it utilizes a correspondence between two subject classification schemes by a research project database as data. • We applied the approach to a practical example, i.e. InCites - a benchmarking tool for research evaluation based on the Web of Science citation database so as to add the KAKENHI subject classification scheme. • Subject classification schemes • Web of Science subject categories 251 • KAKENHI subject categories 67 / 10 • 59,595 pairs of bibliographic records classified by WoS subject categories and KAKEN subject categories induce a correspondence between the two subject classification schemes. • User feedback revealed that users accepted our classification results. • Future work • In present data age, it will be handled on the basis of external data and artificial intelligence. Our approach become robust by large amount of data. • In an alternative way, it is promising to directly look into content and extract knowledge for the same purposes on metadata. 43