Applying a new subject classification scheme for a database by a data-driven correspondence

Applying a new subject classification
scheme for a database
by a data-driven correspondence
Kei Kurakawa, Yuan Sun
National Institute of Informatics
Satoko Ando
Clarivate Analytics Co., Ltd.
This is the presentation slides for the joint conference of the 134th SIG conference of Information Fundamentals and Access
Technologies (IFAT) and 112th SIG conference of Document Communication (DC), Information Processing Society of Japan (IPSJ)
March 22, 2019, at Toyo University, Hakusan Campus.
Cite: Kei Kurakawa, Yuan Sun, and Satoko Ando, Applying a new subject classification scheme for a database by a data-driven
correspondence, IPSJ SIG Technical Report, Vol.2019-IFAT-134/2019-DC-112, No.7, pp.1-10, (2019).

Outline
• Information science methodologies
• Subject classification for research evaluation
• An issue
• How can we apply a new classification scheme for the database, cost-effectively and efficiently?
• Our approach
• A subject classification model of database
• Forming a compact topological space for yet another subject classification scheme
• Deciding a correspondence between two subject classification schemes by means of a research
project database
• A case study
• InCites - a benchmarking tool for research evaluation
• Web of Science subject classification scheme
• KAKEN subject classification scheme
• Conclusions and future work
2

Major objectives on information resources
• Library
• Information management and knowledge organization
• Information retrieval
• Knowledge
• Domain knowledge representation
• Knowledge extraction
3

Information science methodologies
Method
Terms and associations Categorization
Objective
Information management
and knowledge
organization /
information retrieval
Thesaurus Classification
Domain knowledge
representation /
knowledge extraction
Ontology Taxonomy
4

Practices of the knowledge structures
• Classification
• Library classification
• Dewey Decimal Classification (DDC) (1876 - )
• Universal Decimal Classification (UDC) (1895 - )
• Library of Congress Classification (LCC) (1897 - )
• Colon Classification (CC) (1933 - )
• Nippon Decimal Code (NDC) (1928 - )
• NDL Classification (NDLC) (1963 - )
• Journal classification
• Web of Science subject classification
• Taxonomy
• Scientific classification
• Research project classification
• KAKENHI subject classification
• Journal classification for research evaluation
• Essential Science Indicator subject classification
• Thesaurus
• Subject headings
• Library of Congress Subject Headings (1898 - )
• Medical Subject Headings (MeSH)
• NDL Subject Headings (NDLSH)(1964 - )
• Ontology
• (Wikipedia definition) In computer science and information
science, an ontology encompasses a representation, formal
naming, and definition of the categories, properties,
and relations between the concepts, data, and entities that
substantiate one, many, or all domains.
• Automatic document classification methods (cf. Wikipedia)
• Expectation maximization (EM)
• Naive Bayes classifier
• tf–idf
• Instantaneously trained neural networks
• Latent semantic indexing
• Support vector machines (SVM)
• Artificial neural network
• K-nearest neighbor algorithms
• Decision trees such as ID3 or C4.5
• Concept Mining
• Rough set-based classifier
• Soft set-based classifier
• Multiple-instance learning
• Natural language processing approaches
5

Subject classification for research evaluation –
a type of taxonomy
• Subject classification for research evaluation
• Scientific fields for national research evaluation
• UK RAE (research assessment exercise) Unites of
Assessment (UK)
• UK REF (research excellence framework) Unites of
Assessment (UK)
• The ANVUR (Agenzia Nazionale di Valutazione del
Sistema Universitario e della Ricerca) category
scheme (Italia)
• Australia ERA FoR (excellence in research for
Australia, fields of research) (Australia)
• FAPESP (The São Paulo Research Foundation)
classification scheme (Brazil)
• CAPES (Coordenação de Aperfeiçoamento de Pessoal
de Nível Superior) (Brazil)
• China SCADC (State Council Academic Degrees
Committee) subject categories (China)
• Scientific fields for international research
evaluation
• OECD category scheme (OECD)
• GIPP(Global Institutional Profiles Project) categories
(Clarivate Analytics)
• Subject classification for funding
• Scientific fields for research project funds
• KAKENHI subject classification
• Subject classification for research output
• Journal classification
• Web of Science subject classification
• Essential Science Indicator subject classification
6

An issue
• In assessing research activities based on bibliometrics, analysts are
accustomed to use the major citation database Web of Science whose
subject classification schemes, i.e. WoS Subject Category, ESI, and
GIPP are prepared for qualitative analysis.
• Analysts need domestic subject classification schemes for their
analysis, which are not implemented on the database.
• Applying a new classification scheme for the database by hand is too
much labor intensive and time consuming task.
• How can we apply a new classification scheme for the database, cost-
effectively and efficiently?
7

Our approach
• Induce a correspondence between two subject classification
schemes, one of which is already applied to the database and the
other one is not yet.
8

A subject classification model of database
• There exists a bibliographic database that represents a set of articles
for scientific research. Each article is labeled with at least one
category of a subject classification scheme. It means that all articles
are classified under the subject classification scheme.
• The subject classification scheme implies its compact topological
space in the database. It states the database structure, which affects
analysis by the subject classification scheme.
9

Definition 1 (a database with a subject
classification scheme)
• A database 𝑆 is a set of articles 𝑎 𝑛.
• A subject classification scheme 𝐶 is a set of subject categories 𝑐𝜆.
• Articles attributed to a subject category comprise a subset of 𝑆, so
that subject categories in a subject classification scheme refer to a
family of subsets 𝑂𝜆 𝜆∈Λ of 𝑆. 𝑂 is an open set. Λ is an index set.
• A subset 𝑂𝜆 depends on the corresponding subject category 𝑐𝜆.
Therefore, we define a map 𝑓 from a subject classification scheme 𝐶
to the powerset 𝔓 𝑆 .
10

Theorem 1 (a finite cover)
• Theorem 1
• A practical subject classification scheme 𝐶 is mapped to a finite cover 𝔒 of 𝑆.
• Proof
• In practical databases, a subject classification scheme 𝐶 consists of finite
elements 𝑐𝜆 that are mapped to finite subsets 𝑂𝜆 by a map 𝑓.
• Let 𝔒 be a subset of 𝔓 𝑆 which consists of 𝑂𝑖 𝑖 ∈ 𝐼 . 𝐼 is a finite index set..
• And, usually 𝑆 = 𝑖∈𝐼 𝑂𝑖 (𝑂𝑖 ∈ 𝔒).
• 𝔒 is called a finite cover of 𝑆.
11

Theorem 2 (a compact topological space)
• Theorem 2
• A practical subject classification scheme 𝐶 implies a compact topological space 𝑆, 𝔒 .
• Proof
• In practical databases, a subject classification scheme 𝐶 consists of finite elements 𝑐𝑖 that are
mapped to finite subsets 𝑂𝑖 by a map 𝑓.
• Let 𝔒 be a subset of 𝔓 𝑆 which consists of 𝑂𝑖 𝑖 ∈ 𝐼 . 𝐼 is a finite index set.
• As a basis, let 𝔒0 be a subset of 𝔓 𝑆 which consists of {∩𝑖∈𝐼 𝐴𝑖|𝐴𝑖 ∈ 𝔒} where the element
is 𝑆 if 𝐼 = ∅.
• Let 𝔒 be a subset of 𝔓 𝑆 which consists of {∪ 𝜆∈Λ 𝐵𝜆|𝐵𝜆 ∈ 𝔒0} where the element is ∅ if
Λ = ∅. Λ is a finite or infinite index set.
• Thus, 𝔒 ⊃ 𝔒, 𝑆 ∈ 𝔒, and ∅ ∈ 𝔒.
• The 𝔒 is satisfied with the necessary and sufficient conditions to be a topology.
• In addition to the theorem 1, it implies a compact topological space 𝑆, 𝔒 .
• When there exists a finite cover in a topological space, we call it as a compact topological space.
12

Forming a compact topological space for yet
another subject classification scheme
• A prior condition
• A subject classification scheme 𝐶 1
that consists of subject categories 𝑐𝑖
1
is mapped to a finite cover 𝔒 1
= 𝑂𝑖
1
𝑖 ∈ 𝐼 1
by a map 𝑓1, which implies a compact topological space 𝑆, 𝔒 1
.
• Direct category assignment approach
• In the same way, we assign subject categories 𝑐𝑖
2
of a new classification scheme 𝐶 2
to each article of 𝑆.
• This creates a map 𝑓2 from 𝐶 2
to a finite cover 𝔒 2
= 𝑂𝑖
2
𝑖 ∈ 𝐼 2
, which implies a compact topological space 𝑆, 𝔒 2
.
• Indirect correspondence approach (our approach)
• We build a correspondence Γ: 𝐶 2
→ 𝐶 1
(Γ = 𝐶 2
, 𝐶 1
; 𝐺 , 𝐺 ⊂ 𝐶 2
× 𝐶 1
), where 𝑐𝑖
2
∈ 𝐶 2
, 𝑐𝑗
1
∈ 𝐶 1
, 𝑐𝑖
2
×
𝑐𝑗
1
∈ 𝐺, 𝐶 2
= 𝑖 𝑐𝑖
2
, and 𝐶 1
= 𝑗 𝑐𝑗
1
to guarantee existence of a finite cover.
• Then, we create a map 𝑔1: 𝐶 2
→ ℭ 1
= 𝐶𝑖
1
𝑐𝑖
2
∈ 𝐶 2
, 𝑐𝑗
1
∈ 𝐶 1
, 𝑐𝑖
2
× 𝑐𝑗
1
∈ 𝐺, 𝑖 ∈ 𝐼 2
, 𝐶𝑖
1
=
𝑗∈𝐼𝑖
1 𝑐𝑗
1
,
where 𝑆 = 𝑖∈𝐼 2 𝐶𝑖
1
to be a finite cover.
• Finally, we create a map 𝑔2: ℭ 1
→ 𝔒 1
= 𝑂𝑖
1
𝐶𝑖
1
∈ ℭ 1
, 𝑐𝑗
1
∈ 𝐶𝑖
1
, 𝑂𝑗
1
= 𝑓1 𝑐𝑗
1
, 𝑂𝑖
1
=
𝑗∈𝐼𝑖
1 𝑂𝑗
1
, where 𝑆 =
𝑖∈𝐼 2 𝑂𝑖
1
to be a finite cover.
• We get a composite map 𝑔2 ∘ 𝑔1 from 𝐶 2
to a finite cover 𝔒 1
, which implies a compact topological space 𝑆, 𝔒 1
.
Obviously, 𝔒 1
⊂ 𝔒 1
.
13

Deciding a correspondence between two
subject classification schemes
• Expert driven approach
• Experts of the two subject classification schemes decide a correspondence
between them based on their knowledge and practical experiences.
• Data driven approach (our approach)
• Data scientists analyze a database where an entity is categorized with the two
subject classification schemes, and decide a correspondence between them
based on the analysis.
14

By means of a research project database
• A research project database
• A database 𝑇 describes research projects 𝑏 𝑛 one of whose outputs is a list of research articles 𝑎 𝑛 on a database 𝑆.
• Research articles 𝑎 𝑛 of 𝑆 are categorized with a subject classification scheme 𝐶 1
. We define a map 𝑓1 where 𝐶 1
is
mapped to a finite cover 𝔒 𝑆
1
= 𝑂𝑖
1
𝑖 ∈ 𝐼 1
of 𝑆, which implies a compact topological space 𝑆, 𝔒 𝑆
1
.
• Research projects 𝑏 𝑛 of 𝑇 are categorized with a subject classification scheme 𝐶 2
. We define a map ℎ1 where 𝐶 2
is
mapped to a finite cover 𝔒 𝑇
2
= 𝑂𝑖
2
𝑖 ∈ 𝐼 2
of 𝑇, which implies a compact topological space 𝑇, 𝔒 𝑇
2
.
• Research projects 𝑏 𝑛 produce a set of research articles 𝑎 𝑛, so that we define a map ℎ2: 𝑇 → 𝔓 𝑆 so as to mean such the
thing. Here, let the image of the map be reduced to 𝔖 ⊂ 𝔓 𝑆 to be a surjection. Then, we also define a map ℎ2
′
: 𝑇 → 𝔓 𝑆′
where 𝑆′
= 𝑖∈𝐼 𝔖
𝑂𝑖 𝑂𝑖 ∈ 𝔖 and 𝑆′
⊂ 𝑆. For image 𝑆′
, We define a map 𝑓1
′
where 𝐶 1
is mapped to a finite cover 𝔒 𝑆′
1
=
𝑂𝑖
′ 1
𝑖 ∈ 𝐼 1
of 𝑆, which implies a compact topological space 𝑆′
, 𝔒 𝑆′
1
.
• Create a map
• We create a map ℎ3: 𝔒 𝑇
2
→ 𝔒 𝑆′
2
= 𝑂𝑆′ 𝑖
2
𝑂 𝑇𝑖
2
∈ 𝔒 𝑇
2
, 𝑏𝑗
2
∈ 𝑂 𝑇𝑖
2
, 𝑂𝑆′ 𝑗
2
= ℎ2
′
𝑏𝑗
2
, 𝑂𝑆′ 𝑖
2
= 𝑗 𝑂𝑆′ 𝑗
2
that is a subset of
𝔓 𝑆′
, where 𝔒 𝑆′
2
is a finite cover.
• We get a composite map ℎ3 ∘ ℎ1: 𝐶 2
→ 𝔒 𝑆′
2
. Since 𝔒 𝑆′
2
is a finite cover, it induces a compact topological space.
• Supposition
• The composite map ℎ3 ∘ ℎ1: 𝐶 2
→ 𝔒 𝑆′
2
represents the classification of articles by the subject classification scheme.
• If two images on 𝑆′
by a map 𝑓1
′
and a map ℎ3 ∘ ℎ1 are equivalent, the inverse images of them are of an equivalence relation.
15

Data driven approach to decide a
correspondence
• An observation
• In a database 𝑆′
, elements of finite covers 𝔒 𝑆′
1
and 𝔒 𝑆′
2
represent natural overlapping sets.
• For an 𝑂 2
(∈ 𝔒 𝑆′
2
), there exist its intersections 𝑂 2
∩ 𝑂 1
to all 𝑂 1
(∈ 𝔒 𝑆′
1
). Its cardinalities greater than zero, if sorted
in rank order, obey a discrete version of a generalized beta distribution [Martínez-Mekler, et al. (2009)], which is given by
𝑓(𝑟) = 𝐴 𝑁 + 1 − 𝑟 𝑏
/𝑟 𝑎
, where 𝑟 is the rank, 𝑁 its maximum value, 𝐴 the normalization constant and (𝑎, 𝑏) two fitting
exponents.
• Decide a correspondence by calculating precision and recall
• To decide a correspondence, find a subset 𝑂𝑖
1
𝑖 ∈ 𝐼1 of 𝔒 𝑆′
1
for an 𝑂𝑗∈𝐼2
2
to be satisfied that 𝑂𝑗
2
= 𝑖 𝑂𝑖
1
.
• In most cases, 𝑂𝑗
2
⊅ 𝑂𝑖
1
and 𝑂𝑗
2
≠ 𝑂𝑖
1
.
• So we define the following metrics;
• 𝑑 𝑝 =
𝑖∈𝐼
𝑗
1 𝑂𝑗
2
∩𝑂𝑖
1
𝑖∈𝐼
𝑗
1 𝑂𝑖
1 (precision),
• 𝑑 𝑟 =
𝑖∈𝐼
𝑗
1 𝑂𝑗
2
∩𝑂𝑖
1
𝑂𝑗
2 (recall).
• And a generalized harmonic mean of precision and recall;
• 𝑑 𝑓 =
1+𝛽2 𝑑 𝑝 𝑑 𝑟
𝛽2 𝑑 𝑝+𝑑 𝑟
, 𝛽 > 0, (𝐹𝛽-measure)
• Finally, we decide a threshold of the f-measure to determine which element has a correspondence relation.
Martínez-Mekler, G., Alvarez Martínez, R., Beltrán del Río, M., Mansilla, R., Miramontes, P., & Cocho, G. (2009). Universality of rank-ordering distributions in the arts and sciences. PloS One, 4(3), e4791. http://doi.org/10.1371/journal.pone.0004791
16

Application example
• InCites™ (Clarivate Analytics)
• A world class research evaluation platform
• User scenarios
• Users
• Research organizations
• Funding and policy organizations
• Publishers
• The users can
• Identify and manage research activities and their impact,
• Benchmark and compare performance to peers,
• Identify experts both inside and outside the organization,
• Identify emerging subject areas, researchers, and experts,
• Manage funding activity from submission to progress
reports through outcomes,
• Demonstrate results and impact of funding policy,
• Identify new trends and key indicators to enable policy
development,
• Uncover new or emerging areas in which to publish,
• Monitor trends within a field or geographic region,
• Identify the best authors and reviewers,
• Maintain competitive advantage by monitoring the
competition.
• Dataset
• Web of Science™ Core Collection
• Entities
• People
• Organizations
• Regions
• Research Areas
• Journals, Books, Conference Proceedings
17

Linking bibliographic entities between WoS
and KAKEN
18
𝑆′
𝑆𝑇
𝑏
𝑂
ℎ
Web of ScienceKAKEN
𝑎1
𝑎2
𝑎3
𝑎1
′
𝑎2
′
𝑎3
′
articles articlesprojects
Bibliographic linkage
𝑎1
′
≡ 𝑎1
𝑎2
′
≡ 𝑎2
𝑎3
′
≡ 𝑎3

A bibliographic linkage [Kurakawa, et al. 2014]
• Databases
• KAKEN as of 2009
• 173,940 article citations in English
• WoS as of 2009, 2010
• 3,925,776 article citations
• Method
• Identifying pairs of article citations by
the following techniques
• i-Linkage
• Blocking top 5 candidate article
citations of WoS for an article citation
of KAKEN as a pair of citation
• SVM (support vector machine)
• Detecting true or not of pairs of
citation
• Output result
• 75,042 pairs of citation
• 43.1% of 173,940 from KAKEN
• 10 fold cross validation (800 true
pairs by human judge)
• Accuracy 95.01
• Precision 94.92
• Recall 95.10
• F-Measure 94.98
• Pairs of citation that are
categorized with the both subject
classification schemes
• 59,595 pairs of citation
19
Kurakawa, K., Sun, Y., & Aizawa, A. (2014). Mapping between research fields of grants-in-aid for scientific research and web of science subject areas. NII
Technical Reports. National Institute of Informatics. Retrieved from https://www.nii.ac.jp/TechReports/public_html/14-002J.html

Subject classification schemes
• Web of Science subject
classification scheme
• Web of Science subject areas
• 251 subject categories
• ESI research areas
• GIPP research areas
• KAKEN subject classification
scheme (as of 2009)
• 4 categories
• 10 areas
• 67 disciplines
• 284 research fields
20

A contingency table for two subject
classification schemes
21
𝑂1
1
⋮
𝑂 𝑚
1
𝑂1
2
⋯ 𝑂 𝑛
2
𝑓11 ⋯ 𝑓1𝑗 ⋯ 𝑓1𝑛
⋮
𝑓𝑖1
⋮
⋱ ⋮
⋯ 𝑓𝑖𝑗 ⋯
⋮ ⋱
⋮
𝑓𝑖𝑛
⋮
𝑓 𝑚1 ⋯ 𝑓 𝑚𝑗 ⋯ 𝑓𝑚𝑛
Web of Science
subject categories
KAKENHI subject categories
𝑓𝑖𝑗 = 𝑂𝑖
1
∩ 𝑂𝑗
2

An example contingency table (a part view of
251 WoS and 67 KAKENHI subject categories)
22

Analysis of the contingency table
23
,where is the rank value, its maximum value,
a normalized constant
and two fitting components.
The discrete generalized beta distribution (DGBD)

Pseudo precision and recall for subject categories
of a new subject classification scheme
26
𝑂𝑗
2
𝑂1
1
𝑂2
1
𝑂4
1
𝑂3
1
As for the whole counting of papers, we define
𝑑 𝑝
′
=
𝑖 𝑂𝑗
2
∩𝑂𝑖
1
𝑖 𝑂𝑖
1 (pseudo precision),
𝑑 𝑟
′ =
𝑖 𝑂 𝑗
2
∩𝑂𝑖
1
𝑂 𝑗
2 (pseudo recall),
𝑑 𝑓
′
=
1+𝛽2 𝑑 𝑝
′ 𝑑 𝑟
′
𝛽2 𝑑 𝑝
′ +𝑑 𝑟
′ , 𝛽 > 0, (pseudo 𝐹𝛽-measure).

Maximum pseudo f-measure
27
4 categories –
67 disciplines
KAKEN subject category Translation # of WoS subject
categories to cover
Pseudo precision Pseudo recall Max pseudo F1 measure
(01-01) 情報学 Informatics 17 0.57582 0.62589 0.59981
(01-02) 神経科学 Brain sciences 1 0.21829 0.36497 0.27318
(01-03) 実験動物学 Laboratory animal science 1 0.05863 0.07438 0.06557
(01-04) 人間医工学 Human informatics 8 0.22199 0.21253 0.21716
(01-05) 健康・スポーツ科学 Health / sports science 5 0.18095 0.29028 0.22293
(01-06) 生活科学 Human life science 4 0.23905 0.28051 0.25813
(01-07) 科学教育・教育工学
Science education /educational
technology 2 0.37736 0.10309 0.16194
(01-08) 科学社会学・科学技術史
Sociology / history of science and
technology 6 0.11111 0.16279 0.13208
(01-09) 文化財科学 Cultural assets study 1 0.2 0.03636 0.06154
(01-10) 地理学 Geography 4 0.11719 0.2027 0.14851
(01-11) 環境学 Environmental science 14 0.26227 0.3853 0.3121
(01-12) ナノ・マイクロ科学 Nano / micro science 4 0.10326 0.31317 0.15531
(01-13) 社会・安全システム科学 Social / safety system science 14 0.18656 0.21429 0.19946
(01-14) ゲノム科学 Genome science 3 0.04047 0.20305 0.06748
(01-15) 生物分子科学 Biomedical engineering 2 0.11913 0.32457 0.17429
(01-16) 資源保全学 Culture assets and museology 3 0.18116 0.14535 0.16129
(01-17) 地域研究 Area studies 7 0.16429 0.27059 0.20444
(01-18) ジェンダー Gender 3 0.23077 0.11111 0.15

28
4 categories –
67 disciplines
categories to cover
(02-01) 哲学 Philosophy 4 0.4359 0.28333 0.34343
(02-02) 芸術学 Art studies 1 0.09091 0.11111 0.1
(02-03) 文学 Literature 10 0.7 0.68293 0.69136
(02-04) 言語学 Linguistics 3 0.70504 0.41004 0.51852
(02-05) 史学 History 6 0.41176 0.34146 0.37333
(02-06) 人文地理学 Human geography 3 0.175 0.5 0.25926
(02-07) 文化人類学 Cultural anthropology 3 0.05634 0.10526 0.07339
(02-08) 法学 Law 3 0.38462 0.12195 0.18519
(02-09) 政治学 Politics 2 0.40909 0.45763 0.432
(02-10) 経済学 Economics 12 0.6917 0.62198 0.65499
(02-11) 経営学 Management 5 0.29412 0.38462 0.33333
(02-12) 社会学 Sociology 8 0.17606 0.27778 0.21552
(02-13) 心理学 Psychology 14 0.4878 0.47859 0.48315
(02-14) 教育学 Education 9 0.24375 0.25828 0.2508
(03-01) 数学 Mathematics 4 0.73424 0.79181 0.76194
(03-02) 天文学 Astronomy 1 0.5052 0.86965 0.63912
(03-03) 物理学 Physics 6 0.49831 0.65128 0.56462
(03-04) 地球惑星科学 Earth and planetary science 7 0.6186 0.66222 0.63967
(03-05) プラズマ科学 Plasma science 1 0.23261 0.19094 0.20973
(03-06) 基礎化学 Basic chemistry 7 0.22929 0.80065 0.35649
(03-07) 複合化学 Applied chemistry 6 0.28307 0.52645 0.36817
(03-08) 材料化学 Materials chemistry 7 0.1571 0.34801 0.21647
(03-09) 応用物理学・工学基礎 Applied physics 5 0.17011 0.39374 0.23758
(03-10) 機械工学 Mechanical engineering 11 0.43053 0.38804 0.40818
(03-11) 電気電子工学 Electrical and electric engineering 10 0.33758 0.66933 0.4488
(03-12) 土木工学 Civil engineering 8 0.37069 0.48383 0.41977
(03-13) 建築学 Architecture and building engineering 3 0.28571 0.50588 0.36518
(03-14) 材料工学 Material engineering 6 0.34794 0.52269 0.41778
(03-15) プロセス工学 Process / chemical engineering 4 0.14529 0.30553 0.19694
(03-16) 総合工学 Integrated engineering 8 0.25637 0.30922 0.28032

29
Average pseudo
precision
Average pseudo
recall
Average pseudo F1
measure
0.31469 0.36724 0.31718
4 categories –
67 disciplines
categories to cover
(04-01) 基礎生物学 Basic biology 7 0.375 0.39992 0.38706
(04-02) 生物科学 Biological science 4 0.16679 0.58193 0.25927
(04-03) 人類学 Anthropology 3 0.31504 0.44 0.36718
(04-04) 農学
Plant production and environmental
agriculture 4 0.30676 0.44939 0.36462
(04-05) 農芸化学 Agricultural chemistry 6 0.22042 0.38632 0.28069
(04-06) 林学 Forest and forest products science 5 0.40751 0.25224 0.3116
(04-07) 水産学 Applied aquatic science 2 0.4185 0.32702 0.36715
(04-08) 農業経済学
Agricultural science in society and
economy 2 0.33333 0.09677 0.15
(04-09) 農業工学 Agro-engineering 4 0.15686 0.25926 0.19546
(04-10) 畜産学・獣医学 Animal life science 4 0.51054 0.38655 0.43998
(04-11) 境界農学 Boundary agriculture 4 0.2346 0.14787 0.18141
(04-12) 薬学 Pharmacy 4 0.29417 0.3694 0.32752
(04-13) 基礎医学 Basic medicine 16 0.21266 0.55141 0.30695
(04-14) 境界医学 Boundary medicine 12 0.16156 0.11176 0.13213
(04-15) 社会医学 Society medicine 8 0.28153 0.26197 0.2714
(04-16) 内科系臨床医学 Clinical internal medicine 24 0.44074 0.61743 0.51433
(04-17) 外科系臨床医学 Clinical surgery 20 0.41795 0.468 0.44156
(04-18) 歯学 Denticity 3 0.64007 0.27983 0.38941
(04-19) 看護学 Nursing 2 0.73684 0.44304 0.55336

Miscellaneous considerations to decide a
correspondence
• A threshold for decision
• At most top 4 rank elements have correspondence relations.
• For every Web of Science subject category 𝑂𝑖
1
, the number of relations with KAKENHI
subject categories 𝑂𝑗
2
is limited to 4 at most.
• For every Web of Science subject category 𝑂𝑖
1
, when the recall rate exceeds a half, we
stop adding any more relation.
• Decision by experts
• Professionals who know about the subject classification schemes check all
correspondence between 𝑂𝑖
1
and 𝑂𝑗
2
.
• Add or remove correspondence relations between them by means of subject
classification keywords.
30

InCites example screen (Analysis by KAKEN
subject classification scheme)
31
WoS Documents: 58,395,008
for Web of Science subject categories
for Web of Science subject categories
limited with
“LOCATION = JAPAN”
for KAKEN L3 subject categories
limited with
“LOCATION = JAPAN”
(a snapshot of 2018-12-14)

InCites outputs by subject classification
schemes
• WoS Documents
• “LOCATION = JAPAN”
• Web of Science subject areas
• ESI research areas
• KAKEN subject classification
scheme (as of 2009)
• 10 areas (KAKEN L2)
• 67 disciplines (KAKEN L3)
32

User feedback
• KAKEN classification scheme
• April 2016, released on InCites Benchmarking
• User survey
• March 2017 by online questionnaire for institutional active users
• 18 questions
• Results
• 26 institutional users feedback
37
User role in the institution Yes (multiple answers possible)
RA (research administrator) 20
Administrator / officer 3
IR (institutional research) staff 5
Others 2

User feedback results (degree of expertise)
38
Other: 4, when needed
1, when evaluating researchers

User feedback results (validity of applying
KAKENHI subject classification scheme)
39
Other: 1, I need more detail categories

User feedback results (miscellaneous user
voices)
• Comments
• Needs of KAKENHI subject classification scheme
• When I apply a set of Web of Science documents to KAKENHI subject classification scheme, metrics of top
1 % papers by WoS categories and ESI categories is not comfortable. I need a feature to recalculate the
citation ranking by KAKENHI subject classification scheme.
• I need KAKENHI subject classification scheme in the Web of Science search service as well.
• I hope for updating KAKENHI subject classification scheme to new one as possible. (It might be hard to
catch up on updating it since it changes every year.)
• It is very timely to add KAKENHI subject classification scheme.
• Although it is in the case of a limited area of subject, KAKENHI subject classification scheme has
advantage for us to precisely analyze researches because it is higher resolvable than ESI, which make it
available to map “Animal & Plant Science” of ESI to more precise ones, i.e. “Applied Physics”, “Applied
Chemistry”, etc. of Japan.
• Need more precise categories of KAKENHI subject classification scheme
• Sixty-over categories of KAKENHI is not sufficient to relatively compare researches as much as ES (22
only) and WoS (251, four times and more). And, it may cause over-evaluation in comparison between
research fields because KAKENHI subject classification is made in a clock counter-like classification
method. We need more accurate analysis of more concrete examples.
40

Discussion (1)
• Our approach, i.e. deciding a correspondence between two subject classification
schemes has an inherent limitation.
• In natural correlations between subject categories of two subject classification schemes, each
subject category of one scheme partly overlaps several subject categories of the other scheme.
• There is no inclusion relationship between them.
• Correspondence relations are probabilistic.
• Research projects and journal articles have similarities and differences on subject.
• Projects and articles have a strong correlation on subject.
• In our approach, we used a grants database which describes that research projects produce outputs, i.e.
research articles.
• We focused on the subject classification scheme for the research projects and its relationship to a set of
research articles. Research articles are classified with another subject classification scheme.
• We compared those two subject classification schemes through its relationship.
• But, they also have differences on subject.
• Projects precede articles. There is a time lag of project starting and article outputs. This makes a subject
divergence of drift between them.
• Projects tend to indicate the central concept with essential keywords. This allows a subject diversification of
articles.
41

Discussion (2)
• Nevertheless, the classification results were accepted by InCites users.
• The users might focus on comparative analysis of bibliometrics by the subject categories, and not care about
specific case of articles.
• They might need rough quality of metrics at the evaluation stage.
• Metrics are central limits of quantitative attributes of a set of entities, which is the main indicator to be
checked for the research evaluation.
• Our approach is extremely cost effective.
• The numbers of journal titles in Web of Science citation database is 24,688.
• The number of Web of Science documents of InCites is 58,395,008.
• The number of total journals in Web of Science is 24,688.
• http://mjl.clarivate.com/cgi-bin/jrnlst/jlresults.cgi
• The number of subject category pairs to decide a correspondence is 16,817.
• For KAKEN 67 - WoS 251, the number of the pairs is 16,817.
• For KAKEN 10 - WoS 251, the number of the pairs is 2,510.
• Evidence data is too small.
• The sum of frequency counts of the contingency table is 97,175. It is not enough to automatically decide a
correspondence between subject classification schemes. Manual handling was needed.
42

Conclusions and future work
• We proposed an approach to apply a new subject classification scheme for a bibliographic database that is
classified by a subject classification scheme.
• We defined a subject classification model of database that consists of a topological space.
• Then, we showed our approach based on the model, where the step is to form a compact topological space for a new
subject classification scheme.
• To form the space, it utilizes a correspondence between two subject classification schemes by a research project database as
data.
• We applied the approach to a practical example, i.e. InCites - a benchmarking tool for research evaluation
based on the Web of Science citation database so as to add the KAKENHI subject classification scheme.
• Subject classification schemes
• Web of Science subject categories 251
• KAKENHI subject categories 67 / 10
• 59,595 pairs of bibliographic records classified by WoS subject categories and KAKEN subject categories induce a
correspondence between the two subject classification schemes.
• User feedback revealed that users accepted our classification results.
• Future work
• In present data age, it will be handled on the basis of external data and artificial intelligence. Our approach become robust
by large amount of data.
• In an alternative way, it is promising to directly look into content and extract knowledge for the same purposes on metadata.
43

Applying a new subject classification scheme for a database by a data-driven correspondence

More Related Content

What's hot (18)

Similar to Applying a new subject classification scheme for a database by a data-driven correspondence (20)

More from National Institute of Informatics (19)

Recently uploaded (20)

Applying a new subject classification scheme for a database by a data-driven correspondence