A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

Analysis and Modeling of Complex Data in Behavioral and Social Sciences
Joint meeting of Japanese and Italian Classification Societies
Anacapri (Capri Island, Italy), 3-4 September 2012

A SVM Applied Text Categorization of
Academia-Industry Collaborative
Research and Development Documents
on the Web
Kei Kurakawa1, Yuan Sun1,
Nagayoshi Yamashita2, Yasumasa Baba3
1. National Institute of Informatics
2. GMO Research
(ex- Japan Society for the Promotion of Science)
3. The Institute of Statistical Mathematics

U-I-G relations
•  To make a policy of science
and technology research and
U
development, university-
industry-government (U-I-G)
relations is an important aspect I
G
to investigate it (Leydesdorff
and Meyer, 2003).
•  Web document is one of the research targets to
clarify the state of the relationship.
•  In the clarification process, to get the exact
resources of U-I-G relations is the first
requirement.
2

Objective
•  Objective is to extract automatically
resources of U-I relations from the web.
U

I
G

•  We set a target into “press release
articles” of organizations, and make a
framework to automatically crawl them and
decide which is of U-I relations. 3

Automatic extraction framework for
U-I relations documents on the web
Press
release

ar7cles
published

on
university
or

company
web
site

1.
Crawling
Web
Crawled

Documents
Documents

2.
Extrac7ng
Text

From
the
Extracted

Documents
Texts

3.
Learning
to
Learned

Classify
the
4.
Classifying
the

Model
Document
Document
File
4

Support Vector Machine (1)
(Vapnik, 1995)
y=1
•  Two class classifier y=0
y(x) = wT (x) + b y= 1

Bias parameter
Fixed feature space transformation
•  N input vectors
margin
–  Input vector: x1 , . . . , xN
–  Target values: t1 , . . . , tN where tn 2 { 1, 1} Support Vector

•  For all input vectors, tn y(xn ) > 0
•  Maximize margin between
hyperplane y(x) = 1 and y(x) = 1

5

Support Vector Machine (2)
•  Optimization problem
1 2
arg min kwk .
w,b 2
T
subject to the constraints
tn (w (x) + b) 1, n = 1, . . . , N

•  By means of Lagrangian method
N
X
y(x) = an tn k(x, xn ) + b.
n=1

where kernel function is defined by
k(x, x0 ) = (x)T (x0 )

,and an > 0 is Lagrange multipliers
6

U-I relations documents on the web
•  Extracted texts from the web documents are very
noisy for content analysis.
–  Irrelevant text, e.g. menu label text, header or footer of
page, ads are still remained.
•  In our observation,
–  irrelevant text tends to be solely term not in a sentence,
–  in terms of detecting U-I relations, the exact evidence of
relevance are occurred in two or three sequential and
formal sentences.
•  For example, ”the MIT researchers and scientists from
MicroCHIPS Inc. reported that... ”,
•  target of Japanese ”東京大学とオムロン株式会社は、共同研究に
より、重なりや隠れに強く....”
•  It’s enough to filter text including punctuation marks
which means fully formal sentence.
7

Feature selection
•  tf-idf (Term Frequency – Inverse Document
Frequency)
•  tf-idf is defined by tf-idf(t, d, D) = tf(t, d) ⇥ idf(t, D)
a term
a document
all document

•  Feature is defined by
xt,d = tf-idf(t, d, D) ⇥ bt,d
xd = (xt1 ,d , xt2 ,d , · · · , xtM ,d ) ⇢
1 if t 2 d
bt,d =
0 if t 2 d
/

•  The term can be a term in a document, type of
POS (part-of-speech) of morpheme, or analytical
output of external tools in our experiment. 8

Mapping a document
into a feature vector
A document
東北大学は、ＮＥＣとの共同研究によりCPU内で使用される電子回路
（CAM：連想メモリプロセッサ）において、世界で初めて、既存回路と同
等の高速動作と、処理中に電源を切ってもデータを回路上に保持でき
る不揮発動作、を両立する技術を開発、実証しました。

Feature selection
x = (tf-idf( 産官学
, d, D), tf-idf( 協力
, d, D),
tf-idf( 開始+動詞
, d, D),tf-idf( 受託+動詞
, d, D),
tf-idf( 研究+動詞
, d, D),tf-idf( 実験+動詞
, d, D),
tf-idf( 開始+名詞,サ変接続
, d, D),tf-idf( 発見+動詞
, d, D),
tf-idf( 研究員
, d, D),tf-idf( , d, D),
研究+名詞,サ変接続

tf-idf( 開発+名詞,サ変接続
, d, D), tf-idf( 共同
, d, D) )

A feature vector
x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564)
9

Features (1)
1)  BoW
–  Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word
tf-idf consists of feature vector xn.
2)  BoW(N)
–  Only noun is chosen.
3)  BoW(N-3)
–  The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed
by adding ”する” ([suru], do) to the noun).
4)  K(14)
–  Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu],
research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成
功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受
賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration),
”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku],
UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government)
relations), and ”連携” ([renkei], coordination).
5)  K(18)
–  K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),
”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).

10

Features (2)
6)  K(18)+NM
–  Keywords and POS (Part of Speech) of the next morpheme in a sequential text are
checked, in that grammatically connections of those keywords are restricted to verb,
auxiliary verb, and Sahen-noun.
7)  Corp.
–  Cooperation marks.
–  ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U
+3231), (株),or (株) .
8)  Univ.
–  University name is checked.
–  ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)
9)  C.+U.
–  Both cooperation mark and university name are being in a sentence.
10)  ORG
–  The existing of organization by means of Cabocha’s Japanese named entity
extraction function

11

Feature selection and
SVM kernel functions
TF-IDF feature element

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Test ID
BoW
BoW(N)
BoW(N-3)
K(14)
K(18)
K(18)+NM
Corp.
Univ.
C.+U.
ORG
Kernel function
1-1
✔
Linear
1-2
✔
Linear
1-3
✔
Linear
2-1
✔
Linear
2-2
✔
Polynomial
2-3
✔
RBF
3-1
✔
Linear
3-2
✔
Polynomial
3-3
✔
RBF
4-1
✔
Linear
4-2
✔
Polynomial
4-3
✔
RBF
5-1
✔
✔
Linear
5-2
✔
✔
Polynomial
5-3
✔
✔
RBF
6-1
✔
✔
✔
✔
Linear
6-2
✔
✔
✔
✔
Polynomial
6-3
✔
✔
✔
✔
RBF
7-1
✔
✔
✔
✔
Linear
7-2
✔
✔
✔
✔
Polynomial
7-3
✔
✔
✔
✔
RBF
7-4
✔
✔
✔
✔
RBF ( γ tuned)
8-1
✔
✔
✔
✔
✔
Linear
8-2
✔
✔
✔
✔
✔
Polynomial
8-3
✔
✔
✔
✔
✔
RBF
8-4
✔
✔
✔
✔
✔
RBF ( γ tuned)
12

Data set for experiment
Organization
Crawled Articles
Articles for Experiment
Positive Negative Positive Negative
Article
Article
Article
Article
Tohoku Univ.
44
499
44
44
The Univ. of Tokyo
106
848
106
106
Kyoto Univ. 40
329
40
40
Tokyo Inst. of Tech.
37
343
37
37
Hitachi Corp.
103
450
103
103
Total
330
2469
330
330

13

Classification results
(SVM light (Joachims))
Average points in 10 fold cross validation
Test ID
Accuracy
Precision
Recall
F-measure
1-1
61.21 64.04 42.12 47.28
BoW
1-2
60.61 63.75 40.00 45.54
1-3
61.52 67.44 40.00 46.72
2-1
67.58 72.02 61.52 63.70
K(14)
2-2
58.03 69.76 23.33 34.45
2-3
66.51 62.53 86.37 71.89
3-1
68.18 72.02 63.33 64.78
K(18)
3-2
57.88 69.00 23.03 34.08
3-3
66.67 62.22 88.18 72.43
4-1
70.61 74.66 63.64 67.40
K(18)+NM
4-2
-
-
-
-
4-3
70.76 65.49 90.30 75.66
5-1
70.61 74.61 63.64 67.31
K(18)+NM, ORG
5-2
-
-
-
-
5-3
70.76 65.49 90.30 75.66
6-1
-
-
-
-
K(18)+NM, Corp, Univ., ORG
6-2
-
-
-
-
6-3
70.15 64.64 93.64 76.09
7-1
78.79 85.01 71.52 76.99
K(18)+NM, Corp, Univ., C+U
7-2
7-3
-
72.27
-
66.07
-
94.85
-
77.61
7-4
80.15 78.81 83.94 81.05
8-1
78.94 85.03 71.82 77.16
K(18)+NM, Corp, Univ., C+U, ORG
8-2
8-3
-
-
-
-
71.82 65.73 94.85 77.35
8-4
79.85 78.51 83.94 80.86
- Not calculated because of precision zero or learning optimization fault
14

Findings and discussion (1)
•  In the test ID 1- 1, 1-2, 1-3, feature elements
consists of BoW which count over 15800, 13000,
and 12000 respectively. The f-measures are
worse than the other features with the same linear
kernel function. They seem to be out of learning.
•  The reason why they are failed in learning can be
that training data size is too much smaller than
enough to learn. If we have enough size of training
data, it becomes larger than feature vector size.
This means training data size surpass the number
of basis function of SVM, so that learning could be
done without over-fitting.

15

•  In the test ID from 2-1 to 8-3, feature
element size is about 14 to 33.
•  Accuracy and f-measure are gradually
inclined while feature elements are
additionally complex.

16

•  Test ID 7-* and 8-* is related to an occurrence of
university and company symbols. Especially in ID
7-3, recall and f-measure become highest. This
means the occurrence of the two symbols in a
sentence is sensitive to U-I relations.
•  Kernel function type strongly depends on scores.
•  Parameters of kernel function and efficiency of
loss function affect balance between precision and
recall rate. of Radial Basis Function is decided
to get highest F-value under cross validation for
this experiment.
17

Conclusion and future work
•  To extract automatically resources of U-I relations from the web,
–  we set a target into “press release articles” of organizations,
–  Classification technique, i.e. support vector machine (SVM) is adapted
to the decision.
•  We have conducted an experiment for several combinations of
feature vector elements and kernel function types of SVM.
•  The combinations reveal that
–  U-I relations keywords,
–  university and company symbols in a sentence
are effective elements for features.
•  Parameters of SVM is tuned to get higher f-measure, which also
affect balance between precision and recall rate.
•  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I
relations documents on the web.

•  In future work, we build the classifier in a context clawer to
automatically crawl press release Web sites of organizations and get
more resources. 18

A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

More Related Content

What's hot (19)

Similar to A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web (20)

More from National Institute of Informatics (20)

Recently uploaded (20)

A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web