SlideShare a Scribd company logo
Analysis and Modeling of Complex Data in Behavioral and Social Sciences
                          Joint meeting of Japanese and Italian Classification Societies
                          Anacapri (Capri Island, Italy), 3-4 September 2012	




 A SVM Applied Text Categorization of
   Academia-Industry Collaborative
Research and Development Documents
            on the Web	
            Kei Kurakawa1, Yuan Sun1,
       Nagayoshi Yamashita2, Yasumasa Baba3
           1. National Institute of Informatics
                     2. GMO Research
    (ex- Japan Society for the Promotion of Science)
        3. The Institute of Statistical Mathematics
U-I-G relations	
•  To make a policy of science
   and technology research and
                                               U	
   development, university-
   industry-government (U-I-G)
   relations is an important aspect      I	
         G	
   to investigate it (Leydesdorff
   and Meyer, 2003). 	
•  Web document is one of the research targets to
   clarify the state of the relationship.
•  In the clarification process, to get the exact
   resources of U-I-G relations is the first
   requirement.	
                                                           2
Objective	
•  Objective is to extract automatically
   resources of U-I relations from the web.
                 U	


           I	
         G	

•  We set a target into “press release
   articles” of organizations, and make a
   framework to automatically crawl them and
   decide which is of U-I relations.        3
Automatic extraction framework for
U-I relations documents on the web	
  Press	
  release	
  
ar7cles	
  published	
  
 on	
  university	
  or	
  
company	
  web	
  site	

                              1.	
  Crawling	
  Web	
        Crawled	
  
                                    Documents	
             Documents	



                              2.	
  Extrac7ng	
  Text	
  
                                     From	
  the	
           Extracted	
  
                                    Documents	
                Texts	


                                3.	
  Learning	
  to	
      Learned	
  
                                 Classify	
  the	
                           4.	
  Classifying	
  the	
  
                                                             Model	
                Document	
                                  Document	
                  File	
                                        4
Support Vector Machine (1)
          (Vapnik, 1995)	
                                                              y=1
•  Two class classifier                                        y=0
      y(x) = wT (x) + b                                          y=         1

                          Bias parameter	
 Fixed feature space transformation	
•  N input vectors
                                              margin	
    –  Input vector: x1 , . . . , xN
    –  Target values: t1 , . . . , tN where tn 2 { 1, 1}   Support Vector	

•  For all input vectors, tn y(xn ) > 0
•  Maximize margin between
   hyperplane y(x) = 1 and y(x) = 1
                               	
                                                                      5
Support Vector Machine (2)	
•  Optimization problem
                1   2
         arg min kwk .
           w,b  2
                                     T
   subject to the constraints	
 tn (w (x) + b)      1,     n = 1, . . . , N


•  By means of Lagrangian method
                   N
                   X
          y(x) =         an tn k(x, xn ) + b.
                   n=1

   where kernel function is defined by 	
 k(x, x0 ) =    (x)T (x0 )

                  ,and an > 0 is Lagrange multipliers	
                                                                              6
U-I relations documents on the web	
•  Extracted texts from the web documents are very
   noisy for content analysis.
   –  Irrelevant text, e.g. menu label text, header or footer of
      page, ads are still remained.
•  In our observation,
   –  irrelevant text tends to be solely term not in a sentence,
   –  in terms of detecting U-I relations, the exact evidence of
      relevance are occurred in two or three sequential and
      formal sentences.
      •  For example, ”the MIT researchers and scientists from
         MicroCHIPS Inc. reported that... ”,
      •  target of Japanese ”東京大学とオムロン株式会社は、共同研究に
         より、重なりや隠れに強く....”
•  It’s enough to filter text including punctuation marks
   which means fully formal sentence. 	
                                                                   7
Feature selection	
•  tf-idf (Term Frequency – Inverse Document
   Frequency)
•  tf-idf is defined by  tf-idf(t, d, D) = tf(t, d) ⇥ idf(t, D)
                                     a term	
   a document	
   all document	

•  Feature is defined by
                                                    xt,d = tf-idf(t, d, D) ⇥ bt,d
     xd = (xt1 ,d , xt2 ,d , · · · , xtM ,d )              ⇢
                                                               1      if t 2 d
                                                    bt,d =
                                                               0      if t 2 d
                                                                           /

•  The term can be a term in a document, type of
   POS (part-of-speech) of morpheme, or analytical
   output of external tools in our experiment.   8
Mapping a document
                 into a feature vector	
A document	
     東北大学は、NECとの共同研究によりCPU内で使用される電子回路
     (CAM:連想メモリプロセッサ)において、世界で初めて、既存回路と同
     等の高速動作と、処理中に電源を切ってもデータを回路上に保持でき
     る不揮発動作、を両立する技術を開発、実証しました。	

           Feature selection	
                      x = (tf-idf( 産官学	
 , d, D), tf-idf(          協力	
   , d, D),
                           tf-idf( 開始+動詞	
, d, D),tf-idf(         受託+動詞	
 , d, D),
                           tf-idf( 研究+動詞	
, d, D),tf-idf(         実験+動詞	
 , d, D),
                           tf-idf( 開始+名詞,サ変接続	
 , d, D),tf-idf(   発見+動詞	
 , d, D),
                           tf-idf( 研究員	
 , d, D),tf-idf(                     , d, D),
                                                                  研究+名詞,サ変接続	

                           tf-idf( 開発+名詞,サ変接続	
, d, D), tf-idf(     共同	
     , d, D) )

A feature vector	
    x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564)
                                                                                         9
Features (1)	
1)         BoW
      –       Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word
              tf-idf consists of feature vector xn.
2)         BoW(N)
      –       Only noun is chosen.
3)         BoW(N-3)
      –       The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed
              by adding ”する” ([suru], do) to the noun).
4)         K(14)
      –       Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu],
              research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成
              功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受
              賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration),
              ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku],
              UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government)
              relations), and ”連携” ([renkei], coordination).
5)         K(18)
      –       K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment),
              ”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher).


                                                                                              10
Features (2)	
6)         K(18)+NM
      –       Keywords and POS (Part of Speech) of the next morpheme in a sequential text are
              checked, in that grammatically connections of those keywords are restricted to verb,
              auxiliary verb, and Sahen-noun.
7)         Corp.
      –       Cooperation marks.
      –       ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U
              +3231), (株),or (株) .
8)         Univ.
      –       University name is checked.
      –       ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university)
9)         C.+U.
      –       Both cooperation mark and university name are being in a sentence.
10)  ORG
      –       The existing of organization by means of Cabocha’s Japanese named entity
              extraction function	




                                                                                                   11
Feature selection and
                               SVM kernel functions	
            TF-IDF feature element	

            (1)          (2)           (3)         (4)        (5)       (6)          (7)        (8)        (9)        (10)
Test ID	
   BoW	
        BoW(N) 	
     BoW(N-3) 	
 K(14) 	
   K(18)	
   K(18)+NM	
   Corp. 	
   Univ. 	
   C.+U. 	
   ORG 	
   Kernel function	
 1-1	
         ✔	
                                                                                                                Linear	
 1-2	
                       ✔	
                                                                                                  Linear	
 1-3	
                                    ✔	
                                                                                     Linear	
 2-1	
                                                 ✔	
                                                                        Linear	
 2-2	
                                                 ✔	
                                                                      Polynomial	
 2-3	
                                                 ✔	
                                                                         RBF	
 3-1	
                                                            ✔	
                                                             Linear	
 3-2	
                                                            ✔	
                                                           Polynomial	
 3-3	
                                                            ✔	
                                                              RBF	
 4-1	
                                                                     ✔	
                                                    Linear	
 4-2	
                                                                     ✔	
                                                  Polynomial	
 4-3	
                                                                     ✔	
                                                     RBF	
 5-1	
                                                                     ✔	
                                           ✔	
      Linear	
 5-2	
                                                                     ✔	
                                           ✔	
    Polynomial	
 5-3	
                                                                     ✔	
                                           ✔	
       RBF	
 6-1	
                                                                     ✔	
           ✔	
        ✔	
                  ✔	
      Linear	
 6-2	
                                                                     ✔	
           ✔	
        ✔	
                  ✔	
    Polynomial	
 6-3	
                                                                     ✔	
           ✔	
        ✔	
                  ✔	
       RBF	
 7-1	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
                Linear	
 7-2	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
              Polynomial	
 7-3	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
                 RBF	
 7-4	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
             RBF ( γ tuned)	
 8-1	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
      Linear	
 8-2	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
    Polynomial	
 8-3	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
       RBF	
 8-4	
                                                                     ✔	
           ✔	
        ✔	
        ✔	
       ✔	
   RBF ( γ tuned)	
                                                                                                                                        12
Data set for experiment	
Organization	
           Crawled Articles	
                  Articles for Experiment	
                         Positive            Negative        Positive            Negative
                         Article	
           Article	
       Article	
           Article	
Tohoku Univ.	
                        44	
           499	
                44	
               44	
The Univ. of Tokyo	
                 106	
           848	
               106	
           106	
Kyoto Univ.                           40	
           329	
                40	
               40	
Tokyo Inst. of Tech.	
                37	
           343	
                37	
               37	
Hitachi Corp.	
                      103	
           450	
               103	
           103	
Total	
                              330	
          2469	
               330	
           330	




                                                                                                    13
Classification results
                                   (SVM light (Joachims))	
                                        Average points in 10 fold cross validation	
        Test ID	
                       Accuracy	
              Precision	
            Recall	
          F-measure	
          1-1	
                              61.21                   64.04               42.12                 47.28
BoW	
     1-2	
                              60.61                   63.75               40.00                 45.54
          1-3	
                              61.52                   67.44               40.00                 46.72
          2-1	
                              67.58                   72.02               61.52                 63.70
K(14)	
   2-2	
                              58.03                   69.76               23.33                 34.45
          2-3	
                              66.51                   62.53               86.37                 71.89
          3-1	
                              68.18                   72.02               63.33                 64.78
K(18)	
   3-2	
                              57.88                   69.00               23.03                 34.08
          3-3	
                              66.67                   62.22               88.18                 72.43
          4-1	
                              70.61                   74.66               63.64                 67.40
K(18)+NM	
4-2	
                   -	
                     -	
                    -	
               -	
          4-3	
                              70.76                   65.49               90.30                 75.66
          5-1	
                              70.61                   74.61               63.64                 67.31
K(18)+NM, ORG	
          5-2	
                   -	
                     -	
                    -	
               -	
          5-3	
                              70.76                   65.49               90.30                 75.66
          6-1	
                   -	
                     -	
                    -	
               -	
K(18)+NM, Corp, Univ., ORG	
          6-2	
                   -	
                     -	
                    -	
               -	
          6-3	
                              70.15                   64.64               93.64                 76.09
          7-1	
                              78.79                   85.01               71.52                 76.99
K(18)+NM, Corp, Univ., C+U	
          7-2	
          7-3	
                    -	
                          72.27
                                                          -	
                                                                     66.07
                                                                                 -	
                                                                                         94.85
                                                                                                   -	
                                                                                                               77.61
          7-4	
                              80.15                   78.81               83.94                 81.05
          8-1	
                              78.94                   85.03               71.82                 77.16
K(18)+NM, Corp, Univ., C+U, ORG	
          8-2	
          8-3	
                    -	
                                   -	
                    -	
               -	
                                             71.82                   65.73               94.85                 77.35
          8-4	
                              79.85                   78.51               83.94                 80.86
                    - Not calculated because of precision zero or learning optimization fault 	
                 14
Findings and discussion (1)	
•  In the test ID 1- 1, 1-2, 1-3, feature elements
   consists of BoW which count over 15800, 13000,
   and 12000 respectively. The f-measures are
   worse than the other features with the same linear
   kernel function. They seem to be out of learning.
•  The reason why they are failed in learning can be
   that training data size is too much smaller than
   enough to learn. If we have enough size of training
   data, it becomes larger than feature vector size.
   This means training data size surpass the number
   of basis function of SVM, so that learning could be
   done without over-fitting. 	

                                                    15
Findings and discussion (2)	
•  In the test ID from 2-1 to 8-3, feature
   element size is about 14 to 33.
•  Accuracy and f-measure are gradually
   inclined while feature elements are
   additionally complex.




                                             16
Findings and discussion (3)	
•  Test ID 7-* and 8-* is related to an occurrence of
   university and company symbols. Especially in ID
   7-3, recall and f-measure become highest. This
   means the occurrence of the two symbols in a
   sentence is sensitive to U-I relations.
•  Kernel function type strongly depends on scores.
•  Parameters of kernel function and efficiency of
   loss function affect balance between precision and
   recall rate. of Radial Basis Function is decided
   to get highest F-value under cross validation for
   this experiment.
                                                    17
Conclusion and future work	
•  To extract automatically resources of U-I relations from the web,
    –  we set a target into “press release articles” of organizations,
    –  Classification technique, i.e. support vector machine (SVM) is adapted
       to the decision.
•  We have conducted an experiment for several combinations of
   feature vector elements and kernel function types of SVM.
•  The combinations reveal that
    –  U-I relations keywords,
    –  university and company symbols in a sentence
    are effective elements for features.
•  Parameters of SVM is tuned to get higher f-measure, which also
   affect balance between precision and recall rate.
•  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I
   relations documents on the web.

•  In future work, we build the classifier in a context clawer to
   automatically crawl press release Web sites of organizations and get
   more resources.                                                   18

More Related Content

PPTX
Rules for inducing hierarchies from social tagging data
PDF
Word Embedding In IR
PDF
The Black-Litterman model in the light of Bayesian portfolio analysis
PDF
Parameter Uncertainty and Learning in Dynamic Financial Decisions
PDF
Querying UML Class Diagrams - FoSSaCS 2012
PDF
Harnessing Deep Neural Networks with Logic Rules
PDF
A Text Mining Research Based on LDA Topic Modelling
PDF
www.ijerd.com
Rules for inducing hierarchies from social tagging data
Word Embedding In IR
The Black-Litterman model in the light of Bayesian portfolio analysis
Parameter Uncertainty and Learning in Dynamic Financial Decisions
Querying UML Class Diagrams - FoSSaCS 2012
Harnessing Deep Neural Networks with Logic Rules
A Text Mining Research Based on LDA Topic Modelling
www.ijerd.com

What's hot (19)

PPTX
Using Text Embeddings for Information Retrieval
PPT
similarity measure
PDF
Basic review on topic modeling
PDF
Julia text mining_inmobi
PDF
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
PDF
A Document Exploring System on LDA Topic Model for Wikipedia Articles
PDF
Non-parametric regressions & Neural Networks
PDF
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
DOC
2007bai7604.doc.doc
PPTX
Neural Models for Information Retrieval
PDF
Latent Structured Ranking
PDF
Algorithm
PDF
Language Models for Information Retrieval
PDF
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
PDF
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
PDF
IRJET - Document Comparison based on TF-IDF Metric
PDF
Olivier Cappé's talk at BigMC March 2011
PDF
LDA on social bookmarking systems
Using Text Embeddings for Information Retrieval
similarity measure
Basic review on topic modeling
Julia text mining_inmobi
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
A Document Exploring System on LDA Topic Model for Wikipedia Articles
Non-parametric regressions & Neural Networks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
2007bai7604.doc.doc
Neural Models for Information Retrieval
Latent Structured Ranking
Algorithm
Language Models for Information Retrieval
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Bayesian Nonparametrics, Applications to biology, ecology, and marketing
IRJET - Document Comparison based on TF-IDF Metric
Olivier Cappé's talk at BigMC March 2011
LDA on social bookmarking systems
Ad

Similar to A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web (20)

PDF
Automatic generation of domain models for call centers
PDF
Machine Learning: Learning with data
PDF
One talk Machine Learning
PDF
Introduction to active learning
PPTX
Intro to Vectorization Concepts - GaTech cse6242
PDF
論文サーベイ(Sasaki)
PDF
Context Driven Technique for Document Classification
PDF
Basic Research on Text Mining at UCSD.
DOC
Team G
PPT
16 17 bag_words
PPTX
Statistical classification: A review on some techniques
PDF
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
PPT
text
PPT
Extraction of topic evolutions from references in scientific articles and its...
PPT
lecture_mooney.ppt
PDF
Simple semantics in topic detection and tracking
PDF
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
PDF
Declarative analysis of noisy information networks
PPTX
Data Mining Email SPam Detection PPT WITH Algorithms
PDF
Machine Learning in Computer Vision
Automatic generation of domain models for call centers
Machine Learning: Learning with data
One talk Machine Learning
Introduction to active learning
Intro to Vectorization Concepts - GaTech cse6242
論文サーベイ(Sasaki)
Context Driven Technique for Document Classification
Basic Research on Text Mining at UCSD.
Team G
16 17 bag_words
Statistical classification: A review on some techniques
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
text
Extraction of topic evolutions from references in scientific articles and its...
lecture_mooney.ppt
Simple semantics in topic detection and tracking
[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Googl...
Declarative analysis of noisy information networks
Data Mining Email SPam Detection PPT WITH Algorithms
Machine Learning in Computer Vision
Ad

More from National Institute of Informatics (20)

PPTX
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
PPTX
Applying a new subject classification scheme for a database by a data-driven ...
PPTX
Toward universal information access on the digital object cloud
PDF
Making data typing efforts or automatically detecting data types for automat...
PDF
Applying tensor decompositions to author name disambiguation of common Japane...
PPTX
Emerging domain agnostic functionalities on the handle-centered networks
PPTX
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
PPTX
研究者識別子の重要性とORCIDアップデート
PPTX
離散一般化ベータ分布を仮定した研究分野マッピングの導出
PDF
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
PPTX
レコードリンケージに基づく科研費分野-WoS分野マッピング
PPTX
科研費分野-トピック分類マトリックスへの主成分分析の適用
PDF
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
PDF
機械学習を用いたWeb上の産学連携関連文書の抽出
PDF
科研費データベースの分野分類とトピック分類の比較分析
PPTX
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
PDF
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
PDF
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
PDF
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
PDF
ORCIDのプロトタイプシステムと著者ID関連技術の動向
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Applying a new subject classification scheme for a database by a data-driven ...
Toward universal information access on the digital object cloud
Making data typing efforts or automatically detecting data types for automat...
Applying tensor decompositions to author name disambiguation of common Japane...
Emerging domain agnostic functionalities on the handle-centered networks
テンソル分解の著者名寄せへの応用と潜在変数を持つモデルとの比較
研究者識別子の重要性とORCIDアップデート
離散一般化ベータ分布を仮定した研究分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピングの導出
レコードリンケージに基づく科研費分野-WoS分野マッピング
科研費分野-トピック分類マトリックスへの主成分分析の適用
学術情報流通のための識別子とメタデータDBを対象とした融合研究シーズ探索 - 超高層物理学分野における観測データを例として -
機械学習を用いたWeb上の産学連携関連文書の抽出
科研費データベースの分野分類とトピック分類の比較分析
Researcher Identifiers and National Federated Search Portal for Japanese Inst...
著者の同定・識別について- JAIRO著者名検索プロジェクトへ -
1.研究者リゾルバーとJAIRO著者名検索、2.KAKENデータベースの機能拡張
なぜ研究者の名寄せが必要か ~ 世界の動向と研究者リゾルバー ~
ORCIDのプロトタイプシステムと著者ID関連技術の動向

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation theory and applications.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Review of recent advances in non-invasive hemoglobin estimation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf

A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

  • 1. Analysis and Modeling of Complex Data in Behavioral and Social Sciences Joint meeting of Japanese and Italian Classification Societies Anacapri (Capri Island, Italy), 3-4 September 2012 A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web Kei Kurakawa1, Yuan Sun1, Nagayoshi Yamashita2, Yasumasa Baba3 1. National Institute of Informatics 2. GMO Research (ex- Japan Society for the Promotion of Science) 3. The Institute of Statistical Mathematics
  • 2. U-I-G relations •  To make a policy of science and technology research and U development, university- industry-government (U-I-G) relations is an important aspect I G to investigate it (Leydesdorff and Meyer, 2003). •  Web document is one of the research targets to clarify the state of the relationship. •  In the clarification process, to get the exact resources of U-I-G relations is the first requirement. 2
  • 3. Objective •  Objective is to extract automatically resources of U-I relations from the web. U I G •  We set a target into “press release articles” of organizations, and make a framework to automatically crawl them and decide which is of U-I relations. 3
  • 4. Automatic extraction framework for U-I relations documents on the web Press  release   ar7cles  published   on  university  or   company  web  site 1.  Crawling  Web   Crawled   Documents Documents 2.  Extrac7ng  Text   From  the   Extracted   Documents Texts 3.  Learning  to   Learned   Classify  the   4.  Classifying  the   Model   Document Document File 4
  • 5. Support Vector Machine (1) (Vapnik, 1995) y=1 •  Two class classifier y=0 y(x) = wT (x) + b y= 1 Bias parameter Fixed feature space transformation •  N input vectors margin –  Input vector: x1 , . . . , xN –  Target values: t1 , . . . , tN where tn 2 { 1, 1} Support Vector •  For all input vectors, tn y(xn ) > 0 •  Maximize margin between hyperplane y(x) = 1 and y(x) = 1 5
  • 6. Support Vector Machine (2) •  Optimization problem 1 2 arg min kwk . w,b 2 T subject to the constraints tn (w (x) + b) 1, n = 1, . . . , N •  By means of Lagrangian method N X y(x) = an tn k(x, xn ) + b. n=1 where kernel function is defined by k(x, x0 ) = (x)T (x0 ) ,and an > 0 is Lagrange multipliers 6
  • 7. U-I relations documents on the web •  Extracted texts from the web documents are very noisy for content analysis. –  Irrelevant text, e.g. menu label text, header or footer of page, ads are still remained. •  In our observation, –  irrelevant text tends to be solely term not in a sentence, –  in terms of detecting U-I relations, the exact evidence of relevance are occurred in two or three sequential and formal sentences. •  For example, ”the MIT researchers and scientists from MicroCHIPS Inc. reported that... ”, •  target of Japanese ”東京大学とオムロン株式会社は、共同研究に より、重なりや隠れに強く....” •  It’s enough to filter text including punctuation marks which means fully formal sentence. 7
  • 8. Feature selection •  tf-idf (Term Frequency – Inverse Document Frequency) •  tf-idf is defined by tf-idf(t, d, D) = tf(t, d) ⇥ idf(t, D) a term a document all document •  Feature is defined by xt,d = tf-idf(t, d, D) ⇥ bt,d xd = (xt1 ,d , xt2 ,d , · · · , xtM ,d ) ⇢ 1 if t 2 d bt,d = 0 if t 2 d / •  The term can be a term in a document, type of POS (part-of-speech) of morpheme, or analytical output of external tools in our experiment. 8
  • 9. Mapping a document into a feature vector A document 東北大学は、NECとの共同研究によりCPU内で使用される電子回路 (CAM:連想メモリプロセッサ)において、世界で初めて、既存回路と同 等の高速動作と、処理中に電源を切ってもデータを回路上に保持でき る不揮発動作、を両立する技術を開発、実証しました。 Feature selection x = (tf-idf( 産官学 , d, D), tf-idf( 協力 , d, D), tf-idf( 開始+動詞 , d, D),tf-idf( 受託+動詞 , d, D), tf-idf( 研究+動詞 , d, D),tf-idf( 実験+動詞 , d, D), tf-idf( 開始+名詞,サ変接続 , d, D),tf-idf( 発見+動詞 , d, D), tf-idf( 研究員 , d, D),tf-idf( , d, D), 研究+名詞,サ変接続 tf-idf( 開発+名詞,サ変接続 , d, D), tf-idf( 共同 , d, D) ) A feature vector x = (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1473467, 2.4748564) 9
  • 10. Features (1) 1)  BoW –  Bag of Words. Full output of Mecab (Japanese morphological analyzer). Each word tf-idf consists of feature vector xn. 2)  BoW(N) –  Only noun is chosen. 3)  BoW(N-3) –  The word is restricted to proper noun, general-noun, and Sahen-noun (verb formed by adding ”する” ([suru], do) to the noun). 4)  K(14) –  Fourteen keywords related to U-I relations. The keywords are ”研究” ([kennkyu], research), ”開発” ([kaihatsu], development), ”実験” ([jikken], experiment), ”成 功” ([seikou], success), ”発見” ([hakken], discover), ”開始” ([kaisi], start), ”受 賞” ([jushou], award), ”表彰” ([hyoushou], honor), ”共同” ([kyoudou], collaboration), ”協同” ([kyoudou], cooperation), ”協力” ([kyouryoku], join forces), ”産学” ([sangaku], UI relationship), ”産官学” ([sankangaku], UIG (University-Industry-Government) relations), and ”連携” ([renkei], coordination). 5)  K(18) –  K(14) + 4 keywords. ”受託” ([jutaku], entrusted with), ”委託” ([itaku], consignment), ”締結” ([teiketsu], conclusion), and ”研究員” ([kennkyuin], researcher). 10
  • 11. Features (2) 6)  K(18)+NM –  Keywords and POS (Part of Speech) of the next morpheme in a sequential text are checked, in that grammatically connections of those keywords are restricted to verb, auxiliary verb, and Sahen-noun. 7)  Corp. –  Cooperation marks. –  ”株式会社”([kabushikigaisha], Incooperated), (株)( an unicode character as U +3231), (株),or (株) . 8)  Univ. –  University name is checked. –  ”大学”([daigaku], university), or ”大”([dai], a shorten representation of university) 9)  C.+U. –  Both cooperation mark and university name are being in a sentence. 10)  ORG –  The existing of organization by means of Cabocha’s Japanese named entity extraction function 11
  • 12. Feature selection and SVM kernel functions TF-IDF feature element (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Test ID BoW BoW(N) BoW(N-3) K(14) K(18) K(18)+NM Corp. Univ. C.+U. ORG Kernel function 1-1 ✔ Linear 1-2 ✔ Linear 1-3 ✔ Linear 2-1 ✔ Linear 2-2 ✔ Polynomial 2-3 ✔ RBF 3-1 ✔ Linear 3-2 ✔ Polynomial 3-3 ✔ RBF 4-1 ✔ Linear 4-2 ✔ Polynomial 4-3 ✔ RBF 5-1 ✔ ✔ Linear 5-2 ✔ ✔ Polynomial 5-3 ✔ ✔ RBF 6-1 ✔ ✔ ✔ ✔ Linear 6-2 ✔ ✔ ✔ ✔ Polynomial 6-3 ✔ ✔ ✔ ✔ RBF 7-1 ✔ ✔ ✔ ✔ Linear 7-2 ✔ ✔ ✔ ✔ Polynomial 7-3 ✔ ✔ ✔ ✔ RBF 7-4 ✔ ✔ ✔ ✔ RBF ( γ tuned) 8-1 ✔ ✔ ✔ ✔ ✔ Linear 8-2 ✔ ✔ ✔ ✔ ✔ Polynomial 8-3 ✔ ✔ ✔ ✔ ✔ RBF 8-4 ✔ ✔ ✔ ✔ ✔ RBF ( γ tuned) 12
  • 13. Data set for experiment Organization Crawled Articles Articles for Experiment Positive Negative Positive Negative Article Article Article Article Tohoku Univ. 44 499 44 44 The Univ. of Tokyo 106 848 106 106 Kyoto Univ. 40 329 40 40 Tokyo Inst. of Tech. 37 343 37 37 Hitachi Corp. 103 450 103 103 Total 330 2469 330 330 13
  • 14. Classification results (SVM light (Joachims)) Average points in 10 fold cross validation Test ID Accuracy Precision Recall F-measure 1-1 61.21 64.04 42.12 47.28 BoW 1-2 60.61 63.75 40.00 45.54 1-3 61.52 67.44 40.00 46.72 2-1 67.58 72.02 61.52 63.70 K(14) 2-2 58.03 69.76 23.33 34.45 2-3 66.51 62.53 86.37 71.89 3-1 68.18 72.02 63.33 64.78 K(18) 3-2 57.88 69.00 23.03 34.08 3-3 66.67 62.22 88.18 72.43 4-1 70.61 74.66 63.64 67.40 K(18)+NM 4-2 - - - - 4-3 70.76 65.49 90.30 75.66 5-1 70.61 74.61 63.64 67.31 K(18)+NM, ORG 5-2 - - - - 5-3 70.76 65.49 90.30 75.66 6-1 - - - - K(18)+NM, Corp, Univ., ORG 6-2 - - - - 6-3 70.15 64.64 93.64 76.09 7-1 78.79 85.01 71.52 76.99 K(18)+NM, Corp, Univ., C+U 7-2 7-3 - 72.27 - 66.07 - 94.85 - 77.61 7-4 80.15 78.81 83.94 81.05 8-1 78.94 85.03 71.82 77.16 K(18)+NM, Corp, Univ., C+U, ORG 8-2 8-3 - - - - 71.82 65.73 94.85 77.35 8-4 79.85 78.51 83.94 80.86 - Not calculated because of precision zero or learning optimization fault 14
  • 15. Findings and discussion (1) •  In the test ID 1- 1, 1-2, 1-3, feature elements consists of BoW which count over 15800, 13000, and 12000 respectively. The f-measures are worse than the other features with the same linear kernel function. They seem to be out of learning. •  The reason why they are failed in learning can be that training data size is too much smaller than enough to learn. If we have enough size of training data, it becomes larger than feature vector size. This means training data size surpass the number of basis function of SVM, so that learning could be done without over-fitting. 15
  • 16. Findings and discussion (2) •  In the test ID from 2-1 to 8-3, feature element size is about 14 to 33. •  Accuracy and f-measure are gradually inclined while feature elements are additionally complex. 16
  • 17. Findings and discussion (3) •  Test ID 7-* and 8-* is related to an occurrence of university and company symbols. Especially in ID 7-3, recall and f-measure become highest. This means the occurrence of the two symbols in a sentence is sensitive to U-I relations. •  Kernel function type strongly depends on scores. •  Parameters of kernel function and efficiency of loss function affect balance between precision and recall rate. of Radial Basis Function is decided to get highest F-value under cross validation for this experiment. 17
  • 18. Conclusion and future work •  To extract automatically resources of U-I relations from the web, –  we set a target into “press release articles” of organizations, –  Classification technique, i.e. support vector machine (SVM) is adapted to the decision. •  We have conducted an experiment for several combinations of feature vector elements and kernel function types of SVM. •  The combinations reveal that –  U-I relations keywords, –  university and company symbols in a sentence are effective elements for features. •  Parameters of SVM is tuned to get higher f-measure, which also affect balance between precision and recall rate. •  Finally, we get accuracy 80.15, f-measure 81.05 for classifying U-I relations documents on the web. •  In future work, we build the classifier in a context clawer to automatically crawl press release Web sites of organizations and get more resources. 18