SlideShare a Scribd company logo
Pattern Mining to  Chinese Unknown word Extraction 資工碩三  955202037  楊傑程 2008/10/14
Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
Introduction Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.
Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information.  Unknown Words Also known as Out-Of-Vocabulary words (OOV words),  mostly unfamiliar proper nouns or new-born words.  Ex: the sentence “ 王義氣熱衷於研究生命”  would be segmented into “ 王  義氣  熱衷  於  研究  生命” because “ 王義氣”  is a uncommon personal name, which is not in vocabularies.
Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem.  Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex:  中油、中大 Personal names Ex:  王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type  compounds Ex: 1986 年、 19 巷
Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name  王小明  after initial segmentation,  become  王  小  明  Unknown word identification Characters belong to one unknown word should re-combine together. Ex: re-combine  王  小  明  together as  王小明
Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word  Detection   Find detection rules to distinguish monosyllabic words from monosyllabic morphemes. Unknown Word  Extraction focus on detected morphemes and combine them.
Introduction- applied techniques  In this paper, we apply  continuity pattern mining  to discover unknown word detection rules. Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words. Utilize syntactic information 、 context information and heuristic statistical information. Our unknown word identification method is a  general  method not limited on specific types of unknown words
Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade.   First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal names 、 transliteration names 、 Organization names)  <[Chen & Li, 1996] 、 [Chen & Chen, 2000]> Patterns, Frequency, Context Information
Related Works- general methods  (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection and Extraction: <[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes  <[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%) <[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words. ( Precision: 76%, Recall: 57%)
Related Works- general methods  (Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]> Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF <[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%) <[Tsai et. al, 2006]> CRF, (Recall: 73%) Indirect method, like Sliding Window  、 Recurrent Sliding Windows
Related Works – Imbalanced Data Imbalance Data Problem Ensemble method <C. Li, 2007> Combine learning ability of multiple base classifiers using voting. Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007> Focus more on minority class examples. <C. Drummond et. al, 2003> Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]> Select the most informative instances.
Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect).  We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.
Unknown Word Detection & Extraction Unknown Word Detection (Detection Rule Mining) Judge Judge Unknown Word Extraction (Machine Learning- Classification) 8/10 corpus +  detection tags (Initial Segmentation) 8/10 corpus 1/10 corpus (Validation) 1/10 corpus (Initial Segmentation) Classification Decision 1/10 corpus + detection tags training testing Phase 1 Phase 2 Rules 1/10 corpus (Validation) Mining tool (Prowl) Model POS tagging POS tagging
Unknown Word Detection Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.
Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… ,  所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打 * 球”  => “ 打棒球” : match, “ 打躲避球” : not match Strict definition to each items and order.  Efficient pattern mining
Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency. Iteratively extends to longer length of patterns.
Encoding Original segmentation label the words based on lexicon matching : known  (Y)  or unknown  (N) “ 葡萄” ,  in the lexicon  => “ 葡萄”  labels as known word (Y) “ 葡萄皮”  ,  not in the lexicon  => “ 葡萄皮”  labels as unknown word (N) Encoding examples: 葡萄 (Na)   葡 (Na) Y  +  萄 (Na) Y 葡萄皮 (Na)   葡 (Na) N  + 萄 (Na) N+  皮 (Na) N
Create Detection Rules Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 },  萄  ): “ 葡”  be a known word when “ 葡萄”  appears.  Rule Accuracy: Ex: ( { 葡  (Na)},  萄  (Na) ) : =P(#( 葡  (Na) be a known word) | #(  葡  (Na),  萄  (Na) )) (  葡  (Na),  萄  (Na), ) : 2 (  葡  (Na) Y,  萄  (Na), ) : 1 (  葡  (Na) N,  萄  (Na) N, ) : 1 (  葡  (Na) Y,  萄  (Na) Y, ) : 1 (  葡  ,  萄  , ) : 2 (  葡  (Na),  萄  , ) : 2 (  葡  ,  萄  (Na), ) : 2
Unknown Word Extraction Machine Learning Classification Sequential learning
Unknown Word Extraction-  feature ( Pos) We use TnT POS tagger to detect part-of-speech (pos) tags of terms. Kinds of pos tags : Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …
Unknown Word Extraction-  feature ( term_attribute) After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “ term_attributes ” are as follows :  ms()  monosyllabic word , Ex:  你、我、他 ms(?)    morphemes of unknown word , Ex:  “ 王 ”、“ 小 ”、“ 明 ”  on “ 王小明 ” ds()    double-syllabic word , Ex:  學校 ps()    poly-syllabic word , Ex:  筆記型電腦 dot()    punctuation , Ex:  “ ,”、 “。”… none()    no above information or new term  Target of unknown word: at least one ms(?) 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () ps()  dot()  ds()  ds()  ms(?)  ms(?)  ms(?)  dot()  ds()  ms()  ms()  ms()
Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4. Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on.  Window: n+2 terms (n+prefix+suffix) N-gram: n term must exist at least one ms(?) in n terms.   t3 t2 t1 prefix t0 3-gram suffix t4
EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) 姿 (?) 分 (?) ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
Unknown Word Extraction-  feature (Statistical Information) Statistical information: (exemplified by 3-gram Model), Frequency of 3-gram.  p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) t3 t2 t1 prefix t0 3-gram suffix t4
Data presentation Format of machine learning usage: Dimension: accumulative term_attribute (6) pos (55) t2 term_attribute (6) pos (55) term_attribute (6) pos (55) prefix t1 …… … statistics (7) term_attribute (6) pos (55) …… suffix
Experiments Unknown word detection. Unknown word extraction.
Unknown Word Detection 8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules. 1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036 0.76158 0.9092 0.6552 29 0.77033 0.780085 0.787466 0.795082 F-measure 0.8995 0.8932 0.8819 0.8288 Recall 0.6736 0.6924 0.7113 0.764 Precision 19 11 7 3 Fre>=
Unknown Word Extraction 8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution: Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6
Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : <[Chen et al., 2002]> frequency (w) * length (w) . Ex: “ 律師  班  奈  特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3 Our method:  First solve identical N-gram  overlap  :  P (combine | overlap)   Ex: “ 單  親  家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 ) Then solve different N-gram conflict :  Real frequency freq (X)-freq (Y), if X is included in Y  ex: X=“ 醫學”、“學院” ,  Y=“ 醫學院”
Extraction result Comparison:  <[Ma et al., 2003]> morphological rules+ statistical rules+ context free grammar rules Precision: 76%, Recall: 57%  Our result 0.627 68.2% 58.1% Total 0.614 67.1% 56.7% 2-gram 0.707 80% 63.3% 3-gram 0.426 70.3% 30.6% 4-gram F1-score Recall Precision n-gram
Ensemble Method Improvement 0.426 0.703 0.306 0.707 0.8 0.633 0.614 0.671 0.567 Censemble 0.336 0.59 0.238 0.66 0.765 0.583 0.594 0.653 0.544 Caverage 0.412 0.662 0.299 0.669 0.776 0.587 0.587 0.645 0.538 C12 0.335 0.554 0.24 0.667 0.74 0.607 0.593 0.668 0.533 C11 0.344 0.662 0.232 0.655 0.723 0.599 0.596 0.661 0.543 C10 0.321 0.635 0.215 0.645 0.715 0.587 0.598 0.657 0.548 C9 0.309 0.486 0.226 0.676 0.813 0.579 0.6 0.673 0.541 C8 0.325 0.703 0.211 0.648 0.691 0.611 0.604 0.66 0.557 C7 0.333 0.608 0.23 0.641 0.735 0.568 0.582 0.636 0.536 C6 0.299 0.554 0.205 0.644 0.779 0.549 0.603 0.66 0.555 C5 0.42 0.676 0.305 0.667 0.796 0.574 0.598 0.645 0.557 C4 0.28 0.378 0.222 0.664 0.81 0.563 0.58 0.633 0.535 C3 0.338 0.743 0.219 0.7 0.791 0.627 0.61 0.657 0.569 C2 0.315 0.419 0.252 0.649 0.808 0.542 0.572 0.64 0.518 C1 F1-Score Recall Precision F1-Score Recall Precision F1-Score Recall Precision 4-gram 3-gram 2-gram 分類 模型
Experiment- One phase What if without unknown word detection? Two phases do work better. 0.627 68.2% 58.1% Two Phases 0.52 71.4% 40.8% One Phase F-score Recall Precision Classification  Performance
Conclusions We adopt two phases method to solve unknown word problems Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). Imbalanced data solution Our experiment prove two phases do work better than one phase. Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction precision.

More Related Content

PPT
Unknown Word 08
PDF
Logic programming (1)
PPT
Chaps 1-3-ai-prolog
PDF
10 logic+programming+with+prolog
PDF
Framester and WFD
PPT
Information extraction for Free Text
PDF
Artificial intelligence and first order logic
PDF
Harnessing Deep Neural Networks with Logic Rules
Unknown Word 08
Logic programming (1)
Chaps 1-3-ai-prolog
10 logic+programming+with+prolog
Framester and WFD
Information extraction for Free Text
Artificial intelligence and first order logic
Harnessing Deep Neural Networks with Logic Rules

What's hot (20)

PDF
Lecture: Summarization
PDF
Crash Course in Natural Language Processing (2016)
PPTX
PROLOG: Introduction To Prolog
PDF
IE: Named Entity Recognition (NER)
PDF
Language Models for Information Retrieval
PDF
Knowledge Patterns SSSW2016
PDF
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
PPTX
Plc part 4
PDF
Crash-course in Natural Language Processing
PPTX
Neural Text Embeddings for Information Retrieval (WSDM 2017)
PPT
Predicate calculus
PDF
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
PPTX
1909 paclic
PPTX
2010 PACLIC - pay attention to categories
PDF
Information Extraction
PDF
A look inside the distributionally similar terms
PDF
First order logic
PDF
Dependent Types in Natural Language Semantics
PDF
[系列活動] 文字探勘者的入門心法
Lecture: Summarization
Crash Course in Natural Language Processing (2016)
PROLOG: Introduction To Prolog
IE: Named Entity Recognition (NER)
Language Models for Information Retrieval
Knowledge Patterns SSSW2016
PrOntoLearn: Unsupervised Lexico-Semantic Ontology Generation using Probabili...
Plc part 4
Crash-course in Natural Language Processing
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Predicate calculus
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
1909 paclic
2010 PACLIC - pay attention to categories
Information Extraction
A look inside the distributionally similar terms
First order logic
Dependent Types in Natural Language Semantics
[系列活動] 文字探勘者的入門心法
Ad

Viewers also liked (10)

PPTX
LiDAR processing for road network asset inventory
PDF
Object segmentation in images using EEG signals
PPTX
A machine learning approach to building domain specific search
PPT
Wearable Computing - Part III: The Activity Recognition Chain (ARC)
PPTX
Text independent speaker recognition system
PPT
Automatic Speaker Recognition system using MFCC and VQ approach
PPT
Module15: Sliding Windows Protocol and Error Control
PDF
Track 1 session 1 - st dev con 2016 - contextual awareness
PDF
Track 2 session 1 - st dev con 2016 - avnet - making things real
PPT
Digital Image Processing
LiDAR processing for road network asset inventory
Object segmentation in images using EEG signals
A machine learning approach to building domain specific search
Wearable Computing - Part III: The Activity Recognition Chain (ARC)
Text independent speaker recognition system
Automatic Speaker Recognition system using MFCC and VQ approach
Module15: Sliding Windows Protocol and Error Control
Track 1 session 1 - st dev con 2016 - contextual awareness
Track 2 session 1 - st dev con 2016 - avnet - making things real
Digital Image Processing
Ad

Similar to Pattern Mining To Unknown Word Extraction (10 (20)

PPT
Moore_slides.ppt
PPTX
Natural Language processing Parts of speech tagging, its classes, and how to ...
PPTX
Spoken Content Retrieval
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-II.pptx
PPTX
PDF
Named Entity Recognition for Telugu Using Conditional Random Field
PPTX
Open nlp presentationss
PDF
R.E, Text Normalization, Tokenization ALgs, BPE.pdf
PPTX
Dsm as theory building
PDF
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
PPT
SNLI_presentation_2
PPT
Language Technology Enhanced Learning
PDF
NLP Deep Learning with Tensorflow
PPT
cs344-lect15-robotic-knowledge-inferencing-prolog-11feb08.ppt
PDF
GDSC SSN - solution Challenge : Fundamentals of Decision Making
PPT
Inteligencia artificial
PPT
Machine Learning Applications in NLP.ppt
PPTX
KiwiPyCon 2014 - NLP with Python tutorial
PPTX
Nltk
PDF
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Moore_slides.ppt
Natural Language processing Parts of speech tagging, its classes, and how to ...
Spoken Content Retrieval
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-II.pptx
Named Entity Recognition for Telugu Using Conditional Random Field
Open nlp presentationss
R.E, Text Normalization, Tokenization ALgs, BPE.pdf
Dsm as theory building
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...
SNLI_presentation_2
Language Technology Enhanced Learning
NLP Deep Learning with Tensorflow
cs344-lect15-robotic-knowledge-inferencing-prolog-11feb08.ppt
GDSC SSN - solution Challenge : Fundamentals of Decision Making
Inteligencia artificial
Machine Learning Applications in NLP.ppt
KiwiPyCon 2014 - NLP with Python tutorial
Nltk
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics

Recently uploaded (20)

PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Cell Types and Its function , kingdom of life
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Pharma ospi slides which help in ospi learning
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
Institutional Correction lecture only . . .
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Lesson notes of climatology university.
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Complications of Minimal Access Surgery at WLH
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
Cell Structure & Organelles in detailed.
Microbial disease of the cardiovascular and lymphatic systems
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Cell Types and Its function , kingdom of life
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Microbial diseases, their pathogenesis and prophylaxis
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Pharma ospi slides which help in ospi learning
O7-L3 Supply Chain Operations - ICLT Program
01-Introduction-to-Information-Management.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Institutional Correction lecture only . . .
O5-L3 Freight Transport Ops (International) V1.pdf
Lesson notes of climatology university.
GDM (1) (1).pptx small presentation for students
Complications of Minimal Access Surgery at WLH
Abdominal Access Techniques with Prof. Dr. R K Mishra
Cell Structure & Organelles in detailed.

Pattern Mining To Unknown Word Extraction (10

  • 1. Pattern Mining to Chinese Unknown word Extraction 資工碩三 955202037 楊傑程 2008/10/14
  • 2. Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
  • 3. Introduction Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.
  • 4. Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information. Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into “ 王 義氣 熱衷 於 研究 生命” because “ 王義氣” is a uncommon personal name, which is not in vocabularies.
  • 5. Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem. Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex: 中油、中大 Personal names Ex: 王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type compounds Ex: 1986 年、 19 巷
  • 6. Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name 王小明 after initial segmentation, become 王 小 明 Unknown word identification Characters belong to one unknown word should re-combine together. Ex: re-combine 王 小 明 together as 王小明
  • 7. Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word Detection Find detection rules to distinguish monosyllabic words from monosyllabic morphemes. Unknown Word Extraction focus on detected morphemes and combine them.
  • 8. Introduction- applied techniques In this paper, we apply continuity pattern mining to discover unknown word detection rules. Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words. Utilize syntactic information 、 context information and heuristic statistical information. Our unknown word identification method is a general method not limited on specific types of unknown words
  • 9. Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade. First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal names 、 transliteration names 、 Organization names) <[Chen & Li, 1996] 、 [Chen & Chen, 2000]> Patterns, Frequency, Context Information
  • 10. Related Works- general methods (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection and Extraction: <[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes <[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%) <[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words. ( Precision: 76%, Recall: 57%)
  • 11. Related Works- general methods (Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]> Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF <[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%) <[Tsai et. al, 2006]> CRF, (Recall: 73%) Indirect method, like Sliding Window 、 Recurrent Sliding Windows
  • 12. Related Works – Imbalanced Data Imbalance Data Problem Ensemble method <C. Li, 2007> Combine learning ability of multiple base classifiers using voting. Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007> Focus more on minority class examples. <C. Drummond et. al, 2003> Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]> Select the most informative instances.
  • 13. Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.
  • 14. Unknown Word Detection & Extraction Unknown Word Detection (Detection Rule Mining) Judge Judge Unknown Word Extraction (Machine Learning- Classification) 8/10 corpus + detection tags (Initial Segmentation) 8/10 corpus 1/10 corpus (Validation) 1/10 corpus (Initial Segmentation) Classification Decision 1/10 corpus + detection tags training testing Phase 1 Phase 2 Rules 1/10 corpus (Validation) Mining tool (Prowl) Model POS tagging POS tagging
  • 15. Unknown Word Detection Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.
  • 16. Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… , 所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打 * 球” => “ 打棒球” : match, “ 打躲避球” : not match Strict definition to each items and order. Efficient pattern mining
  • 17. Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency. Iteratively extends to longer length of patterns.
  • 18. Encoding Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) “ 葡萄” , in the lexicon => “ 葡萄” labels as known word (Y) “ 葡萄皮” , not in the lexicon => “ 葡萄皮” labels as unknown word (N) Encoding examples: 葡萄 (Na)  葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na)  葡 (Na) N + 萄 (Na) N+ 皮 (Na) N
  • 19. Create Detection Rules Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 }, 萄 ): “ 葡” be a known word when “ 葡萄” appears. Rule Accuracy: Ex: ( { 葡 (Na)}, 萄 (Na) ) : =P(#( 葡 (Na) be a known word) | #( 葡 (Na), 萄 (Na) )) ( 葡 (Na), 萄 (Na), ) : 2 ( 葡 (Na) Y, 萄 (Na), ) : 1 ( 葡 (Na) N, 萄 (Na) N, ) : 1 ( 葡 (Na) Y, 萄 (Na) Y, ) : 1 ( 葡 , 萄 , ) : 2 ( 葡 (Na), 萄 , ) : 2 ( 葡 , 萄 (Na), ) : 2
  • 20. Unknown Word Extraction Machine Learning Classification Sequential learning
  • 21. Unknown Word Extraction- feature ( Pos) We use TnT POS tagger to detect part-of-speech (pos) tags of terms. Kinds of pos tags : Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …
  • 22. Unknown Word Extraction- feature ( term_attribute) After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “ term_attributes ” are as follows : ms() monosyllabic word , Ex: 你、我、他 ms(?) morphemes of unknown word , Ex: “ 王 ”、“ 小 ”、“ 明 ” on “ 王小明 ” ds() double-syllabic word , Ex: 學校 ps() poly-syllabic word , Ex: 筆記型電腦 dot() punctuation , Ex: “ ,”、 “。”… none() no above information or new term Target of unknown word: at least one ms(?) 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () ps() dot() ds() ds() ms(?) ms(?) ms(?) dot() ds() ms() ms() ms()
  • 23. Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4. Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on. Window: n+2 terms (n+prefix+suffix) N-gram: n term must exist at least one ms(?) in n terms. t3 t2 t1 prefix t0 3-gram suffix t4
  • 24. EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) 姿 (?) 分 (?) ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
  • 25. Unknown Word Extraction- feature (Statistical Information) Statistical information: (exemplified by 3-gram Model), Frequency of 3-gram. p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) t3 t2 t1 prefix t0 3-gram suffix t4
  • 26. Data presentation Format of machine learning usage: Dimension: accumulative term_attribute (6) pos (55) t2 term_attribute (6) pos (55) term_attribute (6) pos (55) prefix t1 …… … statistics (7) term_attribute (6) pos (55) …… suffix
  • 27. Experiments Unknown word detection. Unknown word extraction.
  • 28. Unknown Word Detection 8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules. 1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036 0.76158 0.9092 0.6552 29 0.77033 0.780085 0.787466 0.795082 F-measure 0.8995 0.8932 0.8819 0.8288 Recall 0.6736 0.6924 0.7113 0.764 Precision 19 11 7 3 Fre>=
  • 29. Unknown Word Extraction 8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution: Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6
  • 30. Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : <[Chen et al., 2002]> frequency (w) * length (w) . Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*3 : freq( 班 + 奈 + 特 )*3 Our method: First solve identical N-gram overlap : P (combine | overlap) Ex: “ 單 親 家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 ) Then solve different N-gram conflict : Real frequency freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院”
  • 31. Extraction result Comparison: <[Ma et al., 2003]> morphological rules+ statistical rules+ context free grammar rules Precision: 76%, Recall: 57% Our result 0.627 68.2% 58.1% Total 0.614 67.1% 56.7% 2-gram 0.707 80% 63.3% 3-gram 0.426 70.3% 30.6% 4-gram F1-score Recall Precision n-gram
  • 32. Ensemble Method Improvement 0.426 0.703 0.306 0.707 0.8 0.633 0.614 0.671 0.567 Censemble 0.336 0.59 0.238 0.66 0.765 0.583 0.594 0.653 0.544 Caverage 0.412 0.662 0.299 0.669 0.776 0.587 0.587 0.645 0.538 C12 0.335 0.554 0.24 0.667 0.74 0.607 0.593 0.668 0.533 C11 0.344 0.662 0.232 0.655 0.723 0.599 0.596 0.661 0.543 C10 0.321 0.635 0.215 0.645 0.715 0.587 0.598 0.657 0.548 C9 0.309 0.486 0.226 0.676 0.813 0.579 0.6 0.673 0.541 C8 0.325 0.703 0.211 0.648 0.691 0.611 0.604 0.66 0.557 C7 0.333 0.608 0.23 0.641 0.735 0.568 0.582 0.636 0.536 C6 0.299 0.554 0.205 0.644 0.779 0.549 0.603 0.66 0.555 C5 0.42 0.676 0.305 0.667 0.796 0.574 0.598 0.645 0.557 C4 0.28 0.378 0.222 0.664 0.81 0.563 0.58 0.633 0.535 C3 0.338 0.743 0.219 0.7 0.791 0.627 0.61 0.657 0.569 C2 0.315 0.419 0.252 0.649 0.808 0.542 0.572 0.64 0.518 C1 F1-Score Recall Precision F1-Score Recall Precision F1-Score Recall Precision 4-gram 3-gram 2-gram 分類 模型
  • 33. Experiment- One phase What if without unknown word detection? Two phases do work better. 0.627 68.2% 58.1% Two Phases 0.52 71.4% 40.8% One Phase F-score Recall Precision Classification Performance
  • 34. Conclusions We adopt two phases method to solve unknown word problems Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). Imbalanced data solution Our experiment prove two phases do work better than one phase. Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction precision.