SlideShare a Scribd company logo
Pattern Mining to  Chinese Unknown word Extraction 資工碩二  955202037  楊傑程 2008/08/12
Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
Introduction Since the growing popularity of Chinese, Chinese Text Processing has become a popular research task in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.
Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information.  Ex: the sentence “ 研究生命起源”  can be segmented into “ 研究  生命  起源”  or  “ 研究生  命  起源”。 Unknown Words Also known as Out-Of-Vocabulary words (OOV words),  mostly unfamiliar proper nouns or new-born words.  Ex: the sentence “ 王義氣熱衷於研究生命”  would be segmented into “ 王  義氣  熱衷  於  研究  生命” because “ 王義氣”  is a uncommon personal name, which is not in vocabularies.
Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem.  Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex:  中油、中大 Personal names Ex:  王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type  compounds Ex: 1986 年、 19 巷
Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name  王小明  after initial segmentation,  become  王  小  明  Unknown word identification Characters belong to one unknown word should combine together. Ex: combine  王  小  明  together as  王小明
Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word  Detection  Rules With help of  syntactic information 、 context information Then just focus on detected morphemes and combine them.
Introduction- detection and extraction In this paper, we apply  continuity pattern mining  to discover unknown word detection rules. Then, we utilize syntactic information 、 context information and heuristic statistical information to correctly extract unknown words.
Introduction- applied techniques  We adopt Sequential Data Learning methods and Machine Learning Algorithms to carry out unknown word extraction.  Our unknown word extraction method is a  general  method not limit extraction on specific types of unknown words based on artificial rules.
Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade.   First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Patterns, Frequency, Context Information Proper nouns ([Chen & Li, 1996] 、 [Chen & Chen, 2000])
Related Works- general methods  (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection: Distinguish monosyllabic words and monosyllabic morphemes ([ Chen et al., 1998 ]) Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. ( [Chen et al., 2002] )  <Precision: 89%, Recall: 68%> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words.  ([Ma et al., 2003] )  < Precision: 76%, Recall: 57%>
Related Works- general methods  (Statistical Model-based) Statistical Model-based Detection: Apply Machine Learning algorithms and Sequential Supervised Learning. Direct method: Generate one corresponding statistical  model Initial Segmentation and role tagging (HMM 、 CRF) Chunking (SVM) [Goh et. al, 2006]: HMM+SVM,  <Precision: 63.8%, Recall: 58.3%> [Tsai et. al, 2006]: CRF, < Recall: 73% >
Related Works – Data Sequential Supervised Learning: Direct method, like HMM 、 CRF Indirect method, like Sliding Window  、 Recurrent Sliding Windows Transform sequential learning problem into classification problem <[T. G. Dietterich, 2002]> Imbalance Data Problem <[Seyda et. al, 2007]> Select the most informative instances. Random sampling 59 instances in each iteration, then pick the closest instance to the hyper-plane.
Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction utilize natural language information 、 content & context information and statistical information to extract unknown words. Sequential supervised learning methods (indirect) and machine-learning based models are used.
Unknown Word Detection We call unknown word detection as “Phase 1 process”, and unknown word extraction as “Phase 2 process”. The following graph is the flow chart of unknown word detection (Phase 1).
Initial  segmentation Dictionary  (Libtabe lexicon ) POS tagging -TnT Unknown word detection Detection rules Pattern  Mining to derive detection rules Training data  (8/10 balanced corpus) Phase2 training data label Testing 2 ( un-segmented )  (1/10 balanced corpus) Initial  segmentation POS tagging -TnT Phase1 Training Phase1 Testing
Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… ,  所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打球”  => “ 打球” : match, “ 打籃球” : not match Strict definition to each items and order.  Efficient pattern mining
Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency.
Encoding Original segmentation label the words based on lexicon matching : known  (Y)  or unknown  (N) “ 葡萄” ,  in the lexicon  => “ 葡萄”  labels as known word (Y) “ 葡萄皮”  ,  not in the lexicon  => “ 葡萄皮”  labels as unknown word (N) Encoding examples: 葡萄 (Na)   葡 (Na) Y  +  萄 (Na) Y 葡萄皮 (Na)   葡 (Na) N  + 萄 (Na) N+  皮 (Na) N
Create detection rules This pattern rule means: when “ 葡 (Na), 萄 (Na)” appears, the probability that “ 葡 (Na)” being a known word (unknown word) is 0.5. ( 葡  (Na) ,  萄  Y) : 1 ( 葡  (Na) ,  萄  Y) : 1
Store data (term + term_attribute  + POS) Phase2 training data Sliding Window Positive example: Find BIES Negative example: Learn and drop SVM model 2-gram SVM model 3-gram SVM model 4-gram Calculate term  frequency per docs   SVM training Models (3) Calculate  Precision /Recall Correct  segmentation 1/10 balanced corpus Merging evaluation Solve  overlap and conflict  (SVM) Sequential data
Unknown Word Extraction After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “term_attributes” are as follows :  ms()  mornosyllabic word , Ex:  你、我、他 ms(?)    morphemes of unknown word , Ex:  “ 王 ”、“ 小 ”、“ 明 ”  on “ 王小明 ” ds()    double-syllabic word , Ex:  學校 ps()    poly-syllabic word , Ex:  筆記型電腦 dot()    punctuation , Ex:  “ ,”、 “。”… none()    no above information or new term  The target of unknown word are those whose “term_attribute” labeled as “ms(?)”.
Positive / Negative Judgment A term should be a word or part of unknown word. Based upon the position of a word in the sentence, we have the following four types of  position labels : B  Begin  ex: “ 王”  of “ 王姿分” I  Intermediate  ex: “ 姿”  of “ 王姿分” E  End  ex: “ 分”  of “ 王姿分” S  Singular  ex: “ 我”、“你” Find B + I * (zero to more) + E combination (positive) 王  (?) B 姿  (?) I 分  (?) E Combine as a new word ( 王姿分 ) Random pick the same number of positive examples as number of negative ones in the training model.
Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window Each time we choose n+2 (+prefix & suffix) terms as one data, then we shift one token to right to generate another one, and so on.  Ps. must exist at least one ms(?) in n terms.   We offer three choices of n, e.g. 2.3.4. Namely, we offer three SVM models to extract different lengths of unknown words. We call them as N-gram data (model).
EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) B 姿 (?) I 分 (?) E ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
Statistical Information For each n-gram data, we calculate subsequent records: pos tag of each term Term_attribute (ms() 、 ms(?) 、 ds()…) Statistical information: (examplfied by 3-gram Model), Frequency of 3-gram.  p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) prefix (0) t1 t2 t3 suffix (4)
Experiments Unknown word detection. Unknown word extraction.
Unknown Word Detection 8/10 balanced corpus (575m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] Random pick 1/10 balanced corpus (uncovered in training data) as testing data. Use accuracy as threshold of detection rules. Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036
Unknown Word Extraction The rest of Sinica corpus data will be used as testing data in Phase 2. [Chen et al., 2002] evaluates unknown word extraction mainly on Chinese personal names 、 foreign transliteration names and compound nouns. We utilize our extraction method on all kinds of unknown word types.
Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : [Chen et al., 2002] :  frequency (w) * length (w) . Ex: “ 律師  班  奈  特” , => freq( 律師 + 班 )*2 : freq( 班 + 奈 + 特 )*3 Our method:  First solve overlap problem for identical N-gram data:  P( prefix | overlap) : P( suffix | overlap) Ex: “ 義民  廟  中” : P( 義民 | 廟 ) : P( 中 | 廟 ) Then solve conflict problem by comparing different N-gram data by:  Real frequency freq (X)-freq (Y) , if X is included in Y ex: X=“ 醫學”、“學院” ,  Y=“ 醫學院” Freq( N-gram) * Freq( POS_N-gram*), N: 2~4
Testing result We also evaluate three kinds of unknown word in [Chen et al., 2002]: 3-gram unknown words:  recall=0.73 2-gram unknown words:  recall=0.7 3-gram and 2-gram combined: recall=0.68 [Chen et al., 2002] : Only  morphological rules:  F1 score= 0.62 (precision=0.92,recall=0.47) Only  statistical rules:  F1 score= 0.52 (precision=0.78,recall=0.39) Combination:  F1 score= 0.77 (precision=0.89,recall=0.68)
SVM testing result For general purpose:   N-gram F1 score Precision Recall Only 4-gram 0.164 0.1  0.57 Only 3-gram 0.377 0.257  0.70 Only 2-gram 0.587 0.492  0.73 Three n-gram models combined 0.524 0.457  0.614
Ongoing Experiments Two experimental directions: Sampling policy <[Seyda et. al, 2007]>: In SVM, the instances close to the hyper-plane are informative for learning. Weka classification confidence Spilt whole training data to get confidence Ensemble Methods Bagging 、 AdaBoost inst#  actual  predicted  error  prediction  1  2:-1  2:-1  -  0.984  2  1:1  1:1  -  0.933  …………………………………………… .. 116  2:-1  1:1  +  0.505
0.75 0.688 0.825 Bagging (SMO) Confidence=0.97  + all p 3 0.743 0.674 0.829 Libsvm Confidence=0.97 + all p 3 0.72 0.722 0.717 Libsvm P:N= 1:4 3 0.678 0.674 F-Measure 0.612 0.716 Recall Precision 0.759 0.637 Result Libsvm Libsvm Algorithm (inside) Confidence=0.95 + error + all p P:N = 1:2 Sample By 2 2 Gram

More Related Content

PPT
Pattern Mining To Unknown Word Extraction (10
PDF
Logic programming (1)
PPT
Chaps 1-3-ai-prolog
PDF
10 logic+programming+with+prolog
PPT
Information extraction for Free Text
PDF
Artificial intelligence and first order logic
ODP
Information Extraction from the Web - Algorithms and Tools
PPTX
Prolog Programming : Basics
Pattern Mining To Unknown Word Extraction (10
Logic programming (1)
Chaps 1-3-ai-prolog
10 logic+programming+with+prolog
Information extraction for Free Text
Artificial intelligence and first order logic
Information Extraction from the Web - Algorithms and Tools
Prolog Programming : Basics

What's hot (20)

PDF
A Brief Introduction to Type Constraints
PDF
OUTDATED Text Mining 5/5: Information Extraction
PPT
Learning sets of rules, Sequential Learning Algorithm,FOIL
PDF
Lecture: Summarization
PPTX
PROLOG: Introduction To Prolog
PDF
IE: Named Entity Recognition (NER)
PDF
UWB semeval2016-task5
PDF
Framester and WFD
PDF
Introduction to Text Mining
PPTX
1909 paclic
PDF
Language Models for Information Retrieval
PDF
PPTX
2010 PACLIC - pay attention to categories
PDF
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
PPTX
Frames
PDF
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
PDF
Crash-course in Natural Language Processing
PPT
Machine Learning Applications in NLP.ppt
PDF
Crash Course in Natural Language Processing (2016)
PPT
Predicate calculus
A Brief Introduction to Type Constraints
OUTDATED Text Mining 5/5: Information Extraction
Learning sets of rules, Sequential Learning Algorithm,FOIL
Lecture: Summarization
PROLOG: Introduction To Prolog
IE: Named Entity Recognition (NER)
UWB semeval2016-task5
Framester and WFD
Introduction to Text Mining
1909 paclic
Language Models for Information Retrieval
2010 PACLIC - pay attention to categories
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
Frames
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Crash-course in Natural Language Processing
Machine Learning Applications in NLP.ppt
Crash Course in Natural Language Processing (2016)
Predicate calculus
Ad

Similar to Unknown Word 08 (20)

PPTX
Summary distributed representations_words_phrases
PDF
Statistically-Enhanced New Word Identification
PPTX
Part of speech tagger English - By sadak pramodh
PPTX
Natural Language Processing Datascience.pptx
PPTX
NLP Concepts detail explained in details.pptx
PDF
Unknown Words Analysis in POS Tagging of Sinhala Language
PDF
word level analysis
PPTX
Parts of speech tagger
PDF
7 probability and statistics an introduction
PDF
Dynamic Lexical Acquisition in Chinese Sentence Analysis
PDF
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
PPTX
Parts of Speect Tagging
PPTX
Natural Language Processing
PDF
Bilingual terminology mining
PPTX
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
PDF
Adhyann – a hybrid part of-speech tagger
PDF
Guest Lecture for Principles of Data Analytics.pdf
PDF
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
PPTX
3. introduction to text mining
Summary distributed representations_words_phrases
Statistically-Enhanced New Word Identification
Part of speech tagger English - By sadak pramodh
Natural Language Processing Datascience.pptx
NLP Concepts detail explained in details.pptx
Unknown Words Analysis in POS Tagging of Sinhala Language
word level analysis
Parts of speech tagger
7 probability and statistics an introduction
Dynamic Lexical Acquisition in Chinese Sentence Analysis
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Parts of Speect Tagging
Natural Language Processing
Bilingual terminology mining
Natural Language Provessing - Handling Narrarive Fields in Datasets for Class...
Adhyann – a hybrid part of-speech tagger
Guest Lecture for Principles of Data Analytics.pdf
A Combined Approach to Part-of-Speech Tagging Using Features Extraction and H...
3. introduction to text mining
Ad

Recently uploaded (20)

PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PPTX
Tartificialntelligence_presentation.pptx
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
project resource management chapter-09.pdf
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Hybrid model detection and classification of lung cancer
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Encapsulation theory and applications.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Assigned Numbers - 2025 - Bluetooth® Document
Digital-Transformation-Roadmap-for-Companies.pptx
Unlocking AI with Model Context Protocol (MCP)
TLE Review Electricity (Electricity).pptx
cloud_computing_Infrastucture_as_cloud_p
A comparative study of natural language inference in Swahili using monolingua...
Group 1 Presentation -Planning and Decision Making .pptx
Tartificialntelligence_presentation.pptx
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Enhancing emotion recognition model for a student engagement use case through...
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
project resource management chapter-09.pdf
Zenith AI: Advanced Artificial Intelligence
Hybrid model detection and classification of lung cancer
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Encapsulation theory and applications.pdf
A novel scalable deep ensemble learning framework for big data classification...
Accuracy of neural networks in brain wave diagnosis of schizophrenia

Unknown Word 08

  • 1. Pattern Mining to Chinese Unknown word Extraction 資工碩二 955202037 楊傑程 2008/08/12
  • 2. Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions
  • 3. Introduction Since the growing popularity of Chinese, Chinese Text Processing has become a popular research task in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.
  • 4. Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information. Ex: the sentence “ 研究生命起源” can be segmented into “ 研究 生命 起源” or “ 研究生 命 起源”。 Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into “ 王 義氣 熱衷 於 研究 生命” because “ 王義氣” is a uncommon personal name, which is not in vocabularies.
  • 5. Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem. Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex: 中油、中大 Personal names Ex: 王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type compounds Ex: 1986 年、 19 巷
  • 6. Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name 王小明 after initial segmentation, become 王 小 明 Unknown word identification Characters belong to one unknown word should combine together. Ex: combine 王 小 明 together as 王小明
  • 7. Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word Detection Rules With help of syntactic information 、 context information Then just focus on detected morphemes and combine them.
  • 8. Introduction- detection and extraction In this paper, we apply continuity pattern mining to discover unknown word detection rules. Then, we utilize syntactic information 、 context information and heuristic statistical information to correctly extract unknown words.
  • 9. Introduction- applied techniques We adopt Sequential Data Learning methods and Machine Learning Algorithms to carry out unknown word extraction. Our unknown word extraction method is a general method not limit extraction on specific types of unknown words based on artificial rules.
  • 10. Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade. First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Patterns, Frequency, Context Information Proper nouns ([Chen & Li, 1996] 、 [Chen & Chen, 2000])
  • 11. Related Works- general methods (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection: Distinguish monosyllabic words and monosyllabic morphemes ([ Chen et al., 1998 ]) Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. ( [Chen et al., 2002] ) <Precision: 89%, Recall: 68%> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words. ([Ma et al., 2003] ) < Precision: 76%, Recall: 57%>
  • 12. Related Works- general methods (Statistical Model-based) Statistical Model-based Detection: Apply Machine Learning algorithms and Sequential Supervised Learning. Direct method: Generate one corresponding statistical model Initial Segmentation and role tagging (HMM 、 CRF) Chunking (SVM) [Goh et. al, 2006]: HMM+SVM, <Precision: 63.8%, Recall: 58.3%> [Tsai et. al, 2006]: CRF, < Recall: 73% >
  • 13. Related Works – Data Sequential Supervised Learning: Direct method, like HMM 、 CRF Indirect method, like Sliding Window 、 Recurrent Sliding Windows Transform sequential learning problem into classification problem <[T. G. Dietterich, 2002]> Imbalance Data Problem <[Seyda et. al, 2007]> Select the most informative instances. Random sampling 59 instances in each iteration, then pick the closest instance to the hyper-plane.
  • 14. Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction utilize natural language information 、 content & context information and statistical information to extract unknown words. Sequential supervised learning methods (indirect) and machine-learning based models are used.
  • 15. Unknown Word Detection We call unknown word detection as “Phase 1 process”, and unknown word extraction as “Phase 2 process”. The following graph is the flow chart of unknown word detection (Phase 1).
  • 16. Initial segmentation Dictionary (Libtabe lexicon ) POS tagging -TnT Unknown word detection Detection rules Pattern Mining to derive detection rules Training data (8/10 balanced corpus) Phase2 training data label Testing 2 ( un-segmented ) (1/10 balanced corpus) Initial segmentation POS tagging -TnT Phase1 Training Phase1 Testing
  • 17. Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… , 所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打球” => “ 打球” : match, “ 打籃球” : not match Strict definition to each items and order. Efficient pattern mining
  • 18. Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency.
  • 19. Encoding Original segmentation label the words based on lexicon matching : known (Y) or unknown (N) “ 葡萄” , in the lexicon => “ 葡萄” labels as known word (Y) “ 葡萄皮” , not in the lexicon => “ 葡萄皮” labels as unknown word (N) Encoding examples: 葡萄 (Na)  葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na)  葡 (Na) N + 萄 (Na) N+ 皮 (Na) N
  • 20. Create detection rules This pattern rule means: when “ 葡 (Na), 萄 (Na)” appears, the probability that “ 葡 (Na)” being a known word (unknown word) is 0.5. ( 葡 (Na) , 萄 Y) : 1 ( 葡 (Na) , 萄 Y) : 1
  • 21. Store data (term + term_attribute + POS) Phase2 training data Sliding Window Positive example: Find BIES Negative example: Learn and drop SVM model 2-gram SVM model 3-gram SVM model 4-gram Calculate term frequency per docs SVM training Models (3) Calculate Precision /Recall Correct segmentation 1/10 balanced corpus Merging evaluation Solve overlap and conflict (SVM) Sequential data
  • 22. Unknown Word Extraction After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “term_attributes” are as follows : ms() mornosyllabic word , Ex: 你、我、他 ms(?) morphemes of unknown word , Ex: “ 王 ”、“ 小 ”、“ 明 ” on “ 王小明 ” ds() double-syllabic word , Ex: 學校 ps() poly-syllabic word , Ex: 筆記型電腦 dot() punctuation , Ex: “ ,”、 “。”… none() no above information or new term The target of unknown word are those whose “term_attribute” labeled as “ms(?)”.
  • 23. Positive / Negative Judgment A term should be a word or part of unknown word. Based upon the position of a word in the sentence, we have the following four types of position labels : B Begin ex: “ 王” of “ 王姿分” I Intermediate ex: “ 姿” of “ 王姿分” E End ex: “ 分” of “ 王姿分” S Singular ex: “ 我”、“你” Find B + I * (zero to more) + E combination (positive) 王 (?) B 姿 (?) I 分 (?) E Combine as a new word ( 王姿分 ) Random pick the same number of positive examples as number of negative ones in the training model.
  • 24. Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window Each time we choose n+2 (+prefix & suffix) terms as one data, then we shift one token to right to generate another one, and so on. Ps. must exist at least one ms(?) in n terms. We offer three choices of n, e.g. 2.3.4. Namely, we offer three SVM models to extract different lengths of unknown words. We call them as N-gram data (model).
  • 25. EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) B 姿 (?) I 分 (?) E ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
  • 26. Statistical Information For each n-gram data, we calculate subsequent records: pos tag of each term Term_attribute (ms() 、 ms(?) 、 ds()…) Statistical information: (examplfied by 3-gram Model), Frequency of 3-gram. p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) prefix (0) t1 t2 t3 suffix (4)
  • 27. Experiments Unknown word detection. Unknown word extraction.
  • 28. Unknown Word Detection 8/10 balanced corpus (575m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] Random pick 1/10 balanced corpus (uncovered in training data) as testing data. Use accuracy as threshold of detection rules. Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036
  • 29. Unknown Word Extraction The rest of Sinica corpus data will be used as testing data in Phase 2. [Chen et al., 2002] evaluates unknown word extraction mainly on Chinese personal names 、 foreign transliteration names and compound nouns. We utilize our extraction method on all kinds of unknown word types.
  • 30. Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : [Chen et al., 2002] : frequency (w) * length (w) . Ex: “ 律師 班 奈 特” , => freq( 律師 + 班 )*2 : freq( 班 + 奈 + 特 )*3 Our method: First solve overlap problem for identical N-gram data: P( prefix | overlap) : P( suffix | overlap) Ex: “ 義民 廟 中” : P( 義民 | 廟 ) : P( 中 | 廟 ) Then solve conflict problem by comparing different N-gram data by: Real frequency freq (X)-freq (Y) , if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院” Freq( N-gram) * Freq( POS_N-gram*), N: 2~4
  • 31. Testing result We also evaluate three kinds of unknown word in [Chen et al., 2002]: 3-gram unknown words: recall=0.73 2-gram unknown words: recall=0.7 3-gram and 2-gram combined: recall=0.68 [Chen et al., 2002] : Only morphological rules: F1 score= 0.62 (precision=0.92,recall=0.47) Only statistical rules: F1 score= 0.52 (precision=0.78,recall=0.39) Combination: F1 score= 0.77 (precision=0.89,recall=0.68)
  • 32. SVM testing result For general purpose: N-gram F1 score Precision Recall Only 4-gram 0.164 0.1 0.57 Only 3-gram 0.377 0.257 0.70 Only 2-gram 0.587 0.492 0.73 Three n-gram models combined 0.524 0.457 0.614
  • 33. Ongoing Experiments Two experimental directions: Sampling policy <[Seyda et. al, 2007]>: In SVM, the instances close to the hyper-plane are informative for learning. Weka classification confidence Spilt whole training data to get confidence Ensemble Methods Bagging 、 AdaBoost inst# actual predicted error prediction 1 2:-1 2:-1 - 0.984 2 1:1 1:1 - 0.933 …………………………………………… .. 116 2:-1 1:1 + 0.505
  • 34. 0.75 0.688 0.825 Bagging (SMO) Confidence=0.97 + all p 3 0.743 0.674 0.829 Libsvm Confidence=0.97 + all p 3 0.72 0.722 0.717 Libsvm P:N= 1:4 3 0.678 0.674 F-Measure 0.612 0.716 Recall Precision 0.759 0.637 Result Libsvm Libsvm Algorithm (inside) Confidence=0.95 + error + all p P:N = 1:2 Sample By 2 2 Gram