Pattern Mining To Unknown Word Extraction (10

Pattern Mining to Chinese Unknown word Extraction 資工碩三 955202037 楊傑程 2008/10/14

Outline Introduction Related Works Unknown Word Detection Unknown Word Extraction Experiments Conclusions

Introduction Since the growing popularity of Chinese, Chinese Text Processing has drawn a great amount of Interests in recent years. Before utilizing knowledge of Chinese texts, some preprocessing work should be done, such as Chinese Word Segmentation. There is no blank to mark word boundaries in Chinese texts.

Introduction Chinese Word Segmentation encounters two major problems: Ambiguity and Unknown Words. Ambiguity One un-segmented Chinese character string has different segmentations according to different context information. Unknown Words Also known as Out-Of-Vocabulary words (OOV words), mostly unfamiliar proper nouns or new-born words. Ex: the sentence “ 王義氣熱衷於研究生命” would be segmented into “ 王義氣熱衷於研究生命” because “ 王義氣” is a uncommon personal name, which is not in vocabularies.

Introduction- types of unknown words In this paper, we focus on Chinese unknown word problem. Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex: 中油、中大 Personal names Ex: 王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type compounds Ex: 1986 年、 19 巷

Introduction- unknown word identification Chinese Word Segmentation Process: Initial Segmentation (Dictionary assisted) Correctly identified words are called known words. Unknown words are wrongly segmented into two or more parts. Ex: personal name 王小明 after initial segmentation, become 王小明 Unknown word identification Characters belong to one unknown word should re-combine together. Ex: re-combine 王小明 together as 王小明

Introduction- unknown word identification How does unknown word identification work? A character can be a word ( 馬 ) or part of unknown word ( 馬 + 英 + 九 ). Unknown Word Detection Find detection rules to distinguish monosyllabic words from monosyllabic morphemes. Unknown Word Extraction focus on detected morphemes and combine them.

Introduction- applied techniques In this paper, we apply continuity pattern mining to discover unknown word detection rules. Then, we apply machine learning based methods- classification algorithms and sequential learning methods to extract unknown words. Utilize syntactic information 、 context information and heuristic statistical information. Our unknown word identification method is a general method not limited on specific types of unknown words

Related Works- particular methods So far, research on Chinese word segmentation has lasted for a decade. First, researchers apply different kinds of information to discover different kinds of unknown words (particular). Proper nouns (Chinese personal names 、 transliteration names 、 Organization names) <[Chen & Li, 1996] 、 [Chen & Chen, 2000]> Patterns, Frequency, Context Information

Related Works- general methods (Rule-based) Then, researchers start to figure out methods extracting whole kinds of unknown words. Rule-based Detection and Extraction: <[Chen et al., 1998]> Distinguish monosyllabic words and monosyllabic morphemes <[Chen et al., 2002]> Combine Morphological rules with Statistical rules to extract personal names 、 transliteration names and compound nouns. (Precision: 89%, Recall: 68%) <[Ma et al., 2003]> Utilize context free grammar concept and propose a bottom-up merging algorithm Adopt morphological rules and general rules to extract all kinds of unknown words. ( Precision: 76%, Recall: 57%)

Related Works- general methods (Machine Learning-based) Sequential Learning: <[T. G. Dietterich, 2002]> Transform sequential learning problem into classification problem Direct method, like HMM 、 CRF <[Goh et. al, 2006]> HMM+SVM, (Precision: 63.8%, Recall: 58.3%) <[Tsai et. al, 2006]> CRF, (Recall: 73%) Indirect method, like Sliding Window 、 Recurrent Sliding Windows

Related Works – Imbalanced Data Imbalance Data Problem Ensemble method <C. Li, 2007> Combine learning ability of multiple base classifiers using voting. Cost-sensitive learning and sampling <G. M. Weiss et. al, 2007> Focus more on minority class examples. <C. Drummond et. al, 2003> Under-sampling is more sensitive than over-sampling. <[Seyda et. al, 2007]> Select the most informative instances.

Unknown Word Detection & Extraction Our idea is similar to [Chen et al, 2002]: Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). We call: unknown word detection as “Phase 1” unknown word extraction as “Phase 2”.

Unknown Word Detection & Extraction Unknown Word Detection (Detection Rule Mining) Judge Judge Unknown Word Extraction (Machine Learning- Classification) 8/10 corpus + detection tags (Initial Segmentation) 8/10 corpus 1/10 corpus (Validation) 1/10 corpus (Initial Segmentation) Classification Decision 1/10 corpus + detection tags training testing Phase 1 Phase 2 Rules 1/10 corpus (Validation) Mining tool (Prowl) Model POS tagging POS tagging

Unknown Word Detection Mine detection rules: 8/10 corpus learning Continuity pattern mining Focus on monosyllables.

Unknown word detection- Pattern Mining Pattern Mining: Sequential Pattern: “ 因為… , 所以…” Required items match pattern order Allow noise in the middle of required items. Continuity Pattern: “ 打 * 球” => “ 打棒球” : match, “ 打躲避球” : not match Strict definition to each items and order. Efficient pattern mining

Unknown word detection- Continuity Pattern Mining Prowl <[Huang et. al, 2004]> Starts with 1-frequent pattern Extend to 2 pattern by two adjacent 1-frequent patterns, then evaluate its frequency. Iteratively extends to longer length of patterns.

Encoding Original segmentation label the words based on lexicon matching ： known (Y) or unknown (N) “ 葡萄” , in the lexicon => “ 葡萄” labels as known word (Y) “ 葡萄皮” , not in the lexicon => “ 葡萄皮” labels as unknown word (N) Encoding examples: 葡萄 (Na)  葡 (Na) Y + 萄 (Na) Y 葡萄皮 (Na)  葡 (Na) N + 萄 (Na) N+ 皮 (Na) N

Create Detection Rules Rule pattern: character, pos, label Max length = 3. character within “{ }” is primary character of rule. Ex: ( { 葡 }, 萄 ): “ 葡” be a known word when “ 葡萄” appears. Rule Accuracy: Ex: ( { 葡 (Na)}, 萄 (Na) ) : =P(#( 葡 (Na) be a known word) | #( 葡 (Na), 萄 (Na) )) ( 葡 (Na), 萄 (Na), ) ： 2 ( 葡 (Na) Y, 萄 (Na), ) ： 1 ( 葡 (Na) N, 萄 (Na) N, ) ： 1 ( 葡 (Na) Y, 萄 (Na) Y, ) ： 1 ( 葡 , 萄 , ) ： 2 ( 葡 (Na), 萄 , ) ： 2 ( 葡 , 萄 (Na), ) ： 2

Unknown Word Extraction Machine Learning Classification Sequential learning

Unknown Word Extraction- feature ( Pos) We use TnT POS tagger to detect part-of-speech (pos) tags of terms. Kinds of pos tags : Nouns (Na, Nb,…) Verbs (VA, VB, VC,…) Adjectives (A…) Punctuations (Comma, Period,…) …

Unknown Word Extraction- feature ( term_attribute) After initial segmentation and applying detection rules, each term will have a “ term_attribute ” label itself. Six different “ term_attributes ” are as follows ： ms() monosyllabic word ， Ex: 你、我、他 ms(?) morphemes of unknown word ， Ex: “ 王 ”、“ 小 ”、“ 明 ” on “ 王小明 ” ds() double-syllabic word ， Ex: 學校 ps() poly-syllabic word ， Ex: 筆記型電腦 dot() punctuation ， Ex: “ ，”、 “。”… none() no above information or new term Target of unknown word: at least one ms(?) 運動會 () 　‧ () 　四年 () 　甲班 () 　王 (?) 　姿 (?) 　分 (?) 　‧ () 　本校 () 　為 () 　響 () 　應 () ps() dot() ds() ds() ms(?) ms(?) ms(?) dot() ds() ms() ms() ms()

Data Processing- Sliding Window Sequential Supervised Learning Indirect method: transform sequential learning to classification learning Sliding Window We offer three lengths of SVM models to extract different lengths of unknown words , e.g. n= 2.3.4. Each time we choose n+2 (+prefix & suffix) terms as one window, then we shift one token to right to generate another window, and so on. Window: n+2 terms (n+prefix+suffix) N-gram: n term must exist at least one ms(?) in n terms. t3 t2 t1 prefix t0 3-gram suffix t4

EX: 3-gram Model discard negative negative negative positive 運動會 () 　‧ () 　四年 () 　甲班 () 　王 (?) 　姿 (?) 　分 (?) 　‧ () 　本校 () 　為 () 　響 () 　應 () 運動會 ‧ 四年甲班王 (?) ‧ 四年甲班王 (?) 姿 (?) 四年甲班王 (?) 姿 (?) 分 (?) 甲班王 (?) 姿 (?) 分 (?) ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校

Unknown Word Extraction- feature (Statistical Information) Statistical information: (exemplified by 3-gram Model), Frequency of 3-gram. p( prefix | 3-gram), e.g. p( prefix | t1~t3) p( suffix | 3-gram), e.g. p( suffix | t1~t3) p( first term of n | other n-1 consecutive terms), e.g. p( t1 | t2~t3) p( last term of n | other n-1 preceding terms), e.g. p( t3 | t1~t2) p( pos_freq(prefix) / pos_freq(prefix in training positive)) p( pos_freq(suffix) / pos_freq(suffix in training positive)) t3 t2 t1 prefix t0 3-gram suffix t4

Data presentation Format of machine learning usage: Dimension: accumulative term_attribute (6) pos (55) t2 term_attribute (6) pos (55) term_attribute (6) pos (55) prefix t1 …… … statistics (7) term_attribute (6) pos (55) …… suffix

Experiments Unknown word detection. Unknown word extraction.

Unknown Word Detection 8/10 balanced corpus (460m words) as training data. Use Pattern mining tool: Prowl [Huang et al., 2004] 1/10 balanced corpus as validation data. Use accuracy and frequency as threshold of detection rules. 1/10 balanced corpus as real test data (for phase 2): 60.3% precision and 93.6% recall Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036 0.76158 0.9092 0.6552 29 0.77033 0.780085 0.787466 0.795082 F-measure 0.8995 0.8932 0.8819 0.8288 Recall 0.6736 0.6924 0.7113 0.764 Precision 19 11 7 3 Fre>=

Unknown Word Extraction 8/10 balanced corpus (460m words) as training data. 1/10 balanced corpus as testing data. Imbalanced data solution: Ensemble method (voting) + under-sampling (random) Use another 1/10 balanced corpus as validation to find sampling ratio: 2-gram: 1:2 (positive: negative) 3-gram: 1:3 4-gram: 1:6

Unknown Word Extraction In judging overlap and conflict problem of different combination of unknown words : <[Chen et al., 2002]> frequency (w) * length (w) . Ex: “ 律師班奈特” , => freq( 律師 + 班 )*3 ： freq( 班 + 奈 + 特 )*3 Our method: First solve identical N-gram overlap : P (combine | overlap) Ex: “ 單親家庭” : P( 單親 | 親 ) : P( 親家庭 | 親 ) Then solve different N-gram conflict : Real frequency freq (X)-freq (Y), if X is included in Y ex: X=“ 醫學”、“學院” , Y=“ 醫學院”

Extraction result Comparison: <[Ma et al., 2003]> morphological rules+ statistical rules+ context free grammar rules Precision: 76%, Recall: 57% Our result 0.627 68.2% 58.1% Total 0.614 67.1% 56.7% 2-gram 0.707 80% 63.3% 3-gram 0.426 70.3% 30.6% 4-gram F1-score Recall Precision n-gram

Ensemble Method Improvement 0.426 0.703 0.306 0.707 0.8 0.633 0.614 0.671 0.567 Censemble 0.336 0.59 0.238 0.66 0.765 0.583 0.594 0.653 0.544 Caverage 0.412 0.662 0.299 0.669 0.776 0.587 0.587 0.645 0.538 C12 0.335 0.554 0.24 0.667 0.74 0.607 0.593 0.668 0.533 C11 0.344 0.662 0.232 0.655 0.723 0.599 0.596 0.661 0.543 C10 0.321 0.635 0.215 0.645 0.715 0.587 0.598 0.657 0.548 C9 0.309 0.486 0.226 0.676 0.813 0.579 0.6 0.673 0.541 C8 0.325 0.703 0.211 0.648 0.691 0.611 0.604 0.66 0.557 C7 0.333 0.608 0.23 0.641 0.735 0.568 0.582 0.636 0.536 C6 0.299 0.554 0.205 0.644 0.779 0.549 0.603 0.66 0.555 C5 0.42 0.676 0.305 0.667 0.796 0.574 0.598 0.645 0.557 C4 0.28 0.378 0.222 0.664 0.81 0.563 0.58 0.633 0.535 C3 0.338 0.743 0.219 0.7 0.791 0.627 0.61 0.657 0.569 C2 0.315 0.419 0.252 0.649 0.808 0.542 0.572 0.64 0.518 C1 F1-Score Recall Precision F1-Score Recall Precision F1-Score Recall Precision 4-gram 3-gram 2-gram 分類模型

Experiment- One phase What if without unknown word detection? Two phases do work better. 0.627 68.2% 58.1% Two Phases 0.52 71.4% 40.8% One Phase F-score Recall Precision Classification Performance

Conclusions We adopt two phases method to solve unknown word problems Unknown word detection Continuity pattern mining to derive detection rules. Unknown word extraction Machine learning based – classification algorithms and sequential learning (indirect). Imbalanced data solution Our experiment prove two phases do work better than one phase. Future work: Utilize Machine learning on detection. Utilize more information (patterns 、 rules) to improve extraction precision.

Pattern Mining To Unknown Word Extraction (10

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Pattern Mining To Unknown Word Extraction (10 (20)

Recently uploaded (20)

Pattern Mining To Unknown Word Extraction (10