SlideShare a Scribd company logo
Documentation of changes on document parsing of BlueHOUND
Version 1.2 Date 2016-06-02
Mentor: Long Yin
Interns: Wan Xulang, Zhang Yuxi
Author: Wan Xulang
1. Files
To improve the performance of structure detection & clause extraction in BlueHOUND, we identified a
lists of issues and tested the solutions for 4 of them. Changes have been made in two files. One is the
configurationfile (.txt) whichcontainsthe parametersettingsof structure detection& clause extraction
module.The otherfile (.java)the structuredetection&clause extractionmodulewhichcontainsthe detail
rulesof structure detection&clause extraction. These are 6filesinthe documentfolder.
ConfigFile-Original.txt
- the original configuration file
ConfigFile-Updated.txt
- the updatedconfigurationfile
DetectStructure-Original.java
- the original source code of structure detection&clause extractionmodule
DetectStructure-Updated.java
- the updated source code of structure detection&clause extractionmodule
StructureDetection&ClauseExtractionModule.pptx
- Flow charts to describe the procedures in structure detection & clause extraction module. This
file canbe usedas a reference toquicklyunderstandDetectStructure-Original.java
DocumentationofchangesondocumentparsingofBlueHOUND.docx
- Document the changes we made in configuration file (.txt) and the structure detection& clause
extractionmodule (.java)
2. Problemsand solutions
We testedthe performanceof documentparsing function by reviewing31documents. We foundthatnot
all documents could be parsed 100% correctly. A list of issues has been identified, and we proposed 4
changes to fix some of these issues. Note that in problem 2 and 3, we need to modify both the
configurationfileandthe source code.
# Problems Changes Modifications of
ConfigurationFile
Modificationsof
Source Code
1 Clauses/Subclausesstarted
with‘section#.#’cannot be
detected
Addregularexpressionin
configfile forsection#.#
3.1.TAB
Punctuation
3.3. Section#.#
2 Clauses/Subclauses started
with ‘Article
one/two/three’ cannot be
detected
Addregularexpressionin
configfile forArticle
One/Two/Three
3.2.Article
one/two/three…
4.1.Recognize
LetterNumeric
3 Title of Clauses/Subclauses
endedwithcoloncannotbe
detected
Turn off the filteringrule
whichexclude Clause
whose title endedwith
colon
3.4. ColonTitles 4.2.Colon Titles
Filter
4 Missing Clauses/Subclauses
if there isgap in numbering
Setcontinuouskey 4.3.Numbering
Gap
3. ModificationsofConfigFile
3.1. TAB Punctuation
Motivation:Indocumentparsing,titleswhichfollowatabpunctuationwillbe setaspriority3.Butactually
theyare importanttitlesaswe don’twantthembe priority3.
Solution:AddTAB intoimportantpunctuationinthe configfile.
3.2. Article one/two/three…
Motivation:Titleswith “letternumber”likearticle one/two/three can’tbe detectedbythe tool.
Solution: Add a new regular expression in config file to extract such titles. Besides, to make them
recognizable, amappingfunctionisalso neededtomap the “letternumber”like one/two/three to“key
value” like 1,2,3. This mapping function is added in the source code. Please refer to 4.1 for more
information.
Regular Expression:
sectionRegexp b((?:[Aa][Rr][Tt][Ii][Cc][Ll][Ee])s+(([TtWwEeNnHhIiRrFfOoGg]{4,5}[Yy]){0,1}-
{0,1}[OoNnTtWwRrEeFfSsVvXxGgHhLlIiUuYy]{3,9}))b
Explanation:Thisregularexpressioniscombinedbytwoparts. The firstpart isa prefix of “article”while
the secondpart isa lettersetof all possible combinationof letternumbersfromone tofifty-nine.
3.3. Section#. # (# refersto numeral symbols)
Motivation:Whenextractingmultileveltitles,the systemgiveslow prioritytothose titles whichfollowa
wordlike ‘section’ or‘article’etc.
Solution:Addnewregularexpressionswhichcontainthese wordsasprefixestoavoidthissituation.
Figure 1 - TAB Punctuation
Regular Expression:
multilevelRegexp (([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5}).)((d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.))*((d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.?))s+
multilevelRegexp(([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})-)((d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(-))*(d{1,3}|[a-z]{1,2}|[A-
Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})s+
3.4. ColonTitles
Motivation:Whenextractingitems,the systemexclude titlesfollowedbycolonswhen
“LongTitleAndBlankDescription”filterisopen.
Solution: “LongTitleAndBlankDescription” filter contains a set of rules, like exclude clauses/subclauses
withlongtitle;excludeclause/subclauseswithblankdescription;andexcludeclauses/subclausesifitstitle
isendedwithcolonorsemicolon. If we simplyclosethisfilter,itwillhave badeffectsonotherdocuments.
Sowe separate the colonfunctionfrom“LongTitleAndBlankDescription”filterand addanew filternamed
“ColonTitles” inconfigfile. We alsoadjustedthe source code toseparate the rule aboutcolon/semicolon
from the set of rules of “LongTitleAndBlankDescription” filter. Please refer to 4.2 for more information.
4. ModificationsofSource Code
4.1. Recognize LetterNumeric
Motivation: As we’ve added a new regular expression in config file in 3.1, the system now can extract
titleswithletternumeric.Butthe systemcan’t give a numerickeyto themas there’snosuch functionin
the source code to achieve so.
Solution:Buildafunctiontogive keysto suchtitles.We simplyuse a“decode”functiontoachieve this.
Figure 3 shows twolistswhichare pre-builtforthe decoding function.
Figure 2 - ColonTitles Filter
Figure 3 – Preparation for Decoding
Figure 4 showsthe code of decodingfunction.Thisfunctionwill divide the numericsymbol intodifferent
parts by “-” and convert each part into a numeric value by going through the two lists we set before.
Finally,sumupthe numericvalue of eachpart to give the key.
Before thisfunction,anotherfunctionnamed”isLetterNumeral”isusedtoidentifywhetheratitleisletter
numberor not.
4.2. ColonTitlesFilter
Motivation: To separate the colon titles excluding function from “LongTitleAndBlankDescription” filter.
We make a newfilternamed “ColonTitles”. Thus, “LongTitleAndBlankDescription”filterwill onlyexclude
clauseswhichhave longtitlesandblankdescription.
Solution:Adda newfunctioninfilteringmodule tobecome anew filter.
Figure 4 – Convert Function
Figure 5 showsthe code of howto identifyatitle followedbycolonandgive it a mark. Whenthisfilteris
setto “true”.These markedtitleswill be excludedfromthe final output. Atthe same time,suchfunctions
inlongtitle filterhave beenremoved.
4.3. NumberingGap
Motivation: In item extraction part, system will do a fast filtering to check the continuous of extracted
titles.However,somedocumentshave missingtitleswillbe effectbadlybythis.Alsothe pruningfunction
is basedon continuouskeysof extractedtitles.Sowe can’t justclose that fast filteringfunctiontoavoid
this.
Solution: Build a tricky function to across this numbering gap. Note that this change may have negative
impacton parsingotherdocumentsandthusit needsmore validationbeforeimplementation.
Figure 5 – ColonTitles Filter
Figure 6 shows the code of Givekeys function. This function will check the priority and prefix of
neighboringtitles.If theyare the same,the systemwill change the keyof the secondtitle toone plusthe
keyof the firsttitle. Notedthatlists(e.g.:LastKn[],LastPrfx[]..) inthisfunctionwill be resetwhenusinga
newregularexpression.
This function now can’t identify English character and roman letter perfectly. When they are used at a
same article,it will be confused.Inthe attached code,thisfunctionnow is justused fordigital titles.For
letternumeric,Englishcharacterandromannumber,itwon’tbe usednow.(However,if onlyone of them
isusedin a givenarticle,thisfunction isstill agoodsolution).Youcansimplysearchby “givekeys” to see
where toopen thisfunctionforthese kindsof titles (relatedsentencesare commented inthe code).
Figure 6 – Givekeys Function
5. Outcome and Defects
Figure 7 shows performance of documentparsingafterapplychange 1-4. It isshownthat change 1&2&3
can improve the overall performance of documentparsing.
Change 4 has negative impactonseveral documents,especiallyfor“Google PlayTermsof Service”.More
validation is needed before we implement this change to production. After improving change 4, some
negative impactshave beensolved.
Figure 7 – Performance on Test Documents

More Related Content

PDF
Lesson 5 link list
PDF
Lesson 3 simple sorting
PDF
PDF
Lesson 6 recursion
PDF
CIS 336 Life of the Mind/newtonhelp.com   
DOC
CIS 336 Imagine Your Future/newtonhelp.com   
PDF
Lab12 dsa bsee20075
PPTX
Singly & Circular Linked list
Lesson 5 link list
Lesson 3 simple sorting
Lesson 6 recursion
CIS 336 Life of the Mind/newtonhelp.com   
CIS 336 Imagine Your Future/newtonhelp.com   
Lab12 dsa bsee20075
Singly & Circular Linked list

What's hot (18)

PDF
Application sql issues_and_tuning
DOCX
Cis 336 Extraordinary Success/newtonhelp.com
DOC
CIS 336 Focus Dreams/newtonhelp.com
PDF
Introduction to Data Mining with R and Data Import/Export in R
PPT
Lec6 mod linked list
DOC
CIS 336 Start With a Dream /newtonhelp.com
PDF
CIS 336 PAPERS Lessons in Excellence--cis336papers.com
DOCX
CIS 336 Inspiring Innovation -- cis336.com
DOCX
CIS 336 PAPERS Education for Service--cis336papers.com
DOCX
CIS 336 STUDY Inspiring Innovation--cis336study.com
PDF
CIS 336 Achievement Education --cis336.com
PDF
CIS336 Education for Service--cis336.com
PDF
CIS 336 Redefined Education--cis336.com
PPT
Data Structure Lecture 5
DOCX
CIS 336 Become Exceptional--cis336.com
DOCX
CIS 336 STUDY Education Counseling--cis336study.com
PPTX
Linked lists in Data Structure
PDF
Cis 336 Enhance teaching / snaptutorial.com
Application sql issues_and_tuning
Cis 336 Extraordinary Success/newtonhelp.com
CIS 336 Focus Dreams/newtonhelp.com
Introduction to Data Mining with R and Data Import/Export in R
Lec6 mod linked list
CIS 336 Start With a Dream /newtonhelp.com
CIS 336 PAPERS Lessons in Excellence--cis336papers.com
CIS 336 Inspiring Innovation -- cis336.com
CIS 336 PAPERS Education for Service--cis336papers.com
CIS 336 STUDY Inspiring Innovation--cis336study.com
CIS 336 Achievement Education --cis336.com
CIS336 Education for Service--cis336.com
CIS 336 Redefined Education--cis336.com
Data Structure Lecture 5
CIS 336 Become Exceptional--cis336.com
CIS 336 STUDY Education Counseling--cis336study.com
Linked lists in Data Structure
Cis 336 Enhance teaching / snaptutorial.com
Ad

Viewers also liked (16)

PDF
Como trabalhar pela internet
DOCX
Cuestionario quinto sociales
RTF
Titanic y leyendas de pasion. 23
PPTX
World cultures fall 2015 ppt day2
PDF
Konferencia 20110705
PPT
Conference[1]A
PPTX
PDF
De alba gonzalez_marlene_actividad1_mapa_c
PPTX
REBAJAS
PPTX
High frequency modeling
PDF
FBI letter from Alan Malinchak - Confidential
PDF
Coach emagrecimento rj
PDF
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
DOCX
2016_bismoyo_oct
PPTX
Sistemas de Gestão de Ciência e Repositórios - Diretrizes nacionais e interna...
Como trabalhar pela internet
Cuestionario quinto sociales
Titanic y leyendas de pasion. 23
World cultures fall 2015 ppt day2
Konferencia 20110705
Conference[1]A
De alba gonzalez_marlene_actividad1_mapa_c
REBAJAS
High frequency modeling
FBI letter from Alan Malinchak - Confidential
Coach emagrecimento rj
Repositórios Temáticos e Memória: a constituição da Educação em Saúde no Bras...
2016_bismoyo_oct
Sistemas de Gestão de Ciência e Repositórios - Diretrizes nacionais e interna...
Ad

Similar to DocumentationofchangesondocumentparsingofBlueHOUND (20)

PDF
Little gems in the upcoming version 13 of TYPO3
DOCX
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
DOCX
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
DOCX
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
PDF
Chapter 2.pdf WND FWKJFW KSD;KFLWHFB ASNK
PPTX
Functions in Python Syntax and working .
DOCX
Software Systems Modularization
PDF
DOCX
You must implement the following functions- Name the functions exactly.docx
PDF
New Perspectives on XML Comprehensive 3rd Edition Carey Test Bank
PPT
Slides chapters 28-32
PDF
PDF
New Perspectives on XML Comprehensive 3rd Edition Carey Test Bank
PDF
Starting Out With C++ From Control Structures To Objects 9th Edition Gaddis S...
PPT
RPG Program for Unit Testing RPG
PDF
Starting Out With C++ From Control Structures To Objects 9th Edition Gaddis S...
PPT
C, C++ Interview Questions Part - 1
PPT
VB_ERROR CONTROL_FILE HANDLING.ppt
PDF
New Perspectives on XML Comprehensive 3rd Edition Carey Test Bank
DOCX
Assignment 02 Process State SimulationCSci 430 Introduction to.docx
Little gems in the upcoming version 13 of TYPO3
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
COMM 166 Final Research Proposal GuidelinesThe proposal should.docx
Chapter 2.pdf WND FWKJFW KSD;KFLWHFB ASNK
Functions in Python Syntax and working .
Software Systems Modularization
You must implement the following functions- Name the functions exactly.docx
New Perspectives on XML Comprehensive 3rd Edition Carey Test Bank
Slides chapters 28-32
New Perspectives on XML Comprehensive 3rd Edition Carey Test Bank
Starting Out With C++ From Control Structures To Objects 9th Edition Gaddis S...
RPG Program for Unit Testing RPG
Starting Out With C++ From Control Structures To Objects 9th Edition Gaddis S...
C, C++ Interview Questions Part - 1
VB_ERROR CONTROL_FILE HANDLING.ppt
New Perspectives on XML Comprehensive 3rd Edition Carey Test Bank
Assignment 02 Process State SimulationCSci 430 Introduction to.docx

DocumentationofchangesondocumentparsingofBlueHOUND

  • 1. Documentation of changes on document parsing of BlueHOUND Version 1.2 Date 2016-06-02 Mentor: Long Yin Interns: Wan Xulang, Zhang Yuxi Author: Wan Xulang 1. Files To improve the performance of structure detection & clause extraction in BlueHOUND, we identified a lists of issues and tested the solutions for 4 of them. Changes have been made in two files. One is the configurationfile (.txt) whichcontainsthe parametersettingsof structure detection& clause extraction module.The otherfile (.java)the structuredetection&clause extractionmodulewhichcontainsthe detail rulesof structure detection&clause extraction. These are 6filesinthe documentfolder. ConfigFile-Original.txt - the original configuration file ConfigFile-Updated.txt - the updatedconfigurationfile DetectStructure-Original.java - the original source code of structure detection&clause extractionmodule DetectStructure-Updated.java - the updated source code of structure detection&clause extractionmodule StructureDetection&ClauseExtractionModule.pptx - Flow charts to describe the procedures in structure detection & clause extraction module. This file canbe usedas a reference toquicklyunderstandDetectStructure-Original.java DocumentationofchangesondocumentparsingofBlueHOUND.docx - Document the changes we made in configuration file (.txt) and the structure detection& clause extractionmodule (.java) 2. Problemsand solutions We testedthe performanceof documentparsing function by reviewing31documents. We foundthatnot all documents could be parsed 100% correctly. A list of issues has been identified, and we proposed 4 changes to fix some of these issues. Note that in problem 2 and 3, we need to modify both the configurationfileandthe source code.
  • 2. # Problems Changes Modifications of ConfigurationFile Modificationsof Source Code 1 Clauses/Subclausesstarted with‘section#.#’cannot be detected Addregularexpressionin configfile forsection#.# 3.1.TAB Punctuation 3.3. Section#.# 2 Clauses/Subclauses started with ‘Article one/two/three’ cannot be detected Addregularexpressionin configfile forArticle One/Two/Three 3.2.Article one/two/three… 4.1.Recognize LetterNumeric 3 Title of Clauses/Subclauses endedwithcoloncannotbe detected Turn off the filteringrule whichexclude Clause whose title endedwith colon 3.4. ColonTitles 4.2.Colon Titles Filter 4 Missing Clauses/Subclauses if there isgap in numbering Setcontinuouskey 4.3.Numbering Gap 3. ModificationsofConfigFile 3.1. TAB Punctuation Motivation:Indocumentparsing,titleswhichfollowatabpunctuationwillbe setaspriority3.Butactually theyare importanttitlesaswe don’twantthembe priority3. Solution:AddTAB intoimportantpunctuationinthe configfile. 3.2. Article one/two/three… Motivation:Titleswith “letternumber”likearticle one/two/three can’tbe detectedbythe tool. Solution: Add a new regular expression in config file to extract such titles. Besides, to make them recognizable, amappingfunctionisalso neededtomap the “letternumber”like one/two/three to“key value” like 1,2,3. This mapping function is added in the source code. Please refer to 4.1 for more information. Regular Expression: sectionRegexp b((?:[Aa][Rr][Tt][Ii][Cc][Ll][Ee])s+(([TtWwEeNnHhIiRrFfOoGg]{4,5}[Yy]){0,1}- {0,1}[OoNnTtWwRrEeFfSsVvXxGgHhLlIiUuYy]{3,9}))b Explanation:Thisregularexpressioniscombinedbytwoparts. The firstpart isa prefix of “article”while the secondpart isa lettersetof all possible combinationof letternumbersfromone tofifty-nine. 3.3. Section#. # (# refersto numeral symbols) Motivation:Whenextractingmultileveltitles,the systemgiveslow prioritytothose titles whichfollowa wordlike ‘section’ or‘article’etc. Solution:Addnewregularexpressionswhichcontainthese wordsasprefixestoavoidthissituation. Figure 1 - TAB Punctuation
  • 3. Regular Expression: multilevelRegexp (([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5}).)((d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.))*((d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(.?))s+ multilevelRegexp(([Ss][Ee][Cc][Tt][Ii][Oo][Nn])s(d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})-)((d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})(-))*(d{1,3}|[a-z]{1,2}|[A- Z]{1,2}|[MDCLXVI]{1,5}|[mdclxvi]{1,5})s+ 3.4. ColonTitles Motivation:Whenextractingitems,the systemexclude titlesfollowedbycolonswhen “LongTitleAndBlankDescription”filterisopen. Solution: “LongTitleAndBlankDescription” filter contains a set of rules, like exclude clauses/subclauses withlongtitle;excludeclause/subclauseswithblankdescription;andexcludeclauses/subclausesifitstitle isendedwithcolonorsemicolon. If we simplyclosethisfilter,itwillhave badeffectsonotherdocuments. Sowe separate the colonfunctionfrom“LongTitleAndBlankDescription”filterand addanew filternamed “ColonTitles” inconfigfile. We alsoadjustedthe source code toseparate the rule aboutcolon/semicolon from the set of rules of “LongTitleAndBlankDescription” filter. Please refer to 4.2 for more information. 4. ModificationsofSource Code 4.1. Recognize LetterNumeric Motivation: As we’ve added a new regular expression in config file in 3.1, the system now can extract titleswithletternumeric.Butthe systemcan’t give a numerickeyto themas there’snosuch functionin the source code to achieve so. Solution:Buildafunctiontogive keysto suchtitles.We simplyuse a“decode”functiontoachieve this. Figure 3 shows twolistswhichare pre-builtforthe decoding function. Figure 2 - ColonTitles Filter Figure 3 – Preparation for Decoding
  • 4. Figure 4 showsthe code of decodingfunction.Thisfunctionwill divide the numericsymbol intodifferent parts by “-” and convert each part into a numeric value by going through the two lists we set before. Finally,sumupthe numericvalue of eachpart to give the key. Before thisfunction,anotherfunctionnamed”isLetterNumeral”isusedtoidentifywhetheratitleisletter numberor not. 4.2. ColonTitlesFilter Motivation: To separate the colon titles excluding function from “LongTitleAndBlankDescription” filter. We make a newfilternamed “ColonTitles”. Thus, “LongTitleAndBlankDescription”filterwill onlyexclude clauseswhichhave longtitlesandblankdescription. Solution:Adda newfunctioninfilteringmodule tobecome anew filter. Figure 4 – Convert Function
  • 5. Figure 5 showsthe code of howto identifyatitle followedbycolonandgive it a mark. Whenthisfilteris setto “true”.These markedtitleswill be excludedfromthe final output. Atthe same time,suchfunctions inlongtitle filterhave beenremoved. 4.3. NumberingGap Motivation: In item extraction part, system will do a fast filtering to check the continuous of extracted titles.However,somedocumentshave missingtitleswillbe effectbadlybythis.Alsothe pruningfunction is basedon continuouskeysof extractedtitles.Sowe can’t justclose that fast filteringfunctiontoavoid this. Solution: Build a tricky function to across this numbering gap. Note that this change may have negative impacton parsingotherdocumentsandthusit needsmore validationbeforeimplementation. Figure 5 – ColonTitles Filter
  • 6. Figure 6 shows the code of Givekeys function. This function will check the priority and prefix of neighboringtitles.If theyare the same,the systemwill change the keyof the secondtitle toone plusthe keyof the firsttitle. Notedthatlists(e.g.:LastKn[],LastPrfx[]..) inthisfunctionwill be resetwhenusinga newregularexpression. This function now can’t identify English character and roman letter perfectly. When they are used at a same article,it will be confused.Inthe attached code,thisfunctionnow is justused fordigital titles.For letternumeric,Englishcharacterandromannumber,itwon’tbe usednow.(However,if onlyone of them isusedin a givenarticle,thisfunction isstill agoodsolution).Youcansimplysearchby “givekeys” to see where toopen thisfunctionforthese kindsof titles (relatedsentencesare commented inthe code). Figure 6 – Givekeys Function
  • 7. 5. Outcome and Defects Figure 7 shows performance of documentparsingafterapplychange 1-4. It isshownthat change 1&2&3 can improve the overall performance of documentparsing. Change 4 has negative impactonseveral documents,especiallyfor“Google PlayTermsof Service”.More validation is needed before we implement this change to production. After improving change 4, some negative impactshave beensolved. Figure 7 – Performance on Test Documents