Finding a Common Language:
Bringing Complex and Disparate
     Vocabularies Together


             Paula R. McCoy
     Manager, Taxonomy Development
                ProQuest
       paula.mccoy@proquest.com
Part of Cambridge Information Group & CSA

Headquartered in Ann Arbor, Michigan
Editorial offices in Louisville, Kentucky
Access to over 125 billion digital pages of content from
  magazine, trade, & scholarly publications, current &
historical newspapers, original materials such as annual
   reports & civil war pamphlets, and daily wire feeds

  Subscription-based ProQuest® online information
   service available in academic and public libraries
Louisville editors abstract & index 4,000+
periodicals & newspapers

ProQuest Controlled Vocabulary used to index
subjects; Authority Files used to index
company, geographic, personal, product names

CV applied to non-periodical & third-party
content via mapping, to allow cross-searching
of multiple DBs with one vocabulary
Topics of Discussion
Description of ProQuest Controlled
Vocabulary & Authority Files

Taxonomy Management -- Overview

Life Before Synaptica

Thesaurus Management System Purchase
Implementing Synaptica

Life With Synaptica

Q&A
PQ CV




        ProQuest Controlled Vocabulary

         Natural language, hierarchical vocabulary complying
         with ANSI/NISO Standard Z39.19 (Guidelines for
         the Construction, Format, and Management of
         Monolingual Controlled Vocabularies)

         Created in 1970s for ABI/INFORM business database

         Based on Library of Congress Subject Headings
PQ CV




        ProQuest Controlled Vocabulary
        Merged with general reference vocabulary in 1980s
        Major development effort in past 4 years to boost
        science, education & medical terms
        Thesaurus subjects:
          Business, economics & trade – 4300 terms
          Science, math & technology – 1600 terms
          Medicine – 1150 terms
          Humanities – 960 terms
          Government & policy – 850 terms
          Education – 400 terms
PQ CV




        ProQuest CV: Statistics

          Preferred terms: 11,046
          Non-preferred terms: 5631
          Scope Notes: 3194 (29%)
          Cross-references (Broader,
          Narrower, Related terms): 67,700
          Terms added in 2007: 77
          Terms added in 2008: 58+
PQ CV




        Authority Files: Statistics

         Corporate/Organization Names: 438,098
         Names added in 2008: 5489

         Personal Names: 416,239
         Names added in 2008: 1526

         Geographic (Location) Names: 34,331
         Names added in 2008: 144

         Product Names: 38,210
         Names added in 2008: 54
Taxonomy Management




                The Taxonomy Manager’s Job

                      Add subject terms as dictated by new
                      concepts & new content to index

                      Maintain hierarchies & Scope Notes

                      Load updated Thesaurus to ProQuest interface

                      Manage authority files to maintain standards
                      & control file size
Taxonomy Management




                The Taxonomy Manager’s Job

                                OBJECTIVE:

         To ensure that indexers and searchers alike have access to a
         complete and accurate Thesaurus that they can use to
         maximize the discoverability of documents in ProQuest
Taxonomy Management




                      Thesaurus on ProQuest®
Taxonomy Management




                        Sample Subject Term
                                          Preferred, or main term
                                                                    Scope note defining term
                                                                       and how it is used

          Chronic obstructive pulmonary disease
          SN: Any lung disease, such as chronic bronchitis or
          emphysema, causing obstruction of bronchial airflow       Non-preferred term: points
           UF COPD                                                    to term used to index
           BT Disease                                               Terms broader in nature to
           BT Respiratory diseases                                    main term: COPD is a
           NT Asthma                                                disease, and specifically, a
           NT Bronchitis                                                respiratory disease
           NT Emphysema
                                                                    Terms narrower in nature
           RT Airway management                                      to main term: these are
           RT Lungs                                                    chronic lung diseases

                                                                    Terms related to main term
                                                                       that might be used to
                                                                         narrow the search
Before Synaptica


         Managing terms meant:

Multiple files  Duplicate entries  Errors

 = less than ideal thesaurus management
Before Synaptica




                   MS Word Document
Before Synaptica




                   Vocabulary Documents in Word

                      ProQuest controlled vocabulary
                      French-language controlled vocabulary
                      German-language controlled vocabulary
                      Spanish-language controlled vocabulary
                      Combined PQ-CBCA controlled vocabulary
                      Ethnic database vocabulary, English
                      Ethnic database vocabulary, Spanish
Before Synaptica




                   Foreign-Language Vocabularies




               French         German       Spanish
Before Synaptica




                   Oracle Database Forms
Before Synaptica




                   Authority Files in Oracle

                    Class codes (related to subjects)
                    CORP names (391,665+ terms)
                    NAIC codes (related to companies)
                    GEOG names (32,000+ terms)
                    PERS names (350,000+ terms)
                    PROD names (38,000+ terms)
Before Synaptica




                          Adding New Terms

                   1. Enter full term hierarchy into new Word doc
                   2. Copy term into main Word-based vocabulary &
                      enter reciprocal relationships
                   3. Enter term & relationships into Oracle
                   4. Review next-day report on Oracle activity
                   5. Send new term doc to editors via e-mail
                   6. Print new vocabulary (at least every two years)
SN
     BT
     Class Code
       [whew!]

UF                  NT
          RT
TMS Purchase




               Thesaurus Management Systems
                       Buying Criteria
                         Synaptica


          Up to 40 admin & 100 in real time within multiple locations
           1. Ability to interact read-only users editorial system
          Ability to load vocabs from multiple Word docs & Oracle
          authority filesaccommodate authority files of 400,000+
            2. Ability to
               names
          Support for foreign-language vocabularies
          Ability to add new vocabularies
          Vendor onsite installation & training
          Software upgrades & tech support
Implementing Synaptica




                         Implementing Synaptica

                   Contract signed and work begun in August 2004

                   PQ sent to Synaptica all the Word & Oracle files for
                   analysis

                   Decision points: how to load & structure data;
                   how to handle “suspect” or erroneous
                   relationships
Implementing Synaptica




                         Synaptica Data Analysis
                          Relationship Validation Tests:

                            Term Uniqueness
                            Use Violations
                            Self-Referencing Relationships
                            One Relationship per Term Pair
                            Relationship Unique
                            Relationship Reciprocates
                            Circular References

     Exception Reports delivered to PQ; Errors fixed before production
Implementing Synaptica




                         Use Validation Error

                            Marine resources
Implementing Synaptica




                         Foreign-Language Errors

           Terms with no language equivalent (LEQ), e.g., no translation

           In all 3 languages, multiple English terms with the same
            translation, e.g.:

         English term          French term      French term-revised
          Purchasing            Achats
          Shopping              Achats           Shopping
          Buyers                Acheteurs
          Purchasing agents     Acheteurs        Agents d'achat
Implementing Synaptica




                            Final Challenge

              Issue:     Different editorial systems = 2x data
                         entry: once for Synaptica, once for Oracle

              Solution: Overnight synchronization process to copy
                        Synaptica work into Oracle every night

                         Synch process discontinued April 2008
Implementing Synaptica




             Putting Synaptica Into Production
                                   Nov 2004

                Train users — provide documentation & hands-on
                demonstrative training

                Deal with people resistant to change

                Encourage written feedback on system functionality
                Send feedback to Synaptica – many of our suggestions
                implemented in later versions
Life With Synaptica




                        Life With Synaptica
                      Terms Management Made Easy!




              Word – Old, Bad    Synaptica – New, Good 
Life With Synaptica




              Adding Terms Today: 3 Easy Steps

                      1. Enter term and relationships into Synaptica
                         “Item Details” window

                      2. Export report of new terms into Word

                      3. Send Word document to editors
Life With Synaptica




            Improving Thesaurus Management
                      Categories Feature
Life With Synaptica




                      Subject Term Categories
Life With Synaptica




         CORP Names – Categories & Website
Life With Synaptica




                  Foreign-Language Vocabularies



                                            Language
                                           Equivalents
Life With Synaptica




                  Foreign-Language Vocabularies
Life With Synaptica




                  Foreign-Language Vocabularies
                               Spanish




                               Spanish




                                           Alphabetical
                                           by language



                      German             French
Life With Synaptica




                               Synaptica Updates

                        Synaptica version 6.0 released in early 2006

                        Synaptica version 7.0 is being implemented now:

                      • Enhanced user interface
                      • Semantic Web standardization (RDF, OWL, SKOS) and
                         Web Services integration
                      • Expanded Reporting functionality
                      • Enhanced adding and editing of term relationships
                         including “rapid-fire” simple drag-and-drop editing
                      • Improved global term editing
                      • Online help and user guides
Life With Synaptica




                             Benefits of Synaptica
                      Greater awareness of thesaurus standards and
                      terminology, e.g.: “preferred” and “non-preferred”
                      instead of Use and Used For
                      Long-needed updating and improvement in term
                      hierarchies; ability to provide thesaurus statistics
                      Increase in Company name NPTs — from 1935 to
                       8952 today
                      Immediate responsiveness to indexer needs —
                       real-time term additions, esp. NPTs and SNs
                      Easier loading of updated Thesaurus on PQ interface
Questions?

thank you!

More Related Content

PPT
Re-engineering Taxonomy Warehouse as an Ontology
PPT
Synaptica Proquest Talk Taxonomy Boot Camp 2009
PPTX
Enterprise vs. Federated Taxonomy Management - Taxonomy Boot Camp 2012
PPTX
Taxonomies Crossing Boundaries: Thomson Reuters Life Sciences Taxonomy Use Cases
PPT
Finding a Common Language: Bringing Complex and Disparate Vocabularies Together
PPT
Terminology management as fitness v.2 iti
PPTX
Taxonomy 101
PPTX
Kieli analytics
Re-engineering Taxonomy Warehouse as an Ontology
Synaptica Proquest Talk Taxonomy Boot Camp 2009
Enterprise vs. Federated Taxonomy Management - Taxonomy Boot Camp 2012
Taxonomies Crossing Boundaries: Thomson Reuters Life Sciences Taxonomy Use Cases
Finding a Common Language: Bringing Complex and Disparate Vocabularies Together
Terminology management as fitness v.2 iti
Taxonomy 101
Kieli analytics

Similar to ProQuest Taxonomy Boot Camp Presentation 2008 (20)

PPT
Indexing
PPT
Textmining
PDF
Chapter 2 Text Operation and Term Weighting.pdf
PPTX
Enriching the semantic web tutorial session 1
PPT
Literature Based Framework for Semantic Descriptions of e-Science resources
PPTX
Terminology: tips and tricks to boost your terminology work
PDF
GARNet workshop on Integrating Large Data into Plant Science
PPTX
KOS Management - The case of the Organic.Edunet Ontology
PDF
Descript transcription.pdf
PDF
Custom Query Languages: Why? How?
PDF
Automated Abstracts and Big Data
PPTX
Knowledge Organization Systems (KOS): Management of Classification Systems in...
PDF
Chapter 2: Text Operation in information stroage and retrieval
PDF
Knowledge Organization System (KOS) for biodiversity information resources, G...
PDF
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
PDF
Natural Language Processing, Techniques, Current Trends and Applications in I...
PPT
Porting Library Vocabularies to the Semantic Web - IFLA 2010
PDF
Chapter 2 Text Operation.pdf
KEY
The Semantic Web meets the Code of Federal Regulations
PDF
Usability-focused Clinical Decision Support with the Help of Semantic Technol...
Indexing
Textmining
Chapter 2 Text Operation and Term Weighting.pdf
Enriching the semantic web tutorial session 1
Literature Based Framework for Semantic Descriptions of e-Science resources
Terminology: tips and tricks to boost your terminology work
GARNet workshop on Integrating Large Data into Plant Science
KOS Management - The case of the Organic.Edunet Ontology
Descript transcription.pdf
Custom Query Languages: Why? How?
Automated Abstracts and Big Data
Knowledge Organization Systems (KOS): Management of Classification Systems in...
Chapter 2: Text Operation in information stroage and retrieval
Knowledge Organization System (KOS) for biodiversity information resources, G...
II-SDV 2012 Making Knowledge Discoverable: The Role of Agile Text Mining
Natural Language Processing, Techniques, Current Trends and Applications in I...
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Chapter 2 Text Operation.pdf
The Semantic Web meets the Code of Federal Regulations
Usability-focused Clinical Decision Support with the Help of Semantic Technol...
Ad

More from Synaptica, LLC (6)

PPTX
Using ontologies for more than information categorization
PPTX
Text Analytics for Non-Experts
PPTX
Linked data 20171106
PPTX
Selecting the right database type for your knowledge management needs.
PPTX
SKOS-XL vs. Traditional Term Based Taxonomy Management
PPTX
Successfully Managing Multilingual Taxonomies: 3 Methods
Using ontologies for more than information categorization
Text Analytics for Non-Experts
Linked data 20171106
Selecting the right database type for your knowledge management needs.
SKOS-XL vs. Traditional Term Based Taxonomy Management
Successfully Managing Multilingual Taxonomies: 3 Methods
Ad

Recently uploaded (20)

PDF
Five Habits of High-Impact Board Members
PPTX
Microsoft Excel 365/2024 Beginner's training
PPTX
Configure Apache Mutual Authentication
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PDF
Zenith AI: Advanced Artificial Intelligence
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
sbt 2.0: go big (Scala Days 2025 edition)
PDF
A review of recent deep learning applications in wood surface defect identifi...
PPTX
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPT
Geologic Time for studying geology for geologist
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
STKI Israel Market Study 2025 version august
PDF
Developing a website for English-speaking practice to English as a foreign la...
Five Habits of High-Impact Board Members
Microsoft Excel 365/2024 Beginner's training
Configure Apache Mutual Authentication
A proposed approach for plagiarism detection in Myanmar Unicode text
Improvisation in detection of pomegranate leaf disease using transfer learni...
Zenith AI: Advanced Artificial Intelligence
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
2018-HIPAA-Renewal-Training for executives
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
Enhancing plagiarism detection using data pre-processing and machine learning...
sbt 2.0: go big (Scala Days 2025 edition)
A review of recent deep learning applications in wood surface defect identifi...
GROUP4NURSINGINFORMATICSREPORT-2 PRESENTATION
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Geologic Time for studying geology for geologist
A contest of sentiment analysis: k-nearest neighbor versus neural network
“A New Era of 3D Sensing: Transforming Industries and Creating Opportunities,...
Getting started with AI Agents and Multi-Agent Systems
STKI Israel Market Study 2025 version august
Developing a website for English-speaking practice to English as a foreign la...

ProQuest Taxonomy Boot Camp Presentation 2008

  • 1. Finding a Common Language: Bringing Complex and Disparate Vocabularies Together Paula R. McCoy Manager, Taxonomy Development ProQuest paula.mccoy@proquest.com
  • 2. Part of Cambridge Information Group & CSA Headquartered in Ann Arbor, Michigan Editorial offices in Louisville, Kentucky
  • 3. Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current & historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds Subscription-based ProQuest® online information service available in academic and public libraries
  • 4. Louisville editors abstract & index 4,000+ periodicals & newspapers ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary
  • 5. Topics of Discussion Description of ProQuest Controlled Vocabulary & Authority Files Taxonomy Management -- Overview Life Before Synaptica Thesaurus Management System Purchase Implementing Synaptica Life With Synaptica Q&A
  • 6. PQ CV ProQuest Controlled Vocabulary Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies) Created in 1970s for ABI/INFORM business database Based on Library of Congress Subject Headings
  • 7. PQ CV ProQuest Controlled Vocabulary Merged with general reference vocabulary in 1980s Major development effort in past 4 years to boost science, education & medical terms Thesaurus subjects: Business, economics & trade – 4300 terms Science, math & technology – 1600 terms Medicine – 1150 terms Humanities – 960 terms Government & policy – 850 terms Education – 400 terms
  • 8. PQ CV ProQuest CV: Statistics Preferred terms: 11,046 Non-preferred terms: 5631 Scope Notes: 3194 (29%) Cross-references (Broader, Narrower, Related terms): 67,700 Terms added in 2007: 77 Terms added in 2008: 58+
  • 9. PQ CV Authority Files: Statistics Corporate/Organization Names: 438,098 Names added in 2008: 5489 Personal Names: 416,239 Names added in 2008: 1526 Geographic (Location) Names: 34,331 Names added in 2008: 144 Product Names: 38,210 Names added in 2008: 54
  • 10. Taxonomy Management The Taxonomy Manager’s Job Add subject terms as dictated by new concepts & new content to index Maintain hierarchies & Scope Notes Load updated Thesaurus to ProQuest interface Manage authority files to maintain standards & control file size
  • 11. Taxonomy Management The Taxonomy Manager’s Job OBJECTIVE: To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest
  • 12. Taxonomy Management Thesaurus on ProQuest®
  • 13. Taxonomy Management Sample Subject Term Preferred, or main term Scope note defining term and how it is used Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow Non-preferred term: points UF COPD to term used to index BT Disease Terms broader in nature to BT Respiratory diseases main term: COPD is a NT Asthma disease, and specifically, a NT Bronchitis respiratory disease NT Emphysema Terms narrower in nature RT Airway management to main term: these are RT Lungs chronic lung diseases Terms related to main term that might be used to narrow the search
  • 14. Before Synaptica Managing terms meant: Multiple files  Duplicate entries  Errors = less than ideal thesaurus management
  • 15. Before Synaptica MS Word Document
  • 16. Before Synaptica Vocabulary Documents in Word ProQuest controlled vocabulary French-language controlled vocabulary German-language controlled vocabulary Spanish-language controlled vocabulary Combined PQ-CBCA controlled vocabulary Ethnic database vocabulary, English Ethnic database vocabulary, Spanish
  • 17. Before Synaptica Foreign-Language Vocabularies French German Spanish
  • 18. Before Synaptica Oracle Database Forms
  • 19. Before Synaptica Authority Files in Oracle Class codes (related to subjects) CORP names (391,665+ terms) NAIC codes (related to companies) GEOG names (32,000+ terms) PERS names (350,000+ terms) PROD names (38,000+ terms)
  • 20. Before Synaptica Adding New Terms 1. Enter full term hierarchy into new Word doc 2. Copy term into main Word-based vocabulary & enter reciprocal relationships 3. Enter term & relationships into Oracle 4. Review next-day report on Oracle activity 5. Send new term doc to editors via e-mail 6. Print new vocabulary (at least every two years)
  • 21. SN BT Class Code [whew!] UF NT RT
  • 22. TMS Purchase Thesaurus Management Systems Buying Criteria Synaptica Up to 40 admin & 100 in real time within multiple locations 1. Ability to interact read-only users editorial system Ability to load vocabs from multiple Word docs & Oracle authority filesaccommodate authority files of 400,000+ 2. Ability to names Support for foreign-language vocabularies Ability to add new vocabularies Vendor onsite installation & training Software upgrades & tech support
  • 23. Implementing Synaptica Implementing Synaptica Contract signed and work begun in August 2004 PQ sent to Synaptica all the Word & Oracle files for analysis Decision points: how to load & structure data; how to handle “suspect” or erroneous relationships
  • 24. Implementing Synaptica Synaptica Data Analysis Relationship Validation Tests: Term Uniqueness Use Violations Self-Referencing Relationships One Relationship per Term Pair Relationship Unique Relationship Reciprocates Circular References Exception Reports delivered to PQ; Errors fixed before production
  • 25. Implementing Synaptica Use Validation Error Marine resources
  • 26. Implementing Synaptica Foreign-Language Errors Terms with no language equivalent (LEQ), e.g., no translation In all 3 languages, multiple English terms with the same translation, e.g.: English term French term French term-revised Purchasing Achats Shopping Achats Shopping Buyers Acheteurs Purchasing agents Acheteurs Agents d'achat
  • 27. Implementing Synaptica Final Challenge Issue: Different editorial systems = 2x data entry: once for Synaptica, once for Oracle Solution: Overnight synchronization process to copy Synaptica work into Oracle every night Synch process discontinued April 2008
  • 28. Implementing Synaptica Putting Synaptica Into Production Nov 2004 Train users — provide documentation & hands-on demonstrative training Deal with people resistant to change Encourage written feedback on system functionality Send feedback to Synaptica – many of our suggestions implemented in later versions
  • 29. Life With Synaptica Life With Synaptica Terms Management Made Easy! Word – Old, Bad  Synaptica – New, Good 
  • 30. Life With Synaptica Adding Terms Today: 3 Easy Steps 1. Enter term and relationships into Synaptica “Item Details” window 2. Export report of new terms into Word 3. Send Word document to editors
  • 31. Life With Synaptica Improving Thesaurus Management Categories Feature
  • 32. Life With Synaptica Subject Term Categories
  • 33. Life With Synaptica CORP Names – Categories & Website
  • 34. Life With Synaptica Foreign-Language Vocabularies Language Equivalents
  • 35. Life With Synaptica Foreign-Language Vocabularies
  • 36. Life With Synaptica Foreign-Language Vocabularies Spanish Spanish Alphabetical by language German French
  • 37. Life With Synaptica Synaptica Updates Synaptica version 6.0 released in early 2006 Synaptica version 7.0 is being implemented now: • Enhanced user interface • Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration • Expanded Reporting functionality • Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing • Improved global term editing • Online help and user guides
  • 38. Life With Synaptica Benefits of Synaptica Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics Increase in Company name NPTs — from 1935 to 8952 today Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs Easier loading of updated Thesaurus on PQ interface