A Methodology and Tool Suite for
Evaluating the Accuracy of
Interoperating Statistical Natural
Language Processing Engines
Uma Murthy
Virginia Tech

John Pitrelli, Ganesh Ramaswamy,
Martin Franz, and Burn Lewis
IBM T.J. Watson Research Center


Interspeech
22-26 September 2008
Brisbane, Australia
Outline
•    Motivation
•    Context
•    Issues
•    Evaluation methodology
•    Example evaluation modules
•    Future directions


                                  2
Motivation
•  Combining Natural Language Processing
   (NLP) engines for information processing in
   complex tasks
•  Evaluation of accuracy of output of individual
   NLP engines exists
   –  sliding window, BLEU score, word-error rate, etc.
•  No work on evaluation methods for large
   combinations, or aggregates, of NLP engines
   –  Foreign language videos  transcription 
      translation  story segmentation  topic
      clustering


                                                          3
Project Goal

To develop a methodology and tool suite for
   evaluating the accuracy (of output) of
 interoperating statistical natural language
            processing engines


           in the context of IOD


                                          4
Interoperability Demonstration
System (IOD)




                       Built upon UIMA

                                         5
Issues
1.  How is the accuracy of one engine or a set
    of engines evaluated, in the context of being
    present in an aggregate?
2.  What is the measure of accuracy of an
    aggregate and how can it be computed?
3.  How can the mechanics of this evaluation
    methodology be validated and tested?




                                                6
“Evaluation Space”
•  Core of the evaluation methodology
•  Various options of comparison of
   evaluation space of ground truth options
   based on human-generated and
   machine-generated outputs at every
   stage in the pipeline



                                          7
8
1.  Comparison between M-
    M-M… and H-H-H…
    evaluates the accuracy of
    the entire aggregate


2.  Emerging pattern

3.  Comparison of adjacent
    evaluations determines
    how much one engine
    (TC) degrades accuracy
    of the aggregate
4.  Do not consider H-M
    sequences

5.  Comparing two engines of
    the same function

6.  Assembling ground truths
    is the most expensive
    task

                          9
Evaluation Modules
•  Uses evaluation space as a template to automatically
   evaluate the performance of an aggregate
•  Development
    –  Explore methods that are used to evaluate the last
       engine in the aggregate
    –  If required, modify these methods, considering
       •  Preceding engines and, their input and output
       •  Different ground truth formats
•  Testing:
    –  Focus on validating the mechanics of evaluation and
       not the engines in question


                                                          10
Example Evaluation Modules
•  STTSBD
 – Sliding-window scheme
 – Automatically generated comparable
   ROC curves
   •  Validated module with six 30-minute Arabic
      news shows
•  STTMT
 – BLEU metric
 – Automatically generated BLEU scores
   •  Validated module with two Arabic-English MT
      engines on 38 minutes of audio
                                                    11
Future Directions
•  Develop more evaluation modules and
   validate them
    –  Test with actual ground truths
    –  Test with more data-sets
    –  Test on different engines (of the same
       kind)
•  Methodology
    –  Identify points of error
    –  How much does an engine impact the
       performance of the aggregate?


                                                12
Summary
•  Presented a methodology for automatic
   evaluation of accuracy of aggregates of
   interoperating statistical NLP engines
   –  Evaluation space and evaluation modules
•  Developed and validated evaluation modules
   for two aggregates

•  Miles to go!
   –  Small portion of a vast research area

                                                13
Thank You



      ?
            ?

                14
Back-up Slides




                 15
Evaluation Module Implementation
•  Each module was implemented as a
   UIMA CAS consumer
•  Ground truth and other evaluation
   parameters were input as CAS
   Consumer parameters




                                       16
Measuring the performance of
story boundary detection
TDT-style sliding window approach:
       partial credit for slightly misplaced segment boundaries




• True and system agree within the window t correct.
• No system boundary in a window containing a true boundary t Miss
• System boundary in a window containing no true boundary t False
Alarm

• Window length: 15 seconds
                                     Source: Franz, et al. “Breaking Translation Symmetry”


                                                                                  17
STTSBD Test Constraints
•  Ground truth availability: word-position-
   based story boundaries on ASR
   transcripts
  –  Transcripts were already segmented into
     sentences
•  For the pipeline (STTSBD) output, we
   needed to compare time-based story
   boundaries on Arabic speech

                                               18

More Related Content

PPT
Igcse esl syllabus aims and objectives
PPTX
IGCSE ESL (Comprehension) Guide
PPTX
Selecting, preparing, using and developing instructional materials
PPT
Selecting and Use of Instructional Materials
PPTX
Instructional materials
PDF
Automated Testing of Autonomous Driving Assistance Systems
PDF
Enabling Automated Software Testing with Artificial Intelligence
PPT
Testing of Object-Oriented Software
Igcse esl syllabus aims and objectives
IGCSE ESL (Comprehension) Guide
Selecting, preparing, using and developing instructional materials
Selecting and Use of Instructional Materials
Instructional materials
Automated Testing of Autonomous Driving Assistance Systems
Enabling Automated Software Testing with Artificial Intelligence
Testing of Object-Oriented Software

Similar to A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines (20)

PDF
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
PDF
Scalable Software Testing and Verification of Non-Functional Properties throu...
PDF
An empirical evaluation of cost-based federated SPARQL query Processing Engines
PPTX
The DEBS Grand Challenge 2017
PPT
Generating test cases using UML Communication Diagram
PDF
CS-438 COMPUTER SYSTEMS MODELING WK1LEC1-2.pdf
PPTX
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
PDF
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
PDF
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
PDF
Artificial Intelligence for Automated Software Testing
PDF
Performance Testing Java Applications
PDF
SSBSE 2020 keynote
PDF
Combinatorial optimization and deep reinforcement learning
PPT
techniques.ppt
PDF
A practical guide for using Statistical Tests to assess Randomized Algorithms...
PDF
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
PDF
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
PDF
Pro smartbooksquestions
PPTX
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
PDF
Applications of Machine Learning and Metaheuristic Search to Security Testing
Continuous Evaluation of Collaborative Recommender Systems in Data Stream Man...
Scalable Software Testing and Verification of Non-Functional Properties throu...
An empirical evaluation of cost-based federated SPARQL query Processing Engines
The DEBS Grand Challenge 2017
Generating test cases using UML Communication Diagram
CS-438 COMPUTER SYSTEMS MODELING WK1LEC1-2.pdf
2016 QDB VLDB Workshop - Towards Rigorous Evaluation of Data Integration Syst...
Testing Autonomous Cars for Feature Interaction Failures using Many-Objective...
Making Model-Driven Verification Practical and Scalable: Experiences and Less...
Artificial Intelligence for Automated Software Testing
Performance Testing Java Applications
SSBSE 2020 keynote
Combinatorial optimization and deep reinforcement learning
techniques.ppt
A practical guide for using Statistical Tests to assess Randomized Algorithms...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Pro smartbooksquestions
Chromatography Data System: Getting It “Right First Time” Seminar Series – Pa...
Applications of Machine Learning and Metaheuristic Search to Security Testing
Ad

Recently uploaded (20)

PDF
Hindi spoken digit analysis for native and non-native speakers
PPT
Geologic Time for studying geology for geologist
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
A proposed approach for plagiarism detection in Myanmar Unicode text
PPTX
Chapter 5: Probability Theory and Statistics
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
DOCX
search engine optimization ppt fir known well about this
PDF
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPTX
Configure Apache Mutual Authentication
PDF
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Microsoft Excel 365/2024 Beginner's training
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
PDF
Flame analysis and combustion estimation using large language and vision assi...
PDF
Enhancing emotion recognition model for a student engagement use case through...
Hindi spoken digit analysis for native and non-native speakers
Geologic Time for studying geology for geologist
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Convolutional neural network based encoder-decoder for efficient real-time ob...
A proposed approach for plagiarism detection in Myanmar Unicode text
Chapter 5: Probability Theory and Statistics
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
search engine optimization ppt fir known well about this
Two-dimensional Klein-Gordon and Sine-Gordon numerical solutions based on dee...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Configure Apache Mutual Authentication
Hybrid horned lizard optimization algorithm-aquila optimizer for DC motor
2018-HIPAA-Renewal-Training for executives
Developing a website for English-speaking practice to English as a foreign la...
Microsoft Excel 365/2024 Beginner's training
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Getting started with AI Agents and Multi-Agent Systems
A Late Bloomer's Guide to GenAI: Ethics, Bias, and Effective Prompting - Boha...
Flame analysis and combustion estimation using large language and vision assi...
Enhancing emotion recognition model for a student engagement use case through...
Ad

A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating NLP Engines

  • 1. A Methodology and Tool Suite for Evaluating the Accuracy of Interoperating Statistical Natural Language Processing Engines Uma Murthy Virginia Tech John Pitrelli, Ganesh Ramaswamy, Martin Franz, and Burn Lewis IBM T.J. Watson Research Center Interspeech 22-26 September 2008 Brisbane, Australia
  • 2. Outline •  Motivation •  Context •  Issues •  Evaluation methodology •  Example evaluation modules •  Future directions 2
  • 3. Motivation •  Combining Natural Language Processing (NLP) engines for information processing in complex tasks •  Evaluation of accuracy of output of individual NLP engines exists –  sliding window, BLEU score, word-error rate, etc. •  No work on evaluation methods for large combinations, or aggregates, of NLP engines –  Foreign language videos  transcription  translation  story segmentation  topic clustering 3
  • 4. Project Goal To develop a methodology and tool suite for evaluating the accuracy (of output) of interoperating statistical natural language processing engines in the context of IOD 4
  • 6. Issues 1.  How is the accuracy of one engine or a set of engines evaluated, in the context of being present in an aggregate? 2.  What is the measure of accuracy of an aggregate and how can it be computed? 3.  How can the mechanics of this evaluation methodology be validated and tested? 6
  • 7. “Evaluation Space” •  Core of the evaluation methodology •  Various options of comparison of evaluation space of ground truth options based on human-generated and machine-generated outputs at every stage in the pipeline 7
  • 8. 8
  • 9. 1.  Comparison between M- M-M… and H-H-H… evaluates the accuracy of the entire aggregate 2.  Emerging pattern 3.  Comparison of adjacent evaluations determines how much one engine (TC) degrades accuracy of the aggregate 4.  Do not consider H-M sequences 5.  Comparing two engines of the same function 6.  Assembling ground truths is the most expensive task 9
  • 10. Evaluation Modules •  Uses evaluation space as a template to automatically evaluate the performance of an aggregate •  Development –  Explore methods that are used to evaluate the last engine in the aggregate –  If required, modify these methods, considering •  Preceding engines and, their input and output •  Different ground truth formats •  Testing: –  Focus on validating the mechanics of evaluation and not the engines in question 10
  • 11. Example Evaluation Modules •  STTSBD – Sliding-window scheme – Automatically generated comparable ROC curves •  Validated module with six 30-minute Arabic news shows •  STTMT – BLEU metric – Automatically generated BLEU scores •  Validated module with two Arabic-English MT engines on 38 minutes of audio 11
  • 12. Future Directions •  Develop more evaluation modules and validate them –  Test with actual ground truths –  Test with more data-sets –  Test on different engines (of the same kind) •  Methodology –  Identify points of error –  How much does an engine impact the performance of the aggregate? 12
  • 13. Summary •  Presented a methodology for automatic evaluation of accuracy of aggregates of interoperating statistical NLP engines –  Evaluation space and evaluation modules •  Developed and validated evaluation modules for two aggregates •  Miles to go! –  Small portion of a vast research area 13
  • 14. Thank You ? ? 14
  • 16. Evaluation Module Implementation •  Each module was implemented as a UIMA CAS consumer •  Ground truth and other evaluation parameters were input as CAS Consumer parameters 16
  • 17. Measuring the performance of story boundary detection TDT-style sliding window approach: partial credit for slightly misplaced segment boundaries • True and system agree within the window t correct. • No system boundary in a window containing a true boundary t Miss • System boundary in a window containing no true boundary t False Alarm • Window length: 15 seconds Source: Franz, et al. “Breaking Translation Symmetry” 17
  • 18. STTSBD Test Constraints •  Ground truth availability: word-position- based story boundaries on ASR transcripts –  Transcripts were already segmented into sentences •  For the pipeline (STTSBD) output, we needed to compare time-based story boundaries on Arabic speech 18