SemEval 2012 task 6
        A pilot on
Semantic Textual Similarity
   http://www.cs.york.ac.uk/semeval-2012/task6/


     Eneko Agirre (University of the Basque Country)
            Daniel Cer (Stanford University)
           Mona Diab (Columbia University)
                  Bill Dolan (Microsoft)
Aitor Gonzalez-Agirre (University of the Basque Country)
Outline

    Motivation

    Description of the task

    Source Datasets

    Definition of similarity and annotation

    Results

    Conclusions, open issues




                      STS task - SemEval 2012   2
Motivation
●   Word similarity and relatedness highly correlated
    with humans
      Move to longer text fragments (STS)
       –   Li et al. (2006) 65 pairs of glosses
       –   Lee et al. (2005) 50 documents on news
●   Paraphrase datasets judge semantic equivalence
    between text fragments
●   Textual entailment (TE) judges whether one
    fragment entails another
      Move to graded notion of semantic equivalence (STS)

                            STS task - SemEval 2012         3
Motivation
●   STS has been part of the core implementation
    of TE and paraphrase systems
●   Algorithms for STS have been extensively
    applied
    ●   MT, MT evaluation, Summarization, Generation,
        Distillation, Machine Reading, Textual Inference,
        Deep QA
●   Interest from application side confirmed in a
    recent STS workshop:
    ●   http://www.cs.columbia.edu/~weiwei/workshop/

                          STS task - SemEval 2012           4
Motivation
●   STS as a unified framework to combine and evaluate
    semantic (and pragmatic components)
      word sense disambiguation and induction
      lexical substitution
      semantic role labeling
      multiword expression detection and handling
      anaphora and coreference resolution
      time and date resolution
      named-entity handling
      underspecification
      hedging
      semantic scoping
      discourse analysis

                         STS task - SemEval 2012         5
Motivation
●   Start with a pilot task, with the following goals
    1.To set a definition of STS as a graded notion which
      can be easily communicated to non-expert
      annotators beyond the likert-scale
    2.To gather a substantial amount of sentence pairs
      from diverse datasets, and to annotate them with
      high quality
    3.To explore evaluation measures for STS
    4.To explore the relation of STS to paraphrase and
      Machine Translation Evaluation exercises

                       STS task - SemEval 2012              6
Description of the task
●   Given two sentences, s1 and s2
    ●   Return a similarity score
        and an optional confidence score
●   Evaluation
    ●   Correlation (Pearson)
        with average of human scores




                          STS task - SemEval 2012   7
Data sources
●   MSR paraphrase: train (750), test (750)
●   MSR video: train (750), test (750)
●   WMT 07–08 (EuroParl): train (734), test (499)
●   Surprise datasets
    ●   WMT 2007 news: test (399)
    ●   Ontonotes – WordNet glosses: test (750)




                        STS task - SemEval 2012     8
Definition of similarity
Likert scale with definitions




                 STS task - SemEval 2012   9
Annotation
●   Pilot with 200 pairs annotated by three authors
    ●   Pairwise (0.84r to 0.87r), with average (0.87r to 0.89r)
●   Amazon Mechanical Turk
    ●   5 annotations per pair, averaged
    ●   Remove turkers with very low correlations with pilot
    ●   Correlation with us 0.90r to 0.94r
    ●   MSR: 2.76 mean, 0.66 sdv.
    ●   SMT: 4.05 mean, 0.66 sdv.



                           STS task - SemEval 2012                 10
Results
●   Baselines: random, cosine of tokens
●   Participation: 120 hours to submit three runs.
    ●   35 teams, 88 runs
●   Evaluation
    ●   Pearson for each dataset
    ●   Concatenate all 5 datasets: ALL
        –   Some systems doing well in each dataset, low results
    ●   Weighted mean over 5 datasets (micro-average): MEAN
        –   Statistical significance
    ●   Normalize each dataset and concatenate (least square): ALLnorm
        –   Corrects errors (random would get 0.59r)

                                       STS task - SemEval 2012           11
Results
●   Large majority better than both baselines
●   Best three runs
    ●   ALL: 0.82r UKP, TAKELAB, TAKELAB
    ●   Mean: 0.67r TAKELAB, UKP, TAKELAB
    ●   ALLnrm: 0.86r UKP, TAKELAB, SOFT-CARDINALITY
●   Statistical significance (ALL 95% confidence interval)
    ●   1st 0.824r [0.812,0.835]
    ●   2nd 0.814r [0.802,0.825]



                           STS task - SemEval 2012           12
Results
●   Datasets (ALL)
    ●   MSRpar 0.73r TAKELAB
    ●   MSRvid 0.88r TAKELAB
    ●   SMT-eur 0.57r SRANJANS
    ●   SMT-news 0.61r FBK
    ●   On-WN 0.73r WEIWEI




                      STS task - SemEval 2012   13
Results
●   Evaluation using confidence scores
    ●   Weighted Pearson correlation
    ●   Some systems improve results (IRIT, TIANTIANZHU7)
        –   IRIT: 0.48r => 0.55r
    ●   Others did not (UNED)
●   Unfortunately only a few teams sent out
    confidence scores
●   Promising direction, potentially useful in
    applications (Watson)

                             STS task - SemEval 2012        14
Tools used
●   WordNet, corpora and Wikipedia most used
●   Knowledge-based and distributional equally
●   Machine learning widely used for combination
    and tuning
●   Best systems used most resources
    ●   Exception: SOFT-CARDINALITY




                      STS task - SemEval 2012      15
Conclusions
●   Pilot worked!
    ●   Define STS as likert scale with definitions
    ●   Produce a wealth of data of high quality (~ 3750)
    ●   Very successful participation
    ●   All data and system outputs are publicly available
●   Started to explore evaluation of STS
●   Started to explore relation to paraphrase and
    MT evaluation
●   Planning for STS 2013
                          STS task - SemEval 2012            16
Open issues
●   Data sources, alternatives to the opportunistic method
    ●   New pairs of sentences
    ●   Possibly related to specific phenomena, e.g. negation
●   Definition of task
    ●   Agreement for definitions
    ●   Compare to Likert scale with no definitions
    ●   Define multiple dimensions of similarity
        (polarity, sentiment, modality, relatedness, entailment, etc.)
●   Evaluation
    ●   Spearman, Kendall's Tau
    ●   Significance tests over multiple datasets (Bergmann & Hommel, 1989)
●   And more!!         Join STS-semeval google group

                                  STS task - SemEval 2012                     17
STS presentations
●   Three best systems will be presented in
    last session of Semeval today (4:00pm)
●   Analysis of runs and some thoughts on
    evaluation will be also presented
●   Tomorrow in the posters sessions




                     STS task - SemEval 2012   18
Thanks for your attention!

And thanks to all participants, specially all participants, specially
those contributing to the evaluation discussion (Yoan Gutierrez,
Michael Heilman, Sergio Jimenez, Nitin Madnami, Diana
McCarthy and Shrutiranjan Satpathy)
Eneko Agirre was partially funded by the European Community's Seventh Framework Programme
(FP7/2007-2013) under grant agreement no. 270082 (PATHS project) and the Ministry of Economy
under grant TIN2009-14715-C04-01 (KNOW2 project). Daniel Cer gratefully acknowledges the support
of the Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air
Force Research Labora-tory (AFRL) prime contract no. FA8750-09-C-0181 and the support of the
DARPA Broad Operational Language Translation (BOLT) program through IBM. The STS annotations
were funded by an extension to DARPA GALE subcontract to IBM # W0853748 4911021461.0 to Mona
Diab. Any opinions, findings, and conclusion or recommendations expressed in this material are those
of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government.



                                        STS task - SemEval 2012                                        19
SemEval 2012 task 6
        A pilot on
Semantic Textual Similarity
   http://www.cs.york.ac.uk/semeval-2012/task6/


     Eneko Agirre (University of the Basque Country)
             Daniel Cer (Stanford University)
             Mona Diab (Columbia University)
                  Bill Dolan (Microsoft)
Aitor Agirre-Gonzalez (University of the Basque Country)
MSR paraphrase corpus
●   Widely used to evaluate text similarity algorithms
●   Gleaned over a period of 18 months from
    thousands of news sources on the web.
●   5801 pairs of sentences
    ●   70% train, 30% test
    ●   67% yes, %33 no
        –   completely unrelated semantically, partially overlapping, to
            those that are almost-but-not-quite semantically equivalent.
    ●   IAA 82%-84%
●   (Dolan et al. 2004)
                              STS task - SemEval 2012                      21
MSR paraphrase corpus
●   The Senate Select Committee on Intelligence is preparing a
    blistering report on prewar intelligence on Iraq.
●   American intelligence leading up to the war on Iraq will be
    criticised by a powerful US Congressional committee due to
    report soon, officials said today.

●   A strong geomagnetic storm was expected to hit Earth today
    with the potential to affect electrical grids and satellite
    communications.
●   A strong geomagnetic storm is expected to hit Earth
    sometime %%DAY%% and could knock out electrical grids
    and satellite communications.

                          STS task - SemEval 2012                 22
MSR paraphrase corpus
●   Methodology:
    ●   Rank pairs according to string similarity
        –   Algorithms for Approximate String Matching", E.
            Ukkonen, Information and Control Vol. 64, 1985, pp. 100-
            118.
    ●   Five bands (0.8 – 0.4 similarity)
    ●   Sample equal number of pairs from each band
    ●   Repeat for paraphrases / non-paraphrases
    ●   50% from each
●   750 pairs for train, 750 pairs for test
                            STS task - SemEval 2012                    23
MSR Video Description Corpus
●   Show a segment of YouTube video
    ●   Ask for one-sentence description of the main
        action/event in the video (AMT)
    ●   120K sentences, 2,000 videos
    ●   Roughly parallel descriptions (not only in English)
●   (Chen and Dolan, 2011)




                          STS task - SemEval 2012             24
MSR Video Description Corpus
                       ●   A person is slicing a cucumber into
                           pieces.
                       ●   A chef is slicing a vegetable.
                       ●   A person is slicing a cucumber.
                       ●   A woman is slicing vegetables.
                       ●   A woman is slicing a cucumber.
                       ●   A person is slicing cucumber with
                           a knife.
                       ●   A person cuts up a piece of
                           cucumber.
                       ●   A man is slicing cucumber.
                       ●   A man cutting zucchini.
                       ●   Someone is slicing fruit.

          STS task - SemEval 2012                                25
MSR Video Description Corpus
●   Methodology:
    ●   All possible pairs from the same video
    ●   1% of all possible pairs from different videos
    ●   Rank pairs according to string similarity
    ●   Four bands (0.8 – 0.5 similarity)
    ●   Sample equal number of pairs from each band
    ●   Repeat for same video / different video
    ●   50% from each
●   750 pairs for train, 750 pairs for test

                          STS task - SemEval 2012        26
WMT: MT evaluation
●   Pairs of segments (~ sentences) that had been part
    of the human evaluation for WMT systems
    ●   a reference translation
    ●   a machine translation submission
●   To keep things consistent, we just used French to
    English system submissions translation
●   Train contains pairs in WMT 2007
●   Test contains pairs with less than 16 tokens from
    WMT 2008
●   Train and test come from Europarl

                            STS task - SemEval 2012      27
WMT: MT evaluation
●   The only instance in which no tax is levied is
    when the supplier is in a non-EU country and
    the recipient is in a Member State of the EU.
●   The only case for which no tax is still perceived
    "is an example of supply in the European
    Community from a third country.
●   Thank you very much, Commissioner.
●   Thank you very much, Mr Commissioner.


                      STS task - SemEval 2012           28
Surprise datasets
●   human ranked fr-en system submissions from
    the WMT 2007 news conversation test set,
    resulting in 351 unique system reference pairs.
●   The second set is radically different as it
    comprised 750 pairs of glosses from OntoNotes
    4.0 (Hovy et al., 2006) and WordNet 3.1
    (Fellbaum, 1998) senses.




                     STS task - SemEval 2012          29
Pilot
●   Mona, Dan, Eneko
●   ~200 pairs from three datasets
●   Pairwise agreement:
    ●   GS:dan     SYS:eneko     N:188 Pearson: 0.874
    ●   GS:dan     SYS:mona      N:174 Pearson: 0.845
    ●   GS:eneko   SYS:mona      N:184 Pearson: 0.863
●   Agreement with average of rest of us:
    ●   GS:average  SYS:dan   N:188 Pearson: 0.885
    ●   GS:average  SYS:eneko N:198 Pearson: 0.889
    ●   GS:average  SYS:mona  N:184 Pearson: 0.875

                          STS task - SemEval 2012       30
STS task - SemEval 2012   31
Pilot with turkers
●   Average turkers with our average:
    ●   N:197 Pearson: 0.959
●   Each of us with average of turkers:
    ●   dan        N:187 Pearson: 0.937
    ●   eneko      N:197 Pearson: 0.919
    ●   mona       N:183 Pearson: 0.896




                     STS task - SemEval 2012   32
Working with AMT
●   Requirements:
    ●   95% approval rating for their other HITs on AMT.
    ●   To pass a qualification test with 80% accuracy.
        –   6 example pairs
        –   answers were marked correct if they were within +1/-1 of our
            annotations
    ●   Targetting US, but used all origins
●   HIT: 5 pairs of sentences, $ 0.20, 5 turkers per HIT
●   114.9 seconds per HIT on the most recent data we
    submitted.


                                STS task - SemEval 2012                    33
Working with AMT
●   Quality control
    ●   Each HIT contained one pair from our pilot
    ●   After the tagging we check correlation of individual
        turkers with our scores
    ●   Remove annotations of low correlation turkers
        –   A2VJKPNDGBSUOK N:100 Pearson: -0.003
    ●   Later realized that we could use correlation with
        average of other Turkers




                          STS task - SemEval 2012              34
Assessing quality of annotation




           STS task - SemEval 2012   35
Assessing quality of annotation
●   MSR datasets
    ●   Average 2.76
    ●   0:2228
    ●   1:1456
    ●   2:1895
    ●   3:4072
    ●   4:3275
    ●   5:2126


                       STS task - SemEval 2012   36
Average (MSR data)

6



5



4



                                   ave
3



2



1



0




         STS task - SemEval 2012         37
Standard deviation (MSR data)


7


6


5


4


3


2


1


0


-1


-2


                STS task - SemEval 2012   38
Standard deviation (MSR data)

2.5




 2




1.5

                                           sdv


 1




0.5




 0




                 STS task - SemEval 2012         39
Average SMTeuroparl




      STS task - SemEval 2012   40

More Related Content

PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
PDF
OVERALL PERFORMANCE EVALUATION OF ENGINEERING STUDENTS USING FUZZY LOGIC
PDF
PhD Defense Slides
PDF
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
PDF
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
PDF
Determining the Credibility of Science Communication
PPTX
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
PDF
Testing of artificial intelligence; AI quality engineering skils - an introdu...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
OVERALL PERFORMANCE EVALUATION OF ENGINEERING STUDENTS USING FUZZY LOGIC
PhD Defense Slides
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Determining the Credibility of Science Communication
GECCO-2014 Learning Classifier Systems: A Gentle Introduction
Testing of artificial intelligence; AI quality engineering skils - an introdu...

Viewers also liked (16)

PDF
PATHS: User Requirements Analysis v1.0
PDF
PATHS Functional specification first prototype
PDF
IND-2012-252 A G Matic Hr sec School -Kids Birthday Garden
PDF
The old exchange environment versus modern exchange environment part 02#36
PDF
PATHS Second prototype-functional-spec
PPTX
IND-2012-290 Anando -REUNION
PDF
PATHS at PATCH 2011
PPT
Extra unit 2
PDF
My E-mail appears as spam - Troubleshooting path | Part 11#17
PPTX
Renaissance
DOC
Canciones del colegio
PDF
Autodiscover flow in active directory based environment part 15#36
PPTX
IND-2012-289 Anando SIGNATURE CAMPAIGN
PDF
Exchange In-Place eDiscovery & Hold | Introduction | 5#7
PDF
PATHS at the eCult dialogue day 2013
PATHS: User Requirements Analysis v1.0
PATHS Functional specification first prototype
IND-2012-252 A G Matic Hr sec School -Kids Birthday Garden
The old exchange environment versus modern exchange environment part 02#36
PATHS Second prototype-functional-spec
IND-2012-290 Anando -REUNION
PATHS at PATCH 2011
Extra unit 2
My E-mail appears as spam - Troubleshooting path | Part 11#17
Renaissance
Canciones del colegio
Autodiscover flow in active directory based environment part 15#36
IND-2012-289 Anando SIGNATURE CAMPAIGN
Exchange In-Place eDiscovery & Hold | Introduction | 5#7
PATHS at the eCult dialogue day 2013
Ad

Similar to A pilot on Semantic Textual Similarity (20)

PDF
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
PDF
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
PDF
The Status of ML Algorithms for Structure-property Relationships Using Matb...
PPTX
Reference Domain Ontologies and Large Medical Language Models
PDF
Evaluating Chemical Composition and Crystal Structure Representations using t...
PPTX
IA3_presentation.pptx
PDF
Triantafyllia Voulibasi
PDF
Test for AI model
PPTX
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
PPTX
a deep reinforced model for abstractive summarization
PDF
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
PPTX
ICSE20_Tao_slides.pptx
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Lepor: augmented automatic MT evaluation metric
PPTX
Principles of effort estimation
PDF
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
PDF
A literature survey of benchmark functions for global optimisation problems
PDF
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PPTX
CIKM14: Fixing grammatical errors by preposition ranking
PPTX
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
Reference Domain Ontologies and Large Medical Language Models
Evaluating Chemical Composition and Crystal Structure Representations using t...
IA3_presentation.pptx
Triantafyllia Voulibasi
Test for AI model
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
a deep reinforced model for abstractive summarization
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
ICSE20_Tao_slides.pptx
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lepor: augmented automatic MT evaluation metric
Principles of effort estimation
2019 dynamically composing_domain-data_selection_with_clean-data_selection_by...
A literature survey of benchmark functions for global optimisation problems
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
CIKM14: Fixing grammatical errors by preposition ranking
.NET Fest 2017. Игорь Кочетов. Классификация результатов тестирования произво...
Ad

More from pathsproject (20)

PDF
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
PDF
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
PDF
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PDF
Implementing Recommendations in the PATHS system, SUEDL 2013
PDF
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
PDF
Generating Paths through Cultural Heritage Collections Latech2013 paper
PDF
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
PDF
PATHS state of the art monitoring report
PDF
Recommendations for the automatic enrichment of digital library content using...
PDF
Semantic Enrichment of Cultural Heritage content in PATHS
PDF
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
PPT
PATHS @ LATECH 2013
PDF
PATHS at the eChallenges conference
PDF
PATHS at the EAA conference 2013
PDF
Comparing taxonomies for organising collections of documents presentation
PDF
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
PDF
Comparing taxonomies for organising collections of documents
PDF
PATHS Final prototype interface design v1.0
PDF
PATHS Evaluation of the 1st paths prototype
PDF
PATHS Final state of art monitoring report v0_4
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
Implementing Recommendations in the PATHS system, SUEDL 2013
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
Generating Paths through Cultural Heritage Collections Latech2013 paper
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
PATHS state of the art monitoring report
Recommendations for the automatic enrichment of digital library content using...
Semantic Enrichment of Cultural Heritage content in PATHS
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
PATHS @ LATECH 2013
PATHS at the eChallenges conference
PATHS at the EAA conference 2013
Comparing taxonomies for organising collections of documents presentation
SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity
Comparing taxonomies for organising collections of documents
PATHS Final prototype interface design v1.0
PATHS Evaluation of the 1st paths prototype
PATHS Final state of art monitoring report v0_4

Recently uploaded (20)

PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Virtual and Augmented Reality in Current Scenario
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PDF
advance database management system book.pdf
PDF
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
PDF
Complications of Minimal Access-Surgery.pdf
PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
IGGE1 Understanding the Self1234567891011
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
International_Financial_Reporting_Standa.pdf
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PDF
HVAC Specification 2024 according to central public works department
PDF
What if we spent less time fighting change, and more time building what’s rig...
B.Sc. DS Unit 2 Software Engineering.pptx
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Hazard Identification & Risk Assessment .pdf
Virtual and Augmented Reality in Current Scenario
History, Philosophy and sociology of education (1).pptx
Chinmaya Tiranga quiz Grand Finale.pdf
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
advance database management system book.pdf
Vision Prelims GS PYQ Analysis 2011-2022 www.upscpdf.com.pdf
Complications of Minimal Access-Surgery.pdf
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
David L Page_DCI Research Study Journey_how Methodology can inform one's prac...
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
IGGE1 Understanding the Self1234567891011
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
International_Financial_Reporting_Standa.pdf
FORM 1 BIOLOGY MIND MAPS and their schemes
HVAC Specification 2024 according to central public works department
What if we spent less time fighting change, and more time building what’s rig...

A pilot on Semantic Textual Similarity

  • 1. SemEval 2012 task 6 A pilot on Semantic Textual Similarity http://www.cs.york.ac.uk/semeval-2012/task6/ Eneko Agirre (University of the Basque Country) Daniel Cer (Stanford University) Mona Diab (Columbia University) Bill Dolan (Microsoft) Aitor Gonzalez-Agirre (University of the Basque Country)
  • 2. Outline  Motivation  Description of the task  Source Datasets  Definition of similarity and annotation  Results  Conclusions, open issues STS task - SemEval 2012 2
  • 3. Motivation ● Word similarity and relatedness highly correlated with humans Move to longer text fragments (STS) – Li et al. (2006) 65 pairs of glosses – Lee et al. (2005) 50 documents on news ● Paraphrase datasets judge semantic equivalence between text fragments ● Textual entailment (TE) judges whether one fragment entails another Move to graded notion of semantic equivalence (STS) STS task - SemEval 2012 3
  • 4. Motivation ● STS has been part of the core implementation of TE and paraphrase systems ● Algorithms for STS have been extensively applied ● MT, MT evaluation, Summarization, Generation, Distillation, Machine Reading, Textual Inference, Deep QA ● Interest from application side confirmed in a recent STS workshop: ● http://www.cs.columbia.edu/~weiwei/workshop/ STS task - SemEval 2012 4
  • 5. Motivation ● STS as a unified framework to combine and evaluate semantic (and pragmatic components) word sense disambiguation and induction lexical substitution semantic role labeling multiword expression detection and handling anaphora and coreference resolution time and date resolution named-entity handling underspecification hedging semantic scoping discourse analysis STS task - SemEval 2012 5
  • 6. Motivation ● Start with a pilot task, with the following goals 1.To set a definition of STS as a graded notion which can be easily communicated to non-expert annotators beyond the likert-scale 2.To gather a substantial amount of sentence pairs from diverse datasets, and to annotate them with high quality 3.To explore evaluation measures for STS 4.To explore the relation of STS to paraphrase and Machine Translation Evaluation exercises STS task - SemEval 2012 6
  • 7. Description of the task ● Given two sentences, s1 and s2 ● Return a similarity score and an optional confidence score ● Evaluation ● Correlation (Pearson) with average of human scores STS task - SemEval 2012 7
  • 8. Data sources ● MSR paraphrase: train (750), test (750) ● MSR video: train (750), test (750) ● WMT 07–08 (EuroParl): train (734), test (499) ● Surprise datasets ● WMT 2007 news: test (399) ● Ontonotes – WordNet glosses: test (750) STS task - SemEval 2012 8
  • 9. Definition of similarity Likert scale with definitions STS task - SemEval 2012 9
  • 10. Annotation ● Pilot with 200 pairs annotated by three authors ● Pairwise (0.84r to 0.87r), with average (0.87r to 0.89r) ● Amazon Mechanical Turk ● 5 annotations per pair, averaged ● Remove turkers with very low correlations with pilot ● Correlation with us 0.90r to 0.94r ● MSR: 2.76 mean, 0.66 sdv. ● SMT: 4.05 mean, 0.66 sdv. STS task - SemEval 2012 10
  • 11. Results ● Baselines: random, cosine of tokens ● Participation: 120 hours to submit three runs. ● 35 teams, 88 runs ● Evaluation ● Pearson for each dataset ● Concatenate all 5 datasets: ALL – Some systems doing well in each dataset, low results ● Weighted mean over 5 datasets (micro-average): MEAN – Statistical significance ● Normalize each dataset and concatenate (least square): ALLnorm – Corrects errors (random would get 0.59r) STS task - SemEval 2012 11
  • 12. Results ● Large majority better than both baselines ● Best three runs ● ALL: 0.82r UKP, TAKELAB, TAKELAB ● Mean: 0.67r TAKELAB, UKP, TAKELAB ● ALLnrm: 0.86r UKP, TAKELAB, SOFT-CARDINALITY ● Statistical significance (ALL 95% confidence interval) ● 1st 0.824r [0.812,0.835] ● 2nd 0.814r [0.802,0.825] STS task - SemEval 2012 12
  • 13. Results ● Datasets (ALL) ● MSRpar 0.73r TAKELAB ● MSRvid 0.88r TAKELAB ● SMT-eur 0.57r SRANJANS ● SMT-news 0.61r FBK ● On-WN 0.73r WEIWEI STS task - SemEval 2012 13
  • 14. Results ● Evaluation using confidence scores ● Weighted Pearson correlation ● Some systems improve results (IRIT, TIANTIANZHU7) – IRIT: 0.48r => 0.55r ● Others did not (UNED) ● Unfortunately only a few teams sent out confidence scores ● Promising direction, potentially useful in applications (Watson) STS task - SemEval 2012 14
  • 15. Tools used ● WordNet, corpora and Wikipedia most used ● Knowledge-based and distributional equally ● Machine learning widely used for combination and tuning ● Best systems used most resources ● Exception: SOFT-CARDINALITY STS task - SemEval 2012 15
  • 16. Conclusions ● Pilot worked! ● Define STS as likert scale with definitions ● Produce a wealth of data of high quality (~ 3750) ● Very successful participation ● All data and system outputs are publicly available ● Started to explore evaluation of STS ● Started to explore relation to paraphrase and MT evaluation ● Planning for STS 2013 STS task - SemEval 2012 16
  • 17. Open issues ● Data sources, alternatives to the opportunistic method ● New pairs of sentences ● Possibly related to specific phenomena, e.g. negation ● Definition of task ● Agreement for definitions ● Compare to Likert scale with no definitions ● Define multiple dimensions of similarity (polarity, sentiment, modality, relatedness, entailment, etc.) ● Evaluation ● Spearman, Kendall's Tau ● Significance tests over multiple datasets (Bergmann & Hommel, 1989) ● And more!! Join STS-semeval google group STS task - SemEval 2012 17
  • 18. STS presentations ● Three best systems will be presented in last session of Semeval today (4:00pm) ● Analysis of runs and some thoughts on evaluation will be also presented ● Tomorrow in the posters sessions STS task - SemEval 2012 18
  • 19. Thanks for your attention! And thanks to all participants, specially all participants, specially those contributing to the evaluation discussion (Yoan Gutierrez, Michael Heilman, Sergio Jimenez, Nitin Madnami, Diana McCarthy and Shrutiranjan Satpathy) Eneko Agirre was partially funded by the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 270082 (PATHS project) and the Ministry of Economy under grant TIN2009-14715-C04-01 (KNOW2 project). Daniel Cer gratefully acknowledges the support of the Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Labora-tory (AFRL) prime contract no. FA8750-09-C-0181 and the support of the DARPA Broad Operational Language Translation (BOLT) program through IBM. The STS annotations were funded by an extension to DARPA GALE subcontract to IBM # W0853748 4911021461.0 to Mona Diab. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. STS task - SemEval 2012 19
  • 20. SemEval 2012 task 6 A pilot on Semantic Textual Similarity http://www.cs.york.ac.uk/semeval-2012/task6/ Eneko Agirre (University of the Basque Country) Daniel Cer (Stanford University) Mona Diab (Columbia University) Bill Dolan (Microsoft) Aitor Agirre-Gonzalez (University of the Basque Country)
  • 21. MSR paraphrase corpus ● Widely used to evaluate text similarity algorithms ● Gleaned over a period of 18 months from thousands of news sources on the web. ● 5801 pairs of sentences ● 70% train, 30% test ● 67% yes, %33 no – completely unrelated semantically, partially overlapping, to those that are almost-but-not-quite semantically equivalent. ● IAA 82%-84% ● (Dolan et al. 2004) STS task - SemEval 2012 21
  • 22. MSR paraphrase corpus ● The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq. ● American intelligence leading up to the war on Iraq will be criticised by a powerful US Congressional committee due to report soon, officials said today. ● A strong geomagnetic storm was expected to hit Earth today with the potential to affect electrical grids and satellite communications. ● A strong geomagnetic storm is expected to hit Earth sometime %%DAY%% and could knock out electrical grids and satellite communications. STS task - SemEval 2012 22
  • 23. MSR paraphrase corpus ● Methodology: ● Rank pairs according to string similarity – Algorithms for Approximate String Matching", E. Ukkonen, Information and Control Vol. 64, 1985, pp. 100- 118. ● Five bands (0.8 – 0.4 similarity) ● Sample equal number of pairs from each band ● Repeat for paraphrases / non-paraphrases ● 50% from each ● 750 pairs for train, 750 pairs for test STS task - SemEval 2012 23
  • 24. MSR Video Description Corpus ● Show a segment of YouTube video ● Ask for one-sentence description of the main action/event in the video (AMT) ● 120K sentences, 2,000 videos ● Roughly parallel descriptions (not only in English) ● (Chen and Dolan, 2011) STS task - SemEval 2012 24
  • 25. MSR Video Description Corpus ● A person is slicing a cucumber into pieces. ● A chef is slicing a vegetable. ● A person is slicing a cucumber. ● A woman is slicing vegetables. ● A woman is slicing a cucumber. ● A person is slicing cucumber with a knife. ● A person cuts up a piece of cucumber. ● A man is slicing cucumber. ● A man cutting zucchini. ● Someone is slicing fruit. STS task - SemEval 2012 25
  • 26. MSR Video Description Corpus ● Methodology: ● All possible pairs from the same video ● 1% of all possible pairs from different videos ● Rank pairs according to string similarity ● Four bands (0.8 – 0.5 similarity) ● Sample equal number of pairs from each band ● Repeat for same video / different video ● 50% from each ● 750 pairs for train, 750 pairs for test STS task - SemEval 2012 26
  • 27. WMT: MT evaluation ● Pairs of segments (~ sentences) that had been part of the human evaluation for WMT systems ● a reference translation ● a machine translation submission ● To keep things consistent, we just used French to English system submissions translation ● Train contains pairs in WMT 2007 ● Test contains pairs with less than 16 tokens from WMT 2008 ● Train and test come from Europarl STS task - SemEval 2012 27
  • 28. WMT: MT evaluation ● The only instance in which no tax is levied is when the supplier is in a non-EU country and the recipient is in a Member State of the EU. ● The only case for which no tax is still perceived "is an example of supply in the European Community from a third country. ● Thank you very much, Commissioner. ● Thank you very much, Mr Commissioner. STS task - SemEval 2012 28
  • 29. Surprise datasets ● human ranked fr-en system submissions from the WMT 2007 news conversation test set, resulting in 351 unique system reference pairs. ● The second set is radically different as it comprised 750 pairs of glosses from OntoNotes 4.0 (Hovy et al., 2006) and WordNet 3.1 (Fellbaum, 1998) senses. STS task - SemEval 2012 29
  • 30. Pilot ● Mona, Dan, Eneko ● ~200 pairs from three datasets ● Pairwise agreement: ● GS:dan     SYS:eneko     N:188 Pearson: 0.874 ● GS:dan     SYS:mona      N:174 Pearson: 0.845 ● GS:eneko   SYS:mona      N:184 Pearson: 0.863 ● Agreement with average of rest of us: ● GS:average  SYS:dan   N:188 Pearson: 0.885 ● GS:average  SYS:eneko N:198 Pearson: 0.889 ● GS:average  SYS:mona  N:184 Pearson: 0.875 STS task - SemEval 2012 30
  • 31. STS task - SemEval 2012 31
  • 32. Pilot with turkers ● Average turkers with our average: ● N:197 Pearson: 0.959 ● Each of us with average of turkers: ● dan        N:187 Pearson: 0.937 ● eneko      N:197 Pearson: 0.919 ● mona       N:183 Pearson: 0.896 STS task - SemEval 2012 32
  • 33. Working with AMT ● Requirements: ● 95% approval rating for their other HITs on AMT. ● To pass a qualification test with 80% accuracy. – 6 example pairs – answers were marked correct if they were within +1/-1 of our annotations ● Targetting US, but used all origins ● HIT: 5 pairs of sentences, $ 0.20, 5 turkers per HIT ● 114.9 seconds per HIT on the most recent data we submitted. STS task - SemEval 2012 33
  • 34. Working with AMT ● Quality control ● Each HIT contained one pair from our pilot ● After the tagging we check correlation of individual turkers with our scores ● Remove annotations of low correlation turkers – A2VJKPNDGBSUOK N:100 Pearson: -0.003 ● Later realized that we could use correlation with average of other Turkers STS task - SemEval 2012 34
  • 35. Assessing quality of annotation STS task - SemEval 2012 35
  • 36. Assessing quality of annotation ● MSR datasets ● Average 2.76 ● 0:2228 ● 1:1456 ● 2:1895 ● 3:4072 ● 4:3275 ● 5:2126 STS task - SemEval 2012 36
  • 37. Average (MSR data) 6 5 4 ave 3 2 1 0 STS task - SemEval 2012 37
  • 38. Standard deviation (MSR data) 7 6 5 4 3 2 1 0 -1 -2 STS task - SemEval 2012 38
  • 39. Standard deviation (MSR data) 2.5 2 1.5 sdv 1 0.5 0 STS task - SemEval 2012 39
  • 40. Average SMTeuroparl STS task - SemEval 2012 40