SlideShare a Scribd company logo
SPIDER: A SYSTEM FOR PARAPHRASING
       IN DOCUMENT EDITING AND REVISION
          APPLICABILITY IN MACHINE TRANSLATION PRE-EDITING




                            Anabela Barreiro



                              ab@metatrad.com




CICLing 2011                                        February 20-26, 2011
Anabela Barreiro                                    Tokyo, Japan
OUTLINE
          INTRODUCTION
                  PARAPHRASES IN NLP
                  PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS

          SPIDER
                  FIRST STEPS
                  IMPORTANT FEATURES
                  PARAPHRASES COVERED BY SPIDER
                  INTERFACE
                  LINGUISTIC RESOURCES
                  EVALUATION RESULTS

          THE FUTURE
                  FUTURE APPLICATIONS?
                  FUTURE RESEARCH


CICLing 2011                                                              February 20-26, 2011
Anabela Barreiro                                                          Tokyo, Japan
IMPORTANCE OF PARAPHRASES IN NLP TASKS
         Question Answering
          [Ibrahim et al., 2003], [Paşca, 2003], [Duboué & Chu-Carroll, 2006]
         Information Extraction and Text Mining
          [Ibrahim et al., 2003], [Shinyama et al., 2002] [Shinyama & Sekine, 2003],
          [Sekine, 2005] [Paşca, 2005], [Paşca & Dienes, 2005]
         Summarization
          [McKeown et al., 2002], [Barzilay, 2001, 2003], [Hirao et al., 2004] [Zhou et
          al., 2006b]
         Natural Language Generation
          [Iordanskaja et al. 1991]
         Plagiarism Detection
          [Potthast et al., 2010], [Vila et al., 2010]
         Machine Translation
          [Zhou et al., 2006], [Callison-Burch et al., 2006a, 2006b, 2007 and 2008]
          [Barreiro, 2008, 2009, 2011]



CICLing 2011                                                                    February 20-26, 2011
Anabela Barreiro                                                                Tokyo, Japan
THE PRACTICAL NEED FOR PARAPHRASES
                    IN PEDAGOGICAL CONTEXTS

          Text Processing and Authoring Aids
           Writing and revision of original/creative/customized texts
          Learning Tools
           Native and second language learning
           Creation of clear and understandable text content
           e.g. students learning language and writing skills
          Style Editors
           Uniformization /consistency of style




CICLing 2011                                                            February 20-26, 2011
Anabela Barreiro                                                        Tokyo, Japan
THE PRACTICAL NEED FOR PARAPHRASES
                    IN PROFESSIONAL CONTEXTS
          Technical Writing
           Professional high quality documentation and domain-specific texts
           Controlled language
          Linguistic Quality Assurance
           Linguistic quality of generic texts and specialized documentation
           Verification/validation of meaningful content
          Text Optimization
           Readable / publishable texts (business-oriented or purpose-oriented content)
          Terminology
           Search for the “exact” term or relevant keywords
          Translation
           Indispensable for human and machine translation (pre-editing and post-editing)


CICLing 2011                                                                   February 20-26, 2011
Anabela Barreiro                                                               Tokyo, Japan
OUTLINE
          INTRODUCTION
                  PARAPHRASES IN NLP
                  PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS

          SPIDER
                  FIRST STEPS
                  IMPORTANT FEATURES
                  PARAPHRASES COVERED BY SPIDER
                  INTERFACE
                  LINGUISTIC RESOURCES
                  EVALUATION RESULTS

          THE FUTURE
                  FUTURE APPLICATIONS?
                  FUTURE RESEARCH


CICLing 2011                                                              February 20-26, 2011
Anabela Barreiro                                                          Tokyo, Japan
SPIDER PARAPHRASING SYSTEM
                                      FIRST STEPS

            Initially developed for Portuguese
            1st version – ReEscreve
            publicly available service at http://www.linguateca.pt/ReEscreve/

            2nd version – eSPERTo (Portuguese: the smart/clever one; expert)
            currently being integrated in a cyber school project within the scope of an
            educational program

            Writing exercises – students learning how to improve their writing skills in
            the Portuguese language

            English SPIDER
            prototype to assist writing of domain-specific texts



CICLing 2011                                                               February 20-26, 2011
Anabela Barreiro                                                           Tokyo, Japan
SPIDER
                             IMPORTANT FEATURES
       Applies linguistic knowledge to recognize and generate paraphrases
      automatically (preserves the source text semantics and grammaticality -
      inflectional features) in the suggestions provided (included transformations of
      multi-word units)
       Uses text-editing mechanisms which provide a variety of alternatives for
      each expression and the possibility to choose among them (according to
      personal preferences, style, idiomacity, etc.)
       Allows users to suggest new expressions that can be immediately applied
      to their text, making the text editing process easier, more flexible, and
      upgradable
       Designed to help with writing optimization, understandability and
      translatability (improvement of the quality of the source text so that it can cause
      a positive impact in translation)


CICLing 2011                                                                 February 20-26, 2011
Anabela Barreiro                                                             Tokyo, Japan
PARAPHRASES COVERED BY SPIDER
       Synonyms in context (ex: phrasal verbs into equivalent expressions)
             to clear up (weather) = (weather) to become better/brighter
       Support verb constructions into single verbs and stylistic variants
             to make a decision = to decide; to make an audit = to perform an audit
       Aspectual constructions into single verbs
             to launch an attack = to attack
       Adverbials (compounds into single adverbs)
             in a constructive way = constructively
       Relatives into participial adjectives
             the president that was elected = the president elect
       Relatives into possessives
             the role that Europe plays/has = the role of Europe
       Relatives into compound nouns (and vice-versa)
             a container for the milk = a milk container; a bottle made of plastic = a plastic bottle
       Agentive passives into actives
             the man was released by the police officer = the police officer released the man


CICLing 2011                                                                       February 20-26, 2011
Anabela Barreiro                                                                   Tokyo, Japan
INTERFACE
                       SUGGESTIONS FOR EXAMPLE SENTENCES
 Suggestions for general language
      linguistic phenomena



                                                          Compound adverbs >
                                                            single adverbs




                                                                    Relatives >
                                                               participial adjectives



                                         Support verb constructions >
                                                 single verbs




CICLing 2011                                                                   February 20-26, 2011
Anabela Barreiro                                                               Tokyo, Japan
INTERFACE
       SELECTION OF PARAPHRASING GRAMMARS FOR SPECIFIC
                                        LINGUISTIC PHENOMENA
    Users can select among general and technical dictionaries (more than one
selection allowed), grammars for specific linguistic transformations (one, several
or all grammars can be selected). The interface provides sample texts for testing.


                                                                                      Informative details about the
                                                                                       linguistic resources selected




                                                                  Sample LEGAL text




CICLing 2011                                                                                            February 20-26, 2011
Anabela Barreiro                                                                                        Tokyo, Japan
INTERFACE
                          SELECTION OF A DOMAIN DICTIONARY




                                                                                  Identification of legal terms in the text




                       Suggestions for the term “breach of law”

 Users can select one term from the list of suggestions or provide a new suggestion

CICLing 2011                                                                                                 February 20-26, 2011
Anabela Barreiro                                                                                             Tokyo, Japan
INTERFACE
  SUGGESTIONS PROVIDED AND USER’S CAPABILITY TO ADD NEW REWRITING
                                                         OPTIONS




                                                                              The user can suggest new words or
                                                                            expressions (synonyms or paraphrases)

                                                                            It is possible to go back and change the user
                                                                                   option as many times as necessary

                                Text rewritten
                 • In red, the expressions in the source text
    •   In green, suggestions provided by SPIDER and selected by the user




CICLing 2011                                                                                     February 20-26, 2011
Anabela Barreiro                                                                                 Tokyo, Japan
LINGUISTIC RESOURCES
        Eng4NooJ – linguistic knowledge system
       • OpenLogos dictionary (http://logos-os.dfki.de/)
       • converted into NooJ format, and enhanced with new
             properties, including derivational and morpho-syntactic
             and semantic relations
       • Morphological system
       • Contextual rules and grammars
       • Domain specific dictionary (sample “legal terms”)




CICLing 2011                                                 February 20-26, 2011
Anabela Barreiro                                             Tokyo, Japan
LINGUISTIC RESOURCES
                          General language dictionary entries
      impress,V+FLX=POLISH+SAL=PVPCpleasetype+PT=impressionar+DRV=NDRV01:BOOK+
      VSUP=make+VSUP=cause+NPREP=on                                   Morpho-syntactic
      aesthetic,AFLX=NATURAL+SAL=AVstate+PT=aesthetically+DRV=AVDRV03 and semantic
                                                                         relations
      skepticism,N+FLX=BOOK+SAL=ABcause+PT=cepticismo+DRV=NAVDRV02

       NDRV04 = <B>ion/Npred+Nom                 Rules to transform
                                                morpho-syntactically
       ADRV02 = <B>icable                         and semantically
       AVDRV01 = <E>ly/ADV                        related words of
                                                  different parts of
       AVDRV04 = <B>tically/ADV                        speech
                                                                       Grammar to recognize adverbial compounds and
                                                                        transform them into equivalent single adverbs


      Contextual rules

Rules to improve precision
in specific contexts
[bring(vt)) N(charge; action)
> present(vt) N(idem)]



CICLing 2011                                                                                     February 20-26, 2011
Anabela Barreiro                                                                                 Tokyo, Japan
LINGUISTIC RESOURCES




                                          Sample of terms classified
                                              as Information +
                                             Instructional/legal




CICLing 2011                                  February 20-26, 2011
Anabela Barreiro                              Tokyo, Japan
EVALUATION RESULTS: PARAPHRASING
                                     PRECISION
                   Corpus: 500 sentences
                   100 sentences for each of 5 elementary support verbs

                     SVC Recognition            SVC Recognition            SVC Paraphrasing
                        Precision                    Recall                    Precision
       Pôr              73/73 - 100%              73/100 – 73%                72/73 - 98.6%
       Tomar            75/75 - 100%              75/100 – 75%                68/73 - 93.1%
       Ter              65/65 - 100%              65/100 – 65%                59/65 - 90.7%
       Dar               57/60 - 95%              57/100 – 57%                46/51 - 90.1%
       Fazer           43/45 – 95.5%              43/100 – 43%                40/45 - 88.8%
       Average        62.6/63.6 - 98.4%          62.6/100 - 62.6%             57/61 - 93.4%

                              Evaluation of recognition and paraphrasing
                                    of support verb constructions



CICLing 2011                                                                     February 20-26, 2011
Anabela Barreiro                                                                 Tokyo, Japan
EVALUATION RESULTS: IMPACT ON
                   TRANSLATABILITY (MT)
     Same corpus, 50 sentences selected randomly

     (i) automated pre-processing of support verb constructions with SPIDER and
          conversion into equivalent single verbs
     (ii) pre-processed sentences (automatically generated paraphrases) and original text
          are submitted to MT and the output translations for both original and pre-processed
          sentences were compared

     • 29 (58%) of the best translations were of automatically generated paraphrases
     • 9 (18%) were of support verb constructions
     • 12 (24%) were equally bad or equally good

     CONCLUSION
     The experiment indicates that paraphrases such as those generated by SPIDER help
     improve translation scores
     • The automated paraphrasing of support verb constructions through SPIDER
       allowed a significant improvement of the quality of the MT results in that context

CICLing 2011                                                                  February 20-26, 2011
Anabela Barreiro                                                              Tokyo, Japan
OUTLINE
          INTRODUCTION
                  PARAPHRASES IN NLP
                  PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS

          SPIDER
                  FIRST STEPS
                  IMPORTANT FEATURES
                  PARAPHRASES COVERED BY SPIDER
                  INTERFACE
                  LINGUISTIC RESOURCES
                  EVALUATION RESULTS

          THE FUTURE
                  FUTURE APPLICATIONS?
                  FUTURE RESEARCH


CICLing 2011                                                              February 20-26, 2011
Anabela Barreiro                                                          Tokyo, Japan
FUTURE APPLICATIONS?
     •     Writing / authoring aid (word processing applications)
     •     Language composition tool - general and technical language (e.g. student texts or legal
     texts)
     •     Text production and style editor
     •     Terminology verification tool - professional use of terminology in technical domains
                (elimination of informal, idiomatic, slang use of language)
     •      Empirical testbed for linguistic quality assurance (source and target texts)
     •     Text editing (machine translation pre-editing and post-editing) and translation aid
     •     Controlled language tool
                   •   Consistent, direct, and simple language
                   •   Restricted grammar (avoid certain types of construction)
                   •   Avoid complex reasoning, figures of speech, metaphors, etc.
                   •   Elimination of wordiness
     •     “Revision memory” tool (≈ “translation memory”) - recycling of validated reviewed
                sentences, structures or phrases



CICLing 2011                                                                               February 20-26, 2011
Anabela Barreiro                                                                           Tokyo, Japan
FUTURE RESEARCH
                    FROM SPIDER TO MACHINE TRANSLATION

         a fazer um estágio para   dar aulas de / tutor         Religião
         a fazer um estágio para   dar aulas de / lecture       Religião
         a fazer um estágio para   dar aulas de / teach         Religião
         começa a                  dar exemplos / exemplify     :
         sentia-se capaz de        dar um murro em / punch      quem quisesse detê-lo
         gostávamos de lhe         dar uma palavrinha / speak   .




                                                                                    $EN



CICLing 2011                                                               February 20-26, 2011
Anabela Barreiro                                                           Tokyo, Japan
SPIDER: A SYSTEM FOR PARAPHRASING
       IN DOCUMENT EDITING AND REVISION
          APPLICABILITY IN MACHINE TRANSLATION PRE-EDITING




                            Anabela Barreiro



                              ab@metatrad.com




CICLing 2011                                        February 20-26, 2011
Anabela Barreiro                                    Tokyo, Japan

More Related Content

PDF
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
PDF
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
PDF
CSTalks-Natural Language Processing-17Aug
PDF
Development of text to speech system for yoruba language
PDF
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
To Be or Not to be a Zero Pronoun: A Machine Learning Approach for Romanian
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
CSTalks-Natural Language Processing-17Aug
Development of text to speech system for yoruba language
Ontology Integration and Interoperability (OntoIOp) – Part 1: The Distributed...

Similar to SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-Editing - Anabela Barreiro (20)

PDF
Ricardo Baeza-Yates, Luz Rello - Lexical Quality of Social Media - ICWSM - FO...
DOCX
CALL (computer Assisted Language)
PPTX
IFE-MT: An English-to-Yorùbá Machine Translation System
PPTX
Content Writing Optimization with ReWriter
PDF
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
PDF
Word Segmentation and Lexical Normalization for Unsegmented Languages
PDF
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
PDF
Towards a Marketplace of Open Source Software Data
PDF
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
PDF
Recent Developments in Functional Discourse Grammar 1st Edition Evelien Keizer
PDF
Identification of prosodic features of punjabi for enhancing the pronunciatio...
PDF
Poster @ enetCollect CA MC meeting in Iasi, Romania
PDF
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
PDF
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
PDF
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
PPTX
Metaphors in the ESP business class
PPTX
Paraphrasing biomedical support verb constructions for machine translation
PPTX
Language acquisition
PDF
5a use of annotated corpus
PDF
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Ricardo Baeza-Yates, Luz Rello - Lexical Quality of Social Media - ICWSM - FO...
CALL (computer Assisted Language)
IFE-MT: An English-to-Yorùbá Machine Translation System
Content Writing Optimization with ReWriter
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Word Segmentation and Lexical Normalization for Unsegmented Languages
Make it simple with paraphrases: Automated paraphrasing for authoring aids an...
Towards a Marketplace of Open Source Software Data
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Recent Developments in Functional Discourse Grammar 1st Edition Evelien Keizer
Identification of prosodic features of punjabi for enhancing the pronunciatio...
Poster @ enetCollect CA MC meeting in Iasi, Romania
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
IRJET- Kinyarwanda Speech Recognition in an Automatic Dictation System for Tr...
Metaphors in the ESP business class
Paraphrasing biomedical support verb constructions for machine translation
Language acquisition
5a use of annotated corpus
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Ad

More from INESC-ID (Spoken Language Systems Laboratory - L2F) (20)

PDF
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
PDF
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
PPTX
Syntactic-semantic analysis for information extraction in biomedicine
PPT
Cross language semantic relations between English and Portuguese
PDF
PPTX
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
PDF
Barreiro et al POP@PROPOR2018-informal2formal-language
PDF
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
PPTX
Barreiro-Batista-LR4NLP@Coling2018-presentation
PPTX
Barreiro-Mota-VarDial@Coling2018-poster
PDF
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
PDF
Machine Translation of Discontinuous Multiword Units
Análise comparativa das edições portuguesa e brasileira de Os livros que dev...
Welcome session 3rd Annual MC Meeting - enetCollect COST Action
Syntactic-semantic analysis for information extraction in biomedicine
Cross language semantic relations between English and Portuguese
eSPERTo’s Paraphrastic Knowledge Applied to Question-Answering and Summarization
Barreiro et al POP@PROPOR2018-informal2formal-language
Rebelo-Arnold et al POP@PROPOR2018-EP-BP-alignments
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Mota-VarDial@Coling2018-poster
ReEscreve: A Translator-Friendly Multi-Purpose Paraphrasing Software Tool
Machine Translation of Discontinuous Multiword Units
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
August Patch Tuesday
PDF
Getting Started with Data Integration: FME Form 101
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Approach and Philosophy of On baking technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Tartificialntelligence_presentation.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
August Patch Tuesday
Getting Started with Data Integration: FME Form 101
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Group 1 Presentation -Planning and Decision Making .pptx
Approach and Philosophy of On baking technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Tartificialntelligence_presentation.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Building Integrated photovoltaic BIPV_UPV.pdf

SPIDER: a System for Paraphrasing - Applicability in Machine Translation Pre-Editing - Anabela Barreiro

  • 1. SPIDER: A SYSTEM FOR PARAPHRASING IN DOCUMENT EDITING AND REVISION APPLICABILITY IN MACHINE TRANSLATION PRE-EDITING Anabela Barreiro ab@metatrad.com CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 2. OUTLINE INTRODUCTION  PARAPHRASES IN NLP  PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS SPIDER  FIRST STEPS  IMPORTANT FEATURES  PARAPHRASES COVERED BY SPIDER  INTERFACE  LINGUISTIC RESOURCES  EVALUATION RESULTS THE FUTURE  FUTURE APPLICATIONS?  FUTURE RESEARCH CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 3. IMPORTANCE OF PARAPHRASES IN NLP TASKS  Question Answering [Ibrahim et al., 2003], [Paşca, 2003], [Duboué & Chu-Carroll, 2006]  Information Extraction and Text Mining [Ibrahim et al., 2003], [Shinyama et al., 2002] [Shinyama & Sekine, 2003], [Sekine, 2005] [Paşca, 2005], [Paşca & Dienes, 2005]  Summarization [McKeown et al., 2002], [Barzilay, 2001, 2003], [Hirao et al., 2004] [Zhou et al., 2006b]  Natural Language Generation [Iordanskaja et al. 1991]  Plagiarism Detection [Potthast et al., 2010], [Vila et al., 2010]  Machine Translation [Zhou et al., 2006], [Callison-Burch et al., 2006a, 2006b, 2007 and 2008] [Barreiro, 2008, 2009, 2011] CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 4. THE PRACTICAL NEED FOR PARAPHRASES IN PEDAGOGICAL CONTEXTS  Text Processing and Authoring Aids Writing and revision of original/creative/customized texts  Learning Tools Native and second language learning Creation of clear and understandable text content e.g. students learning language and writing skills  Style Editors Uniformization /consistency of style CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 5. THE PRACTICAL NEED FOR PARAPHRASES IN PROFESSIONAL CONTEXTS  Technical Writing Professional high quality documentation and domain-specific texts Controlled language  Linguistic Quality Assurance Linguistic quality of generic texts and specialized documentation Verification/validation of meaningful content  Text Optimization Readable / publishable texts (business-oriented or purpose-oriented content)  Terminology Search for the “exact” term or relevant keywords  Translation Indispensable for human and machine translation (pre-editing and post-editing) CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 6. OUTLINE INTRODUCTION  PARAPHRASES IN NLP  PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS SPIDER  FIRST STEPS  IMPORTANT FEATURES  PARAPHRASES COVERED BY SPIDER  INTERFACE  LINGUISTIC RESOURCES  EVALUATION RESULTS THE FUTURE  FUTURE APPLICATIONS?  FUTURE RESEARCH CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 7. SPIDER PARAPHRASING SYSTEM FIRST STEPS Initially developed for Portuguese 1st version – ReEscreve publicly available service at http://www.linguateca.pt/ReEscreve/ 2nd version – eSPERTo (Portuguese: the smart/clever one; expert) currently being integrated in a cyber school project within the scope of an educational program Writing exercises – students learning how to improve their writing skills in the Portuguese language English SPIDER prototype to assist writing of domain-specific texts CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 8. SPIDER IMPORTANT FEATURES  Applies linguistic knowledge to recognize and generate paraphrases automatically (preserves the source text semantics and grammaticality - inflectional features) in the suggestions provided (included transformations of multi-word units)  Uses text-editing mechanisms which provide a variety of alternatives for each expression and the possibility to choose among them (according to personal preferences, style, idiomacity, etc.)  Allows users to suggest new expressions that can be immediately applied to their text, making the text editing process easier, more flexible, and upgradable  Designed to help with writing optimization, understandability and translatability (improvement of the quality of the source text so that it can cause a positive impact in translation) CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 9. PARAPHRASES COVERED BY SPIDER  Synonyms in context (ex: phrasal verbs into equivalent expressions) to clear up (weather) = (weather) to become better/brighter  Support verb constructions into single verbs and stylistic variants to make a decision = to decide; to make an audit = to perform an audit  Aspectual constructions into single verbs to launch an attack = to attack  Adverbials (compounds into single adverbs) in a constructive way = constructively  Relatives into participial adjectives the president that was elected = the president elect  Relatives into possessives the role that Europe plays/has = the role of Europe  Relatives into compound nouns (and vice-versa) a container for the milk = a milk container; a bottle made of plastic = a plastic bottle  Agentive passives into actives the man was released by the police officer = the police officer released the man CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 10. INTERFACE SUGGESTIONS FOR EXAMPLE SENTENCES Suggestions for general language linguistic phenomena Compound adverbs > single adverbs Relatives > participial adjectives Support verb constructions > single verbs CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 11. INTERFACE SELECTION OF PARAPHRASING GRAMMARS FOR SPECIFIC LINGUISTIC PHENOMENA Users can select among general and technical dictionaries (more than one selection allowed), grammars for specific linguistic transformations (one, several or all grammars can be selected). The interface provides sample texts for testing. Informative details about the linguistic resources selected Sample LEGAL text CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 12. INTERFACE SELECTION OF A DOMAIN DICTIONARY Identification of legal terms in the text Suggestions for the term “breach of law” Users can select one term from the list of suggestions or provide a new suggestion CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 13. INTERFACE SUGGESTIONS PROVIDED AND USER’S CAPABILITY TO ADD NEW REWRITING OPTIONS The user can suggest new words or expressions (synonyms or paraphrases) It is possible to go back and change the user option as many times as necessary Text rewritten • In red, the expressions in the source text • In green, suggestions provided by SPIDER and selected by the user CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 14. LINGUISTIC RESOURCES  Eng4NooJ – linguistic knowledge system • OpenLogos dictionary (http://logos-os.dfki.de/) • converted into NooJ format, and enhanced with new properties, including derivational and morpho-syntactic and semantic relations • Morphological system • Contextual rules and grammars • Domain specific dictionary (sample “legal terms”) CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 15. LINGUISTIC RESOURCES General language dictionary entries impress,V+FLX=POLISH+SAL=PVPCpleasetype+PT=impressionar+DRV=NDRV01:BOOK+ VSUP=make+VSUP=cause+NPREP=on Morpho-syntactic aesthetic,AFLX=NATURAL+SAL=AVstate+PT=aesthetically+DRV=AVDRV03 and semantic relations skepticism,N+FLX=BOOK+SAL=ABcause+PT=cepticismo+DRV=NAVDRV02 NDRV04 = <B>ion/Npred+Nom Rules to transform morpho-syntactically ADRV02 = <B>icable and semantically AVDRV01 = <E>ly/ADV related words of different parts of AVDRV04 = <B>tically/ADV speech Grammar to recognize adverbial compounds and transform them into equivalent single adverbs Contextual rules Rules to improve precision in specific contexts [bring(vt)) N(charge; action) > present(vt) N(idem)] CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 16. LINGUISTIC RESOURCES Sample of terms classified as Information + Instructional/legal CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 17. EVALUATION RESULTS: PARAPHRASING PRECISION Corpus: 500 sentences 100 sentences for each of 5 elementary support verbs SVC Recognition SVC Recognition SVC Paraphrasing Precision Recall Precision Pôr 73/73 - 100% 73/100 – 73% 72/73 - 98.6% Tomar 75/75 - 100% 75/100 – 75% 68/73 - 93.1% Ter 65/65 - 100% 65/100 – 65% 59/65 - 90.7% Dar 57/60 - 95% 57/100 – 57% 46/51 - 90.1% Fazer 43/45 – 95.5% 43/100 – 43% 40/45 - 88.8% Average 62.6/63.6 - 98.4% 62.6/100 - 62.6% 57/61 - 93.4% Evaluation of recognition and paraphrasing of support verb constructions CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 18. EVALUATION RESULTS: IMPACT ON TRANSLATABILITY (MT) Same corpus, 50 sentences selected randomly (i) automated pre-processing of support verb constructions with SPIDER and conversion into equivalent single verbs (ii) pre-processed sentences (automatically generated paraphrases) and original text are submitted to MT and the output translations for both original and pre-processed sentences were compared • 29 (58%) of the best translations were of automatically generated paraphrases • 9 (18%) were of support verb constructions • 12 (24%) were equally bad or equally good CONCLUSION The experiment indicates that paraphrases such as those generated by SPIDER help improve translation scores • The automated paraphrasing of support verb constructions through SPIDER allowed a significant improvement of the quality of the MT results in that context CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 19. OUTLINE INTRODUCTION  PARAPHRASES IN NLP  PARAPHRASES IN PEDAGOGICAL AND PROFESSIONAL CONTEXTS SPIDER  FIRST STEPS  IMPORTANT FEATURES  PARAPHRASES COVERED BY SPIDER  INTERFACE  LINGUISTIC RESOURCES  EVALUATION RESULTS THE FUTURE  FUTURE APPLICATIONS?  FUTURE RESEARCH CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 20. FUTURE APPLICATIONS? • Writing / authoring aid (word processing applications) • Language composition tool - general and technical language (e.g. student texts or legal texts) • Text production and style editor • Terminology verification tool - professional use of terminology in technical domains (elimination of informal, idiomatic, slang use of language) • Empirical testbed for linguistic quality assurance (source and target texts) • Text editing (machine translation pre-editing and post-editing) and translation aid • Controlled language tool • Consistent, direct, and simple language • Restricted grammar (avoid certain types of construction) • Avoid complex reasoning, figures of speech, metaphors, etc. • Elimination of wordiness • “Revision memory” tool (≈ “translation memory”) - recycling of validated reviewed sentences, structures or phrases CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 21. FUTURE RESEARCH FROM SPIDER TO MACHINE TRANSLATION a fazer um estágio para dar aulas de / tutor Religião a fazer um estágio para dar aulas de / lecture Religião a fazer um estágio para dar aulas de / teach Religião começa a dar exemplos / exemplify : sentia-se capaz de dar um murro em / punch quem quisesse detê-lo gostávamos de lhe dar uma palavrinha / speak . $EN CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan
  • 22. SPIDER: A SYSTEM FOR PARAPHRASING IN DOCUMENT EDITING AND REVISION APPLICABILITY IN MACHINE TRANSLATION PRE-EDITING Anabela Barreiro ab@metatrad.com CICLing 2011 February 20-26, 2011 Anabela Barreiro Tokyo, Japan