SlideShare a Scribd company logo
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011
What we get




        Andr´ Santos andrefs@cpan.org
            e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Duplicated versions




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Candidate pairs




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
What this is really about




                    similarity



         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
It’s all LIEs!


  Language Independent Element (LIE)
  Terms which are usually kept untouched during
  translation.

      Year references (e.g. “1977”)
      Proper names (e.g. “Sherlock Holmes”)




           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Measuring similarity




                                           |ALIEs ∩ BLIEs |
       similarity (A, B) =
                                           |ALIEs ∪ BLIEs |




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Measuring similarity




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
pairbooks

  Similarity values
     < 0.2 Documents                are    not related
     > 0.4 Documents                are    candidate pairs
     > 0.9 Documents                are    near duplicates
         1.0 Documents              are    duplicates

  Languages
  High similarity, same language: (Near) duplicates
  High similarity, different language: Candidate pairs

           Andr´ Santos andrefs@cpan.org
               e                           Identifying similar text documents
Behold, pairbooks!

   ~ $ pairbooks                PT_list.txt ES_list.txt
  PTBR__Umberto_EcoO_nome_da_rosa.txt
    (0.227) [6954,7382]   ES__Umberto_EcoEl_Nombre_de_la_Rosa(...)
    (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.018) [6954,5604]   ES__Umberto_EcoDiario_Minimo__2.txt(...)

  PTBR__Umberto_EcoO_Pendulo_de_Focault.txt
    (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...)
    (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...)
    (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt
  (...)




            Andr´ Santos andrefs@cpan.org
                e                           Identifying similar text documents
Perfect LIEs do not exist
  Year references
                         Can be confused with page numbers
                         Headers/footers can contain them
                         (publishing year, copyright, . . . )
  Proper names
                         Sometimes are translated (e.g. “S˜o
                                                          a
                         Tom´” “Judas Tom´” etc)
                              e,           e,
                         Some languages use different scripts
                         (e.g. Russian)
                         Some languages have declensions
        ...
              Andr´ Santos andrefs@cpan.org
                  e                           Identifying similar text documents
How to improve LIEs (future work)



     accept a list of equivalent words
     accept a list of stop words
     ...




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Give me one of those!


  CPAN
   http://search.cpan.org/perldoc?pairbooks

     Developer version
     requires Linux, Perl
     Incomplete documentation




         Andr´ Santos andrefs@cpan.org
             e                           Identifying similar text documents
Identifying similar text documents

            Andr´ Santos
                e
         andrefs@cpan.org




            November 2011

More Related Content

PPTX
Plagirism checker
PPT
RDF briefing
PDF
Understanding Regular expressions: Programming Historian Study Group, Univers...
PPT
PP informàtica
PDF
Mojolicious lite
PDF
乘科技風潮 學術生涯規劃
PPT
Kwf 6 7 Newfeatures En
PPTX
Colchon flotable a luz solar
Plagirism checker
RDF briefing
Understanding Regular expressions: Programming Historian Study Group, Univers...
PP informàtica
Mojolicious lite
乘科技風潮 學術生涯規劃
Kwf 6 7 Newfeatures En
Colchon flotable a luz solar

Viewers also liked (11)

PPTX
Steps to change the formatting of the text
PPTX
How we can help accountants clients save money
PPTX
Final mh 101 for owls 2015(1)
PDF
Building your own CPAN with Pinto
PPTX
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
PPTX
Professional Certifications
PDF
Our Own Success Summit
PPT
Kms 6 7 Newfeatures En
PPT
La Excepción
PPS
Pps delz@-budapest - i - left bank-the historic part and more
PPTX
Alf Lizzio 2015
Steps to change the formatting of the text
How we can help accountants clients save money
Final mh 101 for owls 2015(1)
Building your own CPAN with Pinto
Mobil Cihaz Uygulamalarında Sql Server Ce Kullanımı
Professional Certifications
Our Own Success Summit
Kms 6 7 Newfeatures En
La Excepción
Pps delz@-budapest - i - left bank-the historic part and more
Alf Lizzio 2015
Ad

Similar to Identifying similar text documents (9)

PPT
IR CHAPTER_TWO Most important for students
PPTX
Corpus linguistics
PDF
R.E, Text Normalization, Tokenization ALgs, BPE.pdf
PDF
Cleaning plain text books with Text::Perfide::BookCleaner
PPT
Copy of 10text (2)
PPT
Chapter 10 Data Mining Techniques
PDF
Corpus linguistics intro
PPTX
A Simple Introduction to Word Embeddings
IR CHAPTER_TWO Most important for students
Corpus linguistics
R.E, Text Normalization, Tokenization ALgs, BPE.pdf
Cleaning plain text books with Text::Perfide::BookCleaner
Copy of 10text (2)
Chapter 10 Data Mining Techniques
Corpus linguistics intro
A Simple Introduction to Word Embeddings
Ad

More from andrefsantos (8)

PDF
Elasto Mania
PDF
Slides
PDF
Poster - Bigorna, a toolkit for orthography migration challenges
PDF
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
PDF
A survey on parallel corpora alignment
PDF
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
PDF
Bigorna - a toolkit for orthography migration challenges
PDF
Bigorna
Elasto Mania
Slides
Poster - Bigorna, a toolkit for orthography migration challenges
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
A survey on parallel corpora alignment
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Bigorna - a toolkit for orthography migration challenges
Bigorna

Recently uploaded (20)

PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Chapter 3 Spatial Domain Image Processing.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
A comparative analysis of optical character recognition models for extracting...
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Programs and apps: productivity, graphics, security and other tools
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25-Week II
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx

Identifying similar text documents

  • 1. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011
  • 2. What we get Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 3. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 4. Duplicated versions Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 5. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 6. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 7. Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 8. What this is really about similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 9. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 10. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 11. It’s all LIEs! Language Independent Element (LIE) Terms which are usually kept untouched during translation. Year references (e.g. “1977”) Proper names (e.g. “Sherlock Holmes”) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 12. Measuring similarity |ALIEs ∩ BLIEs | similarity (A, B) = |ALIEs ∪ BLIEs | Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 13. Measuring similarity Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 14. pairbooks Similarity values < 0.2 Documents are not related > 0.4 Documents are candidate pairs > 0.9 Documents are near duplicates 1.0 Documents are duplicates Languages High similarity, same language: (Near) duplicates High similarity, different language: Candidate pairs Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 15. Behold, pairbooks! ~ $ pairbooks PT_list.txt ES_list.txt PTBR__Umberto_EcoO_nome_da_rosa.txt (0.227) [6954,7382] ES__Umberto_EcoEl_Nombre_de_la_Rosa(...) (0.018) [6954,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.018) [6954,5604] ES__Umberto_EcoDiario_Minimo__2.txt(...) PTBR__Umberto_EcoO_Pendulo_de_Focault.txt (0.391) [11276,11408] ES__Umberto_EcoEl_Pendulo_De_Foucau(...) (0.042) [11276,6024] ES__Umberto_EcoLa_busqueda_de_la_Le(...) (0.035) [11276,5604] ES__Umberto_EcoDiario_Minimo__2.txt (...) Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 16. Perfect LIEs do not exist Year references Can be confused with page numbers Headers/footers can contain them (publishing year, copyright, . . . ) Proper names Sometimes are translated (e.g. “S˜o a Tom´” “Judas Tom´” etc) e, e, Some languages use different scripts (e.g. Russian) Some languages have declensions ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 17. How to improve LIEs (future work) accept a list of equivalent words accept a list of stop words ... Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 18. Give me one of those! CPAN http://search.cpan.org/perldoc?pairbooks Developer version requires Linux, Perl Incomplete documentation Andr´ Santos andrefs@cpan.org e Identifying similar text documents
  • 19. Identifying similar text documents Andr´ Santos e andrefs@cpan.org November 2011