SlideShare a Scribd company logo
Interface for Finding Close
Matches from Translation
Memory
Nipun Edara - 10010119
Priyatham Bollimpalli - 10010148
G Sharath Reddy - 10010174
P V S Dileep - 10010180
IR Search Engine
• To first retrieve top relevant sentences.
• Later meaning equivalent sentences are filtered.
Reasons for our own Search Engine
• Difficulty in customizing the ranking function. Just simple ranking
based on BM25 may not be give optimal results since phrasal
searches and proximity measures are not considered in BM25 since it
is essentially a TF-IDF based ranking system.
• Flexibility in index size. Whoosh has an index size which is larger than
the conventional index since it assumes that it user needs all the
features in it. Contrary to that building our own index reduced the
size by 50%.
• Flexibility in query model. Whoosh has a strict query model where
there is only one factor AND/OR between the terms.
Preprocessing Stage: Indexing
Preprocessing is done for getting some overall parameters in the
dataset such as average document length, term frequencies etc.
Conventional Indexing
Proximity
Query Expansion
Every query as well as sentences in the documents during indexing is
subjected to the following:
• Converting to lower case
• Tokenization and Normalization. For example it’s is converted to it is
• Removing punctuations.
• Stemming. Porter stemmer is used.
• Synonyms using wordnet.
Architectural Overview
Closest matching
Two main problem in applying EBMT are
• As the length of a sentence becomes long, the number of retrieved
similar
sentences greatly decreases. This often results in no output when
translating long sentences.
• The other problem arises due to the differences in style between
input sentences and the example corpus.
Meaning Equivalent Sentence
• A sentence that shares the main meaning with the input sentence
despite lacking some unimportant information. It does not contain
information additional to that in the input sentence.
Features:
• Content Words: Words categorized as either noun, pronoun ,
adjective, adverb, or verb are recognized as content words.
Interrogatives are also included.
• Functional Words: Words such as particles, auxiliary verbs,
conjunctions, and interjections are recognized as functional words.
Matching and Ranking:
• # of identical content words
• # of synonymous words
• # of common functional words
• # of different functional words
• # of different content words
Algorithm
• Given a query and a sentence's matching score is calculated as follows
• Get content words of the query (A)
• Get functional words of the query (B)
• Get synonyms of the content words of the query (C)
• Get content words of the sentence (D)
• Get functional words of the sentence (E)
• E1) Identical content words = Number of matching words in A and D
• E2) Identical synonymous words = Number of matching words between C and D
• E3) Identical functional words = Number of matching words between B and E
• Different content words = #(A) + #(D) - 2*( E1 )
• Different functional words = #(B) + #(E) - 2*( E3 )
• Weights are given for the above quantities and total score is calculated.
Sequence Matcher
• Improvement of ‘gestalt pattern matching’ proposed by Ratcliff and
Obershelp.
• Idea : To find the longest contiguous matching subsequence that
contains no “junk” elements and recursively apply the same to the
left and right part of matching subsequence.
• Even though it doesn’t yield minimal edit sequences, but does tend to
yield matches that “look right” to people
• Time Complexity : Cubic in length of string for worst case
PATTERN MATCHING: THE GESTALT APPROACH
• Gestalt - describes how people can recognize a pattern as a
functional unit that has properties not derivable by summation of its
parts.
Example for Gestalt Approach
The Ratcliff/Obershelp
Pattern-matching algorithm
• Works on the similar lines as in the example mentioned above.
• First, locate the largest group of characters in common.
• Using this as anchor, recursively find the largest group of common
characters by comparing left parts of both strings and also in right
parts of both the strings.
The Ratcliff/Obershelp
Pattern-matching algorithm
• Returns a score, reflecting the percentage match
• Score
2*(# of characters matched) /[len(string1)+len(string2)]
Higher the score implies higher is the matching percentage.

More Related Content

PDF
Knowledge based System
PPT
Email Data Cleaning
PDF
Improving Document Clustering by Eliminating Unnatural Language
PPTX
Word embedding
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Some Information Retrieval Models and Our Experiments for TREC KBA
ODP
Decision tables
PDF
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...
Knowledge based System
Email Data Cleaning
Improving Document Clustering by Eliminating Unnatural Language
Word embedding
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Some Information Retrieval Models and Our Experiments for TREC KBA
Decision tables
cushLEPOR uses LABSE distilled knowledge to improve correlation with human tr...

What's hot (20)

PPTX
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
PDF
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
PPTX
PPTX
Mapping cardinality (cardinality constraint) in ER MODEL
PDF
BINF 3121 Data Analysis Report How-To
PPT
Latent Semantic Indexing For Information Retrieval
PDF
Information retrieval as statistical translation
PDF
Multilingual Term Extraction as a Service from Acrolinx, CHAT2013
DOCX
Bca3020– data base management system(dbms)
PPTX
Real-time DirectTranslation System for Sinhala and Tamil Languages.
PDF
SemEval - Aspect Based Sentiment Analysis
PPT
Concept Based Search
PPTX
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
PDF
Meta-evaluation of machine translation evaluation methods
PPT
Scalable Discovery Of Hidden Emails From Large Folders
PPTX
AI: Logic in AI
PPTX
Latent Semanctic Analysis Auro Tripathy
PDF
Can Deep Learning solve the Sentiment Analysis Problem
PDF
Reference Scope Identification of Citances Using Convolutional Neural Network
PPTX
NLP Project Presentation
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Monte Carlo Modelling of Confidence Intervals in Translation Quality Evaluati...
Mapping cardinality (cardinality constraint) in ER MODEL
BINF 3121 Data Analysis Report How-To
Latent Semantic Indexing For Information Retrieval
Information retrieval as statistical translation
Multilingual Term Extraction as a Service from Acrolinx, CHAT2013
Bca3020– data base management system(dbms)
Real-time DirectTranslation System for Sinhala and Tamil Languages.
SemEval - Aspect Based Sentiment Analysis
Concept Based Search
New Quantitative Methodology for Identification of Drug Abuse Based on Featur...
Meta-evaluation of machine translation evaluation methods
Scalable Discovery Of Hidden Emails From Large Folders
AI: Logic in AI
Latent Semanctic Analysis Auro Tripathy
Can Deep Learning solve the Sentiment Analysis Problem
Reference Scope Identification of Citances Using Convolutional Neural Network
NLP Project Presentation
Ad

Similar to Interface for Finding Close Matches from Translation Memory (20)

PDF
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
PDF
Chapter 6 Query Language .pdf
PPTX
Improving Search in Workday Products using Natural Language Processing
PPTX
Vectors in Search - Towards More Semantic Matching
PPTX
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
PPTX
Eskm20140903
PPTX
Haystack 2019 - Search with Vectors - Simon Hughes
PPTX
Searching with vectors
PPTX
Text analytics
PPTX
Efficient instant fuzzy search with proximity ranking
PPTX
information retrieval
PPTX
Information Retrieval
PPTX
Unit - I Sentiment anlysis with logistic regression.pptx
PPTX
Dice.com Bay Area Search - Beyond Learning to Rank Talk
PDF
MODELING AND RETRIEVAL 4.pdfMODELING AND RETRIEVAL EVALUATION
PDF
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
PDF
Intro to Deep Learning for Question Answering
PPTX
Text similarity measures
PDF
TopicModels_BleiPaper_Summary.pptx
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Chapter 6 Query Language .pdf
Improving Search in Workday Products using Natural Language Processing
Vectors in Search - Towards More Semantic Matching
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Eskm20140903
Haystack 2019 - Search with Vectors - Simon Hughes
Searching with vectors
Text analytics
Efficient instant fuzzy search with proximity ranking
information retrieval
Information Retrieval
Unit - I Sentiment anlysis with logistic regression.pptx
Dice.com Bay Area Search - Beyond Learning to Rank Talk
MODELING AND RETRIEVAL 4.pdfMODELING AND RETRIEVAL EVALUATION
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
Intro to Deep Learning for Question Answering
Text similarity measures
TopicModels_BleiPaper_Summary.pptx
Ad

More from Priyatham Bollimpalli (10)

PDF
Meta Machine Learning: Hyperparameter Optimization
PDF
Science and Ethics: The Manhattan Project during World War II
PDF
Kernel Descriptors for Visual Recognition
PDF
Auction Portal
PDF
IIT JEE Seat Allocation System
PDF
Design and Fabrication of 4-bit processor
PDF
Library Management System
PPTX
GCC RTL and Machine Description
PDF
The problem of Spatio-Temporal Invariant Points in Videos
PDF
Literature Survey on Interest Points based Watermarking
Meta Machine Learning: Hyperparameter Optimization
Science and Ethics: The Manhattan Project during World War II
Kernel Descriptors for Visual Recognition
Auction Portal
IIT JEE Seat Allocation System
Design and Fabrication of 4-bit processor
Library Management System
GCC RTL and Machine Description
The problem of Spatio-Temporal Invariant Points in Videos
Literature Survey on Interest Points based Watermarking

Recently uploaded (20)

PPTX
Artificial Intelligence
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PPTX
Information Storage and Retrieval Techniques Unit III
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Visual Aids for Exploratory Data Analysis.pdf
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
Module 8- Technological and Communication Skills.pptx
PPT
Occupational Health and Safety Management System
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPT
Total quality management ppt for engineering students
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
communication and presentation skills 01
Artificial Intelligence
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
Management Information system : MIS-e-Business Systems.pptx
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Nature of X-rays, X- Ray Equipment, Fluoroscopy
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Information Storage and Retrieval Techniques Unit III
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Visual Aids for Exploratory Data Analysis.pdf
Categorization of Factors Affecting Classification Algorithms Selection
R24 SURVEYING LAB MANUAL for civil enggi
III.4.1.2_The_Space_Environment.p pdffdf
Module 8- Technological and Communication Skills.pptx
Occupational Health and Safety Management System
Exploratory_Data_Analysis_Fundamentals.pdf
Total quality management ppt for engineering students
Safety Seminar civil to be ensured for safe working.
communication and presentation skills 01

Interface for Finding Close Matches from Translation Memory

  • 1. Interface for Finding Close Matches from Translation Memory Nipun Edara - 10010119 Priyatham Bollimpalli - 10010148 G Sharath Reddy - 10010174 P V S Dileep - 10010180
  • 2. IR Search Engine • To first retrieve top relevant sentences. • Later meaning equivalent sentences are filtered.
  • 3. Reasons for our own Search Engine • Difficulty in customizing the ranking function. Just simple ranking based on BM25 may not be give optimal results since phrasal searches and proximity measures are not considered in BM25 since it is essentially a TF-IDF based ranking system. • Flexibility in index size. Whoosh has an index size which is larger than the conventional index since it assumes that it user needs all the features in it. Contrary to that building our own index reduced the size by 50%. • Flexibility in query model. Whoosh has a strict query model where there is only one factor AND/OR between the terms.
  • 4. Preprocessing Stage: Indexing Preprocessing is done for getting some overall parameters in the dataset such as average document length, term frequencies etc.
  • 7. Query Expansion Every query as well as sentences in the documents during indexing is subjected to the following: • Converting to lower case • Tokenization and Normalization. For example it’s is converted to it is • Removing punctuations. • Stemming. Porter stemmer is used. • Synonyms using wordnet.
  • 9. Closest matching Two main problem in applying EBMT are • As the length of a sentence becomes long, the number of retrieved similar sentences greatly decreases. This often results in no output when translating long sentences. • The other problem arises due to the differences in style between input sentences and the example corpus.
  • 10. Meaning Equivalent Sentence • A sentence that shares the main meaning with the input sentence despite lacking some unimportant information. It does not contain information additional to that in the input sentence. Features: • Content Words: Words categorized as either noun, pronoun , adjective, adverb, or verb are recognized as content words. Interrogatives are also included. • Functional Words: Words such as particles, auxiliary verbs, conjunctions, and interjections are recognized as functional words.
  • 11. Matching and Ranking: • # of identical content words • # of synonymous words • # of common functional words • # of different functional words • # of different content words
  • 12. Algorithm • Given a query and a sentence's matching score is calculated as follows • Get content words of the query (A) • Get functional words of the query (B) • Get synonyms of the content words of the query (C) • Get content words of the sentence (D) • Get functional words of the sentence (E) • E1) Identical content words = Number of matching words in A and D • E2) Identical synonymous words = Number of matching words between C and D • E3) Identical functional words = Number of matching words between B and E • Different content words = #(A) + #(D) - 2*( E1 ) • Different functional words = #(B) + #(E) - 2*( E3 ) • Weights are given for the above quantities and total score is calculated.
  • 13. Sequence Matcher • Improvement of ‘gestalt pattern matching’ proposed by Ratcliff and Obershelp. • Idea : To find the longest contiguous matching subsequence that contains no “junk” elements and recursively apply the same to the left and right part of matching subsequence. • Even though it doesn’t yield minimal edit sequences, but does tend to yield matches that “look right” to people • Time Complexity : Cubic in length of string for worst case
  • 14. PATTERN MATCHING: THE GESTALT APPROACH • Gestalt - describes how people can recognize a pattern as a functional unit that has properties not derivable by summation of its parts.
  • 16. The Ratcliff/Obershelp Pattern-matching algorithm • Works on the similar lines as in the example mentioned above. • First, locate the largest group of characters in common. • Using this as anchor, recursively find the largest group of common characters by comparing left parts of both strings and also in right parts of both the strings.
  • 17. The Ratcliff/Obershelp Pattern-matching algorithm • Returns a score, reflecting the percentage match • Score 2*(# of characters matched) /[len(string1)+len(string2)] Higher the score implies higher is the matching percentage.