Interface for Finding Close Matches from Translation Memory

Interface for Finding Close
Matches from Translation
Memory
Nipun Edara - 10010119
Priyatham Bollimpalli - 10010148
G Sharath Reddy - 10010174
P V S Dileep - 10010180

IR Search Engine
• To first retrieve top relevant sentences.
• Later meaning equivalent sentences are filtered.

Reasons for our own Search Engine
• Difficulty in customizing the ranking function. Just simple ranking
based on BM25 may not be give optimal results since phrasal
searches and proximity measures are not considered in BM25 since it
is essentially a TF-IDF based ranking system.
• Flexibility in index size. Whoosh has an index size which is larger than
the conventional index since it assumes that it user needs all the
features in it. Contrary to that building our own index reduced the
size by 50%.
• Flexibility in query model. Whoosh has a strict query model where
there is only one factor AND/OR between the terms.

Preprocessing Stage: Indexing
Preprocessing is done for getting some overall parameters in the
dataset such as average document length, term frequencies etc.

Query Expansion
Every query as well as sentences in the documents during indexing is
subjected to the following:
• Converting to lower case
• Tokenization and Normalization. For example it’s is converted to it is
• Removing punctuations.
• Stemming. Porter stemmer is used.
• Synonyms using wordnet.

Closest matching
Two main problem in applying EBMT are
• As the length of a sentence becomes long, the number of retrieved
similar
sentences greatly decreases. This often results in no output when
translating long sentences.
• The other problem arises due to the differences in style between
input sentences and the example corpus.

Meaning Equivalent Sentence
• A sentence that shares the main meaning with the input sentence
despite lacking some unimportant information. It does not contain
information additional to that in the input sentence.
Features:
• Content Words: Words categorized as either noun, pronoun ,
adjective, adverb, or verb are recognized as content words.
Interrogatives are also included.
• Functional Words: Words such as particles, auxiliary verbs,
conjunctions, and interjections are recognized as functional words.

Matching and Ranking:
• # of identical content words
• # of synonymous words
• # of common functional words
• # of different functional words
• # of different content words

Algorithm
• Given a query and a sentence's matching score is calculated as follows
• Get content words of the query (A)
• Get functional words of the query (B)
• Get synonyms of the content words of the query (C)
• Get content words of the sentence (D)
• Get functional words of the sentence (E)
• E1) Identical content words = Number of matching words in A and D
• E2) Identical synonymous words = Number of matching words between C and D
• E3) Identical functional words = Number of matching words between B and E
• Different content words = #(A) + #(D) - 2*( E1 )
• Different functional words = #(B) + #(E) - 2*( E3 )
• Weights are given for the above quantities and total score is calculated.

Sequence Matcher
• Improvement of ‘gestalt pattern matching’ proposed by Ratcliff and
Obershelp.
• Idea : To find the longest contiguous matching subsequence that
contains no “junk” elements and recursively apply the same to the
left and right part of matching subsequence.
• Even though it doesn’t yield minimal edit sequences, but does tend to
yield matches that “look right” to people
• Time Complexity : Cubic in length of string for worst case

PATTERN MATCHING: THE GESTALT APPROACH
• Gestalt - describes how people can recognize a pattern as a
functional unit that has properties not derivable by summation of its
parts.

The Ratcliff/Obershelp
Pattern-matching algorithm
• Works on the similar lines as in the example mentioned above.
• First, locate the largest group of characters in common.
• Using this as anchor, recursively find the largest group of common
characters by comparing left parts of both strings and also in right
parts of both the strings.

The Ratcliff/Obershelp
Pattern-matching algorithm
• Returns a score, reflecting the percentage match
• Score
2*(# of characters matched) /[len(string1)+len(string2)]
Higher the score implies higher is the matching percentage.

Interface for Finding Close Matches from Translation Memory

More Related Content

What's hot (20)

Similar to Interface for Finding Close Matches from Translation Memory (20)

More from Priyatham Bollimpalli (10)

Recently uploaded (20)

Interface for Finding Close Matches from Translation Memory