SlideShare a Scribd company logo
Minimum Edit Distance
How similar are two strings?
Spell correction
User typed “graffe” Which is closest?
◦ graf
◦ graft
◦ grail
◦ giraffe
Used in Speech Recognition, Machine Translation(Alignment), Information
Extraction(NER),
Edit Distance
The minimum edit distance between two strings is the minimum
number of editing operations needed to transform one into the
other
◦ Insertion
◦ Deletion
◦ Substitution
Minimum Edit Distance
Two strings and their alignment:
Minimum Edit Distance
If each operation has cost of 1
◦ Distance between these is 5
If substitutions cost 2 (Levenshtein)
◦ Distance between them is 8
Uses of Edit Distance in NLP
Evaluating Machine Translation and speech recognition
R Spokesman confirms senior government adviser was appointed
H Spokesman said the senior adviser was appointed
S I D
Entity Coreference
◦ IBM Inc. announced today
◦ IBM profits
◦ Stanford Professor Jennifer Eberhardt announced yesterday
◦ for Professor Eberhardt…
How to find the Min Edit Distance?
Searching for a path (sequence of edits) from the start
string to the final string:
◦ Initial state: the word we are transforming
◦ Operators: insert, delete, substitute
◦ Goal state: the word we are trying to get to
◦ Path cost: what we want to minimize: the number of edits
Edit distance is a nlp technique to find the minimum distance between two words
Minimum Edit as Search
Space of all edit sequences is huge
◦ Can’t navigate naïvely
◦ Lots of distinct paths wind up at the same state.
Defining Min Edit Distance
For two strings
◦ X of length n
◦ Y of length m
We define D(i,j)
◦ The edit distance between X[1..i] and Y[1..j]
◦ i.e., the first i characters of X and the first j characters of Y
◦ The edit distance between X and Y is thus D(n, m)
Minimum Edit Distance
COMPUTING MINIMUM EDIT DISTANCE
Dynamic Programming for Minimum Edit Distance
Dynamic programming: A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems.
Bottom-up
◦ Compute D(i,j) for small i,j
◦ Compute larger D(i,j) based on previously computed smaller
values
◦ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance (Levenshtein)
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if S1(i)≠S2(j)
0; if S1(i)=S2(j)
Termination:
D(N,M) is distance
insertion
deletion
substitution
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
i
j
Minimum Edit Distance
If each operation has cost of 1
◦ Distance between these is 5
If substitutions cost 2 (Levenshtein)
◦ Distance between them is 8
N 9
O 8
I 7
T 6
N 5 Ins SUB
E 4
T 3
N 2
I 1 Del
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance
i
j
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
Minimum Edit Distance
BACKTRACE FOR COMPUTING ALIGNMENTS
Computing alignments
✓Edit distance isn’t sufficient
We often need to align each character of the two strings to each other
✓Do this by keeping a “backtrace”
✓Every time we enter a cell, remember where we came from
✓When we reach the end,
◦ Trace back the path from the upper right corner to read off the alignment
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance
Adding Backtrace to Minimum Edit Distance
Base conditions: Termination:
D(i,0) = i D(0,j) = j D(N,M) is distance
Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
LEFT
ptr(i,j)= DOWN
DIAG
insertion
deletion
substitution
insertion
deletion
substitution
MinEdit with Backtrace
* e x e c u t i o n
0 cost
Sub
n/u
with
cost 2
Insert
C
Cost 1
0 cost
Sub
t/x
with
cost 2
Sub
n/e
with
cost 2
Del I
Cost 1
Result of Backtrace
Two strings and their alignment:
Minimum Edit Distance
WEIGHTED MINIMUM EDIT DISTANCE
Weighted Edit Distance
Why would we add weights to the computation?
◦ Spell Correction: some letters are more likely to be
mistyped than others
Confusion matrix for spelling errors
Weighted Min Edit Distance
Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M
Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)=min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
Termination:
D(N,M) is distance

More Related Content

PPTX
2_EditDistance_Jan_08_2020.pptx
PPTX
2_1 Edit Distance.pptx
PDF
L06 stemmer and edit distance
PPT
Tree distance algorithm
PPTX
01 - DAA - PPT.pptx
PDF
Average Sensitivity of Graph Algorithms
PPTX
Data structures notes for college students btech.pptx
DOC
algorithm Unit 3
2_EditDistance_Jan_08_2020.pptx
2_1 Edit Distance.pptx
L06 stemmer and edit distance
Tree distance algorithm
01 - DAA - PPT.pptx
Average Sensitivity of Graph Algorithms
Data structures notes for college students btech.pptx
algorithm Unit 3

Similar to Edit distance is a nlp technique to find the minimum distance between two words (20)

PDF
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
DOC
Unit 3 daa
PDF
1ST_UNIT_DAdefewfrewfgrwefrAdfdgfdsgevedr (2).pdf
PDF
RNN and sequence-to-sequence processing
PDF
Mm chap08 -_lossy_compression_algorithms
PDF
Pathfinding in games
PDF
Computational Intelligence Assisted Engineering Design Optimization (using MA...
PDF
re:mobidyc the overview
PDF
Unit-1 DAA_Notes.pdf
PPTX
Coin change Problem (DP & GREEDY)
PDF
Anlysis and design of algorithms part 1
PPTX
Linear regression, costs & gradient descent
PDF
Introduction to Tensor Flow for Optical Character Recognition (OCR)
PDF
Should a football team go for a one or two point conversion? A dynamic progra...
PDF
Attention is All You Need (Transformer)
PPTX
cse couse aefrfrqewrbqwrgbqgvq2w3vqbvq23rbgw3rnw345
PPT
How to calculate complexity in Data Structure
PPTX
Time complexity.pptxghhhhhhhhhhhhhhhjjjjjjjjjjjjjjjjjjjjjjjjjj
PPTX
dynamic programming complete by Mumtaz Ali (03154103173)
PPT
Lecture 1 and 2 of Data Structures & Algorithms
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Unit 3 daa
1ST_UNIT_DAdefewfrewfgrwefrAdfdgfdsgevedr (2).pdf
RNN and sequence-to-sequence processing
Mm chap08 -_lossy_compression_algorithms
Pathfinding in games
Computational Intelligence Assisted Engineering Design Optimization (using MA...
re:mobidyc the overview
Unit-1 DAA_Notes.pdf
Coin change Problem (DP & GREEDY)
Anlysis and design of algorithms part 1
Linear regression, costs & gradient descent
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Should a football team go for a one or two point conversion? A dynamic progra...
Attention is All You Need (Transformer)
cse couse aefrfrqewrbqwrgbqgvq2w3vqbvq23rbgw3rnw345
How to calculate complexity in Data Structure
Time complexity.pptxghhhhhhhhhhhhhhhjjjjjjjjjjjjjjjjjjjjjjjjjj
dynamic programming complete by Mumtaz Ali (03154103173)
Lecture 1 and 2 of Data Structures & Algorithms
Ad

Recently uploaded (20)

PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPT
Quality review (1)_presentation of this 21
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Lecture1 pattern recognition............
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Introduction to the R Programming Language
PPTX
Database Infoormation System (DBIS).pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Ppt On Nestle.pptx huunnnhhgfvu
IB Computer Science - Internal Assessment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Quality review (1)_presentation of this 21
SAP 2 completion done . PRESENTATION.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
Reliability_Chapter_ presentation 1221.5784
Introduction to Knowledge Engineering Part 1
Qualitative Qantitative and Mixed Methods.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Lecture1 pattern recognition............
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
STERILIZATION AND DISINFECTION-1.ppthhhbx
ISS -ESG Data flows What is ESG and HowHow
Introduction to the R Programming Language
Database Infoormation System (DBIS).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Ad

Edit distance is a nlp technique to find the minimum distance between two words

  • 2. How similar are two strings? Spell correction User typed “graffe” Which is closest? ◦ graf ◦ graft ◦ grail ◦ giraffe Used in Speech Recognition, Machine Translation(Alignment), Information Extraction(NER),
  • 3. Edit Distance The minimum edit distance between two strings is the minimum number of editing operations needed to transform one into the other ◦ Insertion ◦ Deletion ◦ Substitution
  • 4. Minimum Edit Distance Two strings and their alignment:
  • 5. Minimum Edit Distance If each operation has cost of 1 ◦ Distance between these is 5 If substitutions cost 2 (Levenshtein) ◦ Distance between them is 8
  • 6. Uses of Edit Distance in NLP Evaluating Machine Translation and speech recognition R Spokesman confirms senior government adviser was appointed H Spokesman said the senior adviser was appointed S I D Entity Coreference ◦ IBM Inc. announced today ◦ IBM profits ◦ Stanford Professor Jennifer Eberhardt announced yesterday ◦ for Professor Eberhardt…
  • 7. How to find the Min Edit Distance? Searching for a path (sequence of edits) from the start string to the final string: ◦ Initial state: the word we are transforming ◦ Operators: insert, delete, substitute ◦ Goal state: the word we are trying to get to ◦ Path cost: what we want to minimize: the number of edits
  • 9. Minimum Edit as Search Space of all edit sequences is huge ◦ Can’t navigate naïvely ◦ Lots of distinct paths wind up at the same state.
  • 10. Defining Min Edit Distance For two strings ◦ X of length n ◦ Y of length m We define D(i,j) ◦ The edit distance between X[1..i] and Y[1..j] ◦ i.e., the first i characters of X and the first j characters of Y ◦ The edit distance between X and Y is thus D(n, m)
  • 11. Minimum Edit Distance COMPUTING MINIMUM EDIT DISTANCE
  • 12. Dynamic Programming for Minimum Edit Distance Dynamic programming: A tabular computation of D(n,m) Solving problems by combining solutions to subproblems. Bottom-up ◦ Compute D(i,j) for small i,j ◦ Compute larger D(i,j) based on previously computed smaller values ◦ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
  • 13. Defining Min Edit Distance (Levenshtein) Initialization D(i,0) = i D(0,j) = j Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if S1(i)≠S2(j) 0; if S1(i)=S2(j) Termination: D(N,M) is distance insertion deletion substitution
  • 14. N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The Edit Distance Table
  • 15. N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The Edit Distance Table i j
  • 16. Minimum Edit Distance If each operation has cost of 1 ◦ Distance between these is 5 If substitutions cost 2 (Levenshtein) ◦ Distance between them is 8
  • 17. N 9 O 8 I 7 T 6 N 5 Ins SUB E 4 T 3 N 2 I 1 Del # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Edit Distance i j
  • 18. N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N The Edit Distance Table
  • 19. Minimum Edit Distance BACKTRACE FOR COMPUTING ALIGNMENTS
  • 20. Computing alignments ✓Edit distance isn’t sufficient We often need to align each character of the two strings to each other ✓Do this by keeping a “backtrace” ✓Every time we enter a cell, remember where we came from ✓When we reach the end, ◦ Trace back the path from the upper right corner to read off the alignment
  • 21. N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Edit Distance
  • 22. Adding Backtrace to Minimum Edit Distance Base conditions: Termination: D(i,0) = i D(0,j) = j D(N,M) is distance Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) LEFT ptr(i,j)= DOWN DIAG insertion deletion substitution insertion deletion substitution
  • 23. MinEdit with Backtrace * e x e c u t i o n 0 cost Sub n/u with cost 2 Insert C Cost 1 0 cost Sub t/x with cost 2 Sub n/e with cost 2 Del I Cost 1
  • 24. Result of Backtrace Two strings and their alignment:
  • 25. Minimum Edit Distance WEIGHTED MINIMUM EDIT DISTANCE
  • 26. Weighted Edit Distance Why would we add weights to the computation? ◦ Spell Correction: some letters are more likely to be mistyped than others
  • 27. Confusion matrix for spelling errors
  • 28. Weighted Min Edit Distance Initialization: D(0,0) = 0 D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M Recurrence Relation: D(i-1,j) + del[x(i)] D(i,j)=min D(i,j-1) + ins[y(j)] D(i-1,j-1) + sub[x(i),y(j)] Termination: D(N,M) is distance