SlideShare a Scribd company logo
Spell Correction Systems for E-commerce engines
Anjan Goswami HuiZhong Duan
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 1 / 31
The Spell correction problem
Rich literature [KCG90, Pet80].
Active research area [CB04].
Combination of NLP, Machine Learning [DH11, BB01, LDZ12] and
Systems problems [Kuk92].
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 2 / 31
Spell correction for e-commerce
Critical site feature for e-commerce.
Impact of ML based spell correction
Adds revenue.
Reduces bounce rate.
Reduces null Results.
Departments such as pharmacy can have huge gain in revenue with
Spell Correction.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 3 / 31
Spell correction for e-commerce
Science part is same as any other large scale spell correction systems.
Demand and supply side corpus.
Conversion focus.
User Interfaces.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 4 / 31
Spell correction Evaluation
Accuracy for misspelled queries.
Accuracy for correctly spelled queries.
Business metrics.
Coverage.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 5 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 6 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 7 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 8 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 9 / 31
The problem
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 10 / 31
Error statistics
Approximately 26% queries have spelling error in web queries [JM].
E-com data can be expected to be similar.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 11 / 31
Error Types
Typographic errors: Covr ← Cover
Cognitive errors: Visio Tv ← Vizio Tv
Non-english word errors: X345678 ← X345677
Contextual errors: life of Pie ← Life of Pi
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 12 / 31
Challenges
General Challenges
Large candidate pool: queries
Open dictionary: all terms are feasible
Efficiency: happens before search is executed
User behavior: query formulation is different from typical writing
Devices: different device may cause different types of typos
Under-correction: even a term is in correct form, it may need
correction
Over-correction: a term that doesn’t appear correct could still be
good search term
Languages: Different languages have different challenges.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 13 / 31
Query Spelling Challenges
Special Challenges (and Opportunities) in e-Commerce
optimization target: linguistic correct or conversion?
unique dictionary: model numbers, etc.
high cost for over-correction
availability of inventory data
availability of conversion data
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 14 / 31
General problems
Error modeling
Candidate generation
Ranking and selection of the best candidate.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 15 / 31
Modeling
A Noisy Channel Framework
Given user input query q, for every candidate correction c, compute the
conditional probability p(c|q)
p(c|q) =
p(q|c) · p(c)
p(q)
∝ p(q|c) · p(c) (1)
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 16 / 31
Modeling
A Noisy Channel Framework (cont.)
Source model p(c)
Captures: how likely user will pick query c in the first place
Typically: language model
Rationale: common phrases have high probabilities
Error model p(q|c)
Captures: how likely c is misspelled as q
Straightforward model: edit distance
Rationale: misspelled query should not be too different from original
query
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 17 / 31
Modeling
A Noisy Channel Framework (cont.)
More on Source model p(c)
Linguistic correction is important
Should also reflect query popularity
In e-Commerce, we also need to consider query conversion, and query
revenue
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 18 / 31
Modeling
A Noisy Channel Framework (cont.)
Language Model
n-gram language model: data sparsity as n goes up
backoff to/interpolation with lower-gram is necessary
smoothing is important
Good Turing smoothing: use 1-frequency items to estimate 0-frequency
probabilities
Additive smoothing: add pseudo count to terms/phrases
Knesser-Ney Smoothing: smart way of backoff and interpolation
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 19 / 31
Modeling
A Noisy Channel Framework (cont.)
More on Error model p(q|c)
Weighted edit model is better: p( a → e ) > p( a → n )
Context matters: p( a → e |context = ”be...”)
Multi-word errors need to be considered: p(”gopro”|”go pro”), can
be modeled by HMM, joint sequence model, etc.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 20 / 31
Modeling
A Noisy Channel Framework (cont.)
Hierarchical Error models
Character level error model
p( a → e |context = ”be...”)
generalizes well
less accurate
Syllable level error model
Word level error model
p( pi → pie |context = ”life of ...”)
sparse data
more accurate
Phrase level error model
...
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 21 / 31
Modeling
Discriminative Models
Why?
Noisy channel model is a generative framework
Multiplication is difficult as probabilities are estimated in different
ways
How to merge signals in one probability estimation is unknown (e.g.
linguistic correction vs. popularity vs. revenue)
There are other heuristic features and domain specific features that
cannot be subsumed in noisy channel model
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 22 / 31
Modeling
Discriminative Models (cont.)
How?
Learn to score < q, c > pair so that best correction has highest score
Challenges
Obtaining large scale training data: text parsing, human annotation
Learning methods
Classification
Learning to Rank
Structural learning
Efficiency: use noisy channel model to retrieve a handful candidates
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 23 / 31
Modeling
Discriminative Models (cont.)
Typically discriminative models such as SVM can also be used to
rerank the spelling candidates.
Recent successes with deep neural net.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 24 / 31
Modeling
Systems for Spelling Correction
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 25 / 31
Modeling
Candidate generation for Spelling Correction
Given a word find out all neighboring words under k edit distance.
Given a word find out potential close matches by hashing trick.
Generate candidates by using heuristic rules for common errors.
N-gram based techniques.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 26 / 31
Modeling
Candidate generation scaling up
Distributed implementation.
Hashing tricks.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 27 / 31
Modeling
Spell correction for E-commerce
UI for the spell correction.
Input data: Whether to include item titles or not?
Impact of autocorrection on conversion.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 28 / 31
Modeling
References I
Michele Banko and Eric Brill, Scaling to very very large corpora for
natural language disambiguation, Proceedings of the 39th Annual
Meeting on Association for Computational Linguistics, Association for
Computational Linguistics, 2001, pp. 26–33.
Silviu Cucerzan and Eric Brill, Spelling correction as an iterative
process that exploits the collective knowledge of web users., EMNLP,
vol. 4, 2004, pp. 293–300.
Huizhong Duan and Bo-June Paul Hsu, Online spelling correction for
query completion, Proceedings of the 20th international conference on
World wide web, ACM, 2011, pp. 117–126.
Daniel Jurafsky and James H Martin, Speech and language processing:
An introduction to natural language processing, computational
linguistics, and speech recognition.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 29 / 31
Modeling
References II
Mark D Kernighan, Kenneth W Church, and William A Gale, A
spelling correction program based on a noisy channel model,
Proceedings of the 13th conference on Computational
linguistics-Volume 2, Association for Computational Linguistics, 1990,
pp. 205–210.
Karen Kukich, Techniques for automatically correcting words in text,
ACM Computing Surveys (CSUR) 24 (1992), no. 4, 377–439.
Yanen Li, Huizhong Duan, and ChengXiang Zhai, A generalized
hidden markov model with discriminative training for query spelling
correction, Proceedings of the 35th international ACM SIGIR
conference on Research and development in information retrieval,
ACM, 2012, pp. 611–620.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 30 / 31
Modeling
References III
James L Peterson, Computer programs for detecting and correcting
spelling errors, Communications of the ACM 23 (1980), no. 12,
676–687.
Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 31 / 31

More Related Content

PDF
Lunchbox Pitch Deck
PDF
Data Restart 2022: David Janoušek - Jak na výkonnostní kampaně v období cooki...
PPTX
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
PPTX
Cohort Analysis at Scale
PDF
Build a Better Entrepreneur Pitch Deck
PPTX
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
PDF
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Lunchbox Pitch Deck
Data Restart 2022: David Janoušek - Jak na výkonnostní kampaně v období cooki...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Cohort Analysis at Scale
Build a Better Entrepreneur Pitch Deck
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Build Intelligent Fraud Prevention with Machine Learning and Graphs

What's hot (20)

PDF
Graphs for Data Science and Machine Learning
PDF
Mint: $325K VC investment turned into $170M. Mint's initial pitch deck
PPT
Graph database
PDF
Immediately Sales Deck
PDF
Alan's deck
PPTX
Data Mining on Twitter
PDF
Data Restart 2022: Roman Appeltauer - Aktivace first-party dat pomocí SGTM
PDF
Snapchat Advertising Sales Deck
PDF
HubSpot for Startups | Sample Pitch Deck Template
PDF
LeadCrunch.ai Sales Deck
PDF
LinkedIn Pitch Deck - Series B
PDF
Sendgrid pitch deck
PDF
Zuora Sales Deck
PDF
Count-min sketch to Infinity.pdf
PDF
Innovate to Impact_Gerda Noormägi.pdf
PDF
MeasureCamp_Custom GA4 Channel Groups with dbt
PDF
New Trends in software development
PDF
How to Design Retail Recommendation Engines with Neo4j
PDF
Machine Learning for Everyone
PDF
Netflix Recommendations Feature Engineering with Time Travel
Graphs for Data Science and Machine Learning
Mint: $325K VC investment turned into $170M. Mint's initial pitch deck
Graph database
Immediately Sales Deck
Alan's deck
Data Mining on Twitter
Data Restart 2022: Roman Appeltauer - Aktivace first-party dat pomocí SGTM
Snapchat Advertising Sales Deck
HubSpot for Startups | Sample Pitch Deck Template
LeadCrunch.ai Sales Deck
LinkedIn Pitch Deck - Series B
Sendgrid pitch deck
Zuora Sales Deck
Count-min sketch to Infinity.pdf
Innovate to Impact_Gerda Noormägi.pdf
MeasureCamp_Custom GA4 Channel Groups with dbt
New Trends in software development
How to Design Retail Recommendation Engines with Neo4j
Machine Learning for Everyone
Netflix Recommendations Feature Engineering with Time Travel
Ad

Viewers also liked (17)

PDF
Lewisham chswg terms of reference (1)
PPTX
$$$$Rafael e luiz$$$$
PDF
Art fx programme_20h_blender
PPT
Innovative Strategies
PPT
Sergio Baonza Presentacion.
PPTX
From Billions to Trillions - A report on Uganda's SDGs strategy
PDF
AncientEgyptPsychiatry
PDF
Presentation restaurant de la fin du monde
PDF
Marketing : 25 utilisations de la réalité virtuelle par les marques !
PDF
Så här hjäper vi ungdomar till sysselsättning!
PPS
SGF Veg Restaurant Presentation
PPTX
PPTX
Dominic Kniveton - Embracing uncertainty
PPT
The Race
PDF
Marriott International Capstone Research Paper
PDF
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
PDF
MobiliteaTime #10 : Apple Pay & Apple Wallet
Lewisham chswg terms of reference (1)
$$$$Rafael e luiz$$$$
Art fx programme_20h_blender
Innovative Strategies
Sergio Baonza Presentacion.
From Billions to Trillions - A report on Uganda's SDGs strategy
AncientEgyptPsychiatry
Presentation restaurant de la fin du monde
Marketing : 25 utilisations de la réalité virtuelle par les marques !
Så här hjäper vi ungdomar till sysselsättning!
SGF Veg Restaurant Presentation
Dominic Kniveton - Embracing uncertainty
The Race
Marriott International Capstone Research Paper
Thèse "Comment une marque peut intégrer une dimension émotionnelle grâce à la...
MobiliteaTime #10 : Apple Pay & Apple Wallet
Ad

Similar to Spelling correction systems for e-commerce platforms (12)

PDF
D15 1051
PDF
IRJET- Vernacular Language Spell Checker & Autocorrection
PDF
Role of Data Science in eCommerce
PDF
An efficient approach to query reformulation in web search
PPTX
This is a major project presentation from my college
PPTX
Breaking the spell of the spelling check
PDF
Proposed Method for String Transformation using Probablistic Approach
DOCX
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
PDF
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
PPTX
Voice search lessons
PDF
Techniques for automatically correcting words in text
PDF
IRJET - Text Optimization/Summarizer using Natural Language Processing
D15 1051
IRJET- Vernacular Language Spell Checker & Autocorrection
Role of Data Science in eCommerce
An efficient approach to query reformulation in web search
This is a major project presentation from my college
Breaking the spell of the spelling check
Proposed Method for String Transformation using Probablistic Approach
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
Grammarly AI-NLP Club #4 - Understanding and assessing language with neural n...
Voice search lessons
Techniques for automatically correcting words in text
IRJET - Text Optimization/Summarizer using Natural Language Processing

More from Anjan Goswami (8)

PDF
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
PDF
Discovery In Commerce Search
PDF
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
PDF
Controlled Experiments for Decision-Making in e-Commerce Search
PDF
Reputation systems
PPTX
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
PPTX
Assessing product image quality for online shopping
PPT
Clustering
Learning to Diversify for E-commerce Search with Multi-Armed Bandit}
Discovery In Commerce Search
Machine-Learned Ranking Algorithms for E-commerce Search and Recommendation A...
Controlled Experiments for Decision-Making in e-Commerce Search
Reputation systems
Topic Models Based Understanding of Supply and Demand Side of an eCommerce En...
Assessing product image quality for online shopping
Clustering

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
A Presentation on Artificial Intelligence
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
1. Introduction to Computer Programming.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Assigned Numbers - 2025 - Bluetooth® Document
A Presentation on Artificial Intelligence
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MIND Revenue Release Quarter 2 2025 Press Release
A comparative study of natural language inference in Swahili using monolingua...
Unlocking AI with Model Context Protocol (MCP)
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Machine Learning_overview_presentation.pptx
A comparative analysis of optical character recognition models for extracting...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
TLE Review Electricity (Electricity).pptx
Network Security Unit 5.pdf for BCA BBA.
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Spelling correction systems for e-commerce platforms

  • 1. Spell Correction Systems for E-commerce engines Anjan Goswami HuiZhong Duan Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 1 / 31
  • 2. The Spell correction problem Rich literature [KCG90, Pet80]. Active research area [CB04]. Combination of NLP, Machine Learning [DH11, BB01, LDZ12] and Systems problems [Kuk92]. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 2 / 31
  • 3. Spell correction for e-commerce Critical site feature for e-commerce. Impact of ML based spell correction Adds revenue. Reduces bounce rate. Reduces null Results. Departments such as pharmacy can have huge gain in revenue with Spell Correction. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 3 / 31
  • 4. Spell correction for e-commerce Science part is same as any other large scale spell correction systems. Demand and supply side corpus. Conversion focus. User Interfaces. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 4 / 31
  • 5. Spell correction Evaluation Accuracy for misspelled queries. Accuracy for correctly spelled queries. Business metrics. Coverage. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 5 / 31
  • 6. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 6 / 31
  • 7. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 7 / 31
  • 8. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 8 / 31
  • 9. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 9 / 31
  • 10. The problem Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 10 / 31
  • 11. Error statistics Approximately 26% queries have spelling error in web queries [JM]. E-com data can be expected to be similar. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 11 / 31
  • 12. Error Types Typographic errors: Covr ← Cover Cognitive errors: Visio Tv ← Vizio Tv Non-english word errors: X345678 ← X345677 Contextual errors: life of Pie ← Life of Pi Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 12 / 31
  • 13. Challenges General Challenges Large candidate pool: queries Open dictionary: all terms are feasible Efficiency: happens before search is executed User behavior: query formulation is different from typical writing Devices: different device may cause different types of typos Under-correction: even a term is in correct form, it may need correction Over-correction: a term that doesn’t appear correct could still be good search term Languages: Different languages have different challenges. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 13 / 31
  • 14. Query Spelling Challenges Special Challenges (and Opportunities) in e-Commerce optimization target: linguistic correct or conversion? unique dictionary: model numbers, etc. high cost for over-correction availability of inventory data availability of conversion data Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 14 / 31
  • 15. General problems Error modeling Candidate generation Ranking and selection of the best candidate. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 15 / 31
  • 16. Modeling A Noisy Channel Framework Given user input query q, for every candidate correction c, compute the conditional probability p(c|q) p(c|q) = p(q|c) · p(c) p(q) ∝ p(q|c) · p(c) (1) Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 16 / 31
  • 17. Modeling A Noisy Channel Framework (cont.) Source model p(c) Captures: how likely user will pick query c in the first place Typically: language model Rationale: common phrases have high probabilities Error model p(q|c) Captures: how likely c is misspelled as q Straightforward model: edit distance Rationale: misspelled query should not be too different from original query Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 17 / 31
  • 18. Modeling A Noisy Channel Framework (cont.) More on Source model p(c) Linguistic correction is important Should also reflect query popularity In e-Commerce, we also need to consider query conversion, and query revenue Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 18 / 31
  • 19. Modeling A Noisy Channel Framework (cont.) Language Model n-gram language model: data sparsity as n goes up backoff to/interpolation with lower-gram is necessary smoothing is important Good Turing smoothing: use 1-frequency items to estimate 0-frequency probabilities Additive smoothing: add pseudo count to terms/phrases Knesser-Ney Smoothing: smart way of backoff and interpolation Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 19 / 31
  • 20. Modeling A Noisy Channel Framework (cont.) More on Error model p(q|c) Weighted edit model is better: p( a → e ) > p( a → n ) Context matters: p( a → e |context = ”be...”) Multi-word errors need to be considered: p(”gopro”|”go pro”), can be modeled by HMM, joint sequence model, etc. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 20 / 31
  • 21. Modeling A Noisy Channel Framework (cont.) Hierarchical Error models Character level error model p( a → e |context = ”be...”) generalizes well less accurate Syllable level error model Word level error model p( pi → pie |context = ”life of ...”) sparse data more accurate Phrase level error model ... Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 21 / 31
  • 22. Modeling Discriminative Models Why? Noisy channel model is a generative framework Multiplication is difficult as probabilities are estimated in different ways How to merge signals in one probability estimation is unknown (e.g. linguistic correction vs. popularity vs. revenue) There are other heuristic features and domain specific features that cannot be subsumed in noisy channel model Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 22 / 31
  • 23. Modeling Discriminative Models (cont.) How? Learn to score < q, c > pair so that best correction has highest score Challenges Obtaining large scale training data: text parsing, human annotation Learning methods Classification Learning to Rank Structural learning Efficiency: use noisy channel model to retrieve a handful candidates Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 23 / 31
  • 24. Modeling Discriminative Models (cont.) Typically discriminative models such as SVM can also be used to rerank the spelling candidates. Recent successes with deep neural net. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 24 / 31
  • 25. Modeling Systems for Spelling Correction Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 25 / 31
  • 26. Modeling Candidate generation for Spelling Correction Given a word find out all neighboring words under k edit distance. Given a word find out potential close matches by hashing trick. Generate candidates by using heuristic rules for common errors. N-gram based techniques. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 26 / 31
  • 27. Modeling Candidate generation scaling up Distributed implementation. Hashing tricks. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 27 / 31
  • 28. Modeling Spell correction for E-commerce UI for the spell correction. Input data: Whether to include item titles or not? Impact of autocorrection on conversion. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 28 / 31
  • 29. Modeling References I Michele Banko and Eric Brill, Scaling to very very large corpora for natural language disambiguation, Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, 2001, pp. 26–33. Silviu Cucerzan and Eric Brill, Spelling correction as an iterative process that exploits the collective knowledge of web users., EMNLP, vol. 4, 2004, pp. 293–300. Huizhong Duan and Bo-June Paul Hsu, Online spelling correction for query completion, Proceedings of the 20th international conference on World wide web, ACM, 2011, pp. 117–126. Daniel Jurafsky and James H Martin, Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 29 / 31
  • 30. Modeling References II Mark D Kernighan, Kenneth W Church, and William A Gale, A spelling correction program based on a noisy channel model, Proceedings of the 13th conference on Computational linguistics-Volume 2, Association for Computational Linguistics, 1990, pp. 205–210. Karen Kukich, Techniques for automatically correcting words in text, ACM Computing Surveys (CSUR) 24 (1992), no. 4, 377–439. Yanen Li, Huizhong Duan, and ChengXiang Zhai, A generalized hidden markov model with discriminative training for query spelling correction, Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, 2012, pp. 611–620. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 30 / 31
  • 31. Modeling References III James L Peterson, Computer programs for detecting and correcting spelling errors, Communications of the ACM 23 (1980), no. 12, 676–687. Anjan Goswami, HuiZhong Duan (Search Science, WalmartLabs)Spell correction Systems 31 / 31