SlideShare a Scribd company logo
Searn Algorithm for
Structured Prediction
Presented by Supun Abeysinghe
Outline
● What is Structured Prediction
● Approaches to Structured Prediction
● Idea of Search-Based Structured Prediction
● Background Information for SEARN
● SEARN Algorithm
● Comparison with other approaches
Structured Prediction
What is Structured Prediction?
● If we define informally structured prediction is a process by which a
structure inside a given input is captured.
● Main difference with other machine learning problems is Structured
Prediction problems usually have a complex output.
POS Tagging
She sells seashells on the seashore
PRP VBZ NNS IN DT NN
Chunking
Constituency Parsing
Dependency Parsing
And a lot more...
Approaches for Structured
Prediction
Different Approaches to SP
● Structured Perceptron
○ Direct implementation of Averaged Perceptron in binary classification to use in SP
● Incremental Perceptron
○ Also a search-based approach.
● Maximum Entropy Markov Models
○ Similar to logistic regression in binary classification
● Conditional Random Fields
○ Solves label bias problem of Maximum Entropy models
● Maximum Margin Markov Networks
● SVM for Independent and Structured Outputs (SVMstruct
)
Search-based Structured
Prediction
Traditional Approach to SP
ModelInput
All
Possible
Outputs
Xn
W - Parameters
F - features
● There is a model that can generate all the possible outputs for a given
input
● Based on the input features, model parameters assign scores for each of
those outputs
Training
ModelInput
All
Possible
Outputs
Xn
(Xn
, Yn
)
W - Parameters
● Model parameters are trained such that the correct output for each training
example will have the highest score.
Yn
Highest
Score
Decoding
ModelInput
All
Possible
Outputs
Xn
W - Parameters
● In the decoding phase, the input is ran through the model and then all the
outputs are searched to find out the output with the highest score
Search for
the output
with
highest
score
Role of Search
● Search gets the output with highest score from the search space.
● Almost all SP approaches needs a search component
● In most cases, searching through the whole space is intractable.
○ Assumptions about the output is made so that dynamic programming can be applied
○ Use approximating methods such as beam search, greedy search and any other heuristic
based search methods
● Search can be seen as a sequence of decisions taken to get the best
output.
Search-based SP
● Search phase and the model is combined.
● Rather than searching after the model, learn how to search.
● Each decision made during the search is considered as a large
classification problem.
● Now each search decision that make will build the output incrementally.
● The goal is to train these classifiers to build an optimal output.
Background information before
SEARN
Learning Reductions
● Relating a hard and complex prediction problem to a simpler prediction
problem.
● Maps a harder problem to a simpler problem, then obtains a solution for
the simple problem and maps that solution to the harder problem.
● A reduction has three components
○ Sample mapping - Mapping complex problem dataset to the simpler problem
○ Hypothesis mapping - Mapping the solution to the easier problem to the hard problem
○ Bounds - How well the reduction solves the larger problem
Importance Weighted Binary Classification
● A simple extension to binary classification.
● Each example (data item) has an associated weight which reflects the
importance of that data item. (xi
, yi
, ci
)
● Solution should be a binary classifier that minimizes the expected weight
loss.
Importance Weighted Binary Classification
● Solved by reducing the problem to C parallel binary classifiers.
● C datasets are generated sampling from the original dataset with a
sampling probability proportional to their importance weights.
● Using those different datasets, C binary classifiers will be trained.
● Prediction is made based on majority prediction of those C parallel binary
classifiers.
Cost Sensitive Classification
● This is a natural extension of Importance Weighted Binary Classification to
a multi class scenario.
● For a K-Class task, we have to find a hypothesis (h) such that it minimizes
the expected cost of predictions.
● C is a k sized vector containing cost for each classification.
Cost Sensitive Classification
● This is reduced to Importance Weighted Binary Classification problem
using Weighted All Pairs (Beygelzimer et al., 2005) reduction.
● WAP generates k
Cc
Importance weighted binary classification problems.
● Importance weights calculated using a special formula so that
classification is done correctly.
SEARN Algorithm
SEARN Algorithm
● Searn is developed by casting structured prediction in the language of
reductions;
● In particular, it reduces structured prediction to cost-sensitive
classification.
● In that case, the cost-sensitive classification problem can be reduced to
binary classification by applying weighted all pairs method.
● So the structured prediction can be solved using binary classification.
SEARN Algorithm
● Removes the “search” from the prediction process by learning a classifier
to make incremental decisions.
Definition of Structured Prediction
We can define structured prediction problem as a cost-sensitive classification
problem as follows.
Definition of Structured Prediction
The goal of the structured prediction is to find a hypothesis h : X→Y that
minimizes the given loss.
Policy
● We need to find a h such that, given a state s, and and the input x, h(x,s)
gives the next action.
● We can consider policy h as a classifier. Now the whole problem becomes
a classification problem.
● Now we need to train this classifier.
Training
● Training is an iterative process
● Initialize with a known policy
● Using that policy create cost-sensitive examples
● Create a new policy using the cost-sensitive examples
● Interpolate the previous policy and the new policy
Cost Sensitive Examples
● A policy generates one path per one training example. (Path is a sequence
of states; state is a partial structure)
● SEARN creates a single cost sensitive example for each state in each path.
● The classes associated with each example is the cost of available actions.
(Next states)
● Now the difficulty lies in specifying these cost values.
Cost
● Cost of each action can be considered as regret. It is defined as follows. (π
is the policy)
● The complexity of the above equation is problem dependent.
● There are multiple ways to compute it. (Monte-carlo sampling, Single
Monte-carlo sampling, etc)
Optimal Policy
● The optimal policy is a policy that, for a given state, input and
output(structured prediction cost vector) always predicts the best action to
take.
Optimal Policy
● Searn uses the optimal policy to initialize the iterative process, and
attempts to migrate toward a completely learned policy that will generalize
well.
● SEARN assumes existence of an optimal policy to the problem.
Algorithm
● π*
is the optimal policy
● Learn is a multi class learner.
● Policy will be initialized using
the optimal policy. (Line 1)
● Algorithm then iterates for a
number of iterations.
● Makes cost-sensitive examples
using the current policy.
● Interpolate the previous policy
with current policy
Comparison with other
approaches
Vs. Independent Classifiers
● Output structure is assumed to be decomposable and each part is
classified (predicted) individually.
● Cannot define features that span across output structure.
● Even if the previous results are taken into consideration it can be sub
optimal.
● Limited to hamming loss.
Vs. Perceptron algorithms
● Assumes a tractable argmax operation.
● Generalize poorly. (Can solve this by averaging the weights)
● Limited to only one loss function.
Vs. Global prediction algorithms
● Highly dependent on assumptions about output structure. (Markov
assumption)
● In comparison SEARN is more general, limited neither to linear chains nor
to Markov style features.
● SEARN requires far more weaker assumptions.
A brief introduction to Searn Algorithm
SEARN algorithm can solve structured
prediction problems under any model,
any feature functions and any loss
References
● Search-based Structured Prediction. Hal Daumé III, John Langford and
Daniel Marcu. Submitted to Machine Learning Journal, 2006.
● Practical Structured Learning Techniques for Natural Language
Processing. Hal Daumé III. PhD Thesis, 2006 (USC)
Thank you!
supun.14@cse.mrt.ac.lk

More Related Content

PDF
Proximal Policy Optimization Algorithms, Schulman et al, 2017
PDF
Trust Region Policy Optimization, Schulman et al, 2015
PDF
Episodic Policy Gradient Training
PDF
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
PDF
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
PDF
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
PDF
Policy gradient
PPTX
Algorithms Design Patterns
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Trust Region Policy Optimization, Schulman et al, 2015
Episodic Policy Gradient Training
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Policy gradient
Algorithms Design Patterns

What's hot (20)

PDF
Reinforcement learning
PPTX
Reinforcement Learning
PDF
Actor critic algorithm
PDF
Deep reinforcement learning from scratch
PPTX
Machine learning Algorithms with a Sagemaker demo
PPTX
An Introduction to Reinforcement Learning - The Doors to AGI
PPTX
Artificial Intelligence Searching Techniques
PDF
Derivative Free Optimization and Robust Optimization
PPTX
Algorithms and Programming
PDF
Generalized Reinforcement Learning
PDF
Machine Learning Lecture 2 Basics
PDF
L06 stemmer and edit distance
PDF
L05 language model_part2
PDF
Artificial Intelligence Course: Linear models
PDF
Multi armed bandit
PDF
Temporal difference learning
PDF
Model Based Episodic Memory
PPT
Reinforcement learning 7313
PDF
Deep learning concepts
PDF
An introduction to reinforcement learning
Reinforcement learning
Reinforcement Learning
Actor critic algorithm
Deep reinforcement learning from scratch
Machine learning Algorithms with a Sagemaker demo
An Introduction to Reinforcement Learning - The Doors to AGI
Artificial Intelligence Searching Techniques
Derivative Free Optimization and Robust Optimization
Algorithms and Programming
Generalized Reinforcement Learning
Machine Learning Lecture 2 Basics
L06 stemmer and edit distance
L05 language model_part2
Artificial Intelligence Course: Linear models
Multi armed bandit
Temporal difference learning
Model Based Episodic Memory
Reinforcement learning 7313
Deep learning concepts
An introduction to reinforcement learning
Ad

Similar to A brief introduction to Searn Algorithm (20)

PPTX
ngboost.pptx
PPTX
ngboost.pptx
PPTX
Reinforcement learning:policy gradient (part 1)
PPT
Parallel Processing Concepts
PPTX
Machine learning - session 3
PPTX
UNIT IV (4).pptx
PPTX
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PPTX
AI Algorithms
PPTX
Ensemble methods
PDF
Introduction of Deep Reinforcement Learning
PDF
Multiobjective optimization and trade offs using pareto optimality
PPTX
XgBoost.pptx
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PDF
IFTA2020 Kei Nakagawa
PPTX
How to formulate reinforcement learning in illustrative ways
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
Deep Dive into Hyperparameter Tuning
PDF
Structured prediction with reinforcement learning
PPTX
esign and Analysis of Algorithms Presentation.pptx
ngboost.pptx
ngboost.pptx
Reinforcement learning:policy gradient (part 1)
Parallel Processing Concepts
Machine learning - session 3
UNIT IV (4).pptx
Instance Learning and Genetic Algorithm by Dr.C.R.Dhivyaa Kongu Engineering C...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
AI Algorithms
Ensemble methods
Introduction of Deep Reinforcement Learning
Multiobjective optimization and trade offs using pareto optimality
XgBoost.pptx
Reinforcement Learning 8: Planning and Learning with Tabular Methods
IFTA2020 Kei Nakagawa
How to formulate reinforcement learning in illustrative ways
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Deep Dive into Hyperparameter Tuning
Structured prediction with reinforcement learning
esign and Analysis of Algorithms Presentation.pptx
Ad

Recently uploaded (20)

PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Foundation of Data Science unit number two notes
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
annual-report-2024-2025 original latest.
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Knowledge Engineering Part 1
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Business Acumen Training GuidePresentation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
ISS -ESG Data flows What is ESG and HowHow
Foundation of Data Science unit number two notes
Business Ppt On Nestle.pptx huunnnhhgfvu
annual-report-2024-2025 original latest.
Miokarditis (Inflamasi pada Otot Jantung)
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Knowledge Engineering Part 1
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction-to-Cloud-ComputingFinal.pptx

A brief introduction to Searn Algorithm

  • 1. Searn Algorithm for Structured Prediction Presented by Supun Abeysinghe
  • 2. Outline ● What is Structured Prediction ● Approaches to Structured Prediction ● Idea of Search-Based Structured Prediction ● Background Information for SEARN ● SEARN Algorithm ● Comparison with other approaches
  • 4. What is Structured Prediction? ● If we define informally structured prediction is a process by which a structure inside a given input is captured. ● Main difference with other machine learning problems is Structured Prediction problems usually have a complex output.
  • 5. POS Tagging She sells seashells on the seashore PRP VBZ NNS IN DT NN
  • 9. And a lot more...
  • 11. Different Approaches to SP ● Structured Perceptron ○ Direct implementation of Averaged Perceptron in binary classification to use in SP ● Incremental Perceptron ○ Also a search-based approach. ● Maximum Entropy Markov Models ○ Similar to logistic regression in binary classification ● Conditional Random Fields ○ Solves label bias problem of Maximum Entropy models ● Maximum Margin Markov Networks ● SVM for Independent and Structured Outputs (SVMstruct )
  • 13. Traditional Approach to SP ModelInput All Possible Outputs Xn W - Parameters F - features ● There is a model that can generate all the possible outputs for a given input ● Based on the input features, model parameters assign scores for each of those outputs
  • 14. Training ModelInput All Possible Outputs Xn (Xn , Yn ) W - Parameters ● Model parameters are trained such that the correct output for each training example will have the highest score. Yn Highest Score
  • 15. Decoding ModelInput All Possible Outputs Xn W - Parameters ● In the decoding phase, the input is ran through the model and then all the outputs are searched to find out the output with the highest score Search for the output with highest score
  • 16. Role of Search ● Search gets the output with highest score from the search space. ● Almost all SP approaches needs a search component ● In most cases, searching through the whole space is intractable. ○ Assumptions about the output is made so that dynamic programming can be applied ○ Use approximating methods such as beam search, greedy search and any other heuristic based search methods ● Search can be seen as a sequence of decisions taken to get the best output.
  • 17. Search-based SP ● Search phase and the model is combined. ● Rather than searching after the model, learn how to search. ● Each decision made during the search is considered as a large classification problem. ● Now each search decision that make will build the output incrementally. ● The goal is to train these classifiers to build an optimal output.
  • 19. Learning Reductions ● Relating a hard and complex prediction problem to a simpler prediction problem. ● Maps a harder problem to a simpler problem, then obtains a solution for the simple problem and maps that solution to the harder problem. ● A reduction has three components ○ Sample mapping - Mapping complex problem dataset to the simpler problem ○ Hypothesis mapping - Mapping the solution to the easier problem to the hard problem ○ Bounds - How well the reduction solves the larger problem
  • 20. Importance Weighted Binary Classification ● A simple extension to binary classification. ● Each example (data item) has an associated weight which reflects the importance of that data item. (xi , yi , ci ) ● Solution should be a binary classifier that minimizes the expected weight loss.
  • 21. Importance Weighted Binary Classification ● Solved by reducing the problem to C parallel binary classifiers. ● C datasets are generated sampling from the original dataset with a sampling probability proportional to their importance weights. ● Using those different datasets, C binary classifiers will be trained. ● Prediction is made based on majority prediction of those C parallel binary classifiers.
  • 22. Cost Sensitive Classification ● This is a natural extension of Importance Weighted Binary Classification to a multi class scenario. ● For a K-Class task, we have to find a hypothesis (h) such that it minimizes the expected cost of predictions. ● C is a k sized vector containing cost for each classification.
  • 23. Cost Sensitive Classification ● This is reduced to Importance Weighted Binary Classification problem using Weighted All Pairs (Beygelzimer et al., 2005) reduction. ● WAP generates k Cc Importance weighted binary classification problems. ● Importance weights calculated using a special formula so that classification is done correctly.
  • 25. SEARN Algorithm ● Searn is developed by casting structured prediction in the language of reductions; ● In particular, it reduces structured prediction to cost-sensitive classification. ● In that case, the cost-sensitive classification problem can be reduced to binary classification by applying weighted all pairs method. ● So the structured prediction can be solved using binary classification.
  • 26. SEARN Algorithm ● Removes the “search” from the prediction process by learning a classifier to make incremental decisions.
  • 27. Definition of Structured Prediction We can define structured prediction problem as a cost-sensitive classification problem as follows.
  • 28. Definition of Structured Prediction The goal of the structured prediction is to find a hypothesis h : X→Y that minimizes the given loss.
  • 29. Policy ● We need to find a h such that, given a state s, and and the input x, h(x,s) gives the next action. ● We can consider policy h as a classifier. Now the whole problem becomes a classification problem. ● Now we need to train this classifier.
  • 30. Training ● Training is an iterative process ● Initialize with a known policy ● Using that policy create cost-sensitive examples ● Create a new policy using the cost-sensitive examples ● Interpolate the previous policy and the new policy
  • 31. Cost Sensitive Examples ● A policy generates one path per one training example. (Path is a sequence of states; state is a partial structure) ● SEARN creates a single cost sensitive example for each state in each path. ● The classes associated with each example is the cost of available actions. (Next states) ● Now the difficulty lies in specifying these cost values.
  • 32. Cost ● Cost of each action can be considered as regret. It is defined as follows. (π is the policy) ● The complexity of the above equation is problem dependent. ● There are multiple ways to compute it. (Monte-carlo sampling, Single Monte-carlo sampling, etc)
  • 33. Optimal Policy ● The optimal policy is a policy that, for a given state, input and output(structured prediction cost vector) always predicts the best action to take.
  • 34. Optimal Policy ● Searn uses the optimal policy to initialize the iterative process, and attempts to migrate toward a completely learned policy that will generalize well. ● SEARN assumes existence of an optimal policy to the problem.
  • 35. Algorithm ● π* is the optimal policy ● Learn is a multi class learner. ● Policy will be initialized using the optimal policy. (Line 1) ● Algorithm then iterates for a number of iterations. ● Makes cost-sensitive examples using the current policy. ● Interpolate the previous policy with current policy
  • 37. Vs. Independent Classifiers ● Output structure is assumed to be decomposable and each part is classified (predicted) individually. ● Cannot define features that span across output structure. ● Even if the previous results are taken into consideration it can be sub optimal. ● Limited to hamming loss.
  • 38. Vs. Perceptron algorithms ● Assumes a tractable argmax operation. ● Generalize poorly. (Can solve this by averaging the weights) ● Limited to only one loss function.
  • 39. Vs. Global prediction algorithms ● Highly dependent on assumptions about output structure. (Markov assumption) ● In comparison SEARN is more general, limited neither to linear chains nor to Markov style features. ● SEARN requires far more weaker assumptions.
  • 41. SEARN algorithm can solve structured prediction problems under any model, any feature functions and any loss
  • 42. References ● Search-based Structured Prediction. Hal Daumé III, John Langford and Daniel Marcu. Submitted to Machine Learning Journal, 2006. ● Practical Structured Learning Techniques for Natural Language Processing. Hal Daumé III. PhD Thesis, 2006 (USC)