SlideShare a Scribd company logo
Mathematical Language Processing
via Tree Embeddings
Jack Wang, Andrew Lan, Richard Baraniuk
June 15, 2021
Mathematical Language Is Everywhere
textbooks
academic papers
Wikipedia articles
Difficult to extract and synthesize information from massive content
How to efficiently find relevant mathematical content?
The Mathematical Content Retrieval Problem
Difficult to extract and synthesize information from massive content
Desired: efficient, automated system to aid indexing, searching, and organizing
mathematical contents
We focus on formula retrieval:
- Search for and retrieve similar equations, given a query equation
The Mathematical Content Retrieval Problem
Current search engines lack ability to effectively search for mathematical content
Machine
learning
The Mathematical Content Retrieval Problem
Current search engines lack ability to effectively search for mathematical content
query equation in a machine learning textbook
Search results contain only
specific characters that match
with input query but NOT the
entire equation
The Mathematical Content Retrieval Problem
Desired retrieval
Our Solution: Formula Representation via
Tree Embeddings
A novel framework that learns a good representation of mathematical formulae
Based on the encoder-decoder architecture
● A novel encoding scheme: equation as trees
● A novel decoding scheme: generate equation as trees
formula encoder decoder
Reconstructed
formula
Formula
embedding
Minimize this reconstruction loss
Our Solution, part #1: Equation Encoding
Explicitly capture the semantic and syntactic information in an equation
Encoder
(GRU)
Our Solution, part #1: Equation Encoding
Encoder
(GRU)
The formula embedding that we will use in the formula retrieval task
Our Solution, part #1: Equation Encoding
Encoder
(GRU)
After the encoding step
- Decode to recover the input formula tree, using the formula embedding
- Tree beam search to improve reconstruction quality
Formula Retrieval Experiment
- 18 queries formulae
- Train (and search) on 770k equations
- Compute the embedding of all equations and queries
- Compute the cosine similarity between all equations and each query
- For each query, choose the top 25 most relevant equations
- Human evaluation: compute % of relevant equations for each query
Formula Retrieval Experiment
Formula Retrieval: Main Results
Our method outperforms the data-driven baseline
Formula Retrieval: Main Results
Our method achieves state-of-the-art when combined with Approach0
Formula Retrieval: Examples
Our method retrieves structurally and semantically more similar formulae
Learnt Formula Representation: T-SNE Example
Our method embeds good representations of different formulae
Summary
Framework to process equations via tree embeddings
- Novel encoder + decoder + beam search
- State-of-the-art formula retrieval performance
- Application to textbook math content search and beyond
Future work
- Joint math and text processing
- Deploy and pilot study at OpenStax
- Open-ended math solution feedback
Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21
https://arxiv.org/abs/2104.12047

More Related Content

PPTX
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
PDF
Integrating Textbooks with Smart Interactive Content for Learning Programming
PPTX
Contextual Definition Generation
PPTX
Information retrieval 6 ir models
PPTX
Presentation
PDF
Semantic Annotation of Documents
PDF
Data wrangling week 9
PDF
IRE Semantic Annotation of Documents
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
Integrating Textbooks with Smart Interactive Content for Learning Programming
Contextual Definition Generation
Information retrieval 6 ir models
Presentation
Semantic Annotation of Documents
Data wrangling week 9
IRE Semantic Annotation of Documents

What's hot (20)

DOC
Machine Learning
PPTX
Information retrieval 8 term weighting
PPT
Mining Product Reputations On the Web
PDF
SelQA: A New Benchmark for Selection-based Question Answering
PPT
Mining from Open Answers in Questionnaire Data
PDF
Ontology based approach for annotating a corpus of computer science abstracts
PDF
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
PPTX
TextRank: Bringing Order into Texts
DOC
Report
PPT
Real Time Competitive Marketing Intelligence
PPT
Question Answering for Machine Reading Evaluation on Romanian and English
PPT
06 quantitative data processing
PPT
Data Mining and the Web_Past_Present and Future
PPTX
QUT Bachelor of Mathematics (Honours) info presentation
PPT
OR Slide
PPTX
Generating SPSS training materials in StatJR
PPTX
Learning to learn with meta learning
PDF
Concurrent Inference of Topic Models and Distributed Vector Representations
PDF
Resource comparison SciKnow 2019
PDF
IRJET- Implementation of Automatic Question Paper Generator System
Machine Learning
Information retrieval 8 term weighting
Mining Product Reputations On the Web
SelQA: A New Benchmark for Selection-based Question Answering
Mining from Open Answers in Questionnaire Data
Ontology based approach for annotating a corpus of computer science abstracts
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
TextRank: Bringing Order into Texts
Report
Real Time Competitive Marketing Intelligence
Question Answering for Machine Reading Evaluation on Romanian and English
06 quantitative data processing
Data Mining and the Web_Past_Present and Future
QUT Bachelor of Mathematics (Honours) info presentation
OR Slide
Generating SPSS training materials in StatJR
Learning to learn with meta learning
Concurrent Inference of Topic Models and Distributed Vector Representations
Resource comparison SciKnow 2019
IRJET- Implementation of Automatic Question Paper Generator System
Ad

Similar to Mathematical Language Processing via Tree Embeddings (6)

PDF
NTCIR11-Math2-PattaniyilN_slides
PDF
NTCIR11-Math2-PattaniyilN_poster
PDF
Computer Aided Assessment (CAA) for mathematics
PDF
Mathematical logic
PPTX
CPM2013-tabei201306
PPT
Heck
NTCIR11-Math2-PattaniyilN_slides
NTCIR11-Math2-PattaniyilN_poster
Computer Aided Assessment (CAA) for mathematics
Mathematical logic
CPM2013-tabei201306
Heck
Ad

More from Sergey Sosnovsky (20)

PPTX
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
PDF
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
PDF
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
PPTX
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
PPTX
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
PPTX
Creating Session Data from eTextbook Event Streams
PDF
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
PDF
Interactions of reading and assessment activities
PDF
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
PDF
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
PDF
Automatic Question Generation for Evidence-based Online Courseware Engineering
PDF
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
PPTX
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
PPTX
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
PPTX
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
PPTX
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
PDF
What's in a textbook
PPTX
Using Programmed Instruction to Help Students Engage with eTextbook Content
PPTX
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
PPTX
Interlingua: Linking Textbooks Across Different Languages
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Creating Session Data from eTextbook Event Streams
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Interactions of reading and assessment activities
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
Automatic Question Generation for Evidence-based Online Courseware Engineering
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
What's in a textbook
Using Programmed Instruction to Help Students Engage with eTextbook Content
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Interlingua: Linking Textbooks Across Different Languages

Recently uploaded (20)

PPTX
Lesson notes of climatology university.
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PPTX
Institutional Correction lecture only . . .
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PPTX
Cell Types and Its function , kingdom of life
PDF
Basic Mud Logging Guide for educational purpose
PDF
Classroom Observation Tools for Teachers
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
GDM (1) (1).pptx small presentation for students
PDF
Pre independence Education in Inndia.pdf
Lesson notes of climatology university.
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Institutional Correction lecture only . . .
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
STATICS OF THE RIGID BODIES Hibbelers.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Renaissance Architecture: A Journey from Faith to Humanism
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Microbial diseases, their pathogenesis and prophylaxis
102 student loan defaulters named and shamed – Is someone you know on the list?
Cell Types and Its function , kingdom of life
Basic Mud Logging Guide for educational purpose
Classroom Observation Tools for Teachers
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
TR - Agricultural Crops Production NC III.pdf
Microbial disease of the cardiovascular and lymphatic systems
human mycosis Human fungal infections are called human mycosis..pptx
GDM (1) (1).pptx small presentation for students
Pre independence Education in Inndia.pdf

Mathematical Language Processing via Tree Embeddings

  • 1. Mathematical Language Processing via Tree Embeddings Jack Wang, Andrew Lan, Richard Baraniuk June 15, 2021
  • 2. Mathematical Language Is Everywhere textbooks academic papers Wikipedia articles Difficult to extract and synthesize information from massive content How to efficiently find relevant mathematical content?
  • 3. The Mathematical Content Retrieval Problem Difficult to extract and synthesize information from massive content Desired: efficient, automated system to aid indexing, searching, and organizing mathematical contents We focus on formula retrieval: - Search for and retrieve similar equations, given a query equation
  • 4. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content Machine learning
  • 5. The Mathematical Content Retrieval Problem Current search engines lack ability to effectively search for mathematical content query equation in a machine learning textbook Search results contain only specific characters that match with input query but NOT the entire equation
  • 6. The Mathematical Content Retrieval Problem Desired retrieval
  • 7. Our Solution: Formula Representation via Tree Embeddings A novel framework that learns a good representation of mathematical formulae Based on the encoder-decoder architecture ● A novel encoding scheme: equation as trees ● A novel decoding scheme: generate equation as trees formula encoder decoder Reconstructed formula Formula embedding Minimize this reconstruction loss
  • 8. Our Solution, part #1: Equation Encoding Explicitly capture the semantic and syntactic information in an equation Encoder (GRU)
  • 9. Our Solution, part #1: Equation Encoding Encoder (GRU) The formula embedding that we will use in the formula retrieval task
  • 10. Our Solution, part #1: Equation Encoding Encoder (GRU) After the encoding step - Decode to recover the input formula tree, using the formula embedding - Tree beam search to improve reconstruction quality
  • 11. Formula Retrieval Experiment - 18 queries formulae - Train (and search) on 770k equations - Compute the embedding of all equations and queries - Compute the cosine similarity between all equations and each query - For each query, choose the top 25 most relevant equations - Human evaluation: compute % of relevant equations for each query
  • 13. Formula Retrieval: Main Results Our method outperforms the data-driven baseline
  • 14. Formula Retrieval: Main Results Our method achieves state-of-the-art when combined with Approach0
  • 15. Formula Retrieval: Examples Our method retrieves structurally and semantically more similar formulae
  • 16. Learnt Formula Representation: T-SNE Example Our method embeds good representations of different formulae
  • 17. Summary Framework to process equations via tree embeddings - Novel encoder + decoder + beam search - State-of-the-art formula retrieval performance - Application to textbook math content search and beyond Future work - Joint math and text processing - Deploy and pilot study at OpenStax - Open-ended math solution feedback Zhang et al. Math Operation Embeddings for Open-ended Solution Analysis and Feedback. To appear @EDM’21 https://arxiv.org/abs/2104.12047

Editor's Notes

  • #2: Hello my name is Jack Wang and today I am going to present my project on mathematical language processing.
  • #3: The question we focus here is: how do we efficiently find relevant mathematical content?
  • #4: In this talk, I will primarily focus on the problem of formula retrieval as a representative problem. Namely, given an equation, we would like to find the most relevant ones. You can think of this as a search engine such as Google but it is devoted to mathematical formulae. The ability to search for formula is useful for a number of educational related applications. For example, a student might want to search for relevant assessment questions given a query question, or they want to search for relevant content in a textbook given a query formula.
  • #5: Here is a concrete hypothetical example. Say you have a machine learning textbook and you are searching relevant formula given a query formula. Current search engines lack the ability to effectively search for formulae.
  • #6: If you look at the retrieval results , you will find that they contain specific components that match query but not the entire formulae. This observation suggests that we need a method that better captures the semantics of a math formula such that a search engine can return the most relevant ones.
  • #7: For example, this retrieval result is a good match to the query
  • #8: In this project, we present a solution from a representation learning perspective. The starting point is that, we want to learn a good representation of math formulae, such that we can use this representation for the formula retrieval task. Our solution is a novel framework that processes math formula in the form of trees. This is because every formula can be inherently represented as a tree structure, and by explicitly learning their tree representations, our framework retains the inherent properties of formulae and therefore improves the retrieval performance. More specifically, the framework contains 3 key components. The first component is a tree encoder, which encodes the formula in its tree format into a vector representation, or embedding. The second component is a generator, which reconstructs the input formula tree. The entire pipeline is optimized end-to-end by minimizing the reconstruction error between the input formula tree and the reconstructed formulae tree.
  • #9: As I mentioned earlier, this step us to explicitly capture the semantic and syntactic information in an equation.
  • #10: This embedding is what we will use for the formula retrieval task.
  • #11: To complete the pipeline, After the encoding step, we use a decoder that reconstructs the input formula in its tree format. To improve reconstruction quality, we also develop a beam search algorithm specifically for tree structured data. I’ll skip the technical details but you can find them in the paper.
  • #12: We validate our framework on a formula retrieval task. In this task, we have 18 query formula
  • #13: Here are some examples of queries. You can see that they are diverse in appearance and subject domain.
  • #14: First of all, we can first observe that our method outperforms the other data-driven baseline on both metrics.
  • #15: So we develop a new method that combines the strengths of both our method and Approach0. We can see that this method achieves state-of-the-art performance on this formula retrieval task.
  • #16: We can see that our method retrieves equations that are semantically and structurally more similar to the query, whereas the tangentCFT baseline fails to do so in some cases.
  • #17: I also want to visualize how the learnt formula representations are. Here, we choose a small number of formula from different math topics and plot their 2 dimensional TSNE embeddings. We can see that these embeddings form nice clusters. Which indicates that our model learns meaningful representations of these formula.
  • #18: And finally, we can apply our method to analyze students step-wise answers to open ended math questions. We have a paper that is going to appear in the educational data mining conference later this month. The arxiv version is already out. If you are interested you are welcome to checkout the paper and attend our talk at EDM to learn more. Thanks