Erlangen
Artificial Intelligence &
Machine Learning Meetup
presents
AI Applications In Education
Hi, I am

Pascal

Zoleko My Projects

Flexudy
PR & AI
For Education
Study
Work
PR & AI
For Privacy
People Analytics
Artificial Intelligence &
Pattern Recognition
Problems we want to solve
1. Too much to read



2. Too long to read



3. Abstracts are sometime too bold.
4. Abstracts are sometimes too vague.



5. Abstracts are not available for all

kinds of text documents. (Web page)
Some students (learners) … :



6. Read and forget



7. Can’t continuously evaluate their

knowledge on a subject.



8. Can’t revise while on 

the train, bus etc.
Flexudy

Education
Today.
Automatic Text 

Summarisation
Simple Question

Generation
Demo Video
NLP
Ranking
Reinforcement

Learning
Rules
A simple overview
Good enough to give

an idea about the text
Deep

Learning
Fill in the blanks
Simple but useful to

Remember the keywords

found in a text.
We won’t have enough

time for this. But we

can discuss about it.
Text Summarisation
Extractive Abstractive
Simpler
Select the relevant phrases
Harder
Can generate phrases

not found in the text
Automatic Extractive 

Summarisation
Existing Solutions
Text rank
Lexrank
LSA
GitHub
Reinforce-

ment

Learning
Ranking

Algorithms
Natural

Language

Processing
+
Automatic Extractive 

Summarisation
Actually 

Stochastic optimisation

With Cross Entropy

- model free
- policy based
The Summariser Pipeline.
“Let AI do all the work

and then reap the fruits

of its labour.”
Step by Step
I will avoid technical terms as much as possible.

I made no assumption about the audience.

So, no Maths!
Summary generation algorithm. Easy, but not trivial.
1. Get user text to be summarised
2. For each sentence in the text
3. Decide if sentence should be added to the summary.
4. If yes, then append the sentence to the summary
5. Format the summary and return
How do we train our a summariser ?
Reinforcement Learning - Cross-Entropy
Can be improved by using other state of the 

SOTA algorithms, e.g Deep Q-Networks.
First, a quick recap.
1. Agent
Money & Environment icon made by Freepik from www.flaticon.com
2. Reward
The central idea behind Reinforcement Learning
3. Environment
Observation
Actions
Trainable 

Non-linear 

function
Reward
How does it translate to our use case ?
Original Text
Sentence Features
Prediction
[0, 1]
Score The next sentence
First, a quick recap.
Note: although the environment is fully observable, we decide

to observe, sentences one at a time.
Can easily be improved: By observing many at a time.
We need data to train the agent.
Gutenberg
arXiv.org
wikipedia
Broad corpus for

higher coverage.
~50% of our development time.
Data is handpicked

From different domains:



Biology, History,

Physics, Psychology etc.
Our current (English) 

implementation used 

~300 documents.



There is a lot of overfitting

obviously. 

Next release will be trained 

on a lot more data.
Then prepare the data.
Generate random

Chunks of text:

[25K - 50K] characters
Chunks are kept small

to keep training

episodes short.

Better for RL with

Cross-entropy
Cheap data 

augmentation. 

If there are few documents

like in our case (e.g ~300)

then we obtain overfitting.
28K chars
45.5K chars
25k <= x <= 50k

chars
… chars
We generated 12K chunks

in our case
30k <= y <= ~400k

chars
Training step by step.
Start a new training

episode.
Tokenise and

extract sentences
Extract sentence

features
For each chunk in batch For sentence Agent observes
A sentence represents

a step in RL jargon.
Makes a

Decision
Add sentence to summary ?
YES / NO
If YES
get a reward
« Reward is accumulated «
If no more

Sentences
Save all episode steps

and final reward.
Money & Environment icon made by Freepik from www.flaticon.com
Feature Extraction
Part of speech ratios
Dependency ratios
Word Embeddings: We use Glove. Could BERT be better ? We will try that soon.
Sentence position in document
Ratio of skipped sentences



Etc. Be creative.
Possible Improvement: 

SOTA sentence embeddings, more complex features (minimising the similarity of sentences)
Named Entity Recognition
Decision Making
Extract sentence

features
Agent observes
Random
Choice
Add sentence to summary ?
YES / NO
1. Decisions are always random:
Yes or No (1 or 0) with probabilities

P(Decision = 1) and

P(Decision = 0) respectively
2. Probabilities are based on

Softmax predictionsoftmax
3. In early, episodes, softmax prediction

are arbitrary.
4. We use a fully connected (FC)

Neural Network.
Five FC Layers each with high dropout probability.

to minimise overfitting.
Possible Improvement: 

Sequence models, 1D Convolutional Neural Networks
Reward
Rewards are positive and negative:
If YES
get a reward
« Reward is accumulated «
Positive if constraints are met.
Otherwise negative.
How are rewards computed ?
With the Textrank algorithm.
We forked SummaNLP’s

Implementation and 

modified it to our needs.
What are the constraints ?
Number of sentences selected S should

not exceed an integer M.
With M <= total number of sentences. 

M is the theoretical maximum number 

of sentences in any generated summary.
In our case, M = 20.
For example: If a sentence with score x is 

selected for the summary (i.e yes is predicted),

but S >= M then x = -x .
In other words, we punish the agent for

exceeding the upper bound.
Possible Improvement: 

Try different algorithms, e.g LexRank. Combine algorithms. Manually rank sentences.
Money & Environment icon made by Freepik from www.flaticon.com
The steps are repeated for every sentence 

and every chunk in the batch.
Step 1 Step 2 Step 3 Step k
…
Episode 1
Step 1
Episode 2
…
Step 1
Episode j
…
Sentences
Chunks
Score

E1
∑ ∑ ∑
score1: s1 s3
score i
sK s1 s1
i = 1
K
Score

E2
Score

E3
s2
The learning step.
1. Select the episodes with the best scores.
i.e episodes with scores at least as high as some p-th percentile.
We chose 90 based on our empirical analysis.
2. Train the agent, on the elite episodes.
…
Our new “ground truth”
{
Note: The score is not fed into the Neural Network (Agent)
The score is no longer need at inference time.
Loss Reward bound
Reward mean
The agent is careless.
The agent is shy.
The agent has learned 

from experience.
But wait, aren’t we just implicitly learning 

the TextRank scoring algorithm ?
Yes, but:
1. The model does not depend on vocabulary.
2. Transfer learning can be used to improve the agent:
- For particular a use case or in general.
- By simply changing the scoring function when training on new data.
3. The pipeline is flexible.
- Easily integrate new algorithms and architectures.
4. In practice, summaries a usually generated faster.
An honest example: Summarise this page
https://en.wikipedia.org/wiki/Cross_entropy
An honest example: TextRank results - 17 sentences
- In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the same underlying set …

- The cross entropy of the distribution q {displaystyle q} q relative to a distribution p {displaystyle p} p over a given set is defined as follows:

- The definition may be formulated using the Kullback–Leibler divergence D K L ( p ‖ q ) {displaystyle D_{mathrm {KL} }(p|q)} D_{{{mathrm {KL}}}}(p|q) of … 

- For discrete probability distributions p {displaystyle p} p and q {displaystyle q} q with the same support X {displaystyle {mathcal {X}}} {mathcal {X}} …

- H ( p , q ) = − ∑ x ∈ X p ( x ) log q ( x ) {displaystyle H(p,q)=-sum _{xin {mathcal {X}}}p(x),log q(x)} {displaystyle H(p,q)=-sum _{xin {mathcal {X}}} …

- H ( p , q ) = − ∫ X P ( x ) log Q ( x ) d r ( x ) {displaystyle H(p,q)=-int _{mathcal {X}}P(x),log Q(x),dr(x)} {displaystyle H(p,q)=-int _{mathcal {X}}P(x), …

- Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle q} q is assumed while …

- That is why the expectation is taken over the true probability distribution p {displaystyle p} p and not q {displaystyle q} q.

- There are many situations where cross-entropy needs to be measured but the distribution of p {displaystyle p} p is unknown.

- This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) {displaystyle p(x)} p(x)[citation needed].

- If the estimated probability of outcome I {displaystyle I} I is q I {displaystyle q_{I}} q_{I}, while the frequency (empirical probability) of outcome I …

- 1 N log ∏ I q I N p I = ∑ I p I log q I = − H ( p , q ) {displaystyle {frac {1}{N}}log prod _{I}q_{I}^{Np_{I}}=sum _{I}p_{I}log q_{I}=-H(p,q)} {displaystyle …

- When comparing a distribution q {displaystyle q} q against a fixed reference distribution p {displaystyle p} p, cross entropy and KL divergence are …

- This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be D K L ( p …

- The true probability p I {displaystyle p_{I}} p_{I} is the true label, and the given distribution q I {displaystyle q_{I}} q_{I} is the predicted value of the …

- Having set up our notation, p ∈ { y , 1 − y } {displaystyle pin {y,1-y}} pin {y,1-y} and q ∈ { y ^ , 1 − y ^ } {displaystyle qin {{hat {y}},1-{hat {y}}}} …

- J ( w ) = 1 N ∑ n = 1 N H ( p n , q n ) = − 1 N ∑ n = 1 N [ y n log y ^ n + ( 1 − y n ) log ( 1 − y ^ n ) ] , {displaystyle {begin{aligned}J(mathbf {w} ) …

{aligned}J(mathbf {w} ) &= {frac {1}{N}}sum _{n=1}^{N}H(p_{n},q_{n}) = -{frac {1}{N}}sum _{n=1}^{N} {bigg [}y_{n}log {hat {y}}_{n}+(1-y_{n})log( …
An honest example: Flexudy results - 12 sentences
- In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the …



- In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to …

- Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle …

- An example is language modeling, where a model is created based on a training set T {displaystyle T} T, and then its cross-entropy …

- In this example, p {displaystyle p} p is the true distribution of words in any corpus, and q {displaystyle q} q is the distribution of …

- In these cases, an estimate of cross-entropy is calculated using the following formula:



H ( T , q ) =

- displaystyle N} N. This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) …



- Cross-entropy minimization



Cross-entropy minimization is frequently used in optimization and rare-event probability estimation; see the cross-entropy method.





- This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross- …

- Cross entropy can be used to define a loss function in machine learning and optimization.

- The output of the model for a given observation, given a vector of input features x {displaystyle x} x, can be interpreted as a …

- The typical cost function that one uses in logistic regression is computed by taking the average of all cross-entropies in the sample.
Is Flexudy’s current implementation 

better than TextRank ?
We cannot tell. 

We do not yet have evidence to support 

such a claim.
// TODO - Evaluate Flexudy with BLUE and ROUGE Scores
An honest example II: Summarise this page
https://en.wikipedia.org/wiki/Renaissance
An honest example II: Flexudy results - 11 sentences
- The School of Athens (1509–1511), Raphael

Topics



Humanism Age of Discovery Architecture Dance Fine arts

- Depicting the Hebrew prophet-prodigy-king David as a muscular Greek athlete, the Christian humanist ideal can be seen in the ..





- REN-ə-sahnss)[2][a] was a period in European history marking the transition from the Middle Ages to Modernity and covering …

- In addition to the standard periodization, proponents of a long Renaissance put its beginning in the 14th century and its end in the 17th …

- The traditional view focuses more on the early modern aspects of the Renaissance and argues that it was a break from the past, …



The intellectual basis of the Renaissance was its version of humanism, derived from the concept of Roman Humanitas and the rediscovery …

- Early examples were the development of perspective in oil painting and the recycled knowledge of how to make concrete.

- Although the invention of metal movable type sped the dissemination of ideas from the later 15th century, the changes of the Renaissance …





- As a cultural movement, the Renaissance encompassed innovative flowering of Latin and vernacular literatures, beginning with the …

- In politics, the Renaissance contributed to the development of the customs and conventions of diplomacy, and in science to an …

- Various theories have been proposed to account for its origins and characteristics, focusing on a variety of factors including the …

- Other major centres were northern Italian city-states such as Venice, Genoa, Milan, Bologna, and finally Rome during the …

The first 2 sentences make absolutely no sense
Hence, there is still a lot of 

work to do.
Future work
1. Try new architectures and algorithms e.g 1D Convolutions.
2. Support formulas e.g Mathematics:
Combine Reinforcement Learning and Logic (Symbolic AI).
3. Manual annotation to improve sentence selection.
4. Collect more data.
5. Use SOTA sentence embeddings.
6. Improve sentence boundary detection algorithms.
7. Implement co-reference resolution to deal with pronouns.
References
1. Deep Reinforcement Learning Hands-On by Maxim Lapan
2. A survey automatic text summarization by Oguzhan Tas & Farzad Kiyani
3. Deep Transfer Reinforcement Learning for Text Summarization 

by Yaser Naren & Chandan
4. Variations of the Similarity Function of TextRank for Automated Summarization

by Federico Barrios, Luis Argerich & Rosa W.
5. Natural language understanding with {B}loom embeddings, convolutional 

neural networks and incremental parsing by Honnibal, Matthew and Montani, Ines
To learn more about the meetup, click the Link
https://www.meetup.com/Erlangen-Artificial-Intelligence-Machine-Learning-Meetup
Erlangen
Artificial Intelligence &
Machine Learning Meetup
presents

More Related Content

PDF
text summarization using amr
PDF
Improving Neural Abstractive Text Summarization with Prior Knowledge
PPTX
Text summarization
PDF
Text summarization
PPTX
Analytical learning
PPTX
Inductive analytical approaches to learning
PDF
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
PPTX
Jarrar: Probabilistic Language Modeling - Introduction to N-grams
text summarization using amr
Improving Neural Abstractive Text Summarization with Prior Knowledge
Text summarization
Text summarization
Analytical learning
Inductive analytical approaches to learning
Abigail See - 2017 - Get To The Point: Summarization with Pointer-Generator N...
Jarrar: Probabilistic Language Modeling - Introduction to N-grams

What's hot (20)

PPTX
Deep Neural Methods for Retrieval
PDF
Document Summarization
PDF
Harnessing Deep Neural Networks with Logic Rules
PDF
Natural Language Processing: L03 maths fornlp
PDF
Computing probabilistic queries in the presence of uncertainty via probabilis...
PDF
Text Summarization
PPTX
Neural Models for Information Retrieval
PPTX
Jarrar: Introduction to logic and Logic Agents
PPTX
Adversarial and reinforcement learning-based approaches to information retrieval
PDF
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
PDF
Text summarization
PPTX
Language models
PDF
Rethinking Perturbations in Encoder-Decoders for Fast Training
PPTX
Jarrar: Introduction to Information Retrieval
PDF
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
PPTX
Neural Models for Information Retrieval
PDF
Extraction Based automatic summarization
PPTX
Spell checker using Natural language processing
PDF
Topic model an introduction
PPTX
Summary distributed representations_words_phrases
Deep Neural Methods for Retrieval
Document Summarization
Harnessing Deep Neural Networks with Logic Rules
Natural Language Processing: L03 maths fornlp
Computing probabilistic queries in the presence of uncertainty via probabilis...
Text Summarization
Neural Models for Information Retrieval
Jarrar: Introduction to logic and Logic Agents
Adversarial and reinforcement learning-based approaches to information retrieval
IMPROVE THE QUALITY OF IMPORTANT SENTENCES FOR AUTOMATIC TEXT SUMMARIZATION
Text summarization
Language models
Rethinking Perturbations in Encoder-Decoders for Fast Training
Jarrar: Introduction to Information Retrieval
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
Neural Models for Information Retrieval
Extraction Based automatic summarization
Spell checker using Natural language processing
Topic model an introduction
Summary distributed representations_words_phrases
Ad

Similar to AI applications in education, Pascal Zoleko, Flexudy (20)

PPTX
LEC 1oral pathology by lecture 23jn yh.pptx
PPTX
Dsm as theory building
PDF
Introduction to Bayesian Analysis in Python
PPTX
Data structures and ALGORITHMS methd binary SEARCH
PPTX
Data structures and algorithms (DSA) are foundational concepts in computer sc...
PPT
Language Technology Enhanced Learning
PDF
1_2 Introduction to Machine Learning.pdf
PDF
GDSC SSN - solution Challenge : Fundamentals of Decision Making
PPT
Introduction to Machine Learning.
PPT
Theory of computing
PDF
A Short Course in Data Stream Mining
PPT
Aad introduction
PDF
Formal language & automata theory
PDF
P, NP, NP-Complete, and NP-Hard
PPT
lecture_mooney.ppt
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
PDF
Text Mining Analytics 101
PPT
Machine Learning: Decision Trees Chapter 18.1-18.3
PPTX
SYNTAX, SEMANTICS.pptx propositional logic and proof system in discreate mat...
PPTX
SYNTAX, SEMANTICS.pptxOS INSTALLATION.pptx how to istall an guest os in your ...
LEC 1oral pathology by lecture 23jn yh.pptx
Dsm as theory building
Introduction to Bayesian Analysis in Python
Data structures and ALGORITHMS methd binary SEARCH
Data structures and algorithms (DSA) are foundational concepts in computer sc...
Language Technology Enhanced Learning
1_2 Introduction to Machine Learning.pdf
GDSC SSN - solution Challenge : Fundamentals of Decision Making
Introduction to Machine Learning.
Theory of computing
A Short Course in Data Stream Mining
Aad introduction
Formal language & automata theory
P, NP, NP-Complete, and NP-Hard
lecture_mooney.ppt
Discovering Novel Information with sentence Level clustering From Multi-docu...
Text Mining Analytics 101
Machine Learning: Decision Trees Chapter 18.1-18.3
SYNTAX, SEMANTICS.pptx propositional logic and proof system in discreate mat...
SYNTAX, SEMANTICS.pptxOS INSTALLATION.pptx how to istall an guest os in your ...
Ad

More from Erlangen Artificial Intelligence & Machine Learning Meetup (7)

PDF
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
PDF
Knowledge Graphs, Daria Stepanova, Bosch Center for Artificial Intelligence
PDF
Learning global pooling operators in deep neural networks for image retrieval...
PDF
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
PDF
PDF
Best practices for structuring Machine Learning code
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
Knowledge Graphs, Daria Stepanova, Bosch Center for Artificial Intelligence
Learning global pooling operators in deep neural networks for image retrieval...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
Best practices for structuring Machine Learning code

Recently uploaded (20)

PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
Steganography Project Steganography Project .pptx
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
Global Data and Analytics Market Outlook Report
PPT
Image processing and pattern recognition 2.ppt
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
[EN] Industrial Machine Downtime Prediction
PDF
Microsoft 365 products and services descrption
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Managing Community Partner Relationships
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PDF
Microsoft Core Cloud Services powerpoint
PPTX
Business_Capability_Map_Collection__pptx
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
SAP 2 completion done . PRESENTATION.pptx
Topic 5 Presentation 5 Lesson 5 Corporate Fin
Steganography Project Steganography Project .pptx
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
CYBER SECURITY the Next Warefare Tactics
Global Data and Analytics Market Outlook Report
Image processing and pattern recognition 2.ppt
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
[EN] Industrial Machine Downtime Prediction
Microsoft 365 products and services descrption
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Managing Community Partner Relationships
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
DU, AIS, Big Data and Data Analytics.ppt
Microsoft Core Cloud Services powerpoint
Business_Capability_Map_Collection__pptx
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja

AI applications in education, Pascal Zoleko, Flexudy

  • 2. AI Applications In Education
  • 3. Hi, I am Pascal Zoleko My Projects Flexudy PR & AI For Education Study Work PR & AI For Privacy People Analytics Artificial Intelligence & Pattern Recognition
  • 4. Problems we want to solve 1. Too much to read 2. Too long to read 3. Abstracts are sometime too bold. 4. Abstracts are sometimes too vague. 5. Abstracts are not available for all kinds of text documents. (Web page) Some students (learners) … : 6. Read and forget 7. Can’t continuously evaluate their knowledge on a subject. 8. Can’t revise while on the train, bus etc.
  • 5. Flexudy Education Today. Automatic Text Summarisation Simple Question Generation Demo Video NLP Ranking Reinforcement Learning Rules A simple overview Good enough to give an idea about the text Deep Learning Fill in the blanks Simple but useful to Remember the keywords found in a text. We won’t have enough time for this. But we can discuss about it.
  • 6. Text Summarisation Extractive Abstractive Simpler Select the relevant phrases Harder Can generate phrases not found in the text
  • 7. Automatic Extractive Summarisation Existing Solutions Text rank Lexrank LSA GitHub
  • 9. The Summariser Pipeline. “Let AI do all the work and then reap the fruits of its labour.” Step by Step I will avoid technical terms as much as possible. I made no assumption about the audience. So, no Maths!
  • 10. Summary generation algorithm. Easy, but not trivial. 1. Get user text to be summarised 2. For each sentence in the text 3. Decide if sentence should be added to the summary. 4. If yes, then append the sentence to the summary 5. Format the summary and return How do we train our a summariser ? Reinforcement Learning - Cross-Entropy Can be improved by using other state of the SOTA algorithms, e.g Deep Q-Networks.
  • 11. First, a quick recap. 1. Agent Money & Environment icon made by Freepik from www.flaticon.com 2. Reward The central idea behind Reinforcement Learning 3. Environment Observation Actions
  • 12. Trainable Non-linear function Reward How does it translate to our use case ? Original Text Sentence Features Prediction [0, 1] Score The next sentence First, a quick recap. Note: although the environment is fully observable, we decide to observe, sentences one at a time. Can easily be improved: By observing many at a time.
  • 13. We need data to train the agent. Gutenberg arXiv.org wikipedia Broad corpus for higher coverage. ~50% of our development time. Data is handpicked From different domains: Biology, History, Physics, Psychology etc. Our current (English) implementation used ~300 documents. There is a lot of overfitting obviously. Next release will be trained on a lot more data.
  • 14. Then prepare the data. Generate random Chunks of text: [25K - 50K] characters Chunks are kept small to keep training episodes short. Better for RL with Cross-entropy Cheap data augmentation. If there are few documents like in our case (e.g ~300) then we obtain overfitting. 28K chars 45.5K chars 25k <= x <= 50k chars … chars We generated 12K chunks in our case 30k <= y <= ~400k chars
  • 15. Training step by step. Start a new training episode. Tokenise and extract sentences Extract sentence features For each chunk in batch For sentence Agent observes A sentence represents a step in RL jargon. Makes a Decision Add sentence to summary ? YES / NO If YES get a reward « Reward is accumulated « If no more Sentences Save all episode steps and final reward. Money & Environment icon made by Freepik from www.flaticon.com
  • 16. Feature Extraction Part of speech ratios Dependency ratios Word Embeddings: We use Glove. Could BERT be better ? We will try that soon. Sentence position in document Ratio of skipped sentences Etc. Be creative. Possible Improvement: SOTA sentence embeddings, more complex features (minimising the similarity of sentences) Named Entity Recognition
  • 17. Decision Making Extract sentence features Agent observes Random Choice Add sentence to summary ? YES / NO 1. Decisions are always random: Yes or No (1 or 0) with probabilities P(Decision = 1) and P(Decision = 0) respectively 2. Probabilities are based on Softmax predictionsoftmax 3. In early, episodes, softmax prediction are arbitrary. 4. We use a fully connected (FC) Neural Network. Five FC Layers each with high dropout probability. to minimise overfitting. Possible Improvement: Sequence models, 1D Convolutional Neural Networks
  • 18. Reward Rewards are positive and negative: If YES get a reward « Reward is accumulated « Positive if constraints are met. Otherwise negative. How are rewards computed ? With the Textrank algorithm. We forked SummaNLP’s Implementation and modified it to our needs. What are the constraints ? Number of sentences selected S should not exceed an integer M. With M <= total number of sentences. M is the theoretical maximum number of sentences in any generated summary. In our case, M = 20. For example: If a sentence with score x is selected for the summary (i.e yes is predicted), but S >= M then x = -x . In other words, we punish the agent for exceeding the upper bound. Possible Improvement: Try different algorithms, e.g LexRank. Combine algorithms. Manually rank sentences. Money & Environment icon made by Freepik from www.flaticon.com
  • 19. The steps are repeated for every sentence and every chunk in the batch. Step 1 Step 2 Step 3 Step k … Episode 1 Step 1 Episode 2 … Step 1 Episode j … Sentences Chunks Score E1 ∑ ∑ ∑ score1: s1 s3 score i sK s1 s1 i = 1 K Score E2 Score E3 s2
  • 20. The learning step. 1. Select the episodes with the best scores. i.e episodes with scores at least as high as some p-th percentile. We chose 90 based on our empirical analysis. 2. Train the agent, on the elite episodes. … Our new “ground truth” { Note: The score is not fed into the Neural Network (Agent) The score is no longer need at inference time.
  • 21. Loss Reward bound Reward mean The agent is careless. The agent is shy. The agent has learned from experience.
  • 22. But wait, aren’t we just implicitly learning the TextRank scoring algorithm ?
  • 23. Yes, but: 1. The model does not depend on vocabulary. 2. Transfer learning can be used to improve the agent: - For particular a use case or in general. - By simply changing the scoring function when training on new data. 3. The pipeline is flexible. - Easily integrate new algorithms and architectures. 4. In practice, summaries a usually generated faster.
  • 24. An honest example: Summarise this page https://en.wikipedia.org/wiki/Cross_entropy
  • 25. An honest example: TextRank results - 17 sentences - In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the same underlying set … - The cross entropy of the distribution q {displaystyle q} q relative to a distribution p {displaystyle p} p over a given set is defined as follows: - The definition may be formulated using the Kullback–Leibler divergence D K L ( p ‖ q ) {displaystyle D_{mathrm {KL} }(p|q)} D_{{{mathrm {KL}}}}(p|q) of … - For discrete probability distributions p {displaystyle p} p and q {displaystyle q} q with the same support X {displaystyle {mathcal {X}}} {mathcal {X}} … - H ( p , q ) = − ∑ x ∈ X p ( x ) log q ( x ) {displaystyle H(p,q)=-sum _{xin {mathcal {X}}}p(x),log q(x)} {displaystyle H(p,q)=-sum _{xin {mathcal {X}}} … - H ( p , q ) = − ∫ X P ( x ) log Q ( x ) d r ( x ) {displaystyle H(p,q)=-int _{mathcal {X}}P(x),log Q(x),dr(x)} {displaystyle H(p,q)=-int _{mathcal {X}}P(x), … - Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle q} q is assumed while … - That is why the expectation is taken over the true probability distribution p {displaystyle p} p and not q {displaystyle q} q. - There are many situations where cross-entropy needs to be measured but the distribution of p {displaystyle p} p is unknown. - This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) {displaystyle p(x)} p(x)[citation needed]. - If the estimated probability of outcome I {displaystyle I} I is q I {displaystyle q_{I}} q_{I}, while the frequency (empirical probability) of outcome I … - 1 N log ∏ I q I N p I = ∑ I p I log q I = − H ( p , q ) {displaystyle {frac {1}{N}}log prod _{I}q_{I}^{Np_{I}}=sum _{I}p_{I}log q_{I}=-H(p,q)} {displaystyle … - When comparing a distribution q {displaystyle q} q against a fixed reference distribution p {displaystyle p} p, cross entropy and KL divergence are … - This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be D K L ( p … - The true probability p I {displaystyle p_{I}} p_{I} is the true label, and the given distribution q I {displaystyle q_{I}} q_{I} is the predicted value of the … - Having set up our notation, p ∈ { y , 1 − y } {displaystyle pin {y,1-y}} pin {y,1-y} and q ∈ { y ^ , 1 − y ^ } {displaystyle qin {{hat {y}},1-{hat {y}}}} … - J ( w ) = 1 N ∑ n = 1 N H ( p n , q n ) = − 1 N ∑ n = 1 N [ y n log y ^ n + ( 1 − y n ) log ( 1 − y ^ n ) ] , {displaystyle {begin{aligned}J(mathbf {w} ) … {aligned}J(mathbf {w} ) &= {frac {1}{N}}sum _{n=1}^{N}H(p_{n},q_{n}) = -{frac {1}{N}}sum _{n=1}^{N} {bigg [}y_{n}log {hat {y}}_{n}+(1-y_{n})log( …
  • 26. An honest example: Flexudy results - 12 sentences - In information theory, the cross entropy between two probability distributions p {displaystyle p} p and q {displaystyle q} q over the … - In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to … - Therefore, cross entropy can be interpreted as the expected message-length per datum when a wrong distribution q {displaystyle … - An example is language modeling, where a model is created based on a training set T {displaystyle T} T, and then its cross-entropy … - In this example, p {displaystyle p} p is the true distribution of words in any corpus, and q {displaystyle q} q is the distribution of … - In these cases, an estimate of cross-entropy is calculated using the following formula: H ( T , q ) = - displaystyle N} N. This is a Monte Carlo estimate of the true cross entropy, where the test set is treated as samples from p ( x ) … - Cross-entropy minimization Cross-entropy minimization is frequently used in optimization and rare-event probability estimation; see the cross-entropy method. - This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross- … - Cross entropy can be used to define a loss function in machine learning and optimization. - The output of the model for a given observation, given a vector of input features x {displaystyle x} x, can be interpreted as a … - The typical cost function that one uses in logistic regression is computed by taking the average of all cross-entropies in the sample.
  • 27. Is Flexudy’s current implementation better than TextRank ?
  • 28. We cannot tell. We do not yet have evidence to support such a claim. // TODO - Evaluate Flexudy with BLUE and ROUGE Scores
  • 29. An honest example II: Summarise this page https://en.wikipedia.org/wiki/Renaissance
  • 30. An honest example II: Flexudy results - 11 sentences - The School of Athens (1509–1511), Raphael Topics Humanism Age of Discovery Architecture Dance Fine arts - Depicting the Hebrew prophet-prodigy-king David as a muscular Greek athlete, the Christian humanist ideal can be seen in the .. - REN-ə-sahnss)[2][a] was a period in European history marking the transition from the Middle Ages to Modernity and covering … - In addition to the standard periodization, proponents of a long Renaissance put its beginning in the 14th century and its end in the 17th … - The traditional view focuses more on the early modern aspects of the Renaissance and argues that it was a break from the past, … The intellectual basis of the Renaissance was its version of humanism, derived from the concept of Roman Humanitas and the rediscovery … - Early examples were the development of perspective in oil painting and the recycled knowledge of how to make concrete. - Although the invention of metal movable type sped the dissemination of ideas from the later 15th century, the changes of the Renaissance … - As a cultural movement, the Renaissance encompassed innovative flowering of Latin and vernacular literatures, beginning with the … - In politics, the Renaissance contributed to the development of the customs and conventions of diplomacy, and in science to an … - Various theories have been proposed to account for its origins and characteristics, focusing on a variety of factors including the … - Other major centres were northern Italian city-states such as Venice, Genoa, Milan, Bologna, and finally Rome during the … The first 2 sentences make absolutely no sense
  • 31. Hence, there is still a lot of work to do.
  • 32. Future work 1. Try new architectures and algorithms e.g 1D Convolutions. 2. Support formulas e.g Mathematics: Combine Reinforcement Learning and Logic (Symbolic AI). 3. Manual annotation to improve sentence selection. 4. Collect more data. 5. Use SOTA sentence embeddings. 6. Improve sentence boundary detection algorithms. 7. Implement co-reference resolution to deal with pronouns.
  • 33. References 1. Deep Reinforcement Learning Hands-On by Maxim Lapan 2. A survey automatic text summarization by Oguzhan Tas & Farzad Kiyani 3. Deep Transfer Reinforcement Learning for Text Summarization by Yaser Naren & Chandan 4. Variations of the Similarity Function of TextRank for Automated Summarization by Federico Barrios, Luis Argerich & Rosa W. 5. Natural language understanding with {B}loom embeddings, convolutional neural networks and incremental parsing by Honnibal, Matthew and Montani, Ines
  • 34. To learn more about the meetup, click the Link https://www.meetup.com/Erlangen-Artificial-Intelligence-Machine-Learning-Meetup Erlangen Artificial Intelligence & Machine Learning Meetup presents