SlideShare a Scribd company logo
Reference Scope Identification for Citances
Using Convolutional Neural Network
• SAURAV JHA
• AANCHAL CHAURASIA
• AKHILESH SUDHAKAR
• ANIL KUMAR SINGH
19 December, 2017
MNNIT, Allahabad IIT (BHU), Varanasi
Overview of the problem
! Automatically generating the reference scope (the span of cited
text) in a reference paper
! Corresponding to citances (sentences in the citing papers that
cite it)
! Application : Scientific Paper Summarization
The Computational Linguistics Scientific Document
Summarization Shared Task (CL-SciSumm)
• Given: A topic consisting of a Reference Paper (RP) and Citing Papers (CPs) that all contain citations to
the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to
the RP.
• Task: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect
the citance:
• A sentence fragment, a full sentence, or several consecutive sentences (no more than 5).
Citing Paper (ours) Referenced Paper (Yeh et. al. (2017))
Example
Contributions:
● Modeling a new feature set to represent a citance-
reference sentence pair
● Building a classification system: binary classification of a
<CP sentence, RP sentence> pair.
● Showing performance gains over state-of-the-art results
of Yeh et al. (2017)
● Better F1-scores
● Smaller feature set
● 3 binary classifiers:
● Adaptive Boosting Classifier (ABC)
● Gradient Boosting Classifier (GBC)
● CNN classifier..
Citance Reference
Feature Extraction
Label: 0/1
Undersampling + SMOTE
Train + Val. Set Test Set
Principal Comp. Analysis
Classifiers
Label: 0/1
Train Predict
Model
Dataset
• CL-SciSumm Shared Tasks 2016 and 2017
• Development corpus
• Training corpora
• Test corpus
● Each corpus = 10 topics.
● Each topic = a reference paper (RP) + its citing papers (CPs).
● The citation annotations specify citances, their associated reference text
and the discourse facet that it represents.
● Citances in CPs are paired with each sentence in the RPs, along with a
binary label indicating their actual reference relations : 0 or 1.
ANNOTATION FORMAT
Feature Extraction:
• Three different classes of citation-dependent features (i.e., lexical,
knowledge-based and corpus-based) and one class of citation-
independent features (i.e., surface).
1. LEXICAL FEATURES
● Word overlap* using 5 metrics: Dice coefficient, Jaccard coefficient,
Cosine similarity, Levenshtein distance based fuzzy string similarity and
modified gestalt pattern-matching based sequence matcher score.
● TF-IDF similarity: The TF-IDF vector cosine similarity.
● ROUGE measure: ROGUE-1, ROGUE-2 and ROGUE-L.
● Named entity overlap* : Using Dice coefficient, fuzzy string similarity,
sequence matcher score and word2vec similarity.
● Number overlap*: Fuzzy string similarity and sequence matcher score.
● Significance of citation-related word pairs: Based on Pointwise Mutual
Information (PMI) score (Church and Hanks, 1989).
Convention:
● ∗ = borrowed, but modified features.
● ∗∗ = newly added features in this work.
1. Lexical
2. Knowledge-based
3. Corpus-based
4. Surface
Feature Extraction:
2. KNOWLEDGE -BASED FEATURES
• WordNet-based semantic similarity* : Best semantic similarity
score between words in the citance and the reference sentence out of all the
sets of cognitive synonyms (synsets) present in the WordNet.
Convention:
● ∗ = borrowed, but modified features.
● ∗∗ = newly added features in this work.
1. Lexical
2. Knowledge-based
3. Corpus-based
4. Surface
Feature Extraction:
3. CORPUS-BASED FEATURES
• Word2Vec-based Semantic similarity** : Based on the pre-
trained embedding vectors of the GoogleNews corpus, following Mikolov et
al. (2013).
Convention:
● ∗ = borrowed, but modified features.
● ∗∗ = newly added features in this work.
1. Lexical
2. Knowledge-based
3. Corpus-based
4. Surface
Feature Extraction:
4. SURFACE FEATURES
• Count of words: In the reference sentence.
• Count of characters** : In the reference sentence.
• Count of digits: In the reference sentence.
• Count of special characters** : “@”, “#”, “$”, “%”, “&”, “*”, “-”, “=”, “+”, “>”,
“<”, “[”,“]”, “{”, “}”, “/”.
• Normalized count of punctuation markers** : The ratio of count of
punctuation characters to the total count of characters.
• Count of long words** : Words exceeding six letters in length.
• Average word Length** : The ratio of count of total characters in a word to
the count of words in the reference sentence.
• Count of named entities: In the reference sentence.
• Average sentiment score**: The overall positive and negative sentiment
score of the reference sentence averaged over all the words, based on the
SentiWordNet 3.0 lexical resource (Baccianella et al. (2010)).
• Lexical richness** : The lexical richness of the reference sentence based
on Yule’s K index.
Convention:
● ∗ = borrowed, but modified features.
● ∗∗ = newly added features in this work.
1. Lexical
2. Knowledge-based
3. Corpus-based
4. Surface
1. The Class Imbalance
Problem
2. Handling Class Imbalance
3. Handling Correlated
Features
• For a given reference paper, the number of
sentences in it that are cited by some citing
paper, is much lesser than the number of
sentences that are not cited
• Highly imbalanced data set with the ratio of non-
cited to cited pairs being 383.83 : 1 in the
combined corpus of development and training set
and 355.76 : 1 in the test set corpus.
Data Handling Techniques:
1. The Class Imbalance
Problem
2. Handling Class Imbalance
3. Handling Correlated
Features
• We experimented with combinations of three
different degrees of Random under-sampling
(20%, 30% and 35%) on the majority class
(negative samples).
• On each such undersampled dataset, we apply
the SMOTE (Synthetic Minority Over-sampling
Technique) method to generate synthetic cited
pairs until the ratio cited: non-cited pairs = 1:1.
Data Handling Techniques:
1. The Class Imbalance
Problem
2. Handling Class Imbalance
3. Handling Correlated
Features
• Principal Component Analysis (PCA), is
applied on both training and testing feature sets.
• Experiments were done by varying the number
of principal components from 30-40 and the
best performance was obtained by retaining the
top 35 principal components.
Data Handling Techniques:
1. Adaptive Boosting Classifier
(ABC)
2. Gradient Boosting Classifier
(GBC)
3. Convolutional Neural Network
(CNN)
• Work by creating a sequence of models that
attempt to correct the mistakes of the models
used before them in the sequence.
• Offer the added benefit of combining outputs
from weak learners (those whose performance
is at least better than random chance) to create
a strong learner with improved prediction
performance.
• Pay higher focus on instances that have been
misclassified or have higher errors.
• Base classifiers (or weak learner) used in ABC
are decision trees.
Classification Algorithms:
Boosting Ensemble Algorithms
1. Adaptive Boosting Classifier
(ABC)
2. Gradient Boosting Classifier
(GBC)
3. Convolutional Neural Network
(CNN)
• Allows each base classifier to gradually
minimize the loss function of the whole system
using the Gradient Descent method (Collobert
et al. (2004)).
• The base classifiers in a GBC are regression
trees.
Classification Algorithms:
Figure 1: Schematic illustration of the boosting framework. Adapted from Bishop and Nasrabadi (2007): each base classifier y_m(x)
is trained on a weighted form of the training set (blue arrows) in which the weights w_n(m) depend on the performance of the
previous base classifier y_m −1(x) (green arrows). Once all base classifiers have been trained, they are combined to give the final
classifier Y_M(x) (red arrows).
1. Gradient Boosting Classifier
(GBC)
2. Adaptive Boosting Classifier
(ABC)
3. Convolutional Neural Network (CNN)
• Have the ability to extract features of high-level
abstraction with minimum pre-processing of
data.
➢ ARCHITECTURE :
• A 1D Convolutional layer accepts inputs of the
form (Height * Width * Channels).
• Visualize each feature vector as an image with
a unit channel, unit height and a width equal to
the number of features in the reduced feature
vector obtained after applying PCA.
• Input shape for the vector to be fed into the
input layer of the CNN = (No. of features * 1).
Classification Algorithms:
Figure 2: Our CNN architecture: stack of two 1-D convolutional layers with 64 hidden units each (ReLu activations) + 1-
D MaxPooling + stack of two 1-D convolutional layers with 128 hidden units each (ReLu activations) + 1-D Global
Average Pooling + 50% Dropout + a single unit output dense layer (sigmoid activation)
Post Filtering :
! The binary classifier may classify multiple sentences in the RP as positive, i.e., being relevant to a
particular citance: all these might not be true!
! In order to reduce our false positive error rate, we post-process by filtering out some of these false
positives.
! We use the method of Yeh et al. (2017): the final output denotes the top-k sentences from the
ordered sequence of classified reference sentences based on the TF-IDF vector cosine similarity
score to measure the relevance between the citance and the reference sentences.
EXPERIMENTS:
• Evaluation Metrics: Precision, Recall and F1-Score
• The average score on all topics in the test corpus is reported.
• We run experiments on two separate training sets: first run and second run.
➢ In the first run, we use data only from the 2016 shared task for comparison with the existing state-of-the-art (Yeh et
al. (2017)):
1. Train our data on the training set, and tune the CNN’s hyper-parameters on the development set.
2. We then augment the training data and the development data to train the final models.
3. We test our model on the test provided as part of this dataset.
Table 1 : F1 score comparison of CNN with previous models
➢ In the second run, we make use of the datasets from both 2016 and 2017.
1. Both the training datasets are augmented to form the initial training set.
2. After tuning the CNN’s hyperparameters on the development set, the initial training and development sets are
augmented to form the final training set.
➢ Grid search algorithm over 10-fold cross validation used to find the best model parameters for ABC and GBC:
Results And Analysis:
● Precision, recall and F1-score obtained by the models on the test set with respect to the positive
classes, evaluated by 10-fold cross validation are shown in Table 3.
● The CNN-based classifier was trained for 30 epochs.
Instances of TP, FP, TN, FN:
• True Positive:
• False Positive:
Citance: We agree with Sekine (2005) who claims that several different methods are
required to discover a wider variety of paraphrases.
Reference: Rather believe several methods developed using different heuristics
discover wider variety paraphrases
Citance: Similarly, (Sekine, 2005) improved information retrieval based on pattern
recognition by introducing paraphrase generation.
Reference: obstacles completing idea, believe automatic paraphrase discovery
important component building fully automatic information extraction system.
Citance: We agree with Sekine (2005) who claims that several different methods are required to
discover a wider variety of paraphrases.
Reference: Keyword detection error Even keyword consists single word, words desirable
keywords domain.
Citance: This sparked intensive research on unsupervised acquisition of entailment
rules (and similarly paraphrases) e.g. (Lin and Pantel, 2001; Szpektor et al., 2004; Sekine, 2005).
Reference: proposed unsupervised method discover paraphrases large untagged corpus.
• True Negative:
• False Negative:
Comparison with Klampfl et al. (2016):
• Reported an F1-score of 0.346 on the development set corpus and 0.432 on the training set corpus of
2016 using TextSentenceRank assisted sentence classifier.
• Because of the unavailability of their performance results on the test set corpus, we choose to compare
the performance of our CNN classifier with theirs on the development and training set corpus (80:20
train:test split) of 2016.
1. Effect of Feature Classes
2. Effect of Data Handling
Techniques
Ablation Studies:
1. Effect of Feature Classes
2. Effect of Data Handling
Techniques
Ablation Studies:
• More Data
• Extensions to Word2Vec: Paragraph Vector
(Le and Mikolo (2014)).
• Modeling a Learning to rank problem:
Establish some partial order between the
training instances using the binary labels
assigned to each <CP sentence, RP sentence>
pair.
Future Work
● We describe our work on reference
scope identification for citances
using an extended feature set
applied to three different classifiers.
● Among the classifiers trained to
distinguish cited and non-cited
pairs, the CNN-based model gave
the overall best results with an F1
score of 0.5558 on the combined
corpus of CL-SciSumm 2016 and
2017.
● We also achieved an F1 score of
0.2462 on the 2016 dataset, which
surpasses the previous state-of-the-
art accuracy on the dataset.
References:
• Peeyush Aggarwal and Richa Sharma. 2016a. Lexical and syntactic cues to identify reference scope of
citance. In BIRNDL@ JCDL, pages 103–112.
• Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet 3.0: An enhanced lexical
resource for sentiment analysis and opinion mining. In LREC.
• François Chollet et al. 2015. Keras. https: //github.com/fchollet/keras
• Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. Smote: Synthetic
minority oversampling technique. J. Artif. Intell. Res. (JAIR), 16:321–357.
• Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP.
• Bruno Malenfant and Guy Lapalme. 2016. Rali system description for cl-scisumm 2016 shared task. In
BIRNDL@ JCDL, pages 146–155.
• Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed
representations of words and phrases and their compositionality. In NIPS.
• Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet: : Similarity-measuring the
relatedness of concepts. In AAAI
• Jen-Yuan Yeh, Tien-Yu Hsu, Cheng-Jung Tsai, and Pei-Cheng Cheng. 2017. Reference scope identification for
citances by classification with text similarity measures. In ICSCA ’17.
THANKYOU

More Related Content

PDF
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
PDF
Extraction Based automatic summarization
PDF
Text Summarization
PDF
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
PDF
Text summarization
PPTX
Towards advanced data retrieval from learning objects repositories
PPTX
Recommendation system
PPT
Models for Information Retrieval and Recommendation
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstract...
Extraction Based automatic summarization
Text Summarization
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...
Text summarization
Towards advanced data retrieval from learning objects repositories
Recommendation system
Models for Information Retrieval and Recommendation

What's hot (20)

PDF
Document Summarization
PDF
Understanding Natural Languange with Corpora-based Generation of Dependency G...
PDF
CSTalks-Quaternary Semantics Recomandation System-24 Aug
PDF
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
PPTX
TextRank: Bringing Order into Texts
PDF
semeval2016
PPTX
AI to advance science research
PPTX
Sentiment analysis using naive bayes classifier
PDF
Introduction to Probabilistic Latent Semantic Analysis
PDF
MaxEnt (Loglinear) Models - Overview
PDF
4 the sql_standard
PDF
Cg4201552556
PPTX
Probabilistic models (part 1)
PPT
Latent Semantic Indexing For Information Retrieval
PDF
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
PPTX
Tdm probabilistic models (part 2)
PPTX
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
PPT
Part 1
PDF
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
PDF
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
Document Summarization
Understanding Natural Languange with Corpora-based Generation of Dependency G...
CSTalks-Quaternary Semantics Recomandation System-24 Aug
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
TextRank: Bringing Order into Texts
semeval2016
AI to advance science research
Sentiment analysis using naive bayes classifier
Introduction to Probabilistic Latent Semantic Analysis
MaxEnt (Loglinear) Models - Overview
4 the sql_standard
Cg4201552556
Probabilistic models (part 1)
Latent Semantic Indexing For Information Retrieval
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Tdm probabilistic models (part 2)
Dissertation defense slides on "Semantic Analysis for Improved Multi-document...
Part 1
Learning Collaborative Agents with Rule Guidance for Knowledge Graph Reasoning
LP&IIS2013 PPT. Chinese Named Entity Recognition with Conditional Random Fiel...
Ad

Similar to Reference Scope Identification of Citances Using Convolutional Neural Network (20)

PDF
Context Driven Technique for Document Classification
PPTX
CNN for modeling sentence
PDF
IRE Semantic Annotation of Documents
PPTX
Context Based Citation Recommendation
DOC
Team G
PPTX
Text features
PDF
A Neural Corpus Indexer for Document Retrieval.pdf
PDF
IRJET- Automated Document Summarization and Classification using Deep Lear...
PDF
Learning from similarity and information extraction from structured documents...
PPT
The science behind predictive analytics a text mining perspective
PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
PDF
Texts Classification with the usage of Neural Network based on the Word2vec’s...
PDF
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
PPTX
Feature Engineering for NLP
PDF
Mapping Subsets of Scholarly Information
PPTX
NLP Classifier Models & Metrics
PDF
Magpie
PDF
Arules_TM_Rpart_Markdown
PPTX
Deep Neural Methods for Retrieval
Context Driven Technique for Document Classification
CNN for modeling sentence
IRE Semantic Annotation of Documents
Context Based Citation Recommendation
Team G
Text features
A Neural Corpus Indexer for Document Retrieval.pdf
IRJET- Automated Document Summarization and Classification using Deep Lear...
Learning from similarity and information extraction from structured documents...
The science behind predictive analytics a text mining perspective
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
Texts Classification with the usage of Neural Network based on the Word2vec’s...
TEXTS CLASSIFICATION WITH THE USAGE OF NEURAL NETWORK BASED ON THE WORD2VEC’S...
Feature Engineering for NLP
Mapping Subsets of Scholarly Information
NLP Classifier Models & Metrics
Magpie
Arules_TM_Rpart_Markdown
Deep Neural Methods for Retrieval
Ad

Recently uploaded (20)

PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Introduction to machine learning and Linear Models
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Foundation of Data Science unit number two notes
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Computer network topology notes for revision
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Miokarditis (Inflamasi pada Otot Jantung)
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Database Infoormation System (DBIS).pptx
Introduction to machine learning and Linear Models
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Foundation of Data Science unit number two notes
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Business Analytics and business intelligence.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
climate analysis of Dhaka ,Banglades.pptx
Supervised vs unsupervised machine learning algorithms
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Computer network topology notes for revision

Reference Scope Identification of Citances Using Convolutional Neural Network

  • 1. Reference Scope Identification for Citances Using Convolutional Neural Network • SAURAV JHA • AANCHAL CHAURASIA • AKHILESH SUDHAKAR • ANIL KUMAR SINGH 19 December, 2017 MNNIT, Allahabad IIT (BHU), Varanasi
  • 2. Overview of the problem ! Automatically generating the reference scope (the span of cited text) in a reference paper ! Corresponding to citances (sentences in the citing papers that cite it) ! Application : Scientific Paper Summarization
  • 3. The Computational Linguistics Scientific Document Summarization Shared Task (CL-SciSumm) • Given: A topic consisting of a Reference Paper (RP) and Citing Papers (CPs) that all contain citations to the RP. In each CP, the text spans (i.e., citances) have been identified that pertain to a particular citation to the RP. • Task: For each citance, identify the spans of text (cited text spans) in the RP that most accurately reflect the citance: • A sentence fragment, a full sentence, or several consecutive sentences (no more than 5).
  • 4. Citing Paper (ours) Referenced Paper (Yeh et. al. (2017)) Example
  • 5. Contributions: ● Modeling a new feature set to represent a citance- reference sentence pair ● Building a classification system: binary classification of a <CP sentence, RP sentence> pair. ● Showing performance gains over state-of-the-art results of Yeh et al. (2017) ● Better F1-scores ● Smaller feature set ● 3 binary classifiers: ● Adaptive Boosting Classifier (ABC) ● Gradient Boosting Classifier (GBC) ● CNN classifier..
  • 6. Citance Reference Feature Extraction Label: 0/1 Undersampling + SMOTE Train + Val. Set Test Set Principal Comp. Analysis Classifiers Label: 0/1 Train Predict Model
  • 7. Dataset • CL-SciSumm Shared Tasks 2016 and 2017 • Development corpus • Training corpora • Test corpus ● Each corpus = 10 topics. ● Each topic = a reference paper (RP) + its citing papers (CPs). ● The citation annotations specify citances, their associated reference text and the discourse facet that it represents. ● Citances in CPs are paired with each sentence in the RPs, along with a binary label indicating their actual reference relations : 0 or 1.
  • 9. Feature Extraction: • Three different classes of citation-dependent features (i.e., lexical, knowledge-based and corpus-based) and one class of citation- independent features (i.e., surface). 1. LEXICAL FEATURES ● Word overlap* using 5 metrics: Dice coefficient, Jaccard coefficient, Cosine similarity, Levenshtein distance based fuzzy string similarity and modified gestalt pattern-matching based sequence matcher score. ● TF-IDF similarity: The TF-IDF vector cosine similarity. ● ROUGE measure: ROGUE-1, ROGUE-2 and ROGUE-L. ● Named entity overlap* : Using Dice coefficient, fuzzy string similarity, sequence matcher score and word2vec similarity. ● Number overlap*: Fuzzy string similarity and sequence matcher score. ● Significance of citation-related word pairs: Based on Pointwise Mutual Information (PMI) score (Church and Hanks, 1989). Convention: ● ∗ = borrowed, but modified features. ● ∗∗ = newly added features in this work. 1. Lexical 2. Knowledge-based 3. Corpus-based 4. Surface
  • 10. Feature Extraction: 2. KNOWLEDGE -BASED FEATURES • WordNet-based semantic similarity* : Best semantic similarity score between words in the citance and the reference sentence out of all the sets of cognitive synonyms (synsets) present in the WordNet. Convention: ● ∗ = borrowed, but modified features. ● ∗∗ = newly added features in this work. 1. Lexical 2. Knowledge-based 3. Corpus-based 4. Surface
  • 11. Feature Extraction: 3. CORPUS-BASED FEATURES • Word2Vec-based Semantic similarity** : Based on the pre- trained embedding vectors of the GoogleNews corpus, following Mikolov et al. (2013). Convention: ● ∗ = borrowed, but modified features. ● ∗∗ = newly added features in this work. 1. Lexical 2. Knowledge-based 3. Corpus-based 4. Surface
  • 12. Feature Extraction: 4. SURFACE FEATURES • Count of words: In the reference sentence. • Count of characters** : In the reference sentence. • Count of digits: In the reference sentence. • Count of special characters** : “@”, “#”, “$”, “%”, “&”, “*”, “-”, “=”, “+”, “>”, “<”, “[”,“]”, “{”, “}”, “/”. • Normalized count of punctuation markers** : The ratio of count of punctuation characters to the total count of characters. • Count of long words** : Words exceeding six letters in length. • Average word Length** : The ratio of count of total characters in a word to the count of words in the reference sentence. • Count of named entities: In the reference sentence. • Average sentiment score**: The overall positive and negative sentiment score of the reference sentence averaged over all the words, based on the SentiWordNet 3.0 lexical resource (Baccianella et al. (2010)). • Lexical richness** : The lexical richness of the reference sentence based on Yule’s K index. Convention: ● ∗ = borrowed, but modified features. ● ∗∗ = newly added features in this work. 1. Lexical 2. Knowledge-based 3. Corpus-based 4. Surface
  • 13. 1. The Class Imbalance Problem 2. Handling Class Imbalance 3. Handling Correlated Features • For a given reference paper, the number of sentences in it that are cited by some citing paper, is much lesser than the number of sentences that are not cited • Highly imbalanced data set with the ratio of non- cited to cited pairs being 383.83 : 1 in the combined corpus of development and training set and 355.76 : 1 in the test set corpus. Data Handling Techniques:
  • 14. 1. The Class Imbalance Problem 2. Handling Class Imbalance 3. Handling Correlated Features • We experimented with combinations of three different degrees of Random under-sampling (20%, 30% and 35%) on the majority class (negative samples). • On each such undersampled dataset, we apply the SMOTE (Synthetic Minority Over-sampling Technique) method to generate synthetic cited pairs until the ratio cited: non-cited pairs = 1:1. Data Handling Techniques:
  • 15. 1. The Class Imbalance Problem 2. Handling Class Imbalance 3. Handling Correlated Features • Principal Component Analysis (PCA), is applied on both training and testing feature sets. • Experiments were done by varying the number of principal components from 30-40 and the best performance was obtained by retaining the top 35 principal components. Data Handling Techniques:
  • 16. 1. Adaptive Boosting Classifier (ABC) 2. Gradient Boosting Classifier (GBC) 3. Convolutional Neural Network (CNN) • Work by creating a sequence of models that attempt to correct the mistakes of the models used before them in the sequence. • Offer the added benefit of combining outputs from weak learners (those whose performance is at least better than random chance) to create a strong learner with improved prediction performance. • Pay higher focus on instances that have been misclassified or have higher errors. • Base classifiers (or weak learner) used in ABC are decision trees. Classification Algorithms: Boosting Ensemble Algorithms
  • 17. 1. Adaptive Boosting Classifier (ABC) 2. Gradient Boosting Classifier (GBC) 3. Convolutional Neural Network (CNN) • Allows each base classifier to gradually minimize the loss function of the whole system using the Gradient Descent method (Collobert et al. (2004)). • The base classifiers in a GBC are regression trees. Classification Algorithms:
  • 18. Figure 1: Schematic illustration of the boosting framework. Adapted from Bishop and Nasrabadi (2007): each base classifier y_m(x) is trained on a weighted form of the training set (blue arrows) in which the weights w_n(m) depend on the performance of the previous base classifier y_m −1(x) (green arrows). Once all base classifiers have been trained, they are combined to give the final classifier Y_M(x) (red arrows).
  • 19. 1. Gradient Boosting Classifier (GBC) 2. Adaptive Boosting Classifier (ABC) 3. Convolutional Neural Network (CNN) • Have the ability to extract features of high-level abstraction with minimum pre-processing of data. ➢ ARCHITECTURE : • A 1D Convolutional layer accepts inputs of the form (Height * Width * Channels). • Visualize each feature vector as an image with a unit channel, unit height and a width equal to the number of features in the reduced feature vector obtained after applying PCA. • Input shape for the vector to be fed into the input layer of the CNN = (No. of features * 1). Classification Algorithms:
  • 20. Figure 2: Our CNN architecture: stack of two 1-D convolutional layers with 64 hidden units each (ReLu activations) + 1- D MaxPooling + stack of two 1-D convolutional layers with 128 hidden units each (ReLu activations) + 1-D Global Average Pooling + 50% Dropout + a single unit output dense layer (sigmoid activation)
  • 21. Post Filtering : ! The binary classifier may classify multiple sentences in the RP as positive, i.e., being relevant to a particular citance: all these might not be true! ! In order to reduce our false positive error rate, we post-process by filtering out some of these false positives. ! We use the method of Yeh et al. (2017): the final output denotes the top-k sentences from the ordered sequence of classified reference sentences based on the TF-IDF vector cosine similarity score to measure the relevance between the citance and the reference sentences.
  • 22. EXPERIMENTS: • Evaluation Metrics: Precision, Recall and F1-Score • The average score on all topics in the test corpus is reported. • We run experiments on two separate training sets: first run and second run. ➢ In the first run, we use data only from the 2016 shared task for comparison with the existing state-of-the-art (Yeh et al. (2017)): 1. Train our data on the training set, and tune the CNN’s hyper-parameters on the development set. 2. We then augment the training data and the development data to train the final models. 3. We test our model on the test provided as part of this dataset. Table 1 : F1 score comparison of CNN with previous models
  • 23. ➢ In the second run, we make use of the datasets from both 2016 and 2017. 1. Both the training datasets are augmented to form the initial training set. 2. After tuning the CNN’s hyperparameters on the development set, the initial training and development sets are augmented to form the final training set. ➢ Grid search algorithm over 10-fold cross validation used to find the best model parameters for ABC and GBC:
  • 24. Results And Analysis: ● Precision, recall and F1-score obtained by the models on the test set with respect to the positive classes, evaluated by 10-fold cross validation are shown in Table 3. ● The CNN-based classifier was trained for 30 epochs.
  • 25. Instances of TP, FP, TN, FN: • True Positive: • False Positive: Citance: We agree with Sekine (2005) who claims that several different methods are required to discover a wider variety of paraphrases. Reference: Rather believe several methods developed using different heuristics discover wider variety paraphrases Citance: Similarly, (Sekine, 2005) improved information retrieval based on pattern recognition by introducing paraphrase generation. Reference: obstacles completing idea, believe automatic paraphrase discovery important component building fully automatic information extraction system.
  • 26. Citance: We agree with Sekine (2005) who claims that several different methods are required to discover a wider variety of paraphrases. Reference: Keyword detection error Even keyword consists single word, words desirable keywords domain. Citance: This sparked intensive research on unsupervised acquisition of entailment rules (and similarly paraphrases) e.g. (Lin and Pantel, 2001; Szpektor et al., 2004; Sekine, 2005). Reference: proposed unsupervised method discover paraphrases large untagged corpus. • True Negative: • False Negative:
  • 27. Comparison with Klampfl et al. (2016): • Reported an F1-score of 0.346 on the development set corpus and 0.432 on the training set corpus of 2016 using TextSentenceRank assisted sentence classifier. • Because of the unavailability of their performance results on the test set corpus, we choose to compare the performance of our CNN classifier with theirs on the development and training set corpus (80:20 train:test split) of 2016.
  • 28. 1. Effect of Feature Classes 2. Effect of Data Handling Techniques Ablation Studies:
  • 29. 1. Effect of Feature Classes 2. Effect of Data Handling Techniques Ablation Studies:
  • 30. • More Data • Extensions to Word2Vec: Paragraph Vector (Le and Mikolo (2014)). • Modeling a Learning to rank problem: Establish some partial order between the training instances using the binary labels assigned to each <CP sentence, RP sentence> pair. Future Work
  • 31. ● We describe our work on reference scope identification for citances using an extended feature set applied to three different classifiers. ● Among the classifiers trained to distinguish cited and non-cited pairs, the CNN-based model gave the overall best results with an F1 score of 0.5558 on the combined corpus of CL-SciSumm 2016 and 2017. ● We also achieved an F1 score of 0.2462 on the 2016 dataset, which surpasses the previous state-of-the- art accuracy on the dataset. References: • Peeyush Aggarwal and Richa Sharma. 2016a. Lexical and syntactic cues to identify reference scope of citance. In BIRNDL@ JCDL, pages 103–112. • Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In LREC. • François Chollet et al. 2015. Keras. https: //github.com/fchollet/keras • Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. Smote: Synthetic minority oversampling technique. J. Artif. Intell. Res. (JAIR), 16:321–357. • Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP. • Bruno Malenfant and Guy Lapalme. 2016. Rali system description for cl-scisumm 2016 shared task. In BIRNDL@ JCDL, pages 146–155. • Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. • Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. 2004. Wordnet: : Similarity-measuring the relatedness of concepts. In AAAI • Jen-Yuan Yeh, Tien-Yu Hsu, Cheng-Jung Tsai, and Pei-Cheng Cheng. 2017. Reference scope identification for citances by classification with text similarity measures. In ICSCA ’17.