Personalized classifiers

Motivation
• With the proliferation of digital documents it is
important to have sound organization – i.e.
Categorization
– Faceted Search, Exploratory Search, Navigational
Search, Diversifying Search Results, Ranking, etc.
• Yahoo! employs 200 (?) people for manual
labeling of Web pages and managing a hierarchy
of 500,000+ categories*
• MEDLINE (National Library of Medicine) spends
$2 million/year for manual indexing of journal
articles and evolving MEdical Subject Headings
(18,000+ categories)*
* Source: www.cs.purdue.edu/homes/lsi/CS547_2013_Spring/.../IR_TC_I_Big.pdf
(Department of Computer Science, Purdue University)

Challenges
• What categories choose?
• Predefined?
– Reuters, DMOZ, Yahoo categories
• Relevant to organization?
– Personalized categories

Assumptions
• We assume that a knowledge graph exists
with all possible categories
– that can cover the terminology of nearly any
document collection;
– for example, Wikipedia
• Nodes are categories
• Edges are relationship between them
– Association (related)
• Organization receives documents in batches
– Monthly, Weekly, etc.

A part of Knowledge Graph (KnG)

Problem Definition
• Learning a personalized model for the
association of the categories in KnG to a
document collection through active learning
and feature design
• Building an evolving multi-label categorization
system to categorize documents into
Categories Specific to an Organization
– Personalization of categories

Overall Architecture
• We evolve the personalized classifier based on
– Documents seen so far
– Categories referenced from Knowledge Graph
– Feedback provided by the user

Step 1: Spotting
• Spot the key words from documents
– Key phrase identification Techniques
– NLP (noun phrases)

Step 2: Candidate Categories
• Key words are the indicatives of document topics
• Identify the Categories from KnG based on
keyword look ups
– Title Match, Gloss match with Wikipedia categories
• Add categories in Markov Blanket
– Observe that categories that get assigned to a
document exhibit semantic relations such as
“associations”
– E.g.: category “Linear Classifier” is related to
categories such as “Kernel Methods in Classifiers,”
“Machine Learning,” and the like
– Refer our paper for more details

Candidate Categories
• Not all candidate categories are relevant to
the document
– The document is not about that category
– The category is not of interest to the user
• We need to select only most appropriate
categories from these candidate categories

Step 3: Associative Markov Network
formation
• Two types of informative features available
– a feature that is a function of the document and a
category, such as the category-specific classifier
scoring function evaluated on a document
– a feature that is a function of two categories, such
as their co-occurrence frequency or textual
overlap between their descriptions.
• Associative Markov Network (AMN), a very
natural way of modeling these two types of
features.

Associative Markov Network
• The candidate categories for a journal article
taken from arXiv.org
• Only some are actually relevant due to
– Relevance to the document
– User preferences

Step 4: Collectively Inferring
Categories of a Document
• Node features
– Capture the similarity of a node (category) to
document
– E.g: Kernels, SVM / Naïve Bayes classifier scores
• Edge features
– Capture the similarity between nodes
– E.g: Title match, gloss match, etc.

Collectively Inferring Categories of a
Document
• This is the MAP inference in standard Markov Network with
only node and edge potentials
• Using the indicator variable, we can express the potentials
as and
• Note, we have separate feature weights for 0 and 1 labels
x0
x6
x3
x8
x2
x9
x5
x1
x4
x7
1
0

Training
• The training process involves learning
– The AMN feature weights (Wn an We)
• Node specific classifier (SVM, Naïve Bayes, etc)
weights
• Training is done as part of Personalization
explained in coming slides

Personalization
• Process of learning to categorize with
categories that are of interest to an
organization
• We achieve this by soliciting feedback from a
human oracle on the system-suggested
categories and using it to retrain the system
parameters.
• The feedback is solicited as “correct”,
“incorrect” or “never again” for the categories
assigned to a document by the system.

Personalization: Constraints
• Users can indicate (via feedback) that a
category suggested by the system should
never reappear in future categorization
– E.g. Computer Science department may not be
interested in detailed categorization of documents
based on types of Viruses
• The system remembers this feedback as a
hard constraint which are applied during the
inference process

Personalization: Constraints
• Due to the AMN’s associative property, the
constraints naturally propagate
– Users do not have to apply constraints on every
unwanted category on KnG
By applying a “never again” constraint on node N, the label of Node N is
forced to 0. This forces labels of strongly associated neighbors (O,P,Q,R) to 0.
This is due to the AMN MAP inference, which attains maximum value when
the labels of these neighbors (with high edge potentials) are assigned label 0.

Personalization: Active Feedback
• To improve the categorization accuracy, users
can train the system by providing feedback
(“correct”, “incorrect”) on select categories of
select documents.
• System uses this feedback to retrain AMN,
SVM (and other classifiers – Naïve Bayes, etc)
• System chooses the documents and categories
for feedback that can help the system learn
best parameters with as few feedback as
possible

Active Learning
• We prove a claim “There exists a feature space
and a hyperplane in the feature space that
separates AMN nodes with label 1 from the
nodes that have label 0 and that passes
through the origin”
• This claim helps us transforming the AMN
model to the hyperplane based two class
classification problem and apply uncertainty
based principles to determine the most
uncertain category for a document

Active Learning
• ai : gain in selecting category i based on the distance for
hyper plane
• bj : gain in selecting document j based on the categories it
has
• Feedback is sought from the user for the documents with zj
= 1 and only for those categories that are identified as the
most uncertain for that document (yi = 1).

Evaluation
• Warm Start
• RCV1-v2 categories and documents
• Demonstrates our system on standard dataset
• 5000 documents in batches of 50 docs
• 2000 held-out test documents for F1 score
• Compared against
• SVM
• HICLASS from Shantanu et al.
• Cold Start
• User Evaluation using Wikipedia categories and
arXiv articles
• Compared against
• WikipeidaMiner

Warm Start Results
Comparison with SVM
Active Learning with
different algorithms

Cold Start Experiments and Results
• 263 arxiv docs
• Annotated by 8 human annotators using
Wikipedia titles
• 5 fold cross validation
– Trained AMN, SVM weights in each fold

To be addressed…
• Each document is assigned categories separately. This
leads to many accumulated categories at the
organization level
– Over specified number of categories
• AMN inference over thousands of candidate categories
is time consuming. Hence we cannot use this system in
a real time fashion
• KnG evolving over time
– Documents that are already assigned with categories need
to be updated wisely

Personalized classifiers

More Related Content

What's hot (6)

Viewers also liked (20)

Similar to Personalized classifiers (20)

Recently uploaded (20)

Personalized classifiers

Editor's Notes