Gender Detection on Blogs

Presented By (Team No. 32)
Nitish Jain (201301227)
Ganesh Borle (201505587)
Vamshikrishna Reddy (201202177)
Mentored By
Lokesh Walase
IRE [CSE474]

ABSTRACT
● Through the sands of time, textual content has remained a
prominent feature of internet media especially BLOGS.
● Thus, author profiling and attribution becomes an important
and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it
should be the author itself.
● But . . .
● lot of content brings a lot of responsibility

Given a text blog , can we identify whether
the writer is a male or a female ?
The Question

THE APPROACH
● An ensemble is applied on these models and the input
document is classified as written by male or female.
● We take advantage of the linguistic features of the
blog and create a feature file.
● This feature file is then trained on various classifier and a
model for each of the classifier is prepared.

● each document contains text of about ~35 blogs
in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset
● Koppels blog dataset
● contains about 19 thousand document

PARSING
● Language used : Python
● Each blog is entry stored in XML format
<Blog>
<date>....... </date>
<post>
….
</post>
...
<Blog>
● Each of the blog filename contains the name and Gender
of the author

FEATURES
For our task of Gender Identification, we take the help of
the following linguistic features:
● Character Based Features
● Word Based Features
● Syntactic Features
● Structural Features
● Function Words
● POS Start Probability

THE CLASSIFICATION TASK
For the task of classification, we used several classifying
algorithms and arrived at a model that uses ensemble of the
following classification algorithms:
● Random Forest Classifier
● Neural Networks Classifier
● Adaboost Tree Classifier
● Gradient Boosting Classifier
● Bagging Classifier

THE CLASSIFICATION TASK
For each of the classifier
● We fed it with partial features to actually see the variation
of accuracies with the features.
● We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the
majority class from the classified results of the classifiers.

RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number
of decision tree classifiers on various
sub-samples of the dataset
● By using Random Forest Classifier we
were able to achieve an accuracy of
69.79%

NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes
with each layer fully connected to the
next layer nodes and each node is a
neuron with non-linear perceptron.
● Uses a supervised learning called
backpropagation for training the
network.
● By using Neural Networks Classifier
we were able to achieve an accuracy
of 69.51%

ADABOOST TREE CLASSIFIER
● An meta estimator that begins by
fitting a classifier on the original
dataset and then fits the next round
classifiers on the same dataset
● By using Adaboost tree Classifier we
were able to achieve an accuracy of
69.57%

GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise
fashion.
● In each of the next stages weak
classifiers are introduced to
compensate the shortcomings of the
existing weak learners and these
shortcomings are identified by the
gradients.
● By using Gradient Boosting Classifier
of 70.81%

BAGGING CLASSIFIER
● A meta estimator that fits the base
classifiers each on random subsets of
the datasets and then aggregate their
individual predictions.
● By using Gradient Boosting Classifier
of 70.03%

THE ENSEMBLE
● An Ensemble takes the output of other
classifier and then applies a majority
voting to the outputs of the classifier
to determine the output.
● By using the Ensemble model on the
above discussed classifiers we were
able to achieve an accuracy of
71.10%

THE FINAL RESULTS
● By using the ensemble, we were
actually able to increase our efficiency
by nearly 1% in each case irrespective
of the performance of the individual
classifiers.
● The maximum obtainable accuracy
that was shown during the
experiments was 73.19% by the
Ensemble model.

73.188406 %The maximum Accuracy Achieved

USEFUL LINKS
● Github - https://github.com/nitishjain2007/Gender_Identification
● Youtube - https://www.youtube.com/watch?v=T04BJ6cIeTs
● Slideshare - http://bit.ly/1Q8UiCe
● Website - http://nitishjain2007.github.io/Gender_Identification/
● Dropbox - http://bit.ly/1Xx0ppL

REFERENCES
● http://u.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf
● http://www.aaai.org/ocs/index.
php/ICWSM/09/paper/viewFile/208/537
● http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf
● http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589-
121/cheng2011-gender-identification.pdf

Gender Detection on Blogs

More Related Content

Similar to Gender Detection on Blogs (20)

Recently uploaded (20)

Gender Detection on Blogs