SlideShare a Scribd company logo
GENDER
DETECTION IN
BLOGS
Presented By (Team No. 32)
Nitish Jain (201301227)
Ganesh Borle (201505587)
Vamshikrishna Reddy (201202177)
Mentored By
Lokesh Walase
IRE [CSE474]
The Big Picture
ABSTRACT
● Through the sands of time, textual content has remained a
prominent feature of internet media especially BLOGS.
● Thus, author profiling and attribution becomes an important
and task and we try to capture one aspect of it, i.e gender.
● internet can’t take responsibility of the all the content, it
should be the author itself.
● But . . .
● lot of content brings a lot of responsibility
Given a text blog , can we identify whether
the writer is a male or a female ?
The Question
WHO IS THE AUTHOR?
OUR APPROACH
THE APPROACH
● An ensemble is applied on these models and the input
document is classified as written by male or female.
● We take advantage of the linguistic features of the
blog and create a feature file.
● This feature file is then trained on various classifier and a
model for each of the classifier is prepared.
WORKFLOW
● each document contains text of about ~35 blogs
in XML format.
[Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ]
The Dataset
● Koppels blog dataset
● contains about 19 thousand document
PARSING
● Language used : Python
● Each blog is entry stored in XML format
<Blog>
<date>....... </date>
<post>
….
</post>
...
<Blog>
● Each of the blog filename contains the name and Gender
of the author
The Feature Extraction
FEATURES
For our task of Gender Identification, we take the help of
the following linguistic features:
● Character Based Features
● Word Based Features
● Syntactic Features
● Structural Features
● Function Words
● POS Start Probability
The
Classification
THE CLASSIFICATION TASK
For the task of classification, we used several classifying
algorithms and arrived at a model that uses ensemble of the
following classification algorithms:
● Random Forest Classifier
● Neural Networks Classifier
● Adaboost Tree Classifier
● Gradient Boosting Classifier
● Bagging Classifier
THE CLASSIFICATION TASK
For each of the classifier
● We fed it with partial features to actually see the variation
of accuracies with the features.
● We applied a 10 fold validation to measure the accuracies.
For measuring the accuracy of the ensemble we took the
majority class from the classified results of the classifiers.
RANDOM FOREST CLASSIFIER
● An meta estimator that fits a number
of decision tree classifiers on various
sub-samples of the dataset
● By using Random Forest Classifier we
were able to achieve an accuracy of
69.79%
NEURAL NETWORKS CLASSIFIER
● Consists of multiple layers of nodes
with each layer fully connected to the
next layer nodes and each node is a
neuron with non-linear perceptron.
● Uses a supervised learning called
backpropagation for training the
network.
● By using Neural Networks Classifier
we were able to achieve an accuracy
of 69.51%
ADABOOST TREE CLASSIFIER
● An meta estimator that begins by
fitting a classifier on the original
dataset and then fits the next round
classifiers on the same dataset
● By using Adaboost tree Classifier we
were able to achieve an accuracy of
69.57%
GRADIENT BOOSTING CLASSIFIER
● Builds model in a forward stage-wise
fashion.
● In each of the next stages weak
classifiers are introduced to
compensate the shortcomings of the
existing weak learners and these
shortcomings are identified by the
gradients.
● By using Gradient Boosting Classifier
we were able to achieve an accuracy
of 70.81%
BAGGING CLASSIFIER
● A meta estimator that fits the base
classifiers each on random subsets of
the datasets and then aggregate their
individual predictions.
● By using Gradient Boosting Classifier
we were able to achieve an accuracy
of 70.03%
THE ENSEMBLE
● An Ensemble takes the output of other
classifier and then applies a majority
voting to the outputs of the classifier
to determine the output.
● By using the Ensemble model on the
above discussed classifiers we were
able to achieve an accuracy of
71.10%
FINAL RESULTS
THE FINAL RESULTS
● By using the ensemble, we were
actually able to increase our efficiency
by nearly 1% in each case irrespective
of the performance of the individual
classifiers.
● The maximum obtainable accuracy
that was shown during the
experiments was 73.19% by the
Ensemble model.
73.188406 %The maximum Accuracy Achieved
USEFUL LINKS
● Github - https://github.com/nitishjain2007/Gender_Identification
● Youtube - https://www.youtube.com/watch?v=T04BJ6cIeTs
● Slideshare - http://bit.ly/1Q8UiCe
● Website - http://nitishjain2007.github.io/Gender_Identification/
● Dropbox - http://bit.ly/1Xx0ppL
REFERENCES
● http://u.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf
● http://www.aaai.org/ocs/index.
php/ICWSM/09/paper/viewFile/208/537
● http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf
● http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589-
121/cheng2011-gender-identification.pdf
Thanks!
Any questions?

More Related Content

PPTX
Gender Detection In Blogs [Information Retrival and Extraction]
PDF
0-oop java-intro
PDF
1-oop java-object
PPTX
Project presentation
PPTX
Abstractive Review Summarization
PDF
A brief introduction to Searn Algorithm
PDF
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
PDF
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Gender Detection In Blogs [Information Retrival and Extraction]
0-oop java-intro
1-oop java-object
Project presentation
Abstractive Review Summarization
A brief introduction to Searn Algorithm
Recsys 2016 tutorial: Lessons learned from building real-life recommender sys...
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge

Similar to Gender Detection on Blogs (20)

PDF
Movie Recommendation engine
PPTX
Ensemble methods
PDF
PPTX
ngboost.pptx
PPTX
Deep learning crash course
PDF
Text Summarization
PPTX
MACHINE LEARNING - GENETIC ALGORITHM
PDF
From a thousand learners to a thousand markers: Scaling peer feedback with Ad...
PDF
InternshipReport
PDF
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
PDF
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
PDF
CSSC ML Workshop
PPTX
Dowhy: An end-to-end library for causal inference
PPTX
Machine learning - session 3
PPTX
Business Analytics Final Capstone Project Presenation PPT.pptx
PPTX
Unit - I Sentiment anlysis with logistic regression.pptx
PPT
pattern classification
PDF
Automated Essay Grading using Features Selection
PDF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
PPTX
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Movie Recommendation engine
Ensemble methods
ngboost.pptx
Deep learning crash course
Text Summarization
MACHINE LEARNING - GENETIC ALGORITHM
From a thousand learners to a thousand markers: Scaling peer feedback with Ad...
InternshipReport
[Decisions2013@RecSys]The Role of Emotions in Context-aware Recommendation
Strata 2016 - Lessons Learned from building real-life Machine Learning Systems
CSSC ML Workshop
Dowhy: An end-to-end library for causal inference
Machine learning - session 3
Business Analytics Final Capstone Project Presenation PPT.pptx
Unit - I Sentiment anlysis with logistic regression.pptx
pattern classification
Automated Essay Grading using Features Selection
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
20211229120253D6323_PERT 06_ Ensemble Learning.pptx
Ad

Recently uploaded (20)

PPTX
Modernising the Digital Integration Hub
PPT
What is a Computer? Input Devices /output devices
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Getting started with AI Agents and Multi-Agent Systems
PPT
Module 1.ppt Iot fundamentals and Architecture
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
August Patch Tuesday
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Getting Started with Data Integration: FME Form 101
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
1. Introduction to Computer Programming.pptx
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Architecture types and enterprise applications.pdf
PPTX
The various Industrial Revolutions .pptx
PDF
STKI Israel Market Study 2025 version august
Modernising the Digital Integration Hub
What is a Computer? Input Devices /output devices
Univ-Connecticut-ChatGPT-Presentaion.pdf
A novel scalable deep ensemble learning framework for big data classification...
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Getting started with AI Agents and Multi-Agent Systems
Module 1.ppt Iot fundamentals and Architecture
Final SEM Unit 1 for mit wpu at pune .pptx
August Patch Tuesday
A comparative study of natural language inference in Swahili using monolingua...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Getting Started with Data Integration: FME Form 101
Programs and apps: productivity, graphics, security and other tools
1. Introduction to Computer Programming.pptx
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
A contest of sentiment analysis: k-nearest neighbor versus neural network
Architecture types and enterprise applications.pdf
The various Industrial Revolutions .pptx
STKI Israel Market Study 2025 version august
Ad

Gender Detection on Blogs

  • 2. Presented By (Team No. 32) Nitish Jain (201301227) Ganesh Borle (201505587) Vamshikrishna Reddy (201202177) Mentored By Lokesh Walase IRE [CSE474]
  • 4. ABSTRACT ● Through the sands of time, textual content has remained a prominent feature of internet media especially BLOGS. ● Thus, author profiling and attribution becomes an important and task and we try to capture one aspect of it, i.e gender. ● internet can’t take responsibility of the all the content, it should be the author itself. ● But . . . ● lot of content brings a lot of responsibility
  • 5. Given a text blog , can we identify whether the writer is a male or a female ? The Question
  • 6. WHO IS THE AUTHOR?
  • 8. THE APPROACH ● An ensemble is applied on these models and the input document is classified as written by male or female. ● We take advantage of the linguistic features of the blog and create a feature file. ● This feature file is then trained on various classifier and a model for each of the classifier is prepared.
  • 10. ● each document contains text of about ~35 blogs in XML format. [Dataset Link : http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm ] The Dataset ● Koppels blog dataset ● contains about 19 thousand document
  • 11. PARSING ● Language used : Python ● Each blog is entry stored in XML format <Blog> <date>....... </date> <post> …. </post> ... <Blog> ● Each of the blog filename contains the name and Gender of the author
  • 13. FEATURES For our task of Gender Identification, we take the help of the following linguistic features: ● Character Based Features ● Word Based Features ● Syntactic Features ● Structural Features ● Function Words ● POS Start Probability
  • 15. THE CLASSIFICATION TASK For the task of classification, we used several classifying algorithms and arrived at a model that uses ensemble of the following classification algorithms: ● Random Forest Classifier ● Neural Networks Classifier ● Adaboost Tree Classifier ● Gradient Boosting Classifier ● Bagging Classifier
  • 16. THE CLASSIFICATION TASK For each of the classifier ● We fed it with partial features to actually see the variation of accuracies with the features. ● We applied a 10 fold validation to measure the accuracies. For measuring the accuracy of the ensemble we took the majority class from the classified results of the classifiers.
  • 17. RANDOM FOREST CLASSIFIER ● An meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset ● By using Random Forest Classifier we were able to achieve an accuracy of 69.79%
  • 18. NEURAL NETWORKS CLASSIFIER ● Consists of multiple layers of nodes with each layer fully connected to the next layer nodes and each node is a neuron with non-linear perceptron. ● Uses a supervised learning called backpropagation for training the network. ● By using Neural Networks Classifier we were able to achieve an accuracy of 69.51%
  • 19. ADABOOST TREE CLASSIFIER ● An meta estimator that begins by fitting a classifier on the original dataset and then fits the next round classifiers on the same dataset ● By using Adaboost tree Classifier we were able to achieve an accuracy of 69.57%
  • 20. GRADIENT BOOSTING CLASSIFIER ● Builds model in a forward stage-wise fashion. ● In each of the next stages weak classifiers are introduced to compensate the shortcomings of the existing weak learners and these shortcomings are identified by the gradients. ● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.81%
  • 21. BAGGING CLASSIFIER ● A meta estimator that fits the base classifiers each on random subsets of the datasets and then aggregate their individual predictions. ● By using Gradient Boosting Classifier we were able to achieve an accuracy of 70.03%
  • 22. THE ENSEMBLE ● An Ensemble takes the output of other classifier and then applies a majority voting to the outputs of the classifier to determine the output. ● By using the Ensemble model on the above discussed classifiers we were able to achieve an accuracy of 71.10%
  • 24. THE FINAL RESULTS ● By using the ensemble, we were actually able to increase our efficiency by nearly 1% in each case irrespective of the performance of the individual classifiers. ● The maximum obtainable accuracy that was shown during the experiments was 73.19% by the Ensemble model.
  • 25. 73.188406 %The maximum Accuracy Achieved
  • 26. USEFUL LINKS ● Github - https://github.com/nitishjain2007/Gender_Identification ● Youtube - https://www.youtube.com/watch?v=T04BJ6cIeTs ● Slideshare - http://bit.ly/1Q8UiCe ● Website - http://nitishjain2007.github.io/Gender_Identification/ ● Dropbox - http://bit.ly/1Xx0ppL
  • 27. REFERENCES ● http://u.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf ● http://www.aaai.org/ocs/index. php/ICWSM/09/paper/viewFile/208/537 ● http://www.cs.columbia.edu/nlp/papers/2011/acl2011age.pdf ● http://www.ccse.kfupm.edu.sa/~ahmadsm/coe589- 121/cheng2011-gender-identification.pdf