Multi level classifier for the detection of insults

Download as PPTX, PDF

0 likes103 views

The document presents a multi-level classifier designed for detecting insults in social media, emphasizing the need for effective solutions due to the impact on critical age groups. It outlines different approaches, including a lexicon-based classifier, n-gram SVM classifier, and neural networks, along with their training and testing data. The conclusion highlights that combining various classifiers can enhance their overall efficiency.

Technology

Multi level classifier for the detection of insults

1. Multi-level classifier for the detection of insults in social media MARCH 2015 CONFERENCE: 15TH PHILIPPINE COMPUTING SCIENCE CONGRESS AT: UNIVERSITY OF ST. LOUIS, TUGUEGARAO CITY, PHILIPPINES

2. Outline: Introduction Old approaches in insult detection The multi-level classifier : Data Lexicon based classifier N-GRAM SVM classifier Neural network Results Conclusion 2

3. Introduction: Progressive growth of social networks Huge impact on our life Critical age class easily affected Urgent need of solution 3

4. Old approaches in insult detection: Lexical syntactic feature LSF Naïve Bayes text classifier 4

5. The multi-level classifier: 5

6. Data: • Training : 3947 rows • Test and verification : 4881 rows 6

7. 7

8. Second Kind data : List with 1048 curse word Second person pronoun list : you your ya u ur yourself yo and yours 8

9. The lexicon based classifier : Presence of curse and second person pronoun in the same text are considered insult The more they close the more the text is insulting 9

10. Pseudo code to obtain the lexical score: 10

11. N-GRAM SVM classifiers: 11

12. Neural network: 10 inputs : 3 of them are those generated from the lower level classifiers One hidden layer with 15 nodes Learning rate : 5*10^-6 12

13. Results : 13

14. Conclusion: The approach of multi-level classifier showed even if some classifiers generate moderate results we can improve their efficiency if we combine them together 14

15. Thank you for you attention 15

Editor's Notes

#4: These days social media are progressively growing and taking part of our life , people of all ages are spending most of their time in the internet at social networking sites which proves how much successful social networks are nowadays but also how much they can influence positively or negatively our life specially that as we mentioned people of all ages are using social media sites , among them are kids or teenagers which can be affected very badly with insulting or offensive content, That’s what built the urgent need of working and research in this topic,
#5: Since it is a critical need many researchers have worked on this problem which means many solutions were proposed , one of them is the lexical syntactic feature LSF which is used to detect insult on social media using two major feature : lexical and syntactic feature where the lexical features treat each word or phrase as an entity and where the syntactic ones looks for whom those words are directed , In addition to some experience conducted by Vandermissen which resulted a test classifier based on naïve Bayes classifier, And both gave a very poor precision results
#6: The classifier proposed in this project is composed of three main part or four if we consider the data preprocessing part After preprocessing the data , the datasets are passed to two classifier which are the lexicon based classifier and the n-gram SVM classifier The lexicon based classifier which is the classifier 1 doesn't use the dataset to learn with supervision it uses only words list to generate a lexical score that I will explain later The classifier 2 contains 2 SVM classifier that are used to reduce the dimension of the feature set since it is receiving thousand of n-grams from the preprocessing part and feeding that directly to the neural network will make it converge very slowly,
#7: Two kind of data needs to be gathered the first is the training and the test sets : in this project they worked with a dataset that some cyber security start up released in Kaggle , it contains 3947 rows for training and 4881 rows for the test and verification it only consists of 3 column the date when the text was made , the text it self and the classification , it is a binary classification whether it is an insult or not And insults are considered only if they intended to be insulting to a person who is a part of the larger blog or website conversation Which explain the reason for the second kind of data
#8: Before being sent to the first and second classifier the data is preprocessed and managed to be divided into 3 dataset : word n-grams and character n-grams which will be sent to second classifier which is the N-gram SVM classifier and the other dataset will be sent to the lexicon based classifier and it is the second kind of data
#9: The second kind is a two words lists that are needed for the lexicon based classifier these are the curse word list and the second person pronoun list The curse word list is obviously needed no need to explain why , the second person pronoun list assure that the curse word are intended to someone who is part of the conversation There 1048 curse word gathered from different sources The second person pronoun list consist of only 8 words which are you your ya u ur yourself yo and yours
#10: The lexicon based classifier assumes that if there are curse words and second person pronoun in the same text , the text is more likely to be insulting and the more closer they are the more the text is considered insulting
#11: Then the lexical score is normalized to fit the 0 1 range
#12: We have two SVM classifiers one for the character n-grams and the other for the word n-grams which are the first kind of data that we talked about in the data part The minimum word n-gram are unigram and the maximum word n-gram are 4-gram The minimum character n-gram are unigram and the maximum character n-gram are 10-gram After the training every SVM classifier will generate an SVM vector one for the characters and one for the words
#13: The neural network which is the last level , takes 10 inputs among them the first three input , those obtained from the first 2 level classifiers which are the lexical score and the 2 SVM output vector and in addition to those 3 we will be adding the number of curse words , the number of second person pronoun , the number of characters , the number of exclamation point , the number of asterisks , the number of capital letters The neural network has only one hidden layer with 15 nodes and the learning rate is 5*10-6

Multi level classifier for the detection of insults

More Related Content

Similar to Multi level classifier for the detection of insults (20)

Recently uploaded (20)

Multi level classifier for the detection of insults

Editor's Notes