Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

Download as PPTX, PDF

1 like580 views

The document discusses a machine learning approach for detecting hacked tweets, detailing the methods and accuracies achieved using support vector machines (SVM) and a specific dataset of 6,054 tweets. The process includes data extraction, corpus creation, and applying techniques like tf*idf for text analysis. Conclusively, the method demonstrated high accuracy rates in training and testing phases.

Technology

DETECTING A HACKED
TWEET
with Machine Learning and Artificial Intelligence
Sponsored by
Kory Becker 2015
http://primaryobjects.com/cms/article158
http://linkedin.com/in/korybecker
http://twitter.com/primaryobjects

ALL YOUR DATA ARE BELONG TO US
 Accord.NET SVM, Tried Gaussian (96%), then linear (97%) kernel
 Extract Tweets with TweetSharp
 Create Document Corpus (6,054 tweets)
 Create Vocabulary (2,225 words)
 Digitize Corpus
 Porter-Stemmer (“talking” => “talk”, “explosion” => “explos”)
 Term Frequency Inverse Document Frequency (TF*IDF)
 Word Existence
 Vector Size = Vocabulary Size | Matrix = double[6054][2225]

ACCURACY
100% TRAINING
97.38% CV
96.23% TEST

CONCLUSION
Kory Becker
http://linkedin.com/in/korybecker
http://twitter.com/primaryobjects
Detecting a Hacked Tweet with Machine Learning
http://primaryobjects.com/CMS/Article158
An Intelligent Approach to Image
Classification By Color
http://primaryobjects.com/CMS/Article154
Self-Programming Artificial Intelligence
http://primaryobjects.com/CMS/Article149

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

1. DETECTING A HACKED TWEET with Machine Learning and Artificial Intelligence Sponsored by Kory Becker 2015 http://primaryobjects.com/cms/article158 http://linkedin.com/in/korybecker http://twitter.com/primaryobjects

2. APRIL 23, 2013 1:15PM 143 POINT DROP

3. ALL YOUR DATA ARE BELONG TO US  Accord.NET SVM, Tried Gaussian (96%), then linear (97%) kernel  Extract Tweets with TweetSharp  Create Document Corpus (6,054 tweets)  Create Vocabulary (2,225 words)  Digitize Corpus  Porter-Stemmer (“talking” => “talk”, “explosion” => “explos”)  Term Frequency Inverse Document Frequency (TF*IDF)  Word Existence  Vector Size = Vocabulary Size | Matrix = double[6054][2225]

4. ACCURACY 100% TRAINING 97.38% CV 96.23% TEST

5. CONCLUSION Kory Becker http://linkedin.com/in/korybecker http://twitter.com/primaryobjects Detecting a Hacked Tweet with Machine Learning http://primaryobjects.com/CMS/Article158 An Intelligent Approach to Image Classification By Color http://primaryobjects.com/CMS/Article154 Self-Programming Artificial Intelligence http://primaryobjects.com/CMS/Article149

Editor's Notes

#2: 1. Introduction My name is Kory Becker. I'm a Software Architect at The Associated Press. I develop web applications by day, and have a fascination with artificial intelligence. If you like, you can follow the (short) slides for this talk at slideshare.net/korybecker.
#3: 2. What? On April 23, 2013 the stock market experienced one of its biggest flash-crash drops of the year, with the Dow Jones industrial average falling 143 points (over 1%) in a matter of minutes. Unlike the 2012 stock market blip, this one wasn't caused by an individual trade, but rather by a single tweet from The Associated Press account on the social network, Twitter. The tweet, of course, wasn't written by AP, but rather by an impostor (claimed by the Syrian Electronic Army) who had temporarily gained control of the account. Could a computer program have detected the tweet as hacked? The tweet was "Breaking: Two Explosions in the White House and Barack Obama is injured". Now, there are a couple of specific characteristics about the text in question. The term "Breaking" has incorrect casing, coming from AP. It would usually be all capitals. The combination of "White House" + "and" + "Barack Obama" is rare. Maybe a computer could pick up on this? So, what did we do?
#4: 3. How? The idea was to write a program using artificial intelligence. Specifically, a machine learning algorithm with supervised learning. The computer would be given a list of tweets and be told whether a tweet is real or fake. It can then learn common terms in each category and (hopefully) figure out how to detect the hacked tweet. Using the Accord.NET machine learning library, I started by implementing a support vector machine (SVM) with a gaussian kernel. SVMs work with different kernels, and gaussian allows fitting data points in a variety of non-linear shapes (round, curvy, etc). I extracted tweets using the TweetSharp library. I created a document corpus of about 6,000 tweets and a vocabulary of about 2,000 words. The documents were digitized by tokenizing the tweets, running porter-stemmer to shorten words, and then creating a bag-of-words model. Each tweet's unique terms were added to the vocabulary. Then, you loop through each tweet and check each word against the vocabulary. If the word exists, you mark a 1 in an array for that tweet. If it doesn't exist, you mark a 0. You end up with an array of 1's and 0's for each tweet. This is perfect for training a machine learning program. To train and test the accuracy, the tweets were split into a training, cross validation, and test set. The computer uses the training set to learn which tweets it classifies right or wrong and fine-tune its model. It then runs against the cross validation set to see how it does on tweets that it hasn't trained on. So, what were the results?
#5: 4. Result? The gaussian kernel did pretty well. It scored 99.7% accuracy on the training set and 96% on the cross validation. The SVM was then switched to use a linear kernel. This bumped up the accuracy to 100% training and 97% cross validation. Ok, but did it detect the hacked tweet? The initial training set contained random tweets from AP and non-AP Twitter accounts. It correctly classified AP tweets, but failed on the particular hacked tweet. I fed the training set additional tweets, such as "-from:AP obama" and "-from:AP breaking" so it had knowledge of the actual topic. And what do you know, it worked!
#6: 5. Conclusion There are a lot more details in this project, including some cool learning curve charts and examples of tweets being classified. You can read my full article at http://www.primaryobjects.com/cms/article158 (the top link in the last slide). There are some code samples for setting up the SVM and you can even download the test set results. If you're curious about artificial intelligence, I also have some other interesting articles, including Self-Programming Artificial Intelligence (the last link in the slide), where a computer program uses genetic algorithms to successfully write its own computer programs. Scary stuff! In conclusion, my name is Kory Becker. Feel free to chat if you have any questions or connect online via @primaryobjects on Twitter or Kory Becker on LinkedIn. Thanks.

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

More Related Content

Viewers also liked (7)

More from Kory Becker (11)

Recently uploaded (20)

Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)

Editor's Notes