SlideShare a Scribd company logo
2017 CodeFest
$how Me the Money
Kory Becker
October, 2017, http://primaryobjects.com
Unsupervised Learning
Exploratory data analysis
Discovers patterns in unlabeled data
No training set
No error rate for potential solution
K-means Clustering, Markov Chains,
Feature Extraction, Principal Component
Analysis (Dimensionality Reduction)
2
K-Means Clustering
Popular clustering algorithm
Groups data into k clusters
Data points belong to the cluster with closest mean
Each cluster has a centroid (center)
Clustering Example 1
Clustering Example 1
Clustering Example 1
Clustering Example 2
Clustering Example 2
What About Text?
Natural language processing
Term document matrix
Digitize text into an array of 0’s and 1’s by term
Remove sparse terms (non-frequently occurring
terms)
Reduced dimensionality
Compressed data
Speed
Natural Language
Processing
Convert text into a numerical representation
Find commonalities within data
Clustering
Make predictions from data
Classification
Category, Popularity, Sentiment,
Relationships
Bag of Words Model
Corpus
Cats like to chase mice.
Dogs like to eat big bones.
Create a Dictionary Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones
Cats like to chase mice.
Dogs like to eat big bones.
Corpus
Digitize Text
Cats like to chase mice.
1 1 1 1 0 0 0 0
Dogs like to eat big bones.
0 1 0 0 1 1 1 1
Vector Length = 8
Corpus
Dictionary
0 - cats
1 - like
2 - chase
3 - mice
4 - dogs
5 - eat
6 - big
7 - bones
Unigrams vs Bigrams
Unigrams
George
Bush
Clooney
Bigrams
George Bush
George Clooney
N-grams?
ML + News + ??? = Profit!
Extract news stories
Build corpus of headlines
Use bigrams (word pairs)
Strip sparse terms
Apply k-means clustering
.. and what do we get?
Visualizing Clusters
Visualizing Clusters
Visualizing Clusters
Visualizing Clusters
Additional Reading
Discovering Trending Topics in News
http://primaryobjects.com/CMS/Article162
Mirroring Your Twitter Personal with Intelligence
http://primaryobjects.com/CMS/Article160
TF*IDF with .NET
http://primaryobjects.com/CMS/Article157
Thank you!
Kory Becker
http://primaryobjects.com
@primaryobjects

More Related Content

PDF
NLP Structured Data Investigation on Non-Text
PDF
NLP Structured Data Investigation on Non-Text
PDF
NLP Structured Data Investigation on Non-Text by Casey Stella
PDF
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
PDF
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
PDF
Systematic searching...well we had a bit of a look
PDF
Reproducibility in cheminformatics and computational chemistry research: cert...
PPTX
Some Ideas on Making Research Data: "It's the Metadata, stupid!"
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text by Casey Stella
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Systematic searching...well we had a bit of a look
Reproducibility in cheminformatics and computational chemistry research: cert...
Some Ideas on Making Research Data: "It's the Metadata, stupid!"

Similar to 2017 CodeFest Wrap-up Presentation (20)

PPTX
Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
PPTX
Discovering Trending Topics in News - 2017 Edition
PPT
Ijcai 2007 Pedersen
PDF
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
PDF
Topic detecton by clustering and text mining
PDF
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
PDF
An approach to word sense disambiguation combining modified lesk and bag of w...
PPTX
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
PDF
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
PDF
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
PPT
PPT
Eurolan 2005 Pedersen
PDF
Schema-agnositc queries over large-schema databases: a distributional semanti...
PPT
Text Analytics Market Insights: What's Working and What's Next
PDF
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
PDF
A Text Mining Research Based on LDA Topic Modelling
PDF
Extracting and Making Use of Materials Data from Millions of Journal Articles...
PPT
Web & text mining lecture10
PDF
AI Beyond Deep Learning
PDF
Classification of News and Research Articles Using Text Pattern Mining
Machine Learning in a Flash (Extended Edition): An Introduction to Natural La...
Discovering Trending Topics in News - 2017 Edition
Ijcai 2007 Pedersen
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Topic detecton by clustering and text mining
AN APPROACH TO WORD SENSE DISAMBIGUATION COMBINING MODIFIED LESK AND BAG-OF-W...
An approach to word sense disambiguation combining modified lesk and bag of w...
Machine Learning in a Flash (Extended Edition 2): An Introduction to Neural N...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
Eurolan 2005 Pedersen
Schema-agnositc queries over large-schema databases: a distributional semanti...
Text Analytics Market Insights: What's Working and What's Next
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A Text Mining Research Based on LDA Topic Modelling
Extracting and Making Use of Materials Data from Millions of Journal Articles...
Web & text mining lecture10
AI Beyond Deep Learning
Classification of News and Research Articles Using Text Pattern Mining
Ad

More from Kory Becker (10)

PPTX
Intelligent Heuristics for the Game Isolation
PPTX
Tips for Submitting a Proposal to Grace Hopper GHC 2020
PPTX
Grace Hopper 2019 Quantum Computing Recap
PPTX
An Introduction to Quantum Computing - Hopper X1 NYC 2019
PPTX
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
PPTX
Self Programming Artificial Intelligence - Lightning Talk
PPTX
Machine Learning in a Flash: An Introduction to Natural Language Processing
PPTX
Self Programming Artificial Intelligence
PPTX
IBM Watson Concept Insights
PPTX
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Intelligent Heuristics for the Game Isolation
Tips for Submitting a Proposal to Grace Hopper GHC 2020
Grace Hopper 2019 Quantum Computing Recap
An Introduction to Quantum Computing - Hopper X1 NYC 2019
Self-Programming Artificial Intelligence Grace Hopper GHC 2018 GHC18
Self Programming Artificial Intelligence - Lightning Talk
Machine Learning in a Flash: An Introduction to Natural Language Processing
Self Programming Artificial Intelligence
IBM Watson Concept Insights
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
project resource management chapter-09.pdf
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
A Presentation on Touch Screen Technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
A Presentation on Artificial Intelligence
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
A comparative analysis of optical character recognition models for extracting...
MIND Revenue Release Quarter 2 2025 Press Release
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Approach and Philosophy of On baking technology
SOPHOS-XG Firewall Administrator PPT.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
1 - Historical Antecedents, Social Consideration.pdf
Microsoft Solutions Partner Drive Digital Transformation with D365.pdf
Web App vs Mobile App What Should You Build First.pdf
Assigned Numbers - 2025 - Bluetooth® Document
project resource management chapter-09.pdf
A comparative study of natural language inference in Swahili using monolingua...
A Presentation on Touch Screen Technology
Programs and apps: productivity, graphics, security and other tools
A Presentation on Artificial Intelligence
WOOl fibre morphology and structure.pdf for textiles
gpt5_lecture_notes_comprehensive_20250812015547.pdf
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
Encapsulation_ Review paper, used for researhc scholars

2017 CodeFest Wrap-up Presentation

Editor's Notes

  • #3: Unsupervised learning is a type of exploratory data analysis. Unlike supervised learning, it doesn’t require outputs nor training data, cross-validation, test sets. Give it a bunch of data and the AI will make sense of it. Discover patterns.  Unsupervised learning is key behind deep-learning (layers of unsupervised neural networks learn to recognize abstract patterns and feed into a supervised layer for fine-tune training).
  • #4: One of the most common algorithms used for unsupervised learning is “k-Means Clustering”. This algorithm works by grouping data into a specified number of groups, also called “clusters”. Each data point within the data-set belongs to the closest cluster. Each cluster has a centroid (i.e., the center of the cluster). K-means is a really simple, yet powerful algorithm, for automatically clustering and grouping data. In fact, it can often be used as a first go-to algorithm for any data exploration project. Let’s take a look at how this algorithm works.
  • #5: Now that we have an idea of how the algorithm works, let’s see an example! In the above picture we have a series of data points, scattered within the plot. The data seems to have some kind of pattern, but generally, the points are mostly random within them. Suppose we want to divide this data into 6 groups (or clusters). You can probably visually get an idea of where those boundaries would be, effectively dividing the data into 6 parts at each spoke. However, what if we want to cluster into 3 groups? What would that look like? Let’s run through the k-means algorithm and cluster this data into 3 groups. We’ll start by initializing 3 random centroids within our data.
  • #6: We’ve added 3 random centroids to the data. They actually appear pretty well spaced apart within the data, but in actuality, they are indeed randomly placed. Each point has been assigned to its closest centroid, thus coloring the area in its respective centroid’s color. For example, consider the blue area. Do you see that point to the far top-right, sitting right on the line of blue and green? You might think that point is closer to green, but it is indeed closer to the blue centroid. The same goes for all other points within their assigned cluster. With the data points assigned to a cluster, the next step is to move each centroid to the center of their assigned points. So for example, the blue centroid is going to shift slightly up and to the right, so that it sits squarely within the center of the blue area. Likewise, the green centroid will shift slightly to the right and down. The red centroid will shift slightly to the right. After shifting the centroids, some of the data points will be re-assigned. For example, when the blue centroid shifts to the right, some of the points that were assigned to the green centroids will now be closer to the blue centroid. Thus, they’ll be re-assigned to blue. We repeat this process until the centroids stop shifting or the data points stop changing clusters – meaning the k-means algorithm has completed.
  • #7: This image shows the final iteration of the k-means algorithm, effectively clustering our data into 3 clusters. You can see how the data is evenly divided with each point assigned to its respective cluster.
  • #8: Let’s see one more example. This time, we’ll use 6 clusters. In this image, it’s easy to see the randomness of the initial cluster placements. The groups are nowhere near equal. Let’s see what the final iteration of the k-means algorithm looks like with 6 clusters.
  • #9: You can see how the groups are now evenly divided, with 6 clusters displayed with their respective assigned data points.
  • #10: Text can be clustered too! First convert it to a bit string, using a bag-of-words / term-document matrix. This is the key part of natural language processing. Reduce text into array of 1’s and 0’s by term (1 if the term exists in the dictionary, 0 if not existing). Remove sparse terms (words not appearing in many documents) to reduce dimensionality and compress data. Removing sparse terms reduced memory usage in the example data from 2GB to 91MB.
  • #11: Natural Language Processing The most basic form of natural language processing is to simply convert text into a numerical representation. This gives you an array of numbers. So, each document becomes a same-sized array of numbers. With this, you can apply machine learning algorithms, such as clustering and classification. This allows you to build unique insights into a set of documents, determining characteristics like category, popularity, sentiment, and relationships. This is the same type of processing that many popular online machine learning APIs use to classify data. For example, IBM Watson, Microsoft, Amazon, and Google, all include NLP APIs for working with data.
  • #12: Bag of Words Model Let’s take a look at a quick example. Here are two documents: “Cats like to chase mice.” and “Dogs like to eat big bones”. We’re going to try to categorize these documents as being about “eating”. To do this, we’ll build a bag-of-words model and then apply a classification algorithm. Now, the first thing to note is that the two documents are of different lengths. If you think about it, most documents will practically always be of different lengths. This is fine, because after we digitize the corpus, you’ll see that the resulting data fits neatly within same-sized vectors.
  • #13: Create a Dictionary So, the first step is to create a dictionary from our corpus. First, we apply a stemming algorithm on the corpus. This will remove the stop-word “to”. Next, we find each unique term and add it to our dictionary. You can see the resulting list on the right-side of this slide. Our dictionary contains 8 terms.
  • #14: Digitize Text With our dictionary created, we can now digitize the documents. Since our dictionary has 8 terms, each document will be encoded into a vector of length 8. This ensures that all documents end up having the same length. This makes it easier to process with machine learning algorithms. Let’s look at the first document. We’ll take the first term in the dictionary and see if it exists in the first document. The term is “cats”, which does indeed exist in the first document. Therefore, we’ll set a 1 as the first bit. The next term is “like”. Again, it exists in the first document, so we’ll set a 1 as the next bit. This repeats until we see the term “dogs”. This does not exist in the first document, so we set a “0”. Finally, we run through all terms in the dictionary and end up with a vector of length 8 for the first document. We repeat the same steps for the second document, going through each term in the dictionary and checking if it exists in the document.
  • #15: Which words should we include in our dictionary? i.e., how should we tokenize text? Take every word? “and”, “or”, “boy”, “dog” etc? No, we use porter stemmer to remove stop words and reduce longer words. Then we tokenize by either individual words (unigrams) or word-pairs (bigrams). While bigrams give more unique clusters, one downside is that they match less documents in each one. This is because finding documents that contain the same pairs of words is less likely than finding documents with the same single words. You can go further with N-grams, but this reduces the number of items in clusters even further (although they will be more unique). Extreme case of N-grams will assign each headline to its own cluster.
  • #16: What can we do with news data? Read the news database and extract headlines. Use bigrams. Strip sparse terms. Apply K-means clustering. Get highest count terms in each cluster -> trending topics!  
  • #17: Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
  • #18: Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
  • #19: Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.
  • #20: Examples of the results. Each word cloud corresponds to a set of news stories. If you assigned each cluster a trending topic name (by term popularity), you could for example, display a dropdown of trending topics. Selecting a result could take the user to a result page of news stories that correspond to that topic.