SlideShare a Scribd company logo
Company Segmentation
• Challenge Summary
• Objectives
• Libraries
• Data
• Question
o Step 1 - Convert stock prices to a standardized format (daily returns)
o Step 2 - Convert to User-Item Format
o Step 3 - Perform K-Means Clustering
o Step 4 - Find the optimal value of K
o Step 5 - Apply UMAP
o Step 6 - Combine K-Means and UMAP
Challenge Summary
Your organization wants to know which companies are similar to each other to
help in identifying potential customers of a SAAS software solution (e.g. Salesforce
CRM or equivalent) in various segments of the market. The Sales Department is
very interested in this analysis, which will help them more easily penetrate various
market segments.
We will be using stock prices in this analysis. We will classify companies based on how
their stocks trade using their daily stock returns (percentage movement from one day to
the next). This analysis will help your organization determine which companies are
related to each other (competitors and have similar attributes).
We can analyze the stock prices using unsupervised learning tools including K-Means
and UMAP. We’ll use a combination of kmeans() to find groups and umap() to visualize
similarity of daily stock returns.
Objectives
Apply our knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to
create a visualization that identifies subgroups in the S&P 500 Index. We’ll achieve this
with:
• Modeling: kmeans() and umap()
• Iteration: purrr
• Data Manipulation: dplyr, tidyr, and tibble
• Visualization: ggplot2 and plotly
Libraries
Load the following libraries.
# install.packages("plotly")
library(tidyverse)
library(tidyquant)
library(broom)
library(umap)
library(plotly)
Data
We will be using stock prices in this analysis. The tidyquant R package contains an API
to retreive stock prices. The following code is shown so you can see how I obtained the
stock prices for every stock in the S&P 500 index.
# # NOT RUN - WILL TAKE SEVERAL MINUTES TO DOWNLOAD ALL THE STOCK PRICES
# # JUST SHOWN FOR FUN SO YOU KNOW HOW I GOT THE DATA
#
# # GET ALL STOCKS IN A STOCK INDEX (E.G. SP500)
# sp_500_index_tbl <- tq_index("SP500")
# sp_500_index_tbl
#
# # PULL IN STOCK PRICES FOR EACH STOCK IN THE INDEX
# sp_500_prices_tbl <- sp_500_index %>%
# select(symbol) %>%
# tq_get(get = "stock.prices")
#
# # SAVING THE DATA
# fs::dir_create("week_6_data")
# sp_500_prices_tbl %>% write_rds(path = "week_6_data/sp_500_prices_tbl.rds")
# sp_500_index_tbl %>% write_rds("week_6_data/sp_500_index_tbl.rds")
We can read in the stock prices. The data is 1.2M observations. The most important
columns for our analysis are:
• symbol: The stock ticker symbol that corresponds to a company’s stock price
• date: The timestamp relating the symbol to the share price at that point in time
• adjusted: The stock price, adjusted for any splits and divdends (we use this when
analyzing stock data over long periods of time)
# STOCK PRICES
sp_500_prices_tbl <- read_rds("week_6_data/sp_500_prices_tbl.rds")
sp_500_prices_tbl
The second data frame contains information about the stocks the most important of
which are:
• company: The company name
• sector: The sector that the company belongs to
# SECTOR INFORMATION
sp_500_index_tbl <- read_rds("week_6_data/sp_500_index_tbl.rds")
sp_500_index_tbl
Question
Which stock prices behave similarly?
Answering this question helps us understand which companies are related, and we
can use clustering to help us answer it!
Even if you’re not interested in finance, this is still a great analysis because it will tell you
which companies are competitors and which are likely in the same space (often called
sectors) and can be categorized together. Bottom line - This analysis can help you better
understand the dynamics of the market and competition, which is useful for all types of
analyses from finance to sales to marketing.
Step 1 - Convert stock prices to a
standardized format (daily returns)
What you first need to do is get the data in a format that can be converted to a “user-
item” style matrix. The challenge here is to connect the dots between what we have and
what we need to do to format it properly.
We know that in order to compare the data, it needs to be standardized or normalized.
Why? Because we cannot compare values (stock prices) that are of completely different
magnitudes. In order to standardize, we will convert from adjusted stock price (dollar
value) to daily returns (percent change from previous day). Here is the formula.
𝑟𝑒𝑡𝑢𝑟𝑛 𝑑𝑎𝑖𝑙𝑦 =
𝑝𝑟𝑖𝑐𝑒 𝑛 − 𝑝𝑟𝑖𝑐𝑒 𝑛−1
𝑝𝑟𝑖𝑐𝑒 𝑛−1
This formula can be rewritten as
𝑟𝑒𝑡𝑢𝑟𝑛 𝑑𝑎𝑖𝑙𝑦 =
𝑝𝑟𝑖𝑐𝑒 𝑛
𝑝𝑟𝑖𝑐𝑒 𝑛−1
− 1
First, what do we have? We have stock prices for every stock in the SP 500 Index, which is
the daily stock prices for over 500 stocks. The data set is over 1.2M observations.
sp_500_prices_tbl %>% glimpse()
## Observations: 1,225,765
## Variables: 8
## $ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT...
## $ date <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009...
## $ open <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.5...
## $ high <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.9...
## $ low <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.5...
## $ close <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.8...
## $ volume <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 498...
## $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15....
Our first task is to convert to a tibble named sp_500_daily_returns_tbl by
performing the following operations:
• Select the symbol, date and adjusted columns
• Filter to dates beginning in the year 2018 and beyond.
• Compute a Lag of 1 day on the adjusted stock price. Be sure to group by symbol
first, otherwise we will have lags computed using values from the previous stock
in the data frame.
• Remove an NA values from the lagging operation
• Compute the difference between adjusted and the lag
• Compute the percentage difference by dividing the difference by that lag. Name
this column pct_return.
• Return only the symbol, date, and pct_return columns
• Save as a variable named sp_500_daily_returns_tbl
(sp_500_daily_returns_tbl <- sp_500_prices_tbl %>%
select(symbol, date, adjusted) %>%
# filter(date >= ymd('2018-01-01')) %>%
filter(date %>% lubridate::year() >= 2018) %>%
arrange(date) %>%
group_by(symbol) %>%
dplyr::mutate(pct_return = adjusted/lag(adjusted)-1) %>%
ungroup() %>%
drop_na() %>%
#filter(!is.na(pct_return)) %>%
select(symbol, date, pct_return))
Step 2 - Convert to User-Item Format
The next step is to convert to a user-item format with the symbol in the first column and
every other column the value of the daily returns (pct_return) for every stock at each
date.
Now that we have the daily returns (percentage change from one day to the next), we
can convert to a user-item format. The user in this case is the symbol (company), and the
item in this case is the pct_return at each date.
• Spread the date column to get the values as percentage returns. Make sure to fill
an NA values with zeros.
• Save the result as stock_date_matrix_tbl
# Convert to User-Item Format
(stock_date_matrix_tbl <- sp_500_daily_returns_tbl %>%
pivot_wider(names_from = date,
values_from = pct_return,
values_fill = list(pct_return = 0))
)
# Double check for NA's with stock_date_matrix_tbl %>% is.na() %>% any()
# Output: stock_date_matrix_tbl
Step 3 - Perform K-Means Clustering
Next, we’ll perform K-Means clustering.
Beginning with the stock_date_matrix_tbl, perform the following operations:
• Drop the non-numeric column, symbol
• Perform kmeans() with centers = 4 and nstart = 20
• Save the result as kmeans_obj
# Create kmeans_obj for 4 centers
k_means_obj <- stock_date_matrix_tbl %>%
select(-symbol) %>%
kmeans(centers = 4,
nstart = 20 )
Use glance() to get the tot.withinss.
# Apply glance() to get the tot.withinss
k_means_obj %>% broom::glance()
# k_means_obj %>% broom::augment(stock_date_matrix_tbl)
Step 4 - Find the optimal value of K
Now that we are familiar with the process for calculating kmeans(), let’s use purrr to
iterate over many values of “k” using the centers argument.
We’ll use this custom function called kmeans_mapper():
kmeans_mapper = function(center = 3){
stock_date_matrix_tbl %>%
select(-symbol)%>%
kmeans(centers = center,
nstart = 20)
}
Apply the kmeans_mapper() and glance() functions iteratively using purrr.
• Create a tibble containing column called centers that go from 1 to 30
• Add a column named k_means with the kmeans_mapper() output. Use mutate()
to add the column and map() to map centers to the kmeans_mapper() function.
• Add a column named glance with the glance() output. Use mutate() and
map() again to iterate over the column of k_means.
• Save the output as k_means_mapped_tbl
# Use purrr to map
k_means_mapped_tbl <- tibble(centers = 1:30) %>%
mutate(k_means = centers %>% map(kmeans_mapper)) %>%
mutate(glance = k_means %>% map(broom::glance))
# Output: k_means_mapped_tbl
k_means_mapped_tbl
Next, let’s visualize the “tot.withinss” from the glance output as a Scree Plot.
• Begin with the k_means_mapped_tbl
• Unnest the glance column
• Plot the centers column (x-axis) versus the tot.withinss column (y-axis) using
geom_point() and geom_line()
• Add a title “Scree Plot”
# Visualize Scree Plot
k_means_mapped_tbl %>%
unnest(glance) %>%
select(centers, tot.withinss) %>%
ggplot(aes(x = centers,y = tot.withinss))+
geom_line(color = palette_dark()[1])+
geom_point(size = 1.5,
aes(color = centers %in% c(5:10)),
show.legend = F,
shape = 7)+
scale_color_tq()+
theme_minimal()+
labs(
title = 'Scree plot'
)
We can see that the Scree Plot becomes linear (constant rate of change) between 5 and
10 centers for K.
Step 5 - Apply UMAP
Next, let’s plot the UMAP 2D visualization to help us investigate cluster assignments.
First, let’s apply the umap() function to the stock_date_matrix_tbl, which contains
our user-item matrix in tibble format.
• Start with stock_date_matrix_tbl
• De-select the symbol column
• Use the umap() function storing the output as umap_results
# Apply UMAP
umap_results <- stock_date_matrix_tbl %>%
select(-symbol) %>%
umap()
# Store results as: umap_results
Next, we want to combine the layout from the umap_results with the symbol column
from the stock_date_matrix_tbl.
• Start with umap_results$layout
• Convert from a matrix data type to a tibble with as_tibble()
• Bind the columns of the umap tibble with the symbol column from the
stock_date_matrix_tbl.
• Save the results as umap_results_tbl.
• Change V1 to x and V2 to y, respectively
# Convert umap results to tibble with symbols
(umap_results_tbl <- umap_results$layout %>% as_tibble() %>%
bind_cols(stock_date_matrix_tbl['symbol']) %>%
set_names(c('x','y','symbol'))
)
# Output: umap_results_tbl
Finally, let’s make a quick visualization of the umap_results_tbl.
• Pipe the umap_results_tbl into ggplot() mapping the x and y columns to x-
axis and y-axis
• Add a geom_point() geometry with an alpha = 0.5
• Apply theme_tq() and add a title “UMAP Projection”
# Visualize UMAP results
umap_results_tbl %>%
ggplot(aes(x, y))+
geom_point(alpha = 0.5,
color= palette_light()[1])+
theme_tq()+
labs(
title = 'UMAP Projection'
)
We can now see that we have some clusters. However, we still need to combine the K-
Means clusters and the UMAP 2D representation.
Step 6 - Combine K-Means and UMAP
Next, we combine the K-Means clusters and the UMAP 2D representation
First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot
flattens.
• Begin with the k_means_mapped_tbl
• Filter to centers == 10
• Pull the k_means column
• Pluck the first element
• Store this as k_means_obj
# Get the k_means_obj from the 10th center
k_means_obj <- k_means_mapped_tbl %>%
filter(centers == 10) %>%
pull(k_means) %>%
pluck(1)
# Alternative:
# k_means_mapped_tbl %>% pull(k_means) %>% pluck(10)
# Store as k_means_obj
Next, we’ll combine the clusters from the k_means_obj with the umap_results_tbl.
• Begin with the k_means_obj
• Augment the k_means_obj with the stock_date_matrix_tbl to get the clusters
added to the end of the tibble
• Select just the symbol and .cluster columns
• Left join the result with the umap_results_tbl by the symbol column
• Left join the result with the result of sp_500_index_tbl %>% select(symbol,
company, sector) by the symbol column.
• Store the output as umap_kmeans_results_tbl
# Use your dplyr & broom skills to combine the k_means_obj with the umap_result
s_tbl
(umap_kmeans_results_tbl <- k_means_obj %>%
broom::augment(stock_date_matrix_tbl) %>%
select(symbol, .cluster) %>%
left_join(umap_results_tbl) %>%
left_join(sp_500_index_tbl %>% select(symbol, company, sector))
)
# Output: umap_kmeans_results_tbl
Plot the K-Means and UMAP results.
• Begin with the umap_kmeans_results_tbl
• Use ggplot() mapping V1, V2 and color = .cluster
• Add the geom_point() geometry with alpha = 0.5
• Apply theme_tq() and scale_color_tq()
Note - If you’ve used centers greater than 12, you will need to use a hack to enable
scale_color_tq() to work. Just replace with: scale_color_manual(values =
palette_light() %>% rep(3))
# Visualize the combined K-Means and UMAP results
umap_kmeans_results_tbl %>%
ggplot(aes(x, y, color = .cluster))+
geom_point(alpha = 0.5)+
theme_tq()+
# scale_color_manual(values = palette_light() %>% rep(3))
scale_color_brewer(palette = 'Paired')
Company segmentation - an approach with R

More Related Content

PDF
Writing Group Functions - DBMS
DOCX
Assignment 2 lasa 1 the costs of production
PDF
Set Analyse OK.pdf
PPTX
Business Intelligence Portfolio
PPTX
Predicting Employee Churn: A Data-Driven Approach Project Presentation
PDF
Business Intelligence Portfolio
PDF
Business Intelligence Portfolio
Writing Group Functions - DBMS
Assignment 2 lasa 1 the costs of production
Set Analyse OK.pdf
Business Intelligence Portfolio
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Business Intelligence Portfolio
Business Intelligence Portfolio

Similar to Company segmentation - an approach with R (20)

PDF
Oracle_Analytical_function.pdf
PPTX
Ali upload
PPTX
Chris Seebacher Portfolio
PDF
Adding measures to Calcite SQL
PDF
Building a semantic/metrics layer using Calcite
PDF
Data Modeling in Looker
PPTX
Empowerment Technology Lesson 4
PDF
Customer analytics for e commerce
PDF
Multidimensional Data Analysis with Ruby (sample)
PDF
Final project kijtorntham n
PPTX
Project report aditi paul1
DOCX
PorfolioReport
PDF
Stock Price Trend Forecasting using Supervised Learning
PPTX
BIG MART SALES.pptx
PPTX
BIG MART SALES PRIDICTION PROJECT.pptx
PPTX
Telecom Churn Analysis
PDF
Google Data Studio_Building a User Journey Funnel with Google Analytics
PDF
Data Exploration with Apache Drill: Day 2
PPT
James Colby Maddox Business Intellignece and Computer Science Portfolio
PDF
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
Oracle_Analytical_function.pdf
Ali upload
Chris Seebacher Portfolio
Adding measures to Calcite SQL
Building a semantic/metrics layer using Calcite
Data Modeling in Looker
Empowerment Technology Lesson 4
Customer analytics for e commerce
Multidimensional Data Analysis with Ruby (sample)
Final project kijtorntham n
Project report aditi paul1
PorfolioReport
Stock Price Trend Forecasting using Supervised Learning
BIG MART SALES.pptx
BIG MART SALES PRIDICTION PROJECT.pptx
Telecom Churn Analysis
Google Data Studio_Building a User Journey Funnel with Google Analytics
Data Exploration with Apache Drill: Day 2
James Colby Maddox Business Intellignece and Computer Science Portfolio
CARTO en 5 Pasos: del Dato a la Toma de Decisiones [CARTO]
Ad

More from Casper Crause (9)

PPTX
Integrating R and Power BI
PDF
Handling missing data and outliers
PDF
How to read multiple excel files - With R
PPTX
Storytelling By Visualization
PDF
Comparing Co2 Emissions Around The Globe
PDF
Understanding control-flow
DOCX
Levelling up your chart skills
PDF
Wrangling data the tidy way with the tidyverse
PPTX
Project portfolio for Casper Crause
Integrating R and Power BI
Handling missing data and outliers
How to read multiple excel files - With R
Storytelling By Visualization
Comparing Co2 Emissions Around The Globe
Understanding control-flow
Levelling up your chart skills
Wrangling data the tidy way with the tidyverse
Project portfolio for Casper Crause
Ad

Recently uploaded (20)

PPTX
28 - relative valuation lecture economicsnotes
PDF
GVCParticipation_Automation_Climate_India
PPTX
Module5_Session1 (mlzrkfbbbbbbbbbbbz1).pptx
PPTX
Q1 PE AND HEALTH 5 WEEK 5 DAY 1 powerpoint template
PDF
The Role of Islamic Faith, Ethics, Culture, and values in promoting fairness ...
PPT
features and equilibrium under MONOPOLY 17.11.20.ppt
PDF
1a In Search of the Numbers ssrn 1488130 Oct 2009.pdf
PPT
KPMG FA Benefits Report_FINAL_Jan 27_2010.ppt
PPTX
Role and functions of International monetary fund.pptx
PDF
CLIMATE CHANGE AS A THREAT MULTIPLIER: ASSESSING ITS IMPACT ON RESOURCE SCARC...
PDF
7a Lifetime Expected Income Breakeven Comparison between SPIAs and Managed Po...
PDF
THE EFFECT OF FOREIGN AID ON ECONOMIC GROWTH IN ETHIOPIA
PDF
3CMT J.AFABLE Flexible-Learning ENTREPRENEURIAL MANAGEMENT.pdf
PPTX
2. RBI.pptx202029291023i38039013i92292992
PPTX
PPT-Lesson-2-Recognize-a-Potential-Market-2-3.pptx
PDF
Financial discipline for educational purpose
PPTX
lesson in englishhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
PDF
USS pension Report and Accounts 2025.pdf
PDF
4a Probability-of-Failure-Based Decision Rules to Manage Sequence Risk in Ret...
PDF
International Financial Management, 9th Edition, Cheol Eun, Bruce Resnick Tuu...
28 - relative valuation lecture economicsnotes
GVCParticipation_Automation_Climate_India
Module5_Session1 (mlzrkfbbbbbbbbbbbz1).pptx
Q1 PE AND HEALTH 5 WEEK 5 DAY 1 powerpoint template
The Role of Islamic Faith, Ethics, Culture, and values in promoting fairness ...
features and equilibrium under MONOPOLY 17.11.20.ppt
1a In Search of the Numbers ssrn 1488130 Oct 2009.pdf
KPMG FA Benefits Report_FINAL_Jan 27_2010.ppt
Role and functions of International monetary fund.pptx
CLIMATE CHANGE AS A THREAT MULTIPLIER: ASSESSING ITS IMPACT ON RESOURCE SCARC...
7a Lifetime Expected Income Breakeven Comparison between SPIAs and Managed Po...
THE EFFECT OF FOREIGN AID ON ECONOMIC GROWTH IN ETHIOPIA
3CMT J.AFABLE Flexible-Learning ENTREPRENEURIAL MANAGEMENT.pdf
2. RBI.pptx202029291023i38039013i92292992
PPT-Lesson-2-Recognize-a-Potential-Market-2-3.pptx
Financial discipline for educational purpose
lesson in englishhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
USS pension Report and Accounts 2025.pdf
4a Probability-of-Failure-Based Decision Rules to Manage Sequence Risk in Ret...
International Financial Management, 9th Edition, Cheol Eun, Bruce Resnick Tuu...

Company segmentation - an approach with R

  • 1. Company Segmentation • Challenge Summary • Objectives • Libraries • Data • Question o Step 1 - Convert stock prices to a standardized format (daily returns) o Step 2 - Convert to User-Item Format o Step 3 - Perform K-Means Clustering o Step 4 - Find the optimal value of K o Step 5 - Apply UMAP o Step 6 - Combine K-Means and UMAP Challenge Summary Your organization wants to know which companies are similar to each other to help in identifying potential customers of a SAAS software solution (e.g. Salesforce CRM or equivalent) in various segments of the market. The Sales Department is very interested in this analysis, which will help them more easily penetrate various market segments. We will be using stock prices in this analysis. We will classify companies based on how their stocks trade using their daily stock returns (percentage movement from one day to the next). This analysis will help your organization determine which companies are related to each other (competitors and have similar attributes). We can analyze the stock prices using unsupervised learning tools including K-Means and UMAP. We’ll use a combination of kmeans() to find groups and umap() to visualize similarity of daily stock returns. Objectives Apply our knowledge on K-Means and UMAP along with dplyr, ggplot2, and purrr to create a visualization that identifies subgroups in the S&P 500 Index. We’ll achieve this with: • Modeling: kmeans() and umap() • Iteration: purrr • Data Manipulation: dplyr, tidyr, and tibble • Visualization: ggplot2 and plotly
  • 2. Libraries Load the following libraries. # install.packages("plotly") library(tidyverse) library(tidyquant) library(broom) library(umap) library(plotly) Data We will be using stock prices in this analysis. The tidyquant R package contains an API to retreive stock prices. The following code is shown so you can see how I obtained the stock prices for every stock in the S&P 500 index. # # NOT RUN - WILL TAKE SEVERAL MINUTES TO DOWNLOAD ALL THE STOCK PRICES # # JUST SHOWN FOR FUN SO YOU KNOW HOW I GOT THE DATA # # # GET ALL STOCKS IN A STOCK INDEX (E.G. SP500) # sp_500_index_tbl <- tq_index("SP500") # sp_500_index_tbl # # # PULL IN STOCK PRICES FOR EACH STOCK IN THE INDEX # sp_500_prices_tbl <- sp_500_index %>% # select(symbol) %>% # tq_get(get = "stock.prices") # # # SAVING THE DATA # fs::dir_create("week_6_data") # sp_500_prices_tbl %>% write_rds(path = "week_6_data/sp_500_prices_tbl.rds") # sp_500_index_tbl %>% write_rds("week_6_data/sp_500_index_tbl.rds") We can read in the stock prices. The data is 1.2M observations. The most important columns for our analysis are: • symbol: The stock ticker symbol that corresponds to a company’s stock price • date: The timestamp relating the symbol to the share price at that point in time • adjusted: The stock price, adjusted for any splits and divdends (we use this when analyzing stock data over long periods of time) # STOCK PRICES sp_500_prices_tbl <- read_rds("week_6_data/sp_500_prices_tbl.rds") sp_500_prices_tbl The second data frame contains information about the stocks the most important of which are:
  • 3. • company: The company name • sector: The sector that the company belongs to # SECTOR INFORMATION sp_500_index_tbl <- read_rds("week_6_data/sp_500_index_tbl.rds") sp_500_index_tbl Question Which stock prices behave similarly? Answering this question helps us understand which companies are related, and we can use clustering to help us answer it! Even if you’re not interested in finance, this is still a great analysis because it will tell you which companies are competitors and which are likely in the same space (often called sectors) and can be categorized together. Bottom line - This analysis can help you better understand the dynamics of the market and competition, which is useful for all types of analyses from finance to sales to marketing. Step 1 - Convert stock prices to a standardized format (daily returns) What you first need to do is get the data in a format that can be converted to a “user- item” style matrix. The challenge here is to connect the dots between what we have and what we need to do to format it properly. We know that in order to compare the data, it needs to be standardized or normalized. Why? Because we cannot compare values (stock prices) that are of completely different magnitudes. In order to standardize, we will convert from adjusted stock price (dollar value) to daily returns (percent change from previous day). Here is the formula. 𝑟𝑒𝑡𝑢𝑟𝑛 𝑑𝑎𝑖𝑙𝑦 = 𝑝𝑟𝑖𝑐𝑒 𝑛 − 𝑝𝑟𝑖𝑐𝑒 𝑛−1 𝑝𝑟𝑖𝑐𝑒 𝑛−1 This formula can be rewritten as 𝑟𝑒𝑡𝑢𝑟𝑛 𝑑𝑎𝑖𝑙𝑦 = 𝑝𝑟𝑖𝑐𝑒 𝑛 𝑝𝑟𝑖𝑐𝑒 𝑛−1 − 1 First, what do we have? We have stock prices for every stock in the SP 500 Index, which is the daily stock prices for over 500 stocks. The data set is over 1.2M observations. sp_500_prices_tbl %>% glimpse()
  • 4. ## Observations: 1,225,765 ## Variables: 8 ## $ symbol <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT... ## $ date <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07, 2009... ## $ open <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71, 19.5... ## $ high <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79, 19.9... ## $ low <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30, 19.5... ## $ close <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47, 19.8... ## $ volume <dbl> 50084000, 61475200, 58083400, 72709900, 70255400, 498... ## $ adjusted <dbl> 15.86624, 16.01451, 16.20183, 15.22628, 15.70234, 15.... Our first task is to convert to a tibble named sp_500_daily_returns_tbl by performing the following operations: • Select the symbol, date and adjusted columns • Filter to dates beginning in the year 2018 and beyond. • Compute a Lag of 1 day on the adjusted stock price. Be sure to group by symbol first, otherwise we will have lags computed using values from the previous stock in the data frame. • Remove an NA values from the lagging operation • Compute the difference between adjusted and the lag • Compute the percentage difference by dividing the difference by that lag. Name this column pct_return. • Return only the symbol, date, and pct_return columns • Save as a variable named sp_500_daily_returns_tbl (sp_500_daily_returns_tbl <- sp_500_prices_tbl %>% select(symbol, date, adjusted) %>% # filter(date >= ymd('2018-01-01')) %>% filter(date %>% lubridate::year() >= 2018) %>% arrange(date) %>% group_by(symbol) %>% dplyr::mutate(pct_return = adjusted/lag(adjusted)-1) %>% ungroup() %>% drop_na() %>% #filter(!is.na(pct_return)) %>% select(symbol, date, pct_return)) Step 2 - Convert to User-Item Format The next step is to convert to a user-item format with the symbol in the first column and every other column the value of the daily returns (pct_return) for every stock at each date.
  • 5. Now that we have the daily returns (percentage change from one day to the next), we can convert to a user-item format. The user in this case is the symbol (company), and the item in this case is the pct_return at each date. • Spread the date column to get the values as percentage returns. Make sure to fill an NA values with zeros. • Save the result as stock_date_matrix_tbl # Convert to User-Item Format (stock_date_matrix_tbl <- sp_500_daily_returns_tbl %>% pivot_wider(names_from = date, values_from = pct_return, values_fill = list(pct_return = 0)) ) # Double check for NA's with stock_date_matrix_tbl %>% is.na() %>% any() # Output: stock_date_matrix_tbl Step 3 - Perform K-Means Clustering Next, we’ll perform K-Means clustering. Beginning with the stock_date_matrix_tbl, perform the following operations: • Drop the non-numeric column, symbol • Perform kmeans() with centers = 4 and nstart = 20 • Save the result as kmeans_obj # Create kmeans_obj for 4 centers k_means_obj <- stock_date_matrix_tbl %>% select(-symbol) %>% kmeans(centers = 4, nstart = 20 ) Use glance() to get the tot.withinss. # Apply glance() to get the tot.withinss k_means_obj %>% broom::glance() # k_means_obj %>% broom::augment(stock_date_matrix_tbl) Step 4 - Find the optimal value of K Now that we are familiar with the process for calculating kmeans(), let’s use purrr to iterate over many values of “k” using the centers argument.
  • 6. We’ll use this custom function called kmeans_mapper(): kmeans_mapper = function(center = 3){ stock_date_matrix_tbl %>% select(-symbol)%>% kmeans(centers = center, nstart = 20) } Apply the kmeans_mapper() and glance() functions iteratively using purrr. • Create a tibble containing column called centers that go from 1 to 30 • Add a column named k_means with the kmeans_mapper() output. Use mutate() to add the column and map() to map centers to the kmeans_mapper() function. • Add a column named glance with the glance() output. Use mutate() and map() again to iterate over the column of k_means. • Save the output as k_means_mapped_tbl # Use purrr to map k_means_mapped_tbl <- tibble(centers = 1:30) %>% mutate(k_means = centers %>% map(kmeans_mapper)) %>% mutate(glance = k_means %>% map(broom::glance)) # Output: k_means_mapped_tbl k_means_mapped_tbl Next, let’s visualize the “tot.withinss” from the glance output as a Scree Plot. • Begin with the k_means_mapped_tbl • Unnest the glance column • Plot the centers column (x-axis) versus the tot.withinss column (y-axis) using geom_point() and geom_line() • Add a title “Scree Plot” # Visualize Scree Plot k_means_mapped_tbl %>% unnest(glance) %>% select(centers, tot.withinss) %>% ggplot(aes(x = centers,y = tot.withinss))+ geom_line(color = palette_dark()[1])+ geom_point(size = 1.5, aes(color = centers %in% c(5:10)), show.legend = F, shape = 7)+ scale_color_tq()+ theme_minimal()+ labs( title = 'Scree plot' )
  • 7. We can see that the Scree Plot becomes linear (constant rate of change) between 5 and 10 centers for K. Step 5 - Apply UMAP Next, let’s plot the UMAP 2D visualization to help us investigate cluster assignments. First, let’s apply the umap() function to the stock_date_matrix_tbl, which contains our user-item matrix in tibble format. • Start with stock_date_matrix_tbl • De-select the symbol column • Use the umap() function storing the output as umap_results # Apply UMAP umap_results <- stock_date_matrix_tbl %>% select(-symbol) %>% umap() # Store results as: umap_results Next, we want to combine the layout from the umap_results with the symbol column from the stock_date_matrix_tbl.
  • 8. • Start with umap_results$layout • Convert from a matrix data type to a tibble with as_tibble() • Bind the columns of the umap tibble with the symbol column from the stock_date_matrix_tbl. • Save the results as umap_results_tbl. • Change V1 to x and V2 to y, respectively # Convert umap results to tibble with symbols (umap_results_tbl <- umap_results$layout %>% as_tibble() %>% bind_cols(stock_date_matrix_tbl['symbol']) %>% set_names(c('x','y','symbol')) ) # Output: umap_results_tbl Finally, let’s make a quick visualization of the umap_results_tbl. • Pipe the umap_results_tbl into ggplot() mapping the x and y columns to x- axis and y-axis • Add a geom_point() geometry with an alpha = 0.5 • Apply theme_tq() and add a title “UMAP Projection” # Visualize UMAP results umap_results_tbl %>% ggplot(aes(x, y))+ geom_point(alpha = 0.5, color= palette_light()[1])+ theme_tq()+ labs( title = 'UMAP Projection' )
  • 9. We can now see that we have some clusters. However, we still need to combine the K- Means clusters and the UMAP 2D representation. Step 6 - Combine K-Means and UMAP Next, we combine the K-Means clusters and the UMAP 2D representation First, pull out the K-Means for 10 Centers. Use this since beyond this value the Scree Plot flattens. • Begin with the k_means_mapped_tbl • Filter to centers == 10 • Pull the k_means column • Pluck the first element • Store this as k_means_obj # Get the k_means_obj from the 10th center k_means_obj <- k_means_mapped_tbl %>% filter(centers == 10) %>% pull(k_means) %>% pluck(1) # Alternative: # k_means_mapped_tbl %>% pull(k_means) %>% pluck(10)
  • 10. # Store as k_means_obj Next, we’ll combine the clusters from the k_means_obj with the umap_results_tbl. • Begin with the k_means_obj • Augment the k_means_obj with the stock_date_matrix_tbl to get the clusters added to the end of the tibble • Select just the symbol and .cluster columns • Left join the result with the umap_results_tbl by the symbol column • Left join the result with the result of sp_500_index_tbl %>% select(symbol, company, sector) by the symbol column. • Store the output as umap_kmeans_results_tbl # Use your dplyr & broom skills to combine the k_means_obj with the umap_result s_tbl (umap_kmeans_results_tbl <- k_means_obj %>% broom::augment(stock_date_matrix_tbl) %>% select(symbol, .cluster) %>% left_join(umap_results_tbl) %>% left_join(sp_500_index_tbl %>% select(symbol, company, sector)) ) # Output: umap_kmeans_results_tbl Plot the K-Means and UMAP results. • Begin with the umap_kmeans_results_tbl • Use ggplot() mapping V1, V2 and color = .cluster • Add the geom_point() geometry with alpha = 0.5 • Apply theme_tq() and scale_color_tq() Note - If you’ve used centers greater than 12, you will need to use a hack to enable scale_color_tq() to work. Just replace with: scale_color_manual(values = palette_light() %>% rep(3)) # Visualize the combined K-Means and UMAP results umap_kmeans_results_tbl %>% ggplot(aes(x, y, color = .cluster))+ geom_point(alpha = 0.5)+ theme_tq()+ # scale_color_manual(values = palette_light() %>% rep(3)) scale_color_brewer(palette = 'Paired')