Predicting Business Confidence using News and Social Media
Why Predict Business Confidence?
Financial
Crisis
John Cai, University of Cambridge 1
Why use News Media and Social Media data?
Data Collection
8000+
CNN Articles containing
“US Economy” scraped
92000+
Tweets containing “US
Economy” obtained
32000+
Tweets obtained from
@realDonaldTrump
6000+
NYT Articles from the
Economy Section scraped
News Data Twitter Data
Fig 1: Business Confidence vs GDP Growth
1. Financial Crisis: The drop in business confidence preceded US GDP growth data by 1 month.
2. News: Scraped using Selenium, Beautiful Soup and Newspaper in Python.
3. Tweets: Obtained using Python’s Twitterscraper .
(2) (3)
Trump
Presidency
Lead Indicator
OECD Business Confidence
Index (BCI) is a key lead
indicator of GDP growth, as
shown in Fig 1. BCI measures
expectations using surveys.
Infrequency
OECD Business Confidence
Index is published monthly.
We do not currently have
daily or weekly estimates of
business confidence.
High Frequency
Sentiments in news media
and social media change in
real-time. We are able to
construct estimates of daily
business confidence.
Broad-based
Traditional estimates of daily
sentiments are from financial
markets. News and Social
Media capture sentiments in
the broader real economy.
- VIX index (uses option prices to measure volatility)
- S&P500 daily returns and change in daily returns
Financial Data
(1)
Note: Y-axis is the standardized value of the variables
Building a Prediction Model using NLP and ML
John Cai, University of Cambridge 2
1. NLTK: developed by Stanford and trained on movie reviews. VADER: Developed by Georgia Tech and trained on tweets.
2. Cross-validation using h-step ahead: Calculates MSPE by taking the actual value – forecasted value from the model.
3. Plot of Squared Prediction Error over time, to see how the relative performance of the two models varies over the test period.
Use Textblob to tokenize the
textual data at the words level
and the sentence level. Remove
stop-words and creating n-grams.
NLP
Perform Features Scaling with
scikit-learn by standardizing all
features. Set aside 20% of the
data for Testing (2016-2018) and
another 20% for Validation.
1. Data Pre-processing
Employ VADER’s lexical and rule-
based classifier and NLTK’s
Naïve Bayes Classifier to obtain
polarity and/or subjectivity.
2. Sentiment Analysis
Compute the average and cross-
sectional variance of sentiments
over every month. Omit features
with unit roots.
3. Feature Engineering
4. Data Preparation
Employ Cross-Validation LASSO
with a rolling forecast origin and
fixed window (adapted for time-
series). LASSO selects features
with high prediction value.
5. Training the Model
0
0.1
0.2
0.3
0.4
Oct-16 Apr-17 Oct-17 Apr-18 Oct-18
Less Parsimonious More Parsimonious
ML
Fig 2: Test-Set Squared Error
!"#$ , which corresponds to a
more parsimonious model, is
preferred as the model has a
smaller MSPE for out-of-sample
predictions (shown in Fig 2).
6. Model Selection
Training subset Validation
Get Mean Squared Prediction
Error (MSPE) over the Test Set
Cross Validation Loop
Tunes hyper-parameters by
minimizing the MSPE from h-
step ahead forecasting over
the Validation SetGives penalty terms !%&' and !"#$,
which allows us to select
features.
Training Set Test
(1)
(2)
(3)
John Cai, University of Cambridge 3
1. LASSO selected the VADER Score rather than the NLTK Score, likely because VADER is trained on Social Media data.
2. Uncertainty is reflected in sentiment variance. In recessions, uncertainty and sentiment variance increase (Bloom, 2018).
3. Markov Switching Models would account for the structural breaks expected during recessions (Hamilton, 2010).
Fig 3: Best Prediction Model (!"#$) Fig 4: Model Evaluation
Neutral BullishBearish
High
Error
Very Low
Error
Low
Error
Limitations and ExtensionsResults and Implications
1. Twitter sentiments are
most informative of BCI
Twitter has the highest value in
prediction compared to NYT,
CNN and Trump’s Tweets.
2. Results are consistent
with economic theory
Variance of sentiments predicts
business confidence because it
is counter-cyclical (-- coefficent).
Features Selected (sign of
coeff)
1. Twitter Polarity Mean (+)
2. Twitter Polarity Variance (--)
3. VIX Index (--)
4. Returns (+)
5. Lagged Returns (+)
6. Lagged BCI (+)
3. Model works best during
neutral and bullish periods
As shown in Fig 4, the squared
prediction error is much higher
for bearish periods.
4. Possible extension:
Markov Switching Models
Performance in bearish periods
could improve if Markov chains
are used for structural breaks.
Analyzing and Evaluating Results from the Prediction Model
(1) (2) (3)

More Related Content

PDF
Predicting Business Confidence using News and Social Media
PDF
Analyzing Real Time News
PPTX
Rethinking China- 30 sept 2021 - war-room slides
PPT
Can Consumers Predict Macroeconomics Variables
PPTX
B.2 concept of power
PDF
Google Trends
PDF
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team La c...
PDF
IRJET- Analysis of Brand Value Prediction based on Social Media Data
Predicting Business Confidence using News and Social Media
Analyzing Real Time News
Rethinking China- 30 sept 2021 - war-room slides
Can Consumers Predict Macroeconomics Variables
B.2 concept of power
Google Trends
Spotle AI-thon Top 10 Showcase - Analysing Mental Health Of India - Team La c...
IRJET- Analysis of Brand Value Prediction based on Social Media Data

Similar to Predicting Business Confidence using NLP (20)

PDF
Machine Learning in Banking
PDF
Bitcoin Price Prediction using Sentiment and Historical Price
PDF
w-cyber-risk-modeling Owasp cyber risk quantification 2018
PDF
ML in banking
PDF
Size matters a lot rick collins - technomics
PDF
Pillar III presentation 2 27-15 - redacted version
DOCX
Text book title and AuthorMoeller, Robert R. IT audit, cont.docx
PPTX
Global CISO Forum 2017: How To Measure Anything In Cybersecurity Risk
PDF
IRJET - Bankruptcy Score Indexing
PDF
Investment Portfolio Risk Manager using Machine Learning and Deep-Learning.
PDF
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
PPTX
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
PPTX
Population Stability Index(PSI) for Big Data World
PDF
Report 190804110930
PDF
Predicting Bank Customer Churn Using Classification
PDF
COVID Sentiment Analysis of Social Media Data Using Enhanced Stacked Ensemble
PDF
IRJET - Twitter Sentimental Analysis
PPTX
Nick Jesteadt: Predictive Attrition Using Survival Analysis
PPTX
Stock Market Trends Prediction after Earning Release.pptx
PDF
Quant Foundry Labs - Low Probability Defaults
Machine Learning in Banking
Bitcoin Price Prediction using Sentiment and Historical Price
w-cyber-risk-modeling Owasp cyber risk quantification 2018
ML in banking
Size matters a lot rick collins - technomics
Pillar III presentation 2 27-15 - redacted version
Text book title and AuthorMoeller, Robert R. IT audit, cont.docx
Global CISO Forum 2017: How To Measure Anything In Cybersecurity Risk
IRJET - Bankruptcy Score Indexing
Investment Portfolio Risk Manager using Machine Learning and Deep-Learning.
PREDICTING BANKRUPTCY USING MACHINE LEARNING ALGORITHMS
THE REACTION DATA ANALYSIS OFCOVID-19 VACCINATIONS
Population Stability Index(PSI) for Big Data World
Report 190804110930
Predicting Bank Customer Churn Using Classification
COVID Sentiment Analysis of Social Media Data Using Enhanced Stacked Ensemble
IRJET - Twitter Sentimental Analysis
Nick Jesteadt: Predictive Attrition Using Survival Analysis
Stock Market Trends Prediction after Earning Release.pptx
Quant Foundry Labs - Low Probability Defaults
Ad

Recently uploaded (20)

PPTX
Tapan_20220802057_Researchinternship_final_stage.pptx
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
An essential collection of rules designed to help businesses manage and reduc...
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
Caseware_IDEA_Detailed_Presentation.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PDF
Session 11 - Data Visualization Storytelling (2).pdf
PPT
statistic analysis for study - data collection
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
statsppt this is statistics ppt for giving knowledge about this topic
PPT
statistics analysis - topic 3 - describing data visually
PPTX
eGramSWARAJ-PPT Training Module for beginners
DOCX
Factor Analysis Word Document Presentation
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
recommendation Project PPT with details attached
PPTX
Business_Capability_Map_Collection__pptx
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
Tapan_20220802057_Researchinternship_final_stage.pptx
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Navigating the Thai Supplements Landscape.pdf
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
An essential collection of rules designed to help businesses manage and reduc...
Best Data Science Professional Certificates in the USA | IABAC
Caseware_IDEA_Detailed_Presentation.pptx
IMPACT OF LANDSLIDE.....................
Session 11 - Data Visualization Storytelling (2).pdf
statistic analysis for study - data collection
Topic 5 Presentation 5 Lesson 5 Corporate Fin
statsppt this is statistics ppt for giving knowledge about this topic
statistics analysis - topic 3 - describing data visually
eGramSWARAJ-PPT Training Module for beginners
Factor Analysis Word Document Presentation
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
recommendation Project PPT with details attached
Business_Capability_Map_Collection__pptx
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
retention in jsjsksksksnbsndjddjdnFPD.pptx
Ad

Predicting Business Confidence using NLP

  • 1. Predicting Business Confidence using News and Social Media Why Predict Business Confidence? Financial Crisis John Cai, University of Cambridge 1 Why use News Media and Social Media data? Data Collection 8000+ CNN Articles containing “US Economy” scraped 92000+ Tweets containing “US Economy” obtained 32000+ Tweets obtained from @realDonaldTrump 6000+ NYT Articles from the Economy Section scraped News Data Twitter Data Fig 1: Business Confidence vs GDP Growth 1. Financial Crisis: The drop in business confidence preceded US GDP growth data by 1 month. 2. News: Scraped using Selenium, Beautiful Soup and Newspaper in Python. 3. Tweets: Obtained using Python’s Twitterscraper . (2) (3) Trump Presidency Lead Indicator OECD Business Confidence Index (BCI) is a key lead indicator of GDP growth, as shown in Fig 1. BCI measures expectations using surveys. Infrequency OECD Business Confidence Index is published monthly. We do not currently have daily or weekly estimates of business confidence. High Frequency Sentiments in news media and social media change in real-time. We are able to construct estimates of daily business confidence. Broad-based Traditional estimates of daily sentiments are from financial markets. News and Social Media capture sentiments in the broader real economy. - VIX index (uses option prices to measure volatility) - S&P500 daily returns and change in daily returns Financial Data (1) Note: Y-axis is the standardized value of the variables
  • 2. Building a Prediction Model using NLP and ML John Cai, University of Cambridge 2 1. NLTK: developed by Stanford and trained on movie reviews. VADER: Developed by Georgia Tech and trained on tweets. 2. Cross-validation using h-step ahead: Calculates MSPE by taking the actual value – forecasted value from the model. 3. Plot of Squared Prediction Error over time, to see how the relative performance of the two models varies over the test period. Use Textblob to tokenize the textual data at the words level and the sentence level. Remove stop-words and creating n-grams. NLP Perform Features Scaling with scikit-learn by standardizing all features. Set aside 20% of the data for Testing (2016-2018) and another 20% for Validation. 1. Data Pre-processing Employ VADER’s lexical and rule- based classifier and NLTK’s Naïve Bayes Classifier to obtain polarity and/or subjectivity. 2. Sentiment Analysis Compute the average and cross- sectional variance of sentiments over every month. Omit features with unit roots. 3. Feature Engineering 4. Data Preparation Employ Cross-Validation LASSO with a rolling forecast origin and fixed window (adapted for time- series). LASSO selects features with high prediction value. 5. Training the Model 0 0.1 0.2 0.3 0.4 Oct-16 Apr-17 Oct-17 Apr-18 Oct-18 Less Parsimonious More Parsimonious ML Fig 2: Test-Set Squared Error !"#$ , which corresponds to a more parsimonious model, is preferred as the model has a smaller MSPE for out-of-sample predictions (shown in Fig 2). 6. Model Selection Training subset Validation Get Mean Squared Prediction Error (MSPE) over the Test Set Cross Validation Loop Tunes hyper-parameters by minimizing the MSPE from h- step ahead forecasting over the Validation SetGives penalty terms !%&' and !"#$, which allows us to select features. Training Set Test (1) (2) (3)
  • 3. John Cai, University of Cambridge 3 1. LASSO selected the VADER Score rather than the NLTK Score, likely because VADER is trained on Social Media data. 2. Uncertainty is reflected in sentiment variance. In recessions, uncertainty and sentiment variance increase (Bloom, 2018). 3. Markov Switching Models would account for the structural breaks expected during recessions (Hamilton, 2010). Fig 3: Best Prediction Model (!"#$) Fig 4: Model Evaluation Neutral BullishBearish High Error Very Low Error Low Error Limitations and ExtensionsResults and Implications 1. Twitter sentiments are most informative of BCI Twitter has the highest value in prediction compared to NYT, CNN and Trump’s Tweets. 2. Results are consistent with economic theory Variance of sentiments predicts business confidence because it is counter-cyclical (-- coefficent). Features Selected (sign of coeff) 1. Twitter Polarity Mean (+) 2. Twitter Polarity Variance (--) 3. VIX Index (--) 4. Returns (+) 5. Lagged Returns (+) 6. Lagged BCI (+) 3. Model works best during neutral and bullish periods As shown in Fig 4, the squared prediction error is much higher for bearish periods. 4. Possible extension: Markov Switching Models Performance in bearish periods could improve if Markov chains are used for structural breaks. Analyzing and Evaluating Results from the Prediction Model (1) (2) (3)