SlideShare a Scribd company logo
Essay on Data Analysis
Raman Kannan
We will start with Ordinary Least Square (OLS) and review how to verify and validate the
OLS results. Then we will branch on to a few rudimentary tools and techniques for data
mining .
Running Linear Regression using mysql and R
This is a basic introduction to regression using mysql. We will verify the results using R
which is an excellent tool for data analysis. We source our data from Yahoo Finance.
Retrieve data from Yahoo Finance
export from yahoo finance into GE.csv and GSPC.csv
Create tables in mysql
crIn mysql create the tables as shown above.
Load the data into the newly created tables
I am loading as root. Also, note C:/ (forward slash not backward slash '' )
Data Quality
Quality of data is paramount. I am going to create a new table with data for GSPC and GE
ensuring identical dates.
I will treat GSPC as the independent var and GE as dependent var. As if GSPC has a
reliable predictive capability. I need these calculations from wiki
I will do these in SQL using mysql math functions.
Query for interecept: -2.46
select avg(y) - (avg(x*y) -avg(x) *avg(y))/(avg(x*x) -avg(x)*avg(x)) * avg(x) from ols
Query for slope: 0.02
select (avg(x*y) -avg(x) *avg(y))/(avg(x*x) -avg(x)*avg(x)) from ols
In continuation of this thread:
Slope is
Intercept is
the dataset and these sql in the attached xls.
Is this Mathematically fine?
But does GSPC really cause a move on GE like that?
Could this be random? How likely is this random?
We need R-Squared -- this gives us the goodness of fit?
(is this formula trust worthy? does X explain Y?)
Then comes P-Value
I will cover that in Part III.
Recall that ols has the timeseries of GSPC and GE
we can extract the timeseries into a csv file as follows:
I have xy.csv
let us install R and load this file into R as follows:
data <- read.table ("C:UsersmariDownloadsxy.csv",header=TRUE,sep="t")
if u examine data it will have two columns
x<-data[,1]
y<-data[,2]
data has two vectors accessed as data[,1] and data[,2]
y <- data[,1]
x <- data[,2]
# we run a linear regression
m <- lm ( y ~ x)
summary(m)
intercept is -2.458 slope 2.069e-02 -- precisely the intercept and slope we had come up with, using
mysql.
Since p-value is 2e-16 much less than 0.05 or 0.01 we reject the null-hypothesis.
This being the single variable linear regression, the null Hypothesis is slope is zero and there is no
relationship between GSPC and GE.
We reject that there is no relationship between GSPC and GE. p-value is statistical measure -- its
significance is of statistical origin. Statisticians quibble about "Evidence of Absence vs Absence of
Evidence --> EA != AE. Correlation and causation are two different things.
Some of you enquired about SE and t-test etc, http://stattrek.com/regression/slope-test.aspx is your
source.
Any statistical analysis must be verified for consistency, validity and accuracy. That is the model and
the prevailing reality must be valid, accurate and consistent with the assumptions under which the
analysis is done. For OLS there are many assumptions.
OLS is constructed on the following assumptions:
– Residuals are normally distributed
– Variance is constant
– absence of serial correlation
Here is how one can verify those assumptions are valid and therefore the inferences are worth further
consideration. This “audit” is paramount and it is the sole responsibility of the data scientist – one who
is constructing the models and conducting the experiment and the analysis.
Validating OLS
Confirm the vectors are present, run the model using lm, extract
the residuals using residuals function.
Confirming the residuals are normally distributed
Since the p-value is more than 0.05, we cannot reject the null
hypothesis. The null hypothesis for this test is the residuals are
normally distributed. This is same as Anderson-Darling test.
Confirming constant variance.
The statistical “speak” for this is heteroskedascticity (which means variance is not constant). We run
bptest, for which the null hypothesis is that the variance is constant (homoskedastic). Note that I have
used bptest from lmtest package. Here again we will reject the null if p less than 0.05 and therefore it is
appropriate to assume that the residuals do not exhibit constant variance.
Confirming that there is no Serial or Auto Correlation
Here again the p-value is less than 0.05 and
therefore we reject the null hypothesis. The
null hypothesis for this test is that there is no
auto-correlation. To learn more about these
tests, kindly visit wiki or bing it.
Loading libraries as needed
try require as shown.
if require is not successful, then load the lmtest as follows
and then re-run require command as above. Use Packages
menu at the top to install a package.
Normality of the Residuals: Visual Verification
The residuals can be extracted as follows:
And the following plot is generated. If the
points deviate from the line, then the points
are not from Normal Distribution.
I have screenshots that show typos and
miscellaneous problems, for a reason.
Getting multiple technologies requires
determination to complete and never quit.
We have to persist and to highlight that I
have shown the snafus I experienced.
Beyond OLS – Numerical Regression
OLS is the very basic step toward become a data analyst. From here we move on to analyzing
categorical data. Real world problems come with very large datasets and very large datasets come with
many predictor variables. Performing analysis including all of them will quickly escalate into
computational hog and most importantly, all of the variables may not even be relevant or influential.
More often than not, blindly including all the variables can result spurious fitting – over/under fitting.
Therefore, before embarking on more involved analysis, we must validate the features or dimensions
we wish to include in our analysis. This can be done using standard “Dimensionality Reduction”
techniques such as the Principal Component Analysis (PCA) or Singular Value Decomposition (SVD).
R provides exceptional support for performing PCA.
Then we move on to predictive analytics and simple applications of data mining techniques such as
unsupervised/supervised learning, aka knowns as clustering/classification. In subsequent essays, we
will cover unsupervised learning techniques including K-means and Hierarchical clustering in R.
Please visit http://www.slideshare.net/rk2153/introduction-to-unsupervised-learning-kmeans-and-h
here to get a birds-eye view. These techniques are extremely useful for a diverse set of business
applications ranging from customer segmentation to fraud detection.
Questions and feedback are always welcome. Let us have some fun with data.

More Related Content

PDF
Chapter01 introductory handbook
PDF
Chapter 04-discriminant analysis
PDF
Chapter 02-logistic regression
PDF
How to-run-ols-diagnostics-02
PDF
Evaluating classifierperformance ml-cs6923
PPTX
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
PDF
M03 nb-02
PPTX
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Chapter01 introductory handbook
Chapter 04-discriminant analysis
Chapter 02-logistic regression
How to-run-ols-diagnostics-02
Evaluating classifierperformance ml-cs6923
Applied Artificial Intelligence Unit 2 Semester 3 MSc IT Part 2 Mumbai Univer...
M03 nb-02
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University

What's hot (20)

PDF
2 data structure in R
PDF
6. R data structures
PPTX
CART – Classification & Regression Trees
PDF
CART: Not only Classification and Regression Trees
PDF
Classification and regression trees (cart)
PPTX
Machine Learning - Simple Linear Regression
PDF
Linear discriminant analysis
PDF
Understanding the Machine Learning Algorithms
PDF
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
PDF
Principal Component Analysis and Clustering
PDF
4 Descriptive Statistics with R
PPTX
Random forest
PPTX
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
PDF
Machine Learning
PDF
Data Analyst - Interview Guide
PPTX
Data mining: Classification and prediction
PPT
1.8 discretization
PDF
Linear logisticregression
PPTX
WEKA: Algorithms The Basic Methods
PPTX
Classification Continued
2 data structure in R
6. R data structures
CART – Classification & Regression Trees
CART: Not only Classification and Regression Trees
Classification and regression trees (cart)
Machine Learning - Simple Linear Regression
Linear discriminant analysis
Understanding the Machine Learning Algorithms
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Principal Component Analysis and Clustering
4 Descriptive Statistics with R
Random forest
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning
Data Analyst - Interview Guide
Data mining: Classification and prediction
1.8 discretization
Linear logisticregression
WEKA: Algorithms The Basic Methods
Classification Continued
Ad

Similar to Essay on-data-analysis (20)

PDF
working with python
PDF
Chapter 18,19
PDF
Eviews forecasting
DOCX
Anomaly detection Full Article
PPT
Logistic Regression in Case-Control Study
PDF
Higgs Boson Challenge
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
DOCX
Correlation and regression in r
PDF
Asset Price Prediction with Machine Learning
PPTX
SVM - Functional Verification
PDF
Machine Learning Guide maXbox Starter62
PDF
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
PDF
1607.01152.pdf
PDF
R nonlinear least square
PDF
Building a Regression Model using SPSS
PPTX
Predicting Employee Churn: A Data-Driven Approach Project Presentation
PDF
Unit---5.pdf of ba in srcc du gst before exam
DOCX
LSTM Framework For Univariate Time series
PPTX
CHAPTER 11 LOGISTIC REGRESSION.pptx
PDF
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
working with python
Chapter 18,19
Eviews forecasting
Anomaly detection Full Article
Logistic Regression in Case-Control Study
Higgs Boson Challenge
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Correlation and regression in r
Asset Price Prediction with Machine Learning
SVM - Functional Verification
Machine Learning Guide maXbox Starter62
DA ST-1 SET-B-Solution.pdf we also provide the many type of solution
1607.01152.pdf
R nonlinear least square
Building a Regression Model using SPSS
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Unit---5.pdf of ba in srcc du gst before exam
LSTM Framework For Univariate Time series
CHAPTER 11 LOGISTIC REGRESSION.pptx
Study on Evaluation of Venture Capital Based onInteractive Projection Algorithm
Ad

More from Raman Kannan (20)

PDF
conversations-withchatGPT-Claude-gemini-on-vibration-12-112024.pdf
PDF
Essays on-civic-responsibilty
PDF
M12 boosting-part02
PDF
M12 random forest-part01
PDF
M11 bagging loo cv
PDF
M10 gradient descent
PDF
M09-Cross validating-naive-bayes
PDF
M06 tree
PDF
M07 svm
PDF
M08 BiasVarianceTradeoff
PDF
Chapter 05 k nn
PDF
Augmented 11022020-ieee
PDF
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
PDF
A voyage-inward-02
PDF
A data scientist's study plan
PDF
Cognitive Assistants
PDF
Joy of Unix
PDF
Sdr dodd frankbirdseyeview
PDF
Risk management framework
PDF
Innovation: Cost or Rejuvenation
conversations-withchatGPT-Claude-gemini-on-vibration-12-112024.pdf
Essays on-civic-responsibilty
M12 boosting-part02
M12 random forest-part01
M11 bagging loo cv
M10 gradient descent
M09-Cross validating-naive-bayes
M06 tree
M07 svm
M08 BiasVarianceTradeoff
Chapter 05 k nn
Augmented 11022020-ieee
Chapter 2: R tutorial Handbook for Data Science and Machine Learning Practiti...
A voyage-inward-02
A data scientist's study plan
Cognitive Assistants
Joy of Unix
Sdr dodd frankbirdseyeview
Risk management framework
Innovation: Cost or Rejuvenation

Recently uploaded (20)

PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Microsoft 365 products and services descrption
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Business Analytics and business intelligence.pdf
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Leprosy and NLEP programme community medicine
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PDF
Global Data and Analytics Market Outlook Report
PDF
Introduction to Data Science and Data Analysis
ISS -ESG Data flows What is ESG and HowHow
Microsoft 365 products and services descrption
Pilar Kemerdekaan dan Identi Bangsa.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Microsoft Core Cloud Services powerpoint
Business Analytics and business intelligence.pdf
Topic 5 Presentation 5 Lesson 5 Corporate Fin
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
SAP 2 completion done . PRESENTATION.pptx
Navigating the Thai Supplements Landscape.pdf
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Tetra Pak Index 2023 - The future of health and nutrition - Full report.pdf
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Leprosy and NLEP programme community medicine
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Global Data and Analytics Market Outlook Report
Introduction to Data Science and Data Analysis

Essay on-data-analysis

  • 1. Essay on Data Analysis Raman Kannan We will start with Ordinary Least Square (OLS) and review how to verify and validate the OLS results. Then we will branch on to a few rudimentary tools and techniques for data mining . Running Linear Regression using mysql and R This is a basic introduction to regression using mysql. We will verify the results using R which is an excellent tool for data analysis. We source our data from Yahoo Finance. Retrieve data from Yahoo Finance export from yahoo finance into GE.csv and GSPC.csv Create tables in mysql crIn mysql create the tables as shown above. Load the data into the newly created tables I am loading as root. Also, note C:/ (forward slash not backward slash '' ) Data Quality Quality of data is paramount. I am going to create a new table with data for GSPC and GE ensuring identical dates.
  • 2. I will treat GSPC as the independent var and GE as dependent var. As if GSPC has a reliable predictive capability. I need these calculations from wiki I will do these in SQL using mysql math functions. Query for interecept: -2.46 select avg(y) - (avg(x*y) -avg(x) *avg(y))/(avg(x*x) -avg(x)*avg(x)) * avg(x) from ols Query for slope: 0.02 select (avg(x*y) -avg(x) *avg(y))/(avg(x*x) -avg(x)*avg(x)) from ols In continuation of this thread:
  • 3. Slope is Intercept is the dataset and these sql in the attached xls. Is this Mathematically fine? But does GSPC really cause a move on GE like that? Could this be random? How likely is this random? We need R-Squared -- this gives us the goodness of fit? (is this formula trust worthy? does X explain Y?) Then comes P-Value I will cover that in Part III. Recall that ols has the timeseries of GSPC and GE we can extract the timeseries into a csv file as follows: I have xy.csv let us install R and load this file into R as follows:
  • 4. data <- read.table ("C:UsersmariDownloadsxy.csv",header=TRUE,sep="t") if u examine data it will have two columns x<-data[,1] y<-data[,2] data has two vectors accessed as data[,1] and data[,2] y <- data[,1] x <- data[,2] # we run a linear regression m <- lm ( y ~ x) summary(m) intercept is -2.458 slope 2.069e-02 -- precisely the intercept and slope we had come up with, using mysql. Since p-value is 2e-16 much less than 0.05 or 0.01 we reject the null-hypothesis.
  • 5. This being the single variable linear regression, the null Hypothesis is slope is zero and there is no relationship between GSPC and GE. We reject that there is no relationship between GSPC and GE. p-value is statistical measure -- its significance is of statistical origin. Statisticians quibble about "Evidence of Absence vs Absence of Evidence --> EA != AE. Correlation and causation are two different things. Some of you enquired about SE and t-test etc, http://stattrek.com/regression/slope-test.aspx is your source. Any statistical analysis must be verified for consistency, validity and accuracy. That is the model and the prevailing reality must be valid, accurate and consistent with the assumptions under which the analysis is done. For OLS there are many assumptions. OLS is constructed on the following assumptions: – Residuals are normally distributed – Variance is constant – absence of serial correlation Here is how one can verify those assumptions are valid and therefore the inferences are worth further consideration. This “audit” is paramount and it is the sole responsibility of the data scientist – one who is constructing the models and conducting the experiment and the analysis. Validating OLS Confirm the vectors are present, run the model using lm, extract the residuals using residuals function. Confirming the residuals are normally distributed Since the p-value is more than 0.05, we cannot reject the null hypothesis. The null hypothesis for this test is the residuals are normally distributed. This is same as Anderson-Darling test. Confirming constant variance. The statistical “speak” for this is heteroskedascticity (which means variance is not constant). We run
  • 6. bptest, for which the null hypothesis is that the variance is constant (homoskedastic). Note that I have used bptest from lmtest package. Here again we will reject the null if p less than 0.05 and therefore it is appropriate to assume that the residuals do not exhibit constant variance. Confirming that there is no Serial or Auto Correlation Here again the p-value is less than 0.05 and therefore we reject the null hypothesis. The null hypothesis for this test is that there is no auto-correlation. To learn more about these tests, kindly visit wiki or bing it. Loading libraries as needed try require as shown. if require is not successful, then load the lmtest as follows and then re-run require command as above. Use Packages menu at the top to install a package. Normality of the Residuals: Visual Verification The residuals can be extracted as follows: And the following plot is generated. If the points deviate from the line, then the points are not from Normal Distribution. I have screenshots that show typos and miscellaneous problems, for a reason. Getting multiple technologies requires determination to complete and never quit. We have to persist and to highlight that I have shown the snafus I experienced.
  • 7. Beyond OLS – Numerical Regression OLS is the very basic step toward become a data analyst. From here we move on to analyzing categorical data. Real world problems come with very large datasets and very large datasets come with many predictor variables. Performing analysis including all of them will quickly escalate into computational hog and most importantly, all of the variables may not even be relevant or influential. More often than not, blindly including all the variables can result spurious fitting – over/under fitting. Therefore, before embarking on more involved analysis, we must validate the features or dimensions we wish to include in our analysis. This can be done using standard “Dimensionality Reduction” techniques such as the Principal Component Analysis (PCA) or Singular Value Decomposition (SVD). R provides exceptional support for performing PCA. Then we move on to predictive analytics and simple applications of data mining techniques such as unsupervised/supervised learning, aka knowns as clustering/classification. In subsequent essays, we will cover unsupervised learning techniques including K-means and Hierarchical clustering in R. Please visit http://www.slideshare.net/rk2153/introduction-to-unsupervised-learning-kmeans-and-h here to get a birds-eye view. These techniques are extremely useful for a diverse set of business applications ranging from customer segmentation to fraud detection. Questions and feedback are always welcome. Let us have some fun with data.