Essay on-data-analysis

Essay on Data Analysis
Raman Kannan
We will start with Ordinary Least Square (OLS) and review how to verify and validate the
OLS results. Then we will branch on to a few rudimentary tools and techniques for data
mining .
Running Linear Regression using mysql and R
This is a basic introduction to regression using mysql. We will verify the results using R
which is an excellent tool for data analysis. We source our data from Yahoo Finance.
Retrieve data from Yahoo Finance
export from yahoo finance into GE.csv and GSPC.csv
Create tables in mysql
crIn mysql create the tables as shown above.
Load the data into the newly created tables
I am loading as root. Also, note C:/ (forward slash not backward slash '' )
Data Quality
Quality of data is paramount. I am going to create a new table with data for GSPC and GE
ensuring identical dates.

I will treat GSPC as the independent var and GE as dependent var. As if GSPC has a
reliable predictive capability. I need these calculations from wiki
I will do these in SQL using mysql math functions.
Query for interecept: -2.46
select avg(y) - (avg(x*y) -avg(x) *avg(y))/(avg(x*x) -avg(x)*avg(x)) * avg(x) from ols
Query for slope: 0.02
select (avg(x*y) -avg(x) *avg(y))/(avg(x*x) -avg(x)*avg(x)) from ols
In continuation of this thread:

Slope is
Intercept is
the dataset and these sql in the attached xls.
Is this Mathematically fine?
But does GSPC really cause a move on GE like that?
Could this be random? How likely is this random?
We need R-Squared -- this gives us the goodness of fit?
(is this formula trust worthy? does X explain Y?)
Then comes P-Value
I will cover that in Part III.
Recall that ols has the timeseries of GSPC and GE
we can extract the timeseries into a csv file as follows:
I have xy.csv
let us install R and load this file into R as follows:

data <- read.table ("C:UsersmariDownloadsxy.csv",header=TRUE,sep="t")
if u examine data it will have two columns
x<-data[,1]
y<-data[,2]
data has two vectors accessed as data[,1] and data[,2]
y <- data[,1]
x <- data[,2]
# we run a linear regression
m <- lm ( y ~ x)
summary(m)
intercept is -2.458 slope 2.069e-02 -- precisely the intercept and slope we had come up with, using
mysql.
Since p-value is 2e-16 much less than 0.05 or 0.01 we reject the null-hypothesis.

This being the single variable linear regression, the null Hypothesis is slope is zero and there is no
relationship between GSPC and GE.
We reject that there is no relationship between GSPC and GE. p-value is statistical measure -- its
significance is of statistical origin. Statisticians quibble about "Evidence of Absence vs Absence of
Evidence --> EA != AE. Correlation and causation are two different things.
Some of you enquired about SE and t-test etc, http://stattrek.com/regression/slope-test.aspx is your
source.
Any statistical analysis must be verified for consistency, validity and accuracy. That is the model and
the prevailing reality must be valid, accurate and consistent with the assumptions under which the
analysis is done. For OLS there are many assumptions.
OLS is constructed on the following assumptions:
– Residuals are normally distributed
– Variance is constant
– absence of serial correlation
Here is how one can verify those assumptions are valid and therefore the inferences are worth further
consideration. This “audit” is paramount and it is the sole responsibility of the data scientist – one who
is constructing the models and conducting the experiment and the analysis.
Validating OLS
Confirm the vectors are present, run the model using lm, extract
the residuals using residuals function.
Confirming the residuals are normally distributed
Since the p-value is more than 0.05, we cannot reject the null
hypothesis. The null hypothesis for this test is the residuals are
normally distributed. This is same as Anderson-Darling test.
Confirming constant variance.
The statistical “speak” for this is heteroskedascticity (which means variance is not constant). We run

bptest, for which the null hypothesis is that the variance is constant (homoskedastic). Note that I have
used bptest from lmtest package. Here again we will reject the null if p less than 0.05 and therefore it is
appropriate to assume that the residuals do not exhibit constant variance.
Confirming that there is no Serial or Auto Correlation
Here again the p-value is less than 0.05 and
therefore we reject the null hypothesis. The
null hypothesis for this test is that there is no
auto-correlation. To learn more about these
tests, kindly visit wiki or bing it.
Loading libraries as needed
try require as shown.
if require is not successful, then load the lmtest as follows
and then re-run require command as above. Use Packages
menu at the top to install a package.
Normality of the Residuals: Visual Verification
The residuals can be extracted as follows:
And the following plot is generated. If the
points deviate from the line, then the points
are not from Normal Distribution.
I have screenshots that show typos and
miscellaneous problems, for a reason.
Getting multiple technologies requires
determination to complete and never quit.
We have to persist and to highlight that I
have shown the snafus I experienced.

Beyond OLS – Numerical Regression
OLS is the very basic step toward become a data analyst. From here we move on to analyzing
categorical data. Real world problems come with very large datasets and very large datasets come with
many predictor variables. Performing analysis including all of them will quickly escalate into
computational hog and most importantly, all of the variables may not even be relevant or influential.
More often than not, blindly including all the variables can result spurious fitting – over/under fitting.
Therefore, before embarking on more involved analysis, we must validate the features or dimensions
we wish to include in our analysis. This can be done using standard “Dimensionality Reduction”
techniques such as the Principal Component Analysis (PCA) or Singular Value Decomposition (SVD).
R provides exceptional support for performing PCA.
Then we move on to predictive analytics and simple applications of data mining techniques such as
unsupervised/supervised learning, aka knowns as clustering/classification. In subsequent essays, we
will cover unsupervised learning techniques including K-means and Hierarchical clustering in R.
Please visit http://www.slideshare.net/rk2153/introduction-to-unsupervised-learning-kmeans-and-h
here to get a birds-eye view. These techniques are extremely useful for a diverse set of business
applications ranging from customer segmentation to fraud detection.
Questions and feedback are always welcome. Let us have some fun with data.

Essay on-data-analysis

More Related Content

What's hot (20)

Similar to Essay on-data-analysis (20)

More from Raman Kannan (20)

Recently uploaded (20)

Essay on-data-analysis