Marshall University
College of Science
Department of Mathematics
- STA 564 -
Time Series Analysis and Forecasting
focused on Air Pollution in an Urban
Area
By Kenneth Guzman
December 7, 2018
Contents
1 INTRODUCTION 2
2 Yeo-Johnson transformation, Kolmogorov-Smirnov Test for normality 3
3 Factor Analysis and PCA 3
4 Box Jenkins Methodology 3
5 Personal Study Carried out in R 4
6 R Code Explanation and Software Package Used 29
7 Conclusion 30
References 31
TIME SERIES ANALYSIS FORECASTING
NOTES
At its core, the influences of air pollution in the atmosphere are strongly managed by
meteorology. However, in the ”univariate” models we will consider it is assumed that the
final concentration of air pollutants in the atmosphere is the final result of all the complex
interactions of meteorology, chemistry, transport, diffusion etc. For this reason, the combined
information of their effect on air pollutant concentration is contained in the corresponding
time series in a stochastic way. Using this approach, calculations are simplified and performed
only using the time series of the pollutant without explicit inclusion of meteorological or other
measurements.
Four professors from Plovdiv University in Bulgaria, produced a research paper on time series
analysis concerning air pollution, the methods used were explicitly stated in their article as:
(i)
Identify correlation type dependencies and grouping of observed air pollutants using the
method of factor analysis to explain mutual effects of pollution.
(ii)
Conduct time series analysis by determining seasonal ARIMA(based on hourly data) relevant
parametric models of pollutants.
(iii)
Analysis and Diagnostics of constructed models.
(iv)
Application of models for short term forecasting.
(v)
Interpretation of the results and definition of the conditions contributing to the exceeding
of national and European, concentration norms for the considered air pollutants.
Their study was carried out using IBM SPSS 19 and EViews 7.[3]
1
TIME SERIES ANALYSIS FORECASTING
1 INTRODUCTION
Even though there are established regulations for monitoring and controlling effects on
air quality in certain territories, air quality may remain unsatisfactory. Lets consider the
particular case where our focus lies within the town of Blagoevgrad, Bulgaria. Blagoevgrad is
a typical representative of a small urban region, with a population of approximately 70,000.
Time Span of Study:1 year period from September 1st, 2011 to August 31st, 2012, based
on hourly measurements, six air pollutants were observed. Factor analysis and Box-Jenkins
methodology were applied to inspect concentrations of the primary air pollutants of interest.
The pollutants were grouped into three factors and the degree of contribution of the factors
to the overall pollution was determined, this contribution was interpreted as the presence
of common sources of pollution. The classical techniques of principal component analysis
(PCA) and factor analysis are important statistical instruments frequently used in the
environmental sciences.
The focus of the study involved the performance of time series analysis and the development
of univariate stochastic seasonal autoregressive integrated moving average (SARIMA) models
with recording on an hourly basis as seasonality. The study incorporates Yeo-Johnson power
transformation for variance stabilizing of the data, and model selection by using Bayesian
Information Criterion. The SARIMA models obtained in the study in Bulgaria demonstrated
good fitting with respect to the observed air pollutants and short term predictions for 72
hours ahead, specifically in the case of ozone and particulate matter PM10. The methods
presented, allowed the building of less complex models that are effective for short-term air
pollution forecasting and useful for advance warning purposes in urban areas.[3]
Continuous and careful monitoring and forecasting of atmospheric air pollutants is important
when evaluating regulatory control measures related to air quality. In Bulgaria, 12 types
of pollutants are systematically monitored by more than 36 automated stations run by the
Executive Environment Agency(EEA), which manages and coordinates activities related to
the control and environmental protection of the country. Atmospheric air quality reports
for the various regions of the country are regularly published, and from this much data is
accumulated. The data accumulation is what allows us to carry out statistical analysis which
leads to the discovery of, general patterns and dependencies for different time periods and
relationships between observed air pollutants. The observed air pollutants related to the
study carried out in Blagoevgrad, Bulgaria are concentrations of particulate matter PM10,
nitrogen oxide NO, nitrogen dioxide NO2, nitrogen oxides NOx, sulfur dioxide SO2, and
ground level ozone O3. The data measurements are expressed in units of mass concentration
of pollutants in µg/m3
, only NOx is in unit ppb(partsperbillion, as it is observing pollution
from all kinds of nitrogen oxides. The data consisted of 8,744 observations (hourly data).
The goal of their study was to demonstrate the capabilities of the mentioned methods, which
can be applied to other recorded sets including for shorter and longer periods of time.
2
TIME SERIES ANALYSIS FORECASTING
2 Yeo-Johnson transformation, Kolmogorov-Smirnov
Test for normality
Time series data often requires preparation before using forecasting methods; and for this
reason normal or near to normal distribution of the univariate data is important, because
it reduces issues when we forecast future values. The obtained K-S statistic indicated
non-normality of the data collected in Bulgaria, which led to the transformation of the
data prior to constructing the forecasting models. In that particular case the Yeo-Johnson
transformation was carried out, which lead to the satisfying of the Kolmogorov Smirnov Test
for normality at 0.05 level of significance and may be assumed to be normally distributed.
The Yeo-Johnson transformation finds the optimal value of lambda that minimizes the
KullBack-Leibler1
distance between the normal distribution and the transformed distribution.[1][2]
Properties of Yeo-Johnson transformation below:
g(x; λ) = {1(λ=0,x≥0)
(x + 1)λ
− 1
λ
{1(λ=0,x≥0) log(x + 1)
{1(λ=2,x<0)
(1 − x)2−λ
− 1
λ − 2
{1(λ=2,x<0) − log(1 − x)
3 Factor Analysis and PCA
The statistical techniques of factor analysis and principal component analysis, help identify
patterns in the correlation between variables. The patterns identified are used to create
factors, which was the case in Bulgaria and allowed the grouping of correlated pollutants.
The steps followed for the particular case in Bulgaria were: (a) calculation of correlation
matrix (b) testing the adequacy of factor anaylsis (c) factor extraction (d) factor rotation
and (e) score calculation of factor variables. The particular advantages of these methods are
that they reveal strong correlation relationships between observed variables and allow their
grouping into new variables (factors) in order to reduce the dimensions of the complex data
structure. The factors can thereafter be used to build regression or other types of models.[5]
4 Box Jenkins Methodology
Other methods frequently used in times series analysis and forecasting are the auto-regressive
integrated moving average(ARIMA) and seasonal ARIMA (SARIMA)models, also known as
Box-Jenkins stochastic models. Box-Jenkins methodology is widely applied in air quality
research among other disciplines, and is a systematic strategy for identifying, fitting, and
forecasting time series univariate data. ARIMA models generally take the form Arima(p,d,q)
1
In mathematical statistics, the KullbackLeibler divergence (also called relative entropy), is a measure of
how one probability distribution is different from a second, reference probability distribution.
3
TIME SERIES ANALYSIS FORECASTING
where p is the number of parameters describing the auto-regressive process, d is the number of
nonseasonal differences needed to reach stationarity, and q is the number of lagged forecast
errors in the prediction equation. Similarly, the SARIMA models take the general form
Arima(p,d,q)(P,D,Q)s, where P is the number of seasonal auto-regressive terms, D is the
order of seasonal differencing and Q is the number of seasonal moving average terms. In the
seasonal part of the model, the three parameters P,D,Q operate across multiples of lag s,
where s is the number of time periods until a pattern repeats itself.
Main advantages of the Box-Jenkins approach:
(i)
Applicability for modeling and forecasting practically any time series that is stationary or
can be reduced to stationary by a differencing procedure.
(ii)
Ability to extract all the trends and serial correlations in the data with a minimized sequence
of white noise(shock) through inclusion in one general model equation that gets to the basis
of historical data development.
(iii)
The method has been incorporated into many standard software packages which exist within
R, SPSS, etc., which speeds up and assists the modeling process considerably.
5 Personal Study Carried out in R
Using the presented methods, I was able to carry out my own study using the statistical
software R. Using data provided by our own Environmental Protection Agency here in the
United States (https://www.epa.gov/outdoor-air-quality-data), I accessed pollutant concentration
data for the city of Richmond, Virginia, which has a population of approximately 220,000.
Time Span of Observed Data: A total of 4 years of data was accessed, periods from January
2010 to December 2013 based on weekly measurements of the following air pollutants,
concentrations of particulate matter PM2.5, particulate matter PM10, lead Pb expressed
in units of mass concentration (µg/m3
), carbon monoxide CO and ground level ozone O3
are in units ppm(partspermillion), sulfur dioxide SO2 and nitrogen dioxide NO2 are in units
ppb(partsperbillion). The goal of my personal research is to apply the time series analysis
and forecasting methods from the research paper produced in Bulgaria, to a local city here
in the US. As was the case in Bulgaria, once these methods are applied to the Richmond
pollutant data I hope to visually show an appropriate forecast for each pollutant for the year
2013.
Before I proceed forward I would like to point out that while the research paper concerning
Bulgaria highlighted a factor analysis and principal component analysis approach, the correlation
matrix calculated in R concerning the Richmond pollutant data-sets, displayed no signs of
positive or negative correlation between the pollutants, therefore I did not proceed to carry
out any sort of factor analysis or PCA. Also, the 2013 pollutant data-sets were strictly used
to compare our forecast models to the actual data recorded by the EPA in 2013.
4
TIME SERIES ANALYSIS FORECASTING
Directly below is the correlation matrix for all 7 pollutants concerning data over the time
span of the years 2010, 2011, and 2012.
Analyzing PM-2.5 using 3 year data
The first pollutant we will analyze is particulate matter PM2.5
The lambda value used to transform the original PM-2.5 observations, λ = 0.227158.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
5
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM-2.5 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
PM-2.5.
6
TIME SERIES ANALYSIS FORECASTING
Using only 2012 data to predict 2013 values
The lambda value used to transform the original PM-2.5 observations for the year 2012,
λ = 0.7078218.
7
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for PM-2.5 to see how accurately auto.arima() predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations,
λ = 0.05030683.
Finally, once we plot arima model against the 2013 time series plot, I believe the auto.arima
function is somewhat appropriate for predicting the trend of the Pollutant PM-2.5 for the
year 2013.
8
TIME SERIES ANALYSIS FORECASTING
Analyzing PM10 using 3 year data
The second pollutant we will analyze is particulate matter PM10
The lambda value used to transform the original PM10 observations, λ = 0.2409915.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
Below is the time series plot using only the forecast function in R.
9
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for PM10 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original PM10 2013 observations,
λ = 0.7845362.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
neither the forecast nor the auto.arima function is appropriate for predicting the values of
2013 for the Pollutant PM10.
10
TIME SERIES ANALYSIS FORECASTING
Using only 2012 data to predict 2013 values
The lambda value used to transform the original PM10 observations for the year 2012,
λ = −0.04297711.
Below is the time series plot for 2012 after a yeo-johnson transformation.
11
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing Pb(Lead) using 3 year data
The third pollutant we will analyze is lead Pb
The lambda value used to transform the original Pb observations, λ = −4.99994.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
12
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for Pb and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original Pb 2013 observations,
λ = −4.99994.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
13
TIME SERIES ANALYSIS FORECASTING
Pb.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original Pb(Lead) observations for the year 2012,
λ = −4.99994.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
14
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing CO using 3 year data
The fourth pollutant we will analyze is carbon monoxide CO
The lambda value used to transform the original CO observations, λ = −3.577325.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
15
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for CO and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.
16
TIME SERIES ANALYSIS FORECASTING
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
CO.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original CO observations for the year 2012, λ =
−3.641187.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
17
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for CO and see how accurately the arima model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = −2.432302.
Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant CO for
2013.
18
TIME SERIES ANALYSIS FORECASTING
Analyzing O3 using 3 year data
The fifth pollutant we will analyze is ground level ozone O3
The lambda value used to transform the original O3 observations, λ = 3.615548.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
19
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for O3 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original O3 2013 observations,
λ = 4.99994.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
O3.
20
TIME SERIES ANALYSIS FORECASTING
Using only 2012 data to predict 2013 values
The lambda value used to transform the original O3 observations for the year 2012, λ =
4.99994.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
21
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing SO2 using 3 year data
The sixth pollutant we will analyze is sulfur dioxide SO2
The lambda value used to transform the original SO2 observations, λ = −0.227093.
Directly below is the time series plot for the 3 years after a yeo-johnson transformation.
22
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for SO2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original SO2 2013 observations,
λ = 0.2616144.
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the forecast function is most appropriate for predicting the values of 2013 for the Pollutant
23
TIME SERIES ANALYSIS FORECASTING
SO2.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original SO2 observations for the year 2012, λ =
−0.1123281.
Below is the time series plot for 2012 after a yeo-johnson transformation.
24
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
The time series plot using the auto.arima function was not yielding an appropriate graph in
R.
Analyzing NO2 using 3 year data
The seventh and final pollutant we will analyze is nitrogen dioxide NO2
The lambda value used to transform the original NO2 observations, λ = 0.9783584.
Below is the time series plot for the 3 years after a yeo-johnson transformation.
25
TIME SERIES ANALYSIS FORECASTING
Directly below is the time series plot using only the forecast function in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our forecast and arima models, the next step was to access our 2013
pollutant concentration data for NO2 and compare each model to see how accurately it
predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on
the 2013 data. The lambda value used to transform the original NO2 2013 observations,
λ = 1.003092.
26
TIME SERIES ANALYSIS FORECASTING
Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe
the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant
NO2.
Using only 2012 data to predict 2013 values
The lambda value used to transform the original NO2 observations for the year 2012,
λ = 1.229131.
Directly below is the time series plot for 2012 after a yeo-johnson transformation.
27
TIME SERIES ANALYSIS FORECASTING
The time series plot using only the forecast function was not yielding an appropriate graph
in R.
Directly below is the time series plot using auto.arima function in R.
Now that we have our arima model, the next step was to access our 2013 pollutant concentration
data for NO2 and see how accurately the model predicted the 2013 values.
Before I created the time series plot for 2013, I preformed a yeo-johnson transformation
on the 2013 data. The lambda value used to transform the original CO 2013 observations,
λ = 1.003092.
Finally, once we plot the arima model against the 2013 time series plot, I believe the
auto.arima function is most appropriate for predicting the trend of the Pollutant NO2 for
2013.
28
TIME SERIES ANALYSIS FORECASTING
6 R Code Explanation and Software Packages Used
The following packages in the R software were used: MASS, bestNormalize, forecast.
• From MASS the function truehist was used to plot the histograms of the pollutant
data before and after the yeojohnson transformation was applied, to visually show the
transformation from non-normal to normal distribution of the data.
• From bestNormalize the function yeojohnson was used to transform the pollutant
data from non-normal to normally distributed, in order to better carry out our statistical
analysis.
• From forecast the functions forecast and auto.arima were used, each playing the most
important role in analyzing prior pollutant observations and forecasting our future
values as accurately as R allows for each pollutant.
The main functions that I will highlight in this sections are the forecast and auto.arima()
functions in R but I will also briefly explain my usage of the ts() and yeojohnson() functions.
It was very important to my study that within the forecast function level=F because while
having confidence intervals in our graphs could be useful, they were not particularly needed
for my study to be carried out, since I was mostly interested in the specific values that the
forecast function gave us in its output. Also, in the forecast package it was vary important
that we only forecast exactly 59 future values, which is simply due to the fact that there
are exactly 59 values in our EPA 2013 data for each pollutant. Now, in the auto.arima()
function, no restrictions needed to be called within the function but it was most important
that we accessed our forecast values by auto.arima()$f and just for reference we are also
able to access our original values that were put into the function by using auto.arima()$x.
One last note, when I was plotting the time series for the 3 year data, you should notice that
within each ts() function the frequency=(58) which I interpret as they were an average of
58 observations per year, and I simply got 58 by dividing the total amount of observations
in our 3 year data by 3, so 174/3 = 58. Within the yeojohnson() function you will notice
29
TIME SERIES ANALYSIS FORECASTING
that standardize=FALSE this is because if it is not declared within the function by default
R will further perform standardization of the values put into the function, I did not find
the further standardization useful in my case when dealing with the Richmond data, mainly
because the yeojohnson transformation was of interest in the Bulgaria study so I wanted to
follow that transformation as it is without further standardization.
7 Conclusion
In the Bulgaria study the researchers main goal was to be able to use the arima models in
order to forecast ahead 72 hours, because they used hourly data. Similarly, I feel it necessary
to highlight the importance the auto.arima() function played in helping forecast the year
2013. While it was not totally helpful with forecasting all pollutants, it was definitely more
helpful than the forecast() function, in identifying the trend or behavior of each pollutant
throughout the year(s). The most important finding I came across was that the 2012 data
alone was certainly not enough it most cases when attempting to forecast a future year, but
the 3 year(2010,2011,2012) data combination allowed both the forecast() and auto.arima()
functions to display their usefulness when forecasting. I certainly enjoyed preparing this
study and learning about time series and hope that I am given the opportunity to further
explore this discipline in the future.
30
TIME SERIES ANALYSIS FORECASTING
References
[1] Kullback, S. (1959), Information Theory and Statistics, John Wiley and Sons.
Republished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9.
[2] Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve
normality or symmetry. Biometrika.
[3] Gocheva-Ilieva, Snezhana; Ivanov, A; Voynikova, Desislava; Boyadzhiev, Doychin. (2013).
Time series analysis and forecasting for air pollution in small urban area: An SARIMA
and factor analysis approach. Stochastic Environmental Research and Risk Assessment.
28. 1045-1060. 10.1007/s00477-013-0800-4.
[4] Alcosser, Howard. ”Diamond Bar High School” Internal Assessment: Mathematical
Exploration. Web. 27 May 2015.
[5] Jolliffe, Ian. (1986). Principal Component Analysis and Factor Analysis. 10.1007/978 −
1 − 4757 − 1904 − 87. Principal component analysis and Factor Analysis.
31

More Related Content

PDF
3445-8593-4-PB
PDF
Mt3422782294
PDF
Temporal trends of spatial correlation within the PM10 time series of the Air...
PDF
The Impact of Different Validation Datasets on Air Quality Modelling Performance
PDF
Storage Resource Estimates and Seal Evaluation of Cambrian-Ordovician Units i...
PDF
Calibration of Environmental Sensor Data Using a Linear Regression Technique
PPTX
Dubbs, PE, CM, Kasi, Trinity Consultants, PM2.5 Regulatory Guidance Overview ...
PDF
How do air quality models perform with different validation datasets and diff...
3445-8593-4-PB
Mt3422782294
Temporal trends of spatial correlation within the PM10 time series of the Air...
The Impact of Different Validation Datasets on Air Quality Modelling Performance
Storage Resource Estimates and Seal Evaluation of Cambrian-Ordovician Units i...
Calibration of Environmental Sensor Data Using a Linear Regression Technique
Dubbs, PE, CM, Kasi, Trinity Consultants, PM2.5 Regulatory Guidance Overview ...
How do air quality models perform with different validation datasets and diff...

What's hot (20)

PDF
International Journal of Engineering Research and Development
PDF
Integration Method of Local-global SVR and Parallel Time Variant PSO in Water...
PDF
IRJET- Rainfall Forecasting using Regression Techniques
PDF
paper mikrotremor
PDF
Refining Underwater Target Localization and Tracking Estimates
PDF
Am4103223229
PDF
Time Series Data Analysis for Forecasting – A Literature Review
PPTX
Mechanistic models
PDF
IAS_SRF_Project_2015
PDF
Global Sensitivity Analysis for the Calibration of a Fully-distributed Hydrol...
PDF
Estimating Parameter of Nonlinear Bias Correction Method using NSGA-II in Dai...
PDF
An improved method for predicting heat exchanger network area
PPTX
Final presentation (2)
PDF
Download-manuals-ground water-manual-gw-volume2designmanualsamplingprinciples
PDF
Consequence assessment methods for incidents from lng
 
PDF
Comparison of MOC and Lax FDE for simulating transients in Pipe Flows
PPTX
MCP_ES_2012_Jie
PDF
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
PDF
Online flooding monitoring in packed towers
PDF
Determination of the corrosion rate of a mic influenced pipeline using four c...
International Journal of Engineering Research and Development
Integration Method of Local-global SVR and Parallel Time Variant PSO in Water...
IRJET- Rainfall Forecasting using Regression Techniques
paper mikrotremor
Refining Underwater Target Localization and Tracking Estimates
Am4103223229
Time Series Data Analysis for Forecasting – A Literature Review
Mechanistic models
IAS_SRF_Project_2015
Global Sensitivity Analysis for the Calibration of a Fully-distributed Hydrol...
Estimating Parameter of Nonlinear Bias Correction Method using NSGA-II in Dai...
An improved method for predicting heat exchanger network area
Final presentation (2)
Download-manuals-ground water-manual-gw-volume2designmanualsamplingprinciples
Consequence assessment methods for incidents from lng
 
Comparison of MOC and Lax FDE for simulating transients in Pipe Flows
MCP_ES_2012_Jie
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
Online flooding monitoring in packed towers
Determination of the corrosion rate of a mic influenced pipeline using four c...
Ad

Similar to Time Series Analysis (20)

PDF
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
PDF
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
PDF
Air_Quality_Index_Forecasting Prediction BP
PDF
A Smart air pollution detector using SVM Classification
PDF
Atmospheric Pollutant Concentration Prediction Based on KPCA BP
PDF
Analysis Of Air Pollutants Affecting The Air Quality Using ARIMA
PDF
Calibration and validation o s air quality
PDF
Ae4102224236
PDF
PDF
Calculation of solar radiation by using regression methods
PDF
dfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfsfdsf
PDF
Alin Pohoata: "Multiple characterizations of urban air pollution time series ...
PDF
Conference on the Environment- GUERRA presentation Nov 19, 2014
PDF
PDF
PPT.pdf internship demo on machine lerning
PPTX
air quality index forecasting using time series analysis.pptx
PDF
Defining Homogenous Climate zones of Bangladesh using Cluster Analysis
PDF
Air Quality Monitoring Using Model: A Review
PDF
Use of Probabilistic Statistical Techniques in AERMOD Modeling Evaluations
PDF
Ott, Lesley: Low latency flux and concentration datasets in support of greenh...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
ONLINE SCALABLE SVM ENSEMBLE LEARNING METHOD (OSSELM) FOR SPATIO-TEMPORAL AIR...
Air_Quality_Index_Forecasting Prediction BP
A Smart air pollution detector using SVM Classification
Atmospheric Pollutant Concentration Prediction Based on KPCA BP
Analysis Of Air Pollutants Affecting The Air Quality Using ARIMA
Calibration and validation o s air quality
Ae4102224236
Calculation of solar radiation by using regression methods
dfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfdsfsfdsf
Alin Pohoata: "Multiple characterizations of urban air pollution time series ...
Conference on the Environment- GUERRA presentation Nov 19, 2014
PPT.pdf internship demo on machine lerning
air quality index forecasting using time series analysis.pptx
Defining Homogenous Climate zones of Bangladesh using Cluster Analysis
Air Quality Monitoring Using Model: A Review
Use of Probabilistic Statistical Techniques in AERMOD Modeling Evaluations
Ott, Lesley: Low latency flux and concentration datasets in support of greenh...
Ad

Recently uploaded (20)

PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Introduction to the R Programming Language
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
Global Data and Analytics Market Outlook Report
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
PPT
DU, AIS, Big Data and Data Analytics.ppt
PPT
statistic analysis for study - data collection
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
Managing Community Partner Relationships
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Microsoft Core Cloud Services powerpoint
PDF
Transcultural that can help you someday.
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Navigating the Thai Supplements Landscape.pdf
SAP 2 completion done . PRESENTATION.pptx
Introduction to the R Programming Language
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
Global Data and Analytics Market Outlook Report
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
[EN] Industrial Machine Downtime Prediction
Phase1_final PPTuwhefoegfohwfoiehfoegg.pptx
DU, AIS, Big Data and Data Analytics.ppt
statistic analysis for study - data collection
Optimise Shopper Experiences with a Strong Data Estate.pdf
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Managing Community Partner Relationships
Pilar Kemerdekaan dan Identi Bangsa.pptx
Microsoft Core Cloud Services powerpoint
Transcultural that can help you someday.
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt

Time Series Analysis

  • 1. Marshall University College of Science Department of Mathematics - STA 564 - Time Series Analysis and Forecasting focused on Air Pollution in an Urban Area By Kenneth Guzman December 7, 2018
  • 2. Contents 1 INTRODUCTION 2 2 Yeo-Johnson transformation, Kolmogorov-Smirnov Test for normality 3 3 Factor Analysis and PCA 3 4 Box Jenkins Methodology 3 5 Personal Study Carried out in R 4 6 R Code Explanation and Software Package Used 29 7 Conclusion 30 References 31
  • 3. TIME SERIES ANALYSIS FORECASTING NOTES At its core, the influences of air pollution in the atmosphere are strongly managed by meteorology. However, in the ”univariate” models we will consider it is assumed that the final concentration of air pollutants in the atmosphere is the final result of all the complex interactions of meteorology, chemistry, transport, diffusion etc. For this reason, the combined information of their effect on air pollutant concentration is contained in the corresponding time series in a stochastic way. Using this approach, calculations are simplified and performed only using the time series of the pollutant without explicit inclusion of meteorological or other measurements. Four professors from Plovdiv University in Bulgaria, produced a research paper on time series analysis concerning air pollution, the methods used were explicitly stated in their article as: (i) Identify correlation type dependencies and grouping of observed air pollutants using the method of factor analysis to explain mutual effects of pollution. (ii) Conduct time series analysis by determining seasonal ARIMA(based on hourly data) relevant parametric models of pollutants. (iii) Analysis and Diagnostics of constructed models. (iv) Application of models for short term forecasting. (v) Interpretation of the results and definition of the conditions contributing to the exceeding of national and European, concentration norms for the considered air pollutants. Their study was carried out using IBM SPSS 19 and EViews 7.[3] 1
  • 4. TIME SERIES ANALYSIS FORECASTING 1 INTRODUCTION Even though there are established regulations for monitoring and controlling effects on air quality in certain territories, air quality may remain unsatisfactory. Lets consider the particular case where our focus lies within the town of Blagoevgrad, Bulgaria. Blagoevgrad is a typical representative of a small urban region, with a population of approximately 70,000. Time Span of Study:1 year period from September 1st, 2011 to August 31st, 2012, based on hourly measurements, six air pollutants were observed. Factor analysis and Box-Jenkins methodology were applied to inspect concentrations of the primary air pollutants of interest. The pollutants were grouped into three factors and the degree of contribution of the factors to the overall pollution was determined, this contribution was interpreted as the presence of common sources of pollution. The classical techniques of principal component analysis (PCA) and factor analysis are important statistical instruments frequently used in the environmental sciences. The focus of the study involved the performance of time series analysis and the development of univariate stochastic seasonal autoregressive integrated moving average (SARIMA) models with recording on an hourly basis as seasonality. The study incorporates Yeo-Johnson power transformation for variance stabilizing of the data, and model selection by using Bayesian Information Criterion. The SARIMA models obtained in the study in Bulgaria demonstrated good fitting with respect to the observed air pollutants and short term predictions for 72 hours ahead, specifically in the case of ozone and particulate matter PM10. The methods presented, allowed the building of less complex models that are effective for short-term air pollution forecasting and useful for advance warning purposes in urban areas.[3] Continuous and careful monitoring and forecasting of atmospheric air pollutants is important when evaluating regulatory control measures related to air quality. In Bulgaria, 12 types of pollutants are systematically monitored by more than 36 automated stations run by the Executive Environment Agency(EEA), which manages and coordinates activities related to the control and environmental protection of the country. Atmospheric air quality reports for the various regions of the country are regularly published, and from this much data is accumulated. The data accumulation is what allows us to carry out statistical analysis which leads to the discovery of, general patterns and dependencies for different time periods and relationships between observed air pollutants. The observed air pollutants related to the study carried out in Blagoevgrad, Bulgaria are concentrations of particulate matter PM10, nitrogen oxide NO, nitrogen dioxide NO2, nitrogen oxides NOx, sulfur dioxide SO2, and ground level ozone O3. The data measurements are expressed in units of mass concentration of pollutants in µg/m3 , only NOx is in unit ppb(partsperbillion, as it is observing pollution from all kinds of nitrogen oxides. The data consisted of 8,744 observations (hourly data). The goal of their study was to demonstrate the capabilities of the mentioned methods, which can be applied to other recorded sets including for shorter and longer periods of time. 2
  • 5. TIME SERIES ANALYSIS FORECASTING 2 Yeo-Johnson transformation, Kolmogorov-Smirnov Test for normality Time series data often requires preparation before using forecasting methods; and for this reason normal or near to normal distribution of the univariate data is important, because it reduces issues when we forecast future values. The obtained K-S statistic indicated non-normality of the data collected in Bulgaria, which led to the transformation of the data prior to constructing the forecasting models. In that particular case the Yeo-Johnson transformation was carried out, which lead to the satisfying of the Kolmogorov Smirnov Test for normality at 0.05 level of significance and may be assumed to be normally distributed. The Yeo-Johnson transformation finds the optimal value of lambda that minimizes the KullBack-Leibler1 distance between the normal distribution and the transformed distribution.[1][2] Properties of Yeo-Johnson transformation below: g(x; λ) = {1(λ=0,x≥0) (x + 1)λ − 1 λ {1(λ=0,x≥0) log(x + 1) {1(λ=2,x<0) (1 − x)2−λ − 1 λ − 2 {1(λ=2,x<0) − log(1 − x) 3 Factor Analysis and PCA The statistical techniques of factor analysis and principal component analysis, help identify patterns in the correlation between variables. The patterns identified are used to create factors, which was the case in Bulgaria and allowed the grouping of correlated pollutants. The steps followed for the particular case in Bulgaria were: (a) calculation of correlation matrix (b) testing the adequacy of factor anaylsis (c) factor extraction (d) factor rotation and (e) score calculation of factor variables. The particular advantages of these methods are that they reveal strong correlation relationships between observed variables and allow their grouping into new variables (factors) in order to reduce the dimensions of the complex data structure. The factors can thereafter be used to build regression or other types of models.[5] 4 Box Jenkins Methodology Other methods frequently used in times series analysis and forecasting are the auto-regressive integrated moving average(ARIMA) and seasonal ARIMA (SARIMA)models, also known as Box-Jenkins stochastic models. Box-Jenkins methodology is widely applied in air quality research among other disciplines, and is a systematic strategy for identifying, fitting, and forecasting time series univariate data. ARIMA models generally take the form Arima(p,d,q) 1 In mathematical statistics, the KullbackLeibler divergence (also called relative entropy), is a measure of how one probability distribution is different from a second, reference probability distribution. 3
  • 6. TIME SERIES ANALYSIS FORECASTING where p is the number of parameters describing the auto-regressive process, d is the number of nonseasonal differences needed to reach stationarity, and q is the number of lagged forecast errors in the prediction equation. Similarly, the SARIMA models take the general form Arima(p,d,q)(P,D,Q)s, where P is the number of seasonal auto-regressive terms, D is the order of seasonal differencing and Q is the number of seasonal moving average terms. In the seasonal part of the model, the three parameters P,D,Q operate across multiples of lag s, where s is the number of time periods until a pattern repeats itself. Main advantages of the Box-Jenkins approach: (i) Applicability for modeling and forecasting practically any time series that is stationary or can be reduced to stationary by a differencing procedure. (ii) Ability to extract all the trends and serial correlations in the data with a minimized sequence of white noise(shock) through inclusion in one general model equation that gets to the basis of historical data development. (iii) The method has been incorporated into many standard software packages which exist within R, SPSS, etc., which speeds up and assists the modeling process considerably. 5 Personal Study Carried out in R Using the presented methods, I was able to carry out my own study using the statistical software R. Using data provided by our own Environmental Protection Agency here in the United States (https://www.epa.gov/outdoor-air-quality-data), I accessed pollutant concentration data for the city of Richmond, Virginia, which has a population of approximately 220,000. Time Span of Observed Data: A total of 4 years of data was accessed, periods from January 2010 to December 2013 based on weekly measurements of the following air pollutants, concentrations of particulate matter PM2.5, particulate matter PM10, lead Pb expressed in units of mass concentration (µg/m3 ), carbon monoxide CO and ground level ozone O3 are in units ppm(partspermillion), sulfur dioxide SO2 and nitrogen dioxide NO2 are in units ppb(partsperbillion). The goal of my personal research is to apply the time series analysis and forecasting methods from the research paper produced in Bulgaria, to a local city here in the US. As was the case in Bulgaria, once these methods are applied to the Richmond pollutant data I hope to visually show an appropriate forecast for each pollutant for the year 2013. Before I proceed forward I would like to point out that while the research paper concerning Bulgaria highlighted a factor analysis and principal component analysis approach, the correlation matrix calculated in R concerning the Richmond pollutant data-sets, displayed no signs of positive or negative correlation between the pollutants, therefore I did not proceed to carry out any sort of factor analysis or PCA. Also, the 2013 pollutant data-sets were strictly used to compare our forecast models to the actual data recorded by the EPA in 2013. 4
  • 7. TIME SERIES ANALYSIS FORECASTING Directly below is the correlation matrix for all 7 pollutants concerning data over the time span of the years 2010, 2011, and 2012. Analyzing PM-2.5 using 3 year data The first pollutant we will analyze is particulate matter PM2.5 The lambda value used to transform the original PM-2.5 observations, λ = 0.227158. Directly below is the time series plot for the 3 years after a yeo-johnson transformation. 5
  • 8. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using only the forecast function in R. Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for PM-2.5 and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations, λ = 0.05030683. Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant PM-2.5. 6
  • 9. TIME SERIES ANALYSIS FORECASTING Using only 2012 data to predict 2013 values The lambda value used to transform the original PM-2.5 observations for the year 2012, λ = 0.7078218. 7
  • 10. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot for 2012 after a yeo-johnson transformation. The time series plot using only the forecast function was not yielding an appropriate graph in R. Directly below is the time series plot using auto.arima function in R. Now that we have our arima model, the next step was to access our 2013 pollutant concentration data for PM-2.5 to see how accurately auto.arima() predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original PM-2.5 2013 observations, λ = 0.05030683. Finally, once we plot arima model against the 2013 time series plot, I believe the auto.arima function is somewhat appropriate for predicting the trend of the Pollutant PM-2.5 for the year 2013. 8
  • 11. TIME SERIES ANALYSIS FORECASTING Analyzing PM10 using 3 year data The second pollutant we will analyze is particulate matter PM10 The lambda value used to transform the original PM10 observations, λ = 0.2409915. Directly below is the time series plot for the 3 years after a yeo-johnson transformation. Below is the time series plot using only the forecast function in R. 9
  • 12. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for PM10 and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original PM10 2013 observations, λ = 0.7845362. Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe neither the forecast nor the auto.arima function is appropriate for predicting the values of 2013 for the Pollutant PM10. 10
  • 13. TIME SERIES ANALYSIS FORECASTING Using only 2012 data to predict 2013 values The lambda value used to transform the original PM10 observations for the year 2012, λ = −0.04297711. Below is the time series plot for 2012 after a yeo-johnson transformation. 11
  • 14. TIME SERIES ANALYSIS FORECASTING The time series plot using only the forecast function was not yielding an appropriate graph in R. The time series plot using the auto.arima function was not yielding an appropriate graph in R. Analyzing Pb(Lead) using 3 year data The third pollutant we will analyze is lead Pb The lambda value used to transform the original Pb observations, λ = −4.99994. Directly below is the time series plot for the 3 years after a yeo-johnson transformation. 12
  • 15. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using only the forecast function in R. Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for Pb and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original Pb 2013 observations, λ = −4.99994. Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant 13
  • 16. TIME SERIES ANALYSIS FORECASTING Pb. Using only 2012 data to predict 2013 values The lambda value used to transform the original Pb(Lead) observations for the year 2012, λ = −4.99994. Directly below is the time series plot for 2012 after a yeo-johnson transformation. 14
  • 17. TIME SERIES ANALYSIS FORECASTING The time series plot using only the forecast function was not yielding an appropriate graph in R. The time series plot using the auto.arima function was not yielding an appropriate graph in R. Analyzing CO using 3 year data The fourth pollutant we will analyze is carbon monoxide CO The lambda value used to transform the original CO observations, λ = −3.577325. Directly below is the time series plot for the 3 years after a yeo-johnson transformation. 15
  • 18. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using only the forecast function in R. Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for CO and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original CO 2013 observations, λ = −2.432302. 16
  • 19. TIME SERIES ANALYSIS FORECASTING Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant CO. Using only 2012 data to predict 2013 values The lambda value used to transform the original CO observations for the year 2012, λ = −3.641187. Directly below is the time series plot for 2012 after a yeo-johnson transformation. 17
  • 20. TIME SERIES ANALYSIS FORECASTING The time series plot using only the forecast function was not yielding an appropriate graph in R. Directly below is the time series plot using auto.arima function in R. Now that we have our arima model, the next step was to access our 2013 pollutant concentration data for CO and see how accurately the arima model predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original CO 2013 observations, λ = −2.432302. Finally, once we plot the arima model against the 2013 time series plot, I believe the auto.arima function is most appropriate for predicting the trend of the Pollutant CO for 2013. 18
  • 21. TIME SERIES ANALYSIS FORECASTING Analyzing O3 using 3 year data The fifth pollutant we will analyze is ground level ozone O3 The lambda value used to transform the original O3 observations, λ = 3.615548. Directly below is the time series plot for the 3 years after a yeo-johnson transformation. 19
  • 22. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using only the forecast function in R. Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for O3 and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original O3 2013 observations, λ = 4.99994. Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe the forecast function is most appropriate for predicting the values of 2013 for the Pollutant O3. 20
  • 23. TIME SERIES ANALYSIS FORECASTING Using only 2012 data to predict 2013 values The lambda value used to transform the original O3 observations for the year 2012, λ = 4.99994. Directly below is the time series plot for 2012 after a yeo-johnson transformation. 21
  • 24. TIME SERIES ANALYSIS FORECASTING The time series plot using only the forecast function was not yielding an appropriate graph in R. The time series plot using the auto.arima function was not yielding an appropriate graph in R. Analyzing SO2 using 3 year data The sixth pollutant we will analyze is sulfur dioxide SO2 The lambda value used to transform the original SO2 observations, λ = −0.227093. Directly below is the time series plot for the 3 years after a yeo-johnson transformation. 22
  • 25. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using only the forecast function in R. Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for SO2 and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original SO2 2013 observations, λ = 0.2616144. Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe the forecast function is most appropriate for predicting the values of 2013 for the Pollutant 23
  • 26. TIME SERIES ANALYSIS FORECASTING SO2. Using only 2012 data to predict 2013 values The lambda value used to transform the original SO2 observations for the year 2012, λ = −0.1123281. Below is the time series plot for 2012 after a yeo-johnson transformation. 24
  • 27. TIME SERIES ANALYSIS FORECASTING The time series plot using only the forecast function was not yielding an appropriate graph in R. The time series plot using the auto.arima function was not yielding an appropriate graph in R. Analyzing NO2 using 3 year data The seventh and final pollutant we will analyze is nitrogen dioxide NO2 The lambda value used to transform the original NO2 observations, λ = 0.9783584. Below is the time series plot for the 3 years after a yeo-johnson transformation. 25
  • 28. TIME SERIES ANALYSIS FORECASTING Directly below is the time series plot using only the forecast function in R. Directly below is the time series plot using auto.arima function in R. Now that we have our forecast and arima models, the next step was to access our 2013 pollutant concentration data for NO2 and compare each model to see how accurately it predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original NO2 2013 observations, λ = 1.003092. 26
  • 29. TIME SERIES ANALYSIS FORECASTING Finally, once we plot the forecast and arima models against the 2013 time series plot, I believe the auto.arima function is most appropriate for predicting the values of 2013 for the Pollutant NO2. Using only 2012 data to predict 2013 values The lambda value used to transform the original NO2 observations for the year 2012, λ = 1.229131. Directly below is the time series plot for 2012 after a yeo-johnson transformation. 27
  • 30. TIME SERIES ANALYSIS FORECASTING The time series plot using only the forecast function was not yielding an appropriate graph in R. Directly below is the time series plot using auto.arima function in R. Now that we have our arima model, the next step was to access our 2013 pollutant concentration data for NO2 and see how accurately the model predicted the 2013 values. Before I created the time series plot for 2013, I preformed a yeo-johnson transformation on the 2013 data. The lambda value used to transform the original CO 2013 observations, λ = 1.003092. Finally, once we plot the arima model against the 2013 time series plot, I believe the auto.arima function is most appropriate for predicting the trend of the Pollutant NO2 for 2013. 28
  • 31. TIME SERIES ANALYSIS FORECASTING 6 R Code Explanation and Software Packages Used The following packages in the R software were used: MASS, bestNormalize, forecast. • From MASS the function truehist was used to plot the histograms of the pollutant data before and after the yeojohnson transformation was applied, to visually show the transformation from non-normal to normal distribution of the data. • From bestNormalize the function yeojohnson was used to transform the pollutant data from non-normal to normally distributed, in order to better carry out our statistical analysis. • From forecast the functions forecast and auto.arima were used, each playing the most important role in analyzing prior pollutant observations and forecasting our future values as accurately as R allows for each pollutant. The main functions that I will highlight in this sections are the forecast and auto.arima() functions in R but I will also briefly explain my usage of the ts() and yeojohnson() functions. It was very important to my study that within the forecast function level=F because while having confidence intervals in our graphs could be useful, they were not particularly needed for my study to be carried out, since I was mostly interested in the specific values that the forecast function gave us in its output. Also, in the forecast package it was vary important that we only forecast exactly 59 future values, which is simply due to the fact that there are exactly 59 values in our EPA 2013 data for each pollutant. Now, in the auto.arima() function, no restrictions needed to be called within the function but it was most important that we accessed our forecast values by auto.arima()$f and just for reference we are also able to access our original values that were put into the function by using auto.arima()$x. One last note, when I was plotting the time series for the 3 year data, you should notice that within each ts() function the frequency=(58) which I interpret as they were an average of 58 observations per year, and I simply got 58 by dividing the total amount of observations in our 3 year data by 3, so 174/3 = 58. Within the yeojohnson() function you will notice 29
  • 32. TIME SERIES ANALYSIS FORECASTING that standardize=FALSE this is because if it is not declared within the function by default R will further perform standardization of the values put into the function, I did not find the further standardization useful in my case when dealing with the Richmond data, mainly because the yeojohnson transformation was of interest in the Bulgaria study so I wanted to follow that transformation as it is without further standardization. 7 Conclusion In the Bulgaria study the researchers main goal was to be able to use the arima models in order to forecast ahead 72 hours, because they used hourly data. Similarly, I feel it necessary to highlight the importance the auto.arima() function played in helping forecast the year 2013. While it was not totally helpful with forecasting all pollutants, it was definitely more helpful than the forecast() function, in identifying the trend or behavior of each pollutant throughout the year(s). The most important finding I came across was that the 2012 data alone was certainly not enough it most cases when attempting to forecast a future year, but the 3 year(2010,2011,2012) data combination allowed both the forecast() and auto.arima() functions to display their usefulness when forecasting. I certainly enjoyed preparing this study and learning about time series and hope that I am given the opportunity to further explore this discipline in the future. 30
  • 33. TIME SERIES ANALYSIS FORECASTING References [1] Kullback, S. (1959), Information Theory and Statistics, John Wiley and Sons. Republished by Dover Publications in 1968; reprinted in 1978: ISBN 0-8446-5625-9. [2] Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika. [3] Gocheva-Ilieva, Snezhana; Ivanov, A; Voynikova, Desislava; Boyadzhiev, Doychin. (2013). Time series analysis and forecasting for air pollution in small urban area: An SARIMA and factor analysis approach. Stochastic Environmental Research and Risk Assessment. 28. 1045-1060. 10.1007/s00477-013-0800-4. [4] Alcosser, Howard. ”Diamond Bar High School” Internal Assessment: Mathematical Exploration. Web. 27 May 2015. [5] Jolliffe, Ian. (1986). Principal Component Analysis and Factor Analysis. 10.1007/978 − 1 − 4757 − 1904 − 87. Principal component analysis and Factor Analysis. 31