Reading Material - Module-3 - Introduction To Time Series Analysis
Reading Material - Module-3 - Introduction To Time Series Analysis
FINANCIAL ANALYTICS
Curated by Kiran Kumar K V
Content Structure
Histograms - Histograms depict the distribution of values in the time series, providing
insights into the data's central tendency and dispersion.
Autocorrelation plots - Autocorrelation measures the correlation between a time series
and a lagged version of itself. Autocorrelation plots help identify patterns such as
seasonality and cyclical behavior.
Identifying Patterns and Trends
Patterns and trends in time series data can provide valuable insights into the underlying
processes driving the data. Common patterns include:
Trend - The long-term movement or directionality of the data. Trends can be increasing,
decreasing, or stable over time.
Seasonality - Regular, periodic fluctuations in the data that occur at fixed intervals, such
as daily, weekly, or yearly patterns.
Cyclical behavior - Non-periodic fluctuations in the data that occur over longer time
frames, often associated with economic cycles or other recurring phenomena.
Irregularities - Random fluctuations or noise in the data that are not explained by trends,
seasonality, or cyclical patterns.
Seasonality - Seasonal effects refer to regular, repeating patterns in the data that occur
at fixed intervals, such as daily, weekly, or yearly cycles. Non-stationary series may exhibit
seasonality, leading to variations in mean and variance across different time periods.
Other Time-Dependent Structures - Non-stationary series may also display other time-
dependent structures, such as cyclical behavior or irregular fluctuations.
Transformations to Achieve Stationarity
When dealing with non-stationary time series data, transformations can be applied to make
the series stationary. One common transformation technique is differencing, which involves
computing the difference between consecutive observations. By removing trends and
seasonality through differencing, the resulting series may exhibit stationarity.
First-order differencing - Computes the difference between each observation and its
immediate predecessor.
Higher-order differencing - In cases of higher-order trends or seasonality, multiple
difference operations may be necessary to achieve stationarity.
Importance of Stationarity
Stationarity is important in time series analysis for several reasons:
Many statistical techniques and models assume stationarity to produce accurate results.
For example, classic time series models like ARMA (Auto-Regressive Moving Average)
require stationary data.
Stationarity simplifies the analysis by providing a stable framework for interpreting the
data and making predictions.
Stationarity allows for meaningful comparisons between different time periods and
facilitates the identification of underlying patterns and relationships within the data.
ARMA Model
Combining the autoregressive and moving average components, an ARMA(p, q) model can
be expressed as the sum of an AR(p) process and an MA(q) process:
4. Power Transformations
Power transformations are a valuable technique used in time series analysis to stabilize the
variance of a series, especially when the variance is not constant across different levels of the
mean. By transforming the data using a power function, power transformations can help
address issues such as heteroscedasticity and non-normality, making the data more amenable
to analysis and modeling.
Motivation for Power Transformations
In many time series datasets, the variability of the observations may change as the mean of
the series changes. This phenomenon, known as heteroscedasticity, violates the assumption
of constant variance required by many statistical techniques. Additionally, non-normality in
the data distribution can affect the validity of statistical tests and inference procedures. Power
Estimating λ
The choice of the power parameter λ is crucial in the Box-Cox transformation and can
significantly impact the effectiveness of the transformation. Common methods for estimating
λ include maximum likelihood estimation (MLE), which seeks to maximize the likelihood
function of the transformed data, and graphical techniques such as the profile likelihood plot
or the Box-Cox plot.
Interpretation and Application
λ > 1: Indicates a positive transformation, where the data is raised to a power greater than
1. This compresses lower values and stretches higher values, often used to stabilize
variance in data with right-skewed distributions.
λ = 1: Represents a natural logarithm transformation, useful for stabilizing variance and
approximating normality in data with exponential growth patterns.
0 < λ < 1: Indicates a fractional transformation, commonly used to stabilize variance in
data with left-skewed distributions.
Considerations and Limitations
The Box-Cox transformation assumes that the data values are strictly positive. For data
containing zero or negative values, alternative transformations may be necessary.
The choice of λ should be guided by both statistical considerations and domain
knowledge, as overly aggressive transformations can distort the interpretation of the data.
The effectiveness of the transformation should be assessed through diagnostic checks,
such as examining residual plots or conducting statistical tests for normality and
homoscedasticity.
Careful selection of model parameters is essential, as overly complex models may lead to
overfitting, while overly simple models may fail to capture important patterns in the data.
Interpretation of ARIMA model results should be done cautiously, considering the
assumptions and limitations of the model.
Python Project
Let’s create a python project that generates an ARIMA model for a security:
The Durbin-Watson Test Statistic value of 3.0786708200908914, which falls between the range
of 0 to 4, indicates the degree of autocorrelation present in the differenced data.
The Durbin-Watson test statistic ranges between 0 and 4, with a value close to 2 indicating
no autocorrelation. Values significantly below 2 indicate positive autocorrelation (i.e.,
consecutive observations are correlated), while values significantly above 2 indicate negative
autocorrelation (i.e., consecutive observations are negatively correlated).
In this case, the test statistic is close to 2, which suggests that there is little to no
autocorrelation present in the differenced data. A value of 3.0786708200908914 indicates a
mild positive autocorrelation, but it's close enough to 2 that it's often considered acceptable.
Therefore, the differenced data is likely adequately stationary for further analysis.
For the ACF (Autocorrelation Function) plot of the Price Data, the correlation lines at all lag
levels being close to 1 indicate a strong positive autocorrelation, suggesting that each
observation in the time series is highly correlated with its neighboring observations. However,
the slowly decaying nature of these correlations indicates that while there is a strong
correlation between adjacent observations, this correlation diminishes as the time lag
increases. This implies that while recent prices heavily influence each other, the influence
gradually diminishes as we move further back in time.
Regarding the PACF (Partial Autocorrelation Function) plot of the Price Data, the high values
of lag-0 and lag-1, both close to 1, suggest strong partial autocorrelation at these lags. This
indicates that each observation in the time series is significantly influenced by its immediate
predecessor and the one just before it. Additionally, all other partial autocorrelations being
close to 0 and falling within the significance range (represented by the shaded blue area)
indicate that once we account for the influence of these immediate predecessors, the
influence of other observations becomes minimal and not statistically significant. This
suggests that the current observation is predominantly influenced by its recent past, with
diminishing influence from observations further back in time.
The line plot of the stock returns illustrates a consistent level of volatility throughout the
analyzed period, characterized by fluctuations in returns around the mean. However, there
appears to be a notable increase in volatility during the initial period, marked by a spike in
the amplitude of fluctuations. This spike indicates a period of heightened variability in returns,
suggesting that significant market events or factors may have influenced stock performance
during that time.
In terms of stationarity, the visualization suggests that the data may not be entirely stationary.
Stationarity in a time series context typically refers to the statistical properties of the data
remaining constant over time, such as constant mean and variance. While the overall pattern
of returns shows relatively steady volatility, the spike in volatility during the initial period
suggests a departure from stationarity. Stationarity assumptions are crucial for many time
series analysis techniques, such as ARIMA modeling, as violations of stationarity can lead to
unreliable model results. Therefore, further investigation and potentially applying
transformations or differencing techniques may be necessary to achieve stationarity in the
data before proceeding with analysis.
The output represents the results of a stepwise search to find the best ARIMA model order
that minimizes the Akaike Information Criterion (AIC), a measure used for model selection.
Each line in the output corresponds to a different ARIMA model that was evaluated during
the search.
Here's an interpretation of the output:
ARIMA(p,d,q)(P,D,Q)[m] - This notation represents the parameters of the ARIMA model,
where 'p' denotes the number of autoregressive (AR) terms, 'd' denotes the degree of
differencing, 'q' denotes the number of moving average (MA) terms, 'P', 'D', and 'Q' denote
seasonal AR, differencing, and MA terms, respectively, and 'm' denotes the seasonal
period.
AIC - The Akaike Information Criterion is a measure of the relative quality of a statistical
model for a given set of data. Lower AIC values indicate better-fitting models.
Time - The time taken to fit the respective ARIMA model.
Best model - The ARIMA model with the lowest AIC value, which indicates the best-fitting
model according to the stepwise search.
Best ARIMA Model Order - This indicates the order of the best-fitting ARIMA model, in this
case, (4, 0, 5), suggesting that the model includes four AR terms, no differencing, and five
MA terms.
Total fit time - The total time taken for the entire stepwise search process.
In this output, the best-fitting ARIMA model order is (4, 0, 5), meaning it includes four AR
terms, no differencing, and five MA terms. This model achieved the lowest AIC value among
all the tested models, indicating its superiority in capturing the underlying patterns in the
data. However, it's essential to further evaluate the model's performance using diagnostic
checks to ensure its adequacy for forecasting purposes.
The provided output represents the results of fitting a SARIMAX (Seasonal AutoRegressive
Integrated Moving Average with eXogenous regressors) model to the data.
Here's an interpretation of the key components of the output:
Dep. Variable - This indicates the dependent variable used in the model, which in this case
is "Adj Close" representing adjusted closing prices of the stock.
No. Observations - The number of observations used in the model, which is 989.
Model - Specifies the type of model used. In this case, it's an ARIMA(4, 0, 5) model,
indicating four autoregressive (AR) terms, no differencing (d = 0), and five moving average
(MA) terms.
Log Likelihood - The log-likelihood value, which is a measure of how well the model fits
the data. Higher values indicate better fit.
AIC (Akaike Information Criterion) - A measure of the relative quality of a statistical model
for a given set of data. Lower AIC values indicate better-fitting models. In this case, the
AIC value is -5808.515.
BIC (Bayesian Information Criterion) - Similar to AIC, BIC is used for model selection. It
penalizes model complexity more strongly than AIC. Lower BIC values indicate better-
fitting models. Here, the BIC value is -5754.651.
Sample - Specifies the range of observations used in the model. In this case, it ranges from
0 to 989.
Covariance Type - Specifies the type of covariance estimator used in the model. In this
case, it's "opg" (Opeck-Gleser), which is one of the available options for estimating the
covariance matrix.
The provided output represents the coefficient estimates, standard errors, z-values, p-values,
and the confidence intervals for each coefficient in the SARIMAX model.
const - Represents the intercept term in the model. In this case, the coefficient is 1.505e-
06 with a standard error of 5.46e-05. The z-value is 0.028, and the p-value is 0.978,
indicating that the intercept term is not statistically significant at conventional levels of
significance (e.g., α = 0.05).
ar.L1, ar.L2, ar.L3, ar.L4 - These are the autoregressive (AR) terms in the model. They
represent the coefficients of the lagged values of the dependent variable. The coefficients
represent the strength and direction of the relationship between the variable and its
lagged values. All of these coefficients have p-values less than 0.05, indicating that they
are statistically significant.
ma.L1, ma.L2, ma.L3, ma.L4, ma.L5 - These are the moving average (MA) terms in the
model. They represent the coefficients of the lagged forecast errors. Similar to the AR
terms, these coefficients indicate the strength and direction of the relationship between
the forecast errors and their lagged values. Notably, ma.L2 has a p-value greater than 0.05,
suggesting that it is not statistically significant.
sigma2 - Represents the variance of the residuals (error term) in the model. It is estimated
to be 0.0002 with a high level of statistical significance.
The provided output includes diagnostic test results for the SARIMAX model, including the
Ljung-Box test, Jarque-Bera test, and tests for heteroskedasticity.
Ljung-Box (L1) (Q) - The Ljung-Box test is a statistical test that checks whether any group
of autocorrelations of a time series is different from zero. The "L1" in parentheses indicates
the lag at which the test is performed. In this case, the test statistic is 0.24, and the p-value
(Prob(Q)) is 0.63. Since the p-value is greater than the significance level (commonly 0.05),
we fail to reject the null hypothesis of no autocorrelation in the residuals at lag 1.
Jarque-Bera (JB) - The Jarque-Bera test is a goodness-of-fit test that checks whether the
data follows a normal distribution. The test statistic is 5303.06, and the p-value (Prob(JB))
is reported as 0.00. A low p-value suggests that the residuals are not normally distributed.
Heteroskedasticity (H) - Heteroskedasticity refers to the situation where the variability of a
variable is unequal across its range. The test for heteroskedasticity reports a test statistic
of 0.16 and a p-value (Prob(H)) of 0.00. A low p-value suggests that there is evidence of
heteroskedasticity in the residuals.
Skew and Kurtosis - Skewness measures the asymmetry of the distribution of the residuals,
and kurtosis measures the "tailedness" or thickness of the tails of the distribution. In this
case, skew is reported as 0.07, indicating a slight skewness, and kurtosis is reported as
14.34, indicating heavy-tailedness.
Overall, these diagnostic tests provide valuable information about the adequacy of the
SARIMAX model. While the Ljung-Box test suggests no significant autocorrelation at lag 1,
the Jarque-Bera test indicates potential non-normality in the residuals, and the test for
heteroskedasticity suggests unequal variability. These findings may warrant further
investigation or model refinement.
Predicted values
~~~~~~~~~~