A series of participative sessions among coding enthusiasts interested in data analysis, number crunching and teaching of statistics. Quite R-based but welcoming other languages and tools.
Abstract: This is the contribution to the Coding Club UC3M. The aim is to show the use of TensorFlow with KERAS for classification and prediction in Time Series Analysis. The latter just implement a Long Short Term Memory (LSTM) model (an instance of a Recurrent Neural Network which avoids the vanishing gradient problem).
Introduction
The code below has the aim to quick introduce Deep Learning analysis with TensorFlow using the Keras back-end in R environment. Keras is a high-level neural networks API, developed with a focus on enabling fast experimentation and not for final products. Keras and in particular the keras R package allows to perform computations using also the GPU if the installation environment allows for it.
Installing KERAS and TensorFlow in Windows … otherwise it will be more simple
and this will install the Google Tensorflow module in Python.
If you want it working on GPU and you have a suitable CUDA version, you can install it with tensorflow = "gpu" option
Simple check
Background on Neural Networks
Example old faithful IRIS data
Consider the well-known IRIS data set
We want to build an iris specie classifier based on the observed four iris dimensions. This is the usual classification (prediction) problem so we have to consider a training sample and evaluate the classifier on a test sample.
Data in TensorFlow
Data are
Matrices ```matrix´´´ of doubles.
Categorical variables need to be codified in dummies: one hot encoding.
Training and Test Data Sets
Define training and test
Model building
Initialize the model
and suppose to use a very simple one
This is the model structure:
and a graphical representation:
In the plot, blue colors stand for input and green ones for output.
Its analytic representation is the following one:
where the activation function is the softmax (the all life logistic!):
which estimates $Pr(Specie = j|\mathbf{x} = (PW,PL,SW,SL))$.
Model fitting: fit() and the optimizer
Estimation consists in finding the weights $\mathbf{w}$ that minimizes a loss function. For instance, if the response $Y$ were quantitative, then
whose solution is given by the usual equations of derivatives $w$:
Note however, that
(Is parallelizable in batches of samples (of length batch_size), that is
where $n_l$ is the batch_size.
Suppose in general a non-analytical loss function (the usual case in more complicated networks) $Q(w) = \sum_{i = 1}^m(y_i-wx_i)^2,$ and suppose that $\frac{\partial Q(w)}{\partial w} = 0$ is not available analytically. Then we would have to use “Newton-Raphson” optimizer family (or gradient optimizers) whose best known member in Deep Learning (DL) is the Stochastic Gradient Descent (SGD):
Starting form an initial weight $w^{(0)}$ at step $m$:
where $\eta>0$ is the Learning Rate: the lower (bigger) $\eta$ is, the more (less) steps are needed to achieve the optimum with a greater (worse) precision.
It is stochastic in the sense that the index $i$ of the sample is random (avoids overfitting): $\Delta Q(w) : = \Delta Q_i(w)$. This also induces complications when (if) dealing with time series.
There are other variations to the SGD: Momentum, Averaging, AdaGrad, Adam, …
Using SGD with $\eta = 0.01$ we have to set:
and then this is plugged in into the model and used afterwards in compilation. Once it is established, the loss function $Q$ (here we use the categorical_crossentropy because the response is a non-binary categorical variable):
we have to train it in epochs (i.e. the $m$ steps above) using a portion of the training sample, validation_split, to verify eventual overfitting (i.e. the model is fitted and the loss evaluated in that random part of the sample which is finally not used for training):
The result of the trained model is:
Validation on the test sample:
with a validation score
Another example: Classification of breast cancer
We have 10 variables (all factors) and a binary response: benign versus malign.
Data in matrices
Set training and test
Let’s build the DL model with tree layers of neurons:
As activation function (being the response binary) we use a user definedrelu ($f(x) = x^+$):
Let’s use the adam optimizer
Train the model
Validate it on the test set
also with a score
LSTM model
Here we apply the DL to time series analysis: it is not possible to draw train and test randomly and they must be random sequences of train and test of length batch_size.
Data
From Yahoo Finance let’s download the IBEX 35 time series on the last 15 years and consider the last 3000 days of trading:
YAHOO database query and the ACF of the considered IBEX 35 series is here:
Training and Testing samples
Data must be standardized
Let’s use the first 2000 days for training and the last 1000 for test. Remember that the ratio between the number of train samples and test samples must be an integer number as also the ratio between these two lengths with batch_size. This is why 2000/1000, 2000/50 and 1000/50:
Data for LSTM
Predictor $X$ is a 3D matrix:
first dimension is the length of the time series
second is the lag;
third is the number of variables used for prediction $X$ (at least 1 for the series at a given lag).
Response $Y$ is a 2D matrix:
first dimension is the length of the time series
second is the lag;
The LSTM model codified with Keras
Let’s train in 2000 steps. Remember: for being the model stateful (stateful = TRUE), which means that the signal state (the latent part of the model) is trained on the batch of the time series, you need to manually reset the states (batches are supposed to be independent sequences (!) ):
The prediction
more on validation:
Some notes on Deep Learning
A deep learning (DL) model is a neural network with many layers of neurons (Schmidhuber 2015), it is an algorithmic approach rather than probabilistic in its nature, see (Breiman and others 2001) for the merits of both approaches. Each neuron is a deterministic function such that a neuron of a neuron is a function of a function along with an associated weight $w$. Essentially for a response variable $Y_i$ for the unit $i$ and a predictor $X_i$ we have to estimate $Y_i = w_1f_1(w_2f_2(…(w_kf_k(X_i))))$, and the larger $k$ is, the “deeper” is the network. With many stacked layers of neurons all connected (a.k.a. dense layers) it is possible to capture high non-linearities and all interactions among variables. The approach to model estimation underpinned by a DL model is that of composition function against that od additive function underpinned by the usual regression techniques including the most modern one (i.e. $Y_i = w_1f_1(X_i)+w_2f_2(X_i)+…+w_kf_k(X_i)$). A thorough review of DL can be found at (Schmidhuber 2015).
Likely the DL model can be also interpreted as a maximum a posteriori estimation of $Pr(Y|X,Data)$ (Polson, Sokolov, and others 2017) for Gaussian process priors. Despite this and because of its complexity it cannot be evaluated the whole distribution $Pr(Y|X,Data)$, but only its mode.
Estimating a DL consists in just estimating the vectors $w_1,\ldots,w_k$. The estimation requires to evaluate a multidimensional gradient which is not possible to be evaluated jointly for all observations, because of its dimensionality and complexity. Recalling that the derivative of a composite function is defined as the product of the derivative of inner functions (i.e. the chain rule $(f\circ g)’ = (f’\circ g)\cdot g’$) which is implemented for purposes of computational feasibility as a tensor product. Such tensor product is evaluated for batches of observations and it is implemented in the open source software known as Google Tensor Flow (Abadi et al. 2015).
Fundamentals of LSTM can be found here (Sherstinsky 2018) (it needs some translation to the statistical formalism).
References
Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.” https://www.tensorflow.org/.
Breiman, Leo, and others. 2001. “Statistical Modeling: The Two Cultures (with Comments and a Rejoinder by the Author).” Statistical Science 16 (3). Institute of Mathematical Statistics: 199–231.
Polson, Nicholas G, Vadim Sokolov, and others. 2017. “Deep Learning: A Bayesian Perspective.” Bayesian Analysis 12 (4). International Society for Bayesian Analysis: 1275–1304.
Schmidhuber, Jürgen. 2015. “Deep Learning in Neural Networks: An Overview.” Neural Networks 61. Elsevier: 85–117.