Copyright (C) 2020 Laura E. Boucheron
This information is free; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
This work is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this work; if not, If not, see https://www.gnu.org/licenses/.
In this tutorial, we will introduce the basic structure and common components (convolutional layers, pooling layers, nonlinearities, fully connected layers, etc.) of deep learning networks through a combination of illustrations and hands-on implementation of a network. By the end of this tutorial, we will have built from scratch a deep convolutional neural network to operate on the standard MNIST handwritten digits dataset. We will then explore some ways of probing the characteristics of the trained network to help us debug common pitfalls in adapting network architectures.
This tutorial contains 7 sections:
There are a few subsections with the heading "Your turn" throughout this tutorial in which you will be asked to apply what you have learned.
Portions of this tutorial have been taken or adapted from https://elitedatascience.com/keras-tutorial-deep-learning-in-python and the documentation at https://keras.io.
There are two main types of cells in this notebook: code and markdown (text). You can add a new cell with the plus sign in the menu bar above and you can change the type of cell with the dropdown menu in the menu bar above. As you complete this tutorial, you may wish to add additional code cells to try out your own code and markdown cells to add your own comments or notes.
Markdown cells can be augmented with a number of text formatting features, including
embedded $\LaTeX$, monotype specification of code syntax
, bold font, and italic font. There are many other features of markdown cells--see the jupyter documentation for more information.
You can edit a cell by double clicking on it. If you double click on this cell, you can see how to implement the various formatting referenced above. Code cells can be run and markdown cells can be formatted using Shift+Enter or by selecting the Run button in the toolbar above.
Once you have completed (all or part) of this notebook, you can share your results with colleagues by sending them the .ipynb
file. Your colleagues can then open the file and will see your markdown and code cells as well as any results that were printed or displayed at the time you saved the notebook. If you prefer to send a notebook without results displayed (like this notebook appeared when you downloaded it), you can select ("Restart & Clear Output") from the Kernel menu above. You can also export this notebook in a non-executable form, e.g., .pdf
through the File, Save As menu.
Here, at the top of the code, we import all the libraries necessary for this tutorial. We will introduce the functionality of any new libraries throughout the tutorial, but include all import statements here as standard coding practice. We include a brief comment after each library here to indicate its main purpose within this tutorial.
It would be best to run this next cell before the workshop starts to make sure you have all the necessary packages installed on your machine.
A few other notes:
Using Theano backend
or
Using TensorFlow backend
keras
library also includes a tool to do just that. After you have downloaded the dataset for the first time, keras
will load the dataset from its local location. import numpy as np # mathematical and scientific functions
import matplotlib.pyplot as plt # visualization
# format matplotlib options
%matplotlib inline
plt.rcParams.update({'font.size': 20})
import keras.backend # information on the backend that keras is using
from keras.utils import np_utils # functions to wrangle label vectors
from keras.models import Sequential # the basic deep learning model
from keras.layers import Dense, Flatten, Convolution2D, MaxPooling2D # important CNN layers
from keras.models import load_model # to load a pre-saved model (may require hdf libraries installed)
from keras.datasets import mnist # the MNIST dataset
from keras.datasets import fashion_mnist # the Fashion-MNIST dataset
Using TensorFlow backend.
Open a terminal from inside JupyterLab (File > New > Terminal) and type the following commands
source activate
wget https://kerriegeil.github.io/NMSU-USDA-ARS-AI-Workshops/aiworkshop.yml
conda env create --prefix /project/your_project_name/envs/aiworkshop -f aiworkshop.yml
This will build the environment in one of your project directories. It may take 5 minutes to build the Conda environment.
See https://kerriegeil.github.io/NMSU-USDA-ARS-AI-Workshops/setup/ for more information.
When the environment finishes building, select this environment as your kernel in your Jupyter Notebook (click top right corner where you see Python 3, select your new kernel from the dropdown menu, click select)
You will want to do this BEFORE the workshop starts.
A few other notes:
Using Theano backend
or
Using TensorFlow backend
keras
library also includes a tool to do just that. After you have downloaded the dataset for the first time, keras
will load the dataset from its local location. The line in the code cell above that reads from keras.datasets import mnist
has loaded the keras
package that interfaces with the local copy of MNIST dataset.
Before we get going, let's check which backend keras is using. All subsequent instructions should be valid for either tensorflow
or theano
.
print(keras.backend.backend())
tensorflow
As a note, there are other datasets available as part of keras.datasets
, see https://keras.io/datasets/ for more information.
Now we can use the mnist.load_data
function to read in the standard training data and test data. The first time you run the following command you will see a printout of the download progress. Subsequent times you run the command, you will not see any printout as the data will be loaded from where keras
stored it locally on your computer. The mnist.load_data
function outputs numpy
arrays.
(X_train, y_train), (X_test, y_test) = mnist.load_data()
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz 11493376/11490434 [==============================] - 6s 0us/step
In loading the MNIST data, we are storing the data (images) in X_train
and X_test
and the corresponding labels in y_train
and y_test
. It is common convention to label the input data with a capital 'X' and the labels with a lowercase 'y'. Since these data are images which can be represented as arrays, the convention of using 'X' and 'y' comes from matrix notation where vectors are assigned a lowercase variable and matrices an uppercase variable.
We know that the MNIST dataset consists of 70,000 examples of $28\times28$ pixels images of handwritten digits from 0-9. We also know that there are 60,000 images reserved for training and 10,000 reserved for testing. As such, we expect that the dimensionality of X_train
and X_test
to reflect this. We print the shape of the two variables.
print('The dimensions of X_train are:')
print(X_train.shape)
print('The dimensions of X_test are:')
print(X_test.shape)
The dimensions of X_train are: (60000, 28, 28) The dimensions of X_test are: (10000, 28, 28)
We also check the variable types of X_train
and X_test
. Since the mnist.load_data
function outputs numpy
arrays, we need to use the dtype
method to query the variable type.
print('The variable type of X_train is:')
print(X_train.dtype)
print('The variable type of X_test is:')
print(X_test.dtype)
The variable type of X_train is: uint8 The variable type of X_test is: uint8
In the literature and documentation related to deep learning, you will see the word "tensor" quite often. We have just encountered our first tensors. Think of tensors as multidimensional arrays. X_train
took the 60,000 $28\times28$ 2D pixel arrays, each of which represents an image, and stacked them to create a 3D array (tensor). Before we're done here, we'll add a fourth dimension to X_train
and X_test
.
From these dimensions, it appears that the first dimension indexes the sample (image) and the second and third dimensions index the spatial dimensions of the image. It also appears that the images are uint8
. We can check this assumption by visualizing one of the samples of X_train
. In this case we look at the first image in X_train
.
plt.figure()
plt.imshow(X_train[0],cmap='gray')
plt.show()
Look at some other images in X_train
or X_test
. Does there appear to be any order in which the digits appear?
plt.figure(figsize=(20,20))
for k in range(0,10):
plt.subplot(1,10,k+1)
plt.imshow(X_train[k],cmap='gray')
plt.axis('off')
plt.show()
There does not appear to be any order to the digits--they appear to be random.
The y_train
variable contains the label, or the "truth" of what is represented in the image. We can print out the label for the same image we visualized above (the first image in X_train
).
print(y_train[0])
5
This indicates that the image we plotted above corresponds to a ground truth label of '5'.
Revise your code from above to title your plot with the ground truth label.
plt.figure(figsize=(20,20))
for k in range(0,10):
plt.subplot(1,10,k+1)
plt.imshow(X_train[k],cmap='gray')
plt.title(y_train[k])
plt.axis('off')
plt.show()
In addition to providing the labels for training a supervised classifier, these label vectors provide an important way to index into our dataset. The following subsection illustrates one use of the label vector.
We can get a brief sense of the sort of variation included in this dataset by plotting 10 examples for each of the digits. The following code makes use of the X_train
variable and also the corresponding labels in y_train
.
In the following code, we loop over the 10 digits using variable d
and over 10 examples using variable k
. We plot the first 10 examples for each digit. Let's take a more careful look at the syntax X_train[np.where(y_train==d)[0][k],:,:]
np.where(y_train==d)
finds those indexes where the ground truth indicates that we have a specific digitnp.where
command returns a tuple; in this case there is only one dimension to the tuple, so we pull of the first dimension, so we have np.where(y_train==d)[0]
k
-th index, so we have np.where(y_train==d)[0][k]
k
-th instance of the digit d
, and we have X_train[np.where(y_train==d)[0][k],:,:]
plt.figure(figsize=(20,20))
for d in range(0,10): # loop over the digits 0 through 9
for k in range(0,10): # choose 10 example images for each digit
plt.subplot(10,10,d*10+k+1) # select the current subplot
plt.imshow(X_train[np.where(y_train==d)[0][k],:,:],cmap='gray') # plot the image
plt.axis('off')
keras
with the theano
backend expects input to be tensors of the form samples $\times$ channels $\times$ height $\times$ width ('channels_first'
) or samples $\times$ height $\times$ width $\times$ channels ('channels_last'
).
MNIST images are one channel (grayscale), but we don't see that explicitly represented in the shape of X_train
or X_test
. Thus, we need to add a dimension to the X_train
and X_test
tensors to have the proper shape.
We can do this with the reshape
command. We choose the 'channels_last'
option and and tack on the channel as the fourth dimension.
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
Now when we check the shape, we find the expected form samples $\times$ height $\times$ width $\times$ channels.
print('The dimensions of X_train are:')
print(X_train.shape)
print('The dimensions of X_test are:')
print(X_test.shape)
The dimensions of X_train are: (60000, 28, 28, 1) The dimensions of X_test are: (10000, 28, 28, 1)
We note that there is a default assumption of either 'channels_last'
or 'channels_first'
for each deep learning framework such as theano
or tensorflow
. To avoid potential misinterpretation, we will explicitly specify data_format='channels_last'
in our keras
code below.
This is the first example of the care with which we need to consider the shape/dimensionality of our data. This example is specific to keras
, but the general principles here are similar for other deep learning frameworks, e.g., tensorflow
, caffe
, pytorch
.
Above, you worked with the original X_train
and X_test
arrays as loaded by keras
. Now we have expanded the dimensions of those arrays. Does your visualization code from above still work?
plt.figure(figsize=(20,20))
for k in range(0,10):
plt.subplot(1,10,k+1)
plt.imshow(X_train[k],cmap='gray')
plt.axis('off')
plt.show()
Depending on your library versions, you may have found that your visualization code from above no longer works. If you get an error, it is likely similar to
TypeError: Invalid shape (28,28,1) for image data
when you try to visualize one of the MNIST images. This error is due to the very fact that we explicitly expanded the dimensions to make keras
happy.
Modify your code to work with the newly shaped X_train
and X_test
arrays. The np.squeeze
method for numpy arrays will likely be of use here: it removes single-dimensional entries from the shape of an array. Note--you do not want to actually modify the shape of X_train
or X_test
here. Your goal is to modify the visualization code to deal with the singleton dimensions. Even if you were able to runt he code above, it is a good exercise to learn the usage of np.squeeze
as other functions may grumble about singleton dimensions in the future.
plt.figure(figsize=(20,20))
for k in range(0,10):
plt.subplot(1,10,k+1)
plt.imshow(X_train[k].squeeze(),cmap='gray')
plt.axis('off')
plt.show()
We noted earlier that X_train
and X_test
are of variable type uint8
. It is considered best practice to normalize the range of your input data, commonly to $[0,1]$. Back in the world of classical machine learning, this avoids a specific feature dominating the classification simply because it is larger. In deep learning, continuing this convention allows for more consistency and robustness in the computation of the various gradients during training. The risk of overflow (exceeding the capabilities of a variable type to represent a very large number) or underflow (exceeding the capabilities of a variable type to represent a very small, i.e., close to zero, number) is very real in deep learning.
Before we normalize the input data intensity, we should double check that the variables are within the range we expect. Verify that X_train
and X_test
are within the expected range of [0,255] for a uint8
variable.
print('The range of X_train is ['+str(X_train.min())+','+str(X_train.max())+']')
print('The range of X_test is ['+str(X_test.min())+','+str(X_test.max())+']')
The range of X_train is [0,255] The range of X_test is [0,255]
Here, we cast the numpy
arrays as float32
and divide by the maximum we expect for a uint8
variable (255).
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
Check the range of the normalized X_train
and X_test
arrays to verify that they are now in the range [0,1].
print('The range of X_train is ['+str(X_train.min())+','+str(X_train.max())+']')
print('The range of X_test is ['+str(X_test.min())+','+str(X_test.max())+']')
The range of X_train is [0.0,1.0] The range of X_test is [0.0,1.0]
There are a suite of common data preprocessing methods. Most of these involve some form of statistical normalization. For example, we might scale our data to have a mean of 0 and a standard deviation of 1. Or we might whiten the data to make it more normally distributed. Here we have considered a simple range normalization, but note that other standard preprocessing routines exist. See https://keras.io/preprocessing/image/ for some examples of other preprocessing methods and syntax for the built-in functions in keras
to perform those methods.
In this case we cast as float32
since that is already overkill for uint8
variables and it will take up less memory than casting those arrays as float64
. We note, however, that if your data is natively float64
, you probably want to leave it as such.
print('The dimensions of y_train are:')
print(y_train.shape)
print('The dimensions of y_test are:')
print(y_test.shape)
The dimensions of y_train are: (60000,) The dimensions of y_test are: (10000,)
We have already looked at entries in y_train
and y_test
and noted that they are integers that (at least in this case) directly correspond to the digit that the image represents. More on this in a bit...
However... keras
(and many other common classification and deep learning frameworks) expects labels of shape $N_\text{samples}\times N_\text{classes}$. We see that we are okay in terms of $N_\text{samples}$ (60,000 for training and 10,000 for test), but we have an empty second dimension. We somehow need to reconfigure the label vectors so that they will be $60,000\times10$ for y_train
and $10,000\times10$ for y_test
. How do we get to a $60,000\times10$ array for the labels?
What we really need is a representation of the label vectors that better matches the typical output of a neural network. The output layer of a neural network classifier will have $N_\text{classes}$ nodes. In a typical application, the last layer is a softmax
layer which outputs probabilities of a sample belonging to each of the classes $C_j,~j=0,\ldots,N_\text{classes}-1$. Thus, the softmax
layer for an MNIST digit classification will have form $$[p(C_0),p(C_1),p(C_2),p(C_3),p(C_4),p(C_5),p(C_6),p(C_7),p(C_8),p(C_9)]^T.$$ A simple argmax predicts the label as the class with the highest probability, i.e., $\hat{y}=\text{argmax}_j p(C_j)$. This means that if the network is absolutely 100% certain that a sample is a digit '3', all coefficients in the softmax layer will be zero except the coefficient corresponding to the digit '3', i.e., $$[0,0,0,1,0,0,0,0,0,0]^T$$ with $\hat{y}=3$.
This gives us insight into how to "encode" the input label vector. We want a value of 1 for the given class and zeros everywhere else; this is also known as one-hot coding.
If we print the first ten labels in y_train
, we see that the labels are reported as the digit itself.
print('The first ten entries of y_train are:')
print(y_train[:10])
The first ten entries of y_train are: [5 0 4 1 9 2 1 3 1 4]
In this case there is a very direct and obvious relationship between the label and the meaning. If y_train==3
, the data is an image of the numeral three.
It is important to keep in mind, however, that these labels are a very abstract concept--when we see the ASCII character '3' printed in the first ten entries of y_train
above, we interpret that to mean 'an image of the numeral three.' We could just as easily have labeled the images of the numeral three with the label 'hamster' and nothing about the following code would change. The performance we will see below on the ability to correctly classify all images of the numeral 3 would be identical. The only difference is that the network would very cheerfully report that an image of the numeral three is a 'hamster' instead of a '3'. And it would be correct because we would have told it that images of the numeral three are 'hamsters.'
This highlights the importance of leveraging humans to provide labels for the training data. It is the humans that are providing the abstract intepretation of what those images represent. Computers, however, only understand numbers. So we need to find some means to translate our abstract notion of the classes of the input data to something numerical for the computer to interpret.
As a more concrete example of this abstractness of the labels, consider the Fashion-MNIST dataset (see also https://keras.io/datasets/). This dataset was designed to be a drop-in replacement for MNIST. The dimensionality is exactly the same (60,000 28$\times$28 pixel training images and 10,000 28$\times$28 pixel testing images), but the images are grayscale images of clothing articles. Thus in the Fashion-MNIT dataset, if ground truth label is specified as '3', instead of interpreting that as 'an image of the 'numeral three,' you interpret that as 'an image of a dress.'
A convenient and common translation is to these one-hot coded vectors. Different frameworks and different networks may have different conventions.
We will use the keras
function np_utils.to_categorical
to convert the label vector to a one-hot vector. We specifying y_train
or y_test
as input and denote the one-hot label vector with a capital Y
to remind ourselves that this is now actually a matrix of probabilities and thus a very different representation than the original label vector.
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)
Let's check the dimensionality of these new one-hot labels Y_train
and Y_test
.
print('The dimensions of Y_train are:')
print(Y_train.shape)
print('The dimensions of Y_test are:')
print(Y_test.shape)
The dimensions of Y_train are: (60000, 10) The dimensions of Y_test are: (10000, 10)
We note that the first dimension of Y_train
and Y_test
correspond to the sample and the second dimension consists of 10 entries. Let's look at the one-hot label for the first 10 samples in Y_train
and compare to the first 10 samples in the original label vector y_train
.
print('The first ten entries of y_train (original label vector) are:')
print(y_train[0:10])
print('The argmax of the first ten entries of Y_train (one-hot label vector) are:')
print(Y_train[0:10,:])
The first ten entries of y_train (original label vector) are: [5 0 4 1 9 2 1 3 1 4] The argmax of the first ten entries of Y_train (one-hot label vector) are: [[0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]]
Verify to yourself that the correct entries in the one-hot label vector are hot.
We use the first entry as an example. The first label is '5', so we expect that index 5 of the first row in Y_train
should be 1 and all others zero. Recalling that python
indexes beginning at 0 (which is also very convenient for the MNIST dataset that begins at 0), we find that index 5 of the first row is indeed the only value of 1.
We can double check ourselves by applying an argmax to the one-hot labels. We expect to get back the original labels.
print('The first ten entries of y_train (original label vector) are:')
print(y_train[0:10])
print('The first ten entries of Y_train (one-hot label vector) are:')
print(np.argmax(Y_train[0:10,:],axis=1))
The first ten entries of y_train (original label vector) are: [5 0 4 1 9 2 1 3 1 4] The first ten entries of Y_train (one-hot label vector) are: [5 0 4 1 9 2 1 3 1 4]
We find that we do get back the original labels.
Now we've loaded and preprocessed the input data (samples X_train
and labels y_train
) needed to train a deep learning network. We need to decide the specific architecture of the network itself. We begin here with a simple 2-layer network. This network will result in approximately 95% accuracy on the training data after several epochs, but can take a few minutes per epoch to run on a CPU. As such, we will set this up to run and then cycle back around to understand the details as it is training.
Within each epoch of training, the entire training dataset is visited once. Thus, an epoch can be thought of in the general sense of an iteration. Deep learning uses the distinct terminology of epoch to specifically mean the one visit of the entire training set. Within each epoch, you have batches of the input data. The decomposition of an epoch into multiple batches is particularly important for very large datasets that cannot fit into working memory.
In Section 0.3 we directly imported only those functions we use from keras
to make our code more compact.
Next we define our first model, which we call model1
. We'll cycle back to understand the components of this model after we set it training, but also include some descriptions here:
Sequential
model is the base class used in keras
to define a sequential stack of layersConvolution2D
layer defines a typical convolution layer. This layer takes a tensor of images as input and outputs a tensor of images. Note that you need to define the input shape for the first convolutional layer $28\times28\times1$ in this case. The input shape for all subsequent layers is automatically inferred to be the same as the output shape for the previous layer. We explicitly specify 'channels_last'
for the data_format
since that is how we defined our input data. Note that for CNNs we generally only count the convolutional layers when reporting on the depth of the network.MaxPooling2D
layer reduces the spatial dimensions of the input tensor. We again explicitly specify 'channels_last'
for the data_format
.Flatten
layer essentially reshapes the dimensions of the data. In this case it takes the $28\times28\times32$ tensor output from the second convolutional layer and reshapes it into a length $28*28*32=25088$ vector.Dense
layer is the layer type for fully connected (i.e., dense) layers. This type of layer defines a connection from all nodes in the previous layer to all nodes in the subsequent layer. The first layer defined here takes the 25088 inputs and outputs 128 values. The second fully connected layer takes those 128 as input and outputs 10 values. 'softmax'
activation. This means that the output will be the probability of an example belonging to each of the 10 classes. model1 = Sequential()
model1.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))
model1.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))
model1.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))
model1.add(Flatten())
model1.add(Dense(128, activation='relu'))
model1.add(Dense(10, activation='softmax'))
Next we need to compile the model before we can train it. This requires specification of several parameters including the loss
, the optimizer
and the metrics
. Again, we will cycle back to understand these after we set it training. We specify a minimum of options here, including:
'categorical_crossentropy'
which is a common loss function for multi-class classification problems. This loss expects labels in one-hot coded format.'adam'
optimizer which is a good optimizer in the absence of any other prefered optimizer. The 'adam'
optimizer adjusts the learning rate throughout the training process to help convergence.metrics
which are the "human-interpretable" measurements of network performance. Here we request the accuracy which will be a the percentage of correctly classified digits.model1.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
Now we finally start the actual training of this model. We input the X_train
and Y_train
variables that we worked with above, and specify a few simulation parameters such as batch_size
and epochs
which we will cycle back to in a while. We specify verbose=1
in order to print out status so we can keep track of where we are in the training process.
You may get one or more warnings, but as long as you don't get any errors, you should see something of the form
Epoch 1/10
38208/60000 [==================>..........] - ETA: 1:12 - loss: 0.1849 - acc: 0.9448
We have specified a total of 1 epoch, so the ETA specified at the beginning of the current epoch is approximately the total time the training is expected to take. How long the training takes is very dependent on the hardware and how well that hardware is configured to perform the sort of computation required for CNN training. On my desktop machine (AMD Ryzen 7 2700X 4.3 GHz processor), the training took about 3.5 minutes on all 8 CPUs. While this might be longer than the average person is accustomed to waiting for a computer to finish processing, this is actually a very reasonable time to train a complete deep network. This is because the MNIST dataset is not too large, nor is the network we specified. Ordinarily, we would (need to) run the training for more than one epoch. For MNIST, however, we can converge to a very good accuracy within one epoch.
model1.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
Epoch 1/1 60000/60000 [==============================] - 145s 2ms/step - loss: 0.1403 - accuracy: 0.9602
<keras.callbacks.callbacks.History at 0x16b6e9170c8>
There are some other issues related to testing the trained network that we will return to in Section 4. For the remainder of this section, we focus on deepening our understanding of this model that we have trained on MNIST.
Error reporting is not always the most elucidating in deep learning models. Here we explore some common errors in model definition, compilation, and training. Below, we have copied the definition, compilation, and training stages from above and named this new model model2
. You can copy and paste this code into subsequent code cells and modify different aspects of the three stages to explore the effects and/or error messages encountered.
model2 = Sequential()
model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))
model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))
model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))
model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dense(10, activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
Epoch 1/1 60000/60000 [==============================] - 103s 2ms/step - loss: 0.1444 - accuracy: 0.9560
<keras.callbacks.callbacks.History at 0x16b6f514748>
The following errors should actually report as an error. That error may or may not be particularly ellucidating in helping you find the source of the error if you weren't aware of the source in advance.
What happens if you specify data_format='channels_first'
? How useful is the error in this case?
model2 = Sequential()
model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_first'))
model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_first'))
model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_first'))
model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dense(10, activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-28-4b1bdaf93886> in <module> 1 model2 = Sequential() ----> 2 model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_first')) 3 model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_first')) 4 model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_first')) 5 model2.add(Flatten()) ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\sequential.py in add(self, layer) 164 # and create the node connecting the current layer 165 # to the input layer we just created. --> 166 layer(x) 167 set_inputs = True 168 else: ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\backend\tensorflow_backend.py in symbolic_fn_wrapper(*args, **kwargs) 73 if _SYMBOLIC_SCOPE.value: 74 with get_graph().as_default(): ---> 75 return func(*args, **kwargs) 76 else: 77 return func(*args, **kwargs) ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\base_layer.py in __call__(self, inputs, **kwargs) 487 # Actually call the layer, 488 # collecting output(s), mask(s), and shape(s). --> 489 output = self.call(inputs, **kwargs) 490 output_mask = self.compute_mask(inputs, previous_mask) 491 ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\layers\convolutional.py in call(self, inputs) 169 padding=self.padding, 170 data_format=self.data_format, --> 171 dilation_rate=self.dilation_rate) 172 if self.rank == 3: 173 outputs = K.conv3d( ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\backend\tensorflow_backend.py in conv2d(x, kernel, strides, padding, data_format, dilation_rate) 3699 data_format = normalize_data_format(data_format) 3700 -> 3701 x, tf_data_format = _preprocess_conv2d_input(x, data_format) 3702 3703 padding = _preprocess_padding(padding) ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\backend\tensorflow_backend.py in _preprocess_conv2d_input(x, data_format, force_transpose) 3572 tf_data_format = 'NHWC' 3573 if data_format == 'channels_first': -> 3574 if not _has_nchw_support() or force_transpose: 3575 x = tf.transpose(x, (0, 2, 3, 1)) # NCHW -> NHWC 3576 else: ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\backend\tensorflow_backend.py in _has_nchw_support() 520 """ 521 explicitly_on_cpu = _is_current_explicit_device('cpu') --> 522 gpus_available = len(_get_available_gpus()) > 0 523 return (not explicitly_on_cpu and gpus_available) 524 ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\backend\tensorflow_backend.py in _get_available_gpus() 504 _LOCAL_DEVICES = [x.name for x in devices] 505 else: --> 506 _LOCAL_DEVICES = tf.config.experimental_list_devices() 507 return [x for x in _LOCAL_DEVICES if 'device:gpu' in x.lower()] 508 AttributeError: module 'tensorflow_core._api.v2.config' has no attribute 'experimental_list_devices'
What happens if you forget the flatten layer? How useful is the error in this case?
model2 = Sequential()
model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))
model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))
model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))
#model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dense(10, activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-29-b42bce916ce9> in <module> 9 model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy']) 10 ---> 11 model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1) ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs) 1152 sample_weight=sample_weight, 1153 class_weight=class_weight, -> 1154 batch_size=batch_size) 1155 1156 # Prepare validation data. ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size) 619 feed_output_shapes, 620 check_batch_axis=False, # Don't enforce the batch size. --> 621 exception_prefix='target') 622 623 # Generate sample-wise weight values given the `sample_weight` and ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix) 133 ': expected ' + names[i] + ' to have ' + 134 str(len(shape)) + ' dimensions, but got array ' --> 135 'with shape ' + str(data_shape)) 136 if not check_batch_axis: 137 data_shape = data_shape[1:] ValueError: Error when checking target: expected dense_6 to have 4 dimensions, but got array with shape (60000, 10)
What happens if you specify an output layer that is not length 10? How useful is the error in this case?
model2 = Sequential()
model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))
model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))
model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))
model2.add(Flatten())
model2.add(Dense(128, activation='relu'))
model2.add(Dense(128, activation='softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-30-2e1393361a52> in <module> 9 model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy']) 10 ---> 11 model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1) ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs) 1152 sample_weight=sample_weight, 1153 class_weight=class_weight, -> 1154 batch_size=batch_size) 1155 1156 # Prepare validation data. ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size) 619 feed_output_shapes, 620 check_batch_axis=False, # Don't enforce the batch size. --> 621 exception_prefix='target') 622 623 # Generate sample-wise weight values given the `sample_weight` and ~\anaconda3\envs\aiworkshop1\lib\site-packages\keras\engine\training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix) 143 ': expected ' + names[i] + ' to have shape ' + 144 str(shape) + ' but got array with shape ' + --> 145 str(data_shape)) 146 return data 147 ValueError: Error when checking target: expected dense_8 to have shape (128,) but got array with shape (10,)
Sometimes, errors in your specification of the model will not result in an explicit coding error, which may cause further issues in debugging. Here are two examples that we will explore further after we study more about testing models in Section 4.
What happens if you use a 'tanh'
activation instead of a 'softmax'
activation on the output layer?
model3 = Sequential()
model3.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))
model3.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))
model3.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))
model3.add(Flatten())
model3.add(Dense(128, activation='relu'))
model3.add(Dense(10, activation='tanh'))
model3.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model3.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
Epoch 1/1 60000/60000 [==============================] - 110s 2ms/step - loss: 8.0835 - accuracy: 0.1009
<keras.callbacks.callbacks.History at 0x16b71d1de88>
What happens if we specify a 'binary_crossentropy'
loss function instead of 'categorical_crossentropy'
?
model4 = Sequential()
model4.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))
model4.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))
model4.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))
model4.add(Flatten())
model4.add(Dense(128, activation='relu'))
model4.add(Dense(10, activation='softmax'))
model4.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model4.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)
Epoch 1/1 60000/60000 [==============================] - 108s 2ms/step - loss: 0.0251 - accuracy: 0.9915
<keras.callbacks.callbacks.History at 0x16b71e73dc8>
The accuracies that you see reported as the network trains are the accuracies on the training data. This can be a good indication of the convergence of the network since you expect that the loss should decrease and accuracy should increase as training progresses.
There is concern, however, that the network has "learned the data" instead of learned a more general classifier. That is why we set aside a separate test set. All of the data in the test set were unseen in training and thus are brand new to the network. If the network is good and has not overfit the training data (learned the training data), we expect to see a good accuracy on the test data. We expect that the test accuracy will likely be a bit lower than the training accuracy.
We can take the trained model and evaluate it on a dataset using the evaluate
method of the trained model. As a sanity check, if we were to input the training data again, we would expect exactly the last accuracy reported in training.
We again use the verbose=1
option here to track the progress of evaluating the model on all 10,000 test images.
score = model1.evaluate(X_test, Y_test, verbose=1)
print(score)
10000/10000 [==============================] - 7s 682us/step [0.04841461051488295, 0.9836999773979187]
Note that the test stage is very quick. The major computational overhead in deep learning is in training. The operational use of the trained model is very computationally light. On my desktop computer, using the CPU, all 10,000 test images were labeled and compared to their ground truth in 8s, or 840 $\mu$s per image. This reported two values after completion. We can check what those values are by looking at the metrics_names
attribute of the model.
print(model1.metrics_names)
['loss', 'accuracy']
We note that these metrics are the loss and accuracy. The loss is reported by default since that is the metric used by the network during training. We requested that the network keep track of the additional metric of accuracy with the option metrics=['accuracy']
when compiling the model.
We might want more information than just a summary of the accuracy. If we output the predicted label for each of the test images, we can look more carefully at the performance of the network. We use the predict
method of the model. This has to run all 10,000 test images through the trained network and determine the class for each image.
Y_predict = model1.predict(X_test,verbose=1)
10000/10000 [==============================] - 7s 681us/step
When we computed the one-hot coded label vector used to train the network, we began with the assumption that a one-hot form is consistent with the native output of the network. We would thus expect that Y_predict
is in a one-hot format. We check this by printing the dimensions of Y_predict
.
print(Y_predict.shape)
(10000, 10)
Y_predict
does have the dimensions we would expect for a one-hot coded label vector. Similar to our process when we developed the one-hot coded vector Y_test
, we can look at the first 10 entries of Y_predict
.
print(Y_predict[0:10,:])
[[4.05177758e-08 7.43065387e-09 2.37806557e-06 2.58250338e-06 1.32016120e-10 6.64513600e-09 6.50171193e-12 9.99993682e-01 6.94475716e-07 5.78243203e-07] [1.32357775e-06 8.71970027e-04 9.99126256e-01 5.49216530e-08 6.64163308e-11 3.73077036e-09 1.32753513e-07 2.21830199e-09 3.57962051e-07 1.14670252e-10] [1.79842009e-05 9.98999059e-01 5.65366936e-05 1.32168325e-05 6.44273678e-05 1.28859028e-05 1.01682672e-05 7.75526569e-04 4.35361144e-05 6.71234602e-06] [9.99958873e-01 3.54211949e-08 9.94088714e-06 8.73126549e-09 3.90109278e-07 3.95757098e-07 2.31599006e-05 6.50946845e-07 5.35008394e-06 1.46342552e-06] [5.83880501e-06 6.28110342e-07 5.05188427e-06 3.39747743e-07 9.99858856e-01 8.84693065e-08 2.88095930e-06 1.05900563e-05 5.37548749e-06 1.10328197e-04] [7.17639432e-06 9.98586416e-01 9.56198073e-06 1.58115131e-06 4.67288082e-05 1.08867607e-06 1.44783314e-06 1.32461940e-03 1.69964169e-05 4.56649468e-06] [1.40812929e-07 8.77150342e-06 2.12768646e-05 1.91419222e-05 3.51276517e-01 7.10015942e-04 1.34696995e-06 9.68216409e-05 4.54874605e-01 1.92991406e-01] [4.10326662e-09 3.42933930e-07 9.80323193e-07 1.17285090e-05 2.24219402e-04 1.33342022e-04 3.50651312e-08 8.88734803e-07 6.77975011e-04 9.98950481e-01] [1.26563573e-05 2.65104583e-09 5.87654597e-07 4.32099547e-08 3.93154920e-07 9.84957039e-01 8.30869284e-03 3.97608204e-08 6.70878543e-03 1.16976953e-05] [1.21137347e-07 1.99952876e-09 1.36449376e-06 1.15867590e-06 6.39622682e-04 2.09732275e-06 6.38211084e-09 5.66185452e-04 4.46928432e-03 9.94320154e-01]]
At first glance, this form looks very different than those we saw for Y_train
. Remember, however, that with Y_train
we knew exactly what the actual labels were. Here, with Y_predict
, the network is computing probabilities of the image belonging to each of the 10 classes. If you pay careful attention to the exponents of the coefficients in each row of Y_predict
, you will note that one coefficient is very close to 1 and the remainder are very close to zero.
In fact, most of these coefficients round to 1 and 0 if rounded to two decimal places:
print(np.round(Y_predict[0:10,:],2))
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. ] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. ] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. ] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. ] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. ] [0. 0. 0. 0. 0.35 0. 0. 0. 0.45 0.19] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. ] [0. 0. 0. 0. 0. 0.98 0.01 0. 0.01 0. ] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.99]]
We can apply the argmax function to the one-hot label vector Y_predict
to determine the class label for each sample. Since this output will have a similar form to the original label vectors, we denote it as y_predict
.
y_predict = Y_predict.argmax(axis=-1)
If we print these numerical labels, we see that they correspond to the one-hot interpretation above.
print(y_predict[0:10])
[7 2 1 0 4 1 8 9 5 9]
The deterimination of accuracy requires a comparison of the predicted labels to the ground truth labels. That is was is done "under the hood" when keras
reports accuracy using the evaluate
method of the model. As a sanity check, we can compute the accuracy "by hand" using y_predict
and y_test
.
my_acc = (y_predict==y_test).sum()/len(y_predict)
print('My accuracy computation says:')
print(my_acc)
My accuracy computation says: 0.9837
We see that this value exactly matches that reported by keras
above.
We can also use both y_predict
and y_test
to gain a bit more insight into the performance of the network.
As a very simple verification, we can print the first 10 labels of both y_predict
and y_train
and compare by eye.
print('Actual labels are:')
print(y_test[0:10])
print('Predicted labels are:')
print(y_predict[0:10])
Actual labels are: [7 2 1 0 4 1 4 9 5 9] Predicted labels are: [7 2 1 0 4 1 8 9 5 9]
Looking more closely at those images that the network incorrectly classified can give us some insight in the robustness of the network. If the incorrectly classified images are difficult images, we may have more confidence in the network than if it is incorrectly classifying obvious images (more fun examples of that tomorrow!).
We can find which images were incorrectly classified by the network by looking for those images where the predicted and ground truth labels do not match.
incorrect_labels = np.where(y_predict!=y_test)[0]
print('There are '+str(len(incorrect_labels))+' incorrectly classified images')
There are 163 incorrectly classified images
The code below visualizes the first 10 of these incorrectly classified images and titles the plots with both the correct and predicted label.
plt.figure(figsize=(15,15))
for k in range(0,9): # choose 10 examples
plt.subplot(3,3,k+1) # select the current subplot
plt.imshow(np.squeeze(X_test[incorrect_labels[k],:,:]),cmap='gray') # plot the image
plt.title('Actual:'+str(y_test[incorrect_labels[k]])+' Predicted:'+str(y_predict[incorrect_labels[k]]))
plt.axis('off')
In many of these cases, the digits do not appear "typical" in form and it is thus not surprising that the network may have had difficulty correctly classifying them. In most cases, it is also easy to postulate what structures in the image may have resulted in the incorrect classification that did result.
Explore the performance of model3
in which we used a 'relu'
activation on the output layer and model4
in which we used a 'binary_crossentropy'
loss.
score = model3.evaluate(X_test, Y_test, verbose=1)
print(score)
Y_predict = model3.predict(X_test,verbose=1)
y_predict = Y_predict.argmax(axis=-1)
my_acc = (y_predict==y_test).sum()/len(y_predict)
print('My accuracy computation says:')
print(my_acc)
print(Y_predict)
10000/10000 [==============================] - 8s 773us/step [8.057436024475098, 0.10090000182390213] 10000/10000 [==============================] - 3s 323us/step My accuracy computation says: 0.1009 [[-0.973072 -0.8842376 0.96776295 ... 0.98423636 0.9954213 0.9915314 ] [-0.9976873 -0.8748546 0.99810606 ... 0.99707055 0.9997076 0.99870926] [-0.90909916 -0.8807202 0.8966731 ... 0.86087024 0.9748953 0.9491585 ] ... [-0.98996896 -0.96023846 0.9889787 ... 0.99146676 0.99941707 0.9986351 ] [-0.989946 -0.9145103 0.9892308 ... 0.9919738 0.99909675 0.99780166] [-0.99878985 -0.96146595 0.99856424 ... 0.9987184 0.9999072 0.9996895 ]]
score = model4.evaluate(X_test, Y_test, verbose=1)
print(score)
Y_predict = model4.predict(X_test,verbose=1)
y_predict = Y_predict.argmax(axis=-1)
my_acc = (y_predict==y_test).sum()/len(y_predict)
print('My accuracy computation says:')
print(my_acc)
print(Y_predict)
10000/10000 [==============================] - 7s 678us/step [0.009117328152665869, 0.9967696070671082] 10000/10000 [==============================] - 3s 329us/step My accuracy computation says: 0.9831 [[5.9873059e-07 3.5044065e-10 1.0003435e-07 ... 9.9999797e-01 1.1135760e-07 5.5747341e-07] [2.1488156e-06 8.6308628e-06 9.9998629e-01 ... 3.1622252e-10 7.8438552e-07 1.0652504e-10] [7.1815870e-05 9.9565083e-01 1.2233267e-03 ... 4.1579080e-04 2.2510721e-03 1.3366683e-05] ... [2.7786931e-09 2.8147473e-09 1.9756570e-09 ... 2.9908915e-06 3.6714137e-05 4.7742269e-05] [8.2209077e-08 7.3970114e-09 9.7479913e-10 ... 9.9465547e-09 1.2571395e-04 2.8228129e-08] [1.6427188e-07 8.0722729e-10 3.2076801e-05 ... 7.6732789e-09 1.1046466e-06 7.8044499e-10]]
The issues with model3
were probably apparent in the training stage, in that the accuracy reported was very low. You may also have noticed that the predicted one-hot labels Y_predict
are not consistent with the probabilities that we get for a softmax
activation. In some cases, you may get values in Y_predict
that are not-a-number (nan
) which is another indication that the training has gone very wrong.
The issues with model4
are much more subtle. You should have noticed that the accuracy reported by keras
' evaluate
function is not the same as when we compute it by hand. From the documentation (https://keras.io/api/losses/probabilistic_losses/), we note that 'binary_crossentropy'
should be used "when there are only two label classes" whereas 'categorical_crossentropy'
should be used "when there are two or more label classes."
Here we look at what happens when we input data to a network that is completely different than what it has seen before. To make our lives easier, we will use the Fashion-MNIST dataset which is designed as a dropin for the MNIST dataset. This way, we don't need to worry about as many details in the data preprocessing and can focus on the behavior of the network to completely different data.
We import and preprocess the Fashion-MNIST dataset in exactly the same way we did the MNIST data.
(X_train_f, y_train_f), (X_test_f, y_test_f) = fashion_mnist.load_data()
X_train_f = X_train_f.reshape(X_train_f.shape[0], 28, 28, 1)
X_train_f = X_train_f.astype('float32')
X_train_f /= 255
Y_train_f = np_utils.to_categorical(y_train_f, 10)
X_test_f = X_test_f.reshape(X_test_f.shape[0], 28, 28, 1)
X_test_f = X_test_f.astype('float32')
X_test_f /= 255
Y_test_f = np_utils.to_categorical(y_test_f, 10)
Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz 32768/29515 [=================================] - 0s 5us/step Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz 26427392/26421880 [==============================] - 21s 1us/step Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz 8192/5148 [===============================================] - 0s 0us/step Downloading data from http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz 4423680/4422102 [==============================] - 5s 1us/step
Let's check the performance of the MNIST network on this new dataset.
score = model1.evaluate(X_test_f, Y_test_f, verbose=1)
print(score)
10000/10000 [==============================] - 7s 672us/step [4.987198582458496, 0.1193000003695488]
As a point of reference, since there are 10 classes in the Fashion-MNIST dataset, you would expect a random guess to yield approximately 10% accuracy. We find about 8% accuracy (this may differ depending on exactly where your model converged to in training and may differ from run to run given the random initialization and randomization in assigning data to batches). Why is the performance so bad?
Let's look at one of the images from the Fashion-MNIST dataset.
plt.figure()
plt.imshow(np.squeeze(X_test_f[0]),cmap='gray')
plt.show()
print('This image is class '+str(y_test[0])+' in the Fashion-MNIST dataset')
This image is class 7 in the Fashion-MNIST dataset
We see that this is an image of a "sneaker," which also corresponds to class 7 in the Fashion-MNIST dataset (see https://keras.io/datasets/ for the full list of class labels and descriptions).
Let's see what class our digit MNIST network classifies this image as.
Y_example = model1.predict(X_test_f[0].reshape(1,28,28,1),verbose=1)
y_example = np.argmax(Y_example)
print(y_example)
1/1 [==============================] - 0s 14ms/step 1
This network has decided that this image of a "sneaker" is the digit "2" (the network usually converges to "2", but may have converged to a different value depending on differences in training). It has never seen a sneaker. But it will still do its level best to match that sneaker to the closest thing it knows. In this case, that is apparently a "2".
In transfer learning, we can "transfer" knowledge learned in one domain (e.g., MNIST) to another domain (e.g., Fashion-MNIST). The idea of transfer learning is predicated on the assumption that all images share the same basic primitives (edges, corners, etc.) which are essentially the features of images that we hand-designed in the second tutorial. In transfer learning, we re-use those image primitives and only have to relearn how to combine those primitives together in order to correctly classify a new domain of images. To do this, we will copy our MNIST model1
architecture and "freeze" all layers except the last layer by setting the trainable
attribute of layers to False
. All the parameters from the two convolutional layers and the first fully connected layer will remain in the state that we converged to when training the network on MNIST. It is only that final fully connected layer that will change in order to (hopefully) learn to correctly classify the Fashion-MNIST data.
model1_f = keras.models.clone_model(model1)
model1_f.set_weights(model1.get_weights())
for layer in model1_f.layers[:-1]:
layer.trainable=False
model1_f.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model1_f.fit(X_train_f, Y_train_f, batch_size=64, epochs=1, verbose=1)
Epoch 1/1 60000/60000 [==============================] - 23s 389us/step - loss: 1.2105 - accuracy: 0.5974s - loss: 1.2147 - accuracy
<keras.callbacks.callbacks.History at 0x16b70101d88>
The main advantages of transfer learning are related to computational efficiency and small datasets:
How well does our new transfer learning model model1_f
perform on the Fashion-MNIST data?
score = model1_f.evaluate(X_test_f, Y_test_f, verbose=1)
print(score)
10000/10000 [==============================] - 8s 760us/step [0.7792962156295776, 0.729200005531311]
You have probably found that the network does not perform as well on the Fashion-MNIST dataset as it did on MNIST. If you trained the full 2-layer network from scratch (as we did for MNIST), you would achieve approximately 89% test accuracy.
How well does our new transfer learning model model1_f
perform on the original MNIST data?
score = model1_f.evaluate(X_test, Y_test, verbose=1)
print(score)
10000/10000 [==============================] - 7s 738us/step [4.2286063625335695, 0.3264999985694885]
How does the transfer learning work if you freeze fewer layers?
model1_f = keras.models.clone_model(model1)
model1_f.set_weights(model1.get_weights())
for layer in model1_f.layers[:-2]:
layer.trainable=False
model1_f.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])
model1_f.fit(X_train_f, Y_train_f, batch_size=64, epochs=1, verbose=1)
score = model1_f.evaluate(X_test_f, Y_test_f, verbose=1)
print(score)
Epoch 1/1 60000/60000 [==============================] - 33s 546us/step - loss: 0.4156 - accuracy: 0.8545 10000/10000 [==============================] - 3s 338us/step [0.32236459945440293, 0.8853999972343445]
We can save a trained model so that we don't have to go through the bother of training it again later. The following instructions save the model in a binary file in HDF5 (Hierarchical Data Format). The use of these commands assume that you have h5py
(the python interface to HDF5 format) installed. For at least the Linux version of Anaconda 3.7, it appears that h5py
was included. If it does not appear that you have h5py
installed, you can run the following command from your computer's terminal
conda install h5py
The successful installation of h5py
, however, requires that the HDF5 libraries to be installed on your computer.
model1.save('model1.h5')
model1_f.save('model1_f.h5')
This will save a binary file named model1.h5
to the same directory as this notebook. You can load this file and pick up right where we left off.
model1 = load_model('model1.h5')
model1_f = load_model('model1_f.h5')