{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 3: Deep Learning for Images\n",
    "## Laura E. Boucheron, Electrical & Computer Engineering, NMSU\n",
    "### May 2021\n",
    "Copyright (C) 2021  Laura E. Boucheron\n",
    "\n",
    "This information is free; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.\n",
    "\n",
    "This work is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.\n",
    "\n",
    "You should have received a copy of the GNU General Public License along with this work; if not, If not, see <https://www.gnu.org/licenses/>.\n",
    "\n",
    "## Overview\n",
    "In this tutorial, we will introduce the basic structure and common components (convolutional layers, pooling layers, nonlinearities, fully connected layers, etc.) of deep learning networks through a combination of illustrations and hands-on implementation of a network.  By the end of this tutorial, we will have built from scratch a deep convolutional neural network to operate on the standard MNIST handwritten digits dataset.  We will then explore some ways of probing the characteristics of the trained network to help us debug common pitfalls in adapting network architectures.  \n",
    "\n",
    "This tutorial contains 7 sections:\n",
    "  - **Section 0: Preliminaries**: some notes on using this notebook, how to download the image dataset that we will use for this tutorial, and import commands for the libraries necessary for this tutorial\n",
    "  - **Section 1: The MNIST Dataset** how to load the MNIST dataset and some ideas for visualization of the dataset\n",
    "  - **Section 2: Data Preprocessing (Dimensionality Wrangling)** how to preprocess the MNIST dataset in preparation for using it to train a deep network, including dimensionality and intensity scaling of the images and coding of the labels\n",
    "  - **Section 3: Building a CNN for MNIST** how to build a basic 2-layer CNN for classification of MNIST digits including definition of the network architecture, compilation, and training\n",
    "  - Detour to a powerpoint presentation to learn more about the different CNN layers which are currently training\n",
    "  - **Section 4: Testing the Trained CNN** how to test the accuracy of the trained network and locate those images incorrectly classified\n",
    "  - **Section 5: Transfer Learning for MNIST** how to adapt a previously trained network to a new dataset\n",
    "  - **Section 6: Saving a Trained Model** how to save a trained model so that you can load it and use it later.\n",
    "  \n",
    "There are a few subsections with the heading \"**Your turn**\" throughout this tutorial in which you will be asked to apply what you have learned.  \n",
    "\n",
    "Portions of this tutorial have been taken or adapted from https://elitedatascience.com/keras-tutorial-deep-learning-in-python and the documentation at https://keras.io."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 0: Preliminaries \n",
    "## A Note on Jupyter Notebooks\n",
    "\n",
    "There are two main types of cells in this notebook: code and markdown (text).  You can add a new cell with the plus sign in the menu bar above and you can change the type of cell with the dropdown menu in the menu bar above.  As you complete this tutorial, you may wish to add additional code cells to try out your own code and markdown cells to add your own comments or notes. \n",
    "\n",
    "Markdown cells can be augmented with a number of text formatting features, including\n",
    "  - bulleted\n",
    "  - lists\n",
    "\n",
    "embedded $\\LaTeX$, monotype specification of `code syntax`, **bold font**, and *italic font*.  There are many other features of markdown cells--see the jupyter documentation for more information.\n",
    "\n",
    "You can edit a cell by double clicking on it.  If you double click on this cell, you can see how to implement the various formatting referenced above.  Code cells can be run and markdown cells can be formatted using Shift+Enter or by selecting the Run button in the toolbar above.\n",
    "\n",
    "Once you have completed (all or part) of this notebook, you can share your results with colleagues by sending them the `.ipynb` file.  Your colleagues can then open the file and will see your markdown and code cells as well as any results that were printed or displayed at the time you saved the notebook.  If you prefer to send a notebook without results displayed (like this notebook appeared when you downloaded it), you can select (\"Restart & Clear Output\") from the Kernel menu above.  You can also export this notebook in a non-executable form, e.g., `.pdf` through the File, Save As menu."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 0.3a Import Necessary Libraries (For users using a local machine)\n",
    "Here, at the top of the code, we import all the libraries necessary for this tutorial.  We will introduce the functionality of any new libraries throughout the tutorial, but include all import statements here as standard coding practice.  We include a brief comment after each library here to indicate its main purpose within this tutorial.\n",
    "\n",
    "It would be best to run this next cell before the workshop starts to make sure you have all the necessary packages installed on your machine.\n",
    "\n",
    "A few other notes:\n",
    " - After the first import of keras packages, you may get a printout in a pink box that states\n",
    "```\n",
    "Using Theano backend\n",
    "```\n",
    "or\n",
    "```\n",
    "Using TensorFlow backend\n",
    "```\n",
    " - You may get one or more warnings complaining about various configs.  As long as you don't get any errors, you should be good to go.  You can, if you wish, fix whatever is causing a warning at a later point in time.  I find it best to copy and paste the error warning itself into a Google search and tack on the OS in which you encountered the error.  Seldom have I encountered an error that someone else hasn't encountered in my same OS.\n",
    " - The last two lines in the following code cell import the MNIST and Fashion-MNIST datasets.  Those datasets can be downloaded directly from online, but the `keras` library also includes a tool to do just that.  After you have downloaded the dataset for the first time, `keras` will load the dataset from its local location.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np # mathematical and scientific functions\n",
    "import matplotlib.pyplot as plt # visualization\n",
    "\n",
    "# format matplotlib options\n",
    "%matplotlib inline\n",
    "plt.rcParams.update({'font.size': 12})\n",
    "\n",
    "import keras.backend # information on the backend that keras is using\n",
    "from keras.utils import np_utils # functions to wrangle label vectors\n",
    "from keras.models import Sequential # the basic deep learning model\n",
    "from keras.layers import Dense, Flatten, Convolution2D, MaxPooling2D # important CNN layers\n",
    "from keras.models import load_model # to load a pre-saved model (may require hdf libraries installed)\n",
    "\n",
    "from keras.datasets import mnist # the MNIST dataset\n",
    "from keras.datasets import fashion_mnist # the Fashion-MNIST dataset\n",
    "\n",
    "# The following lines are included for the first time you access the mnist and fashion-mist datasets as they \n",
    "# download the dataset from online the first time.  Subsequent times, this command will load the dataset from the \n",
    "# local installation location.  If you have already downloaded the mnist and fashion-mnist datasets, you can \n",
    "# comment out these lines as we will load the dataset later.  \n",
    "(X_train, y_train), (X_test, y_test) = mnist.load_data()\n",
    "(X_train_f, y_train_f), (X_test_f, y_test_f) = fashion_mnist.load_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 0.3b Build the Conda Environment (For users using the ARS HPC Ceres with JupyterLab)\n",
    "Please follow instructions at https://kerriegeil.github.io/NMSU-USDA-ARS-AI-Workshops/setup/#on-the-ceres-hpc\n",
    "\n",
    "You will want to do this BEFORE the workshop starts."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 1: The MNIST Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1 Importing the MNIST dataset\n",
    "The line in the code cell above that reads `from keras.datasets import mnist` has loaded the `keras` package that interfaces with the local copy of MNIST dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Printing out the current backend\n",
    "Before we get going, let's check which backend keras is using. All subsequent instructions should be valid for either `tensorflow` or `theano`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(keras.backend.backend())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on other standard datasets included in keras\n",
    "As a note, there are other datasets available as part of `keras.datasets`, see https://keras.io/datasets/ for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2 Load training and test data\n",
    "Now we can use the `mnist.load_data` function to read in the standard training data and test data.  The first time you run the following command you will see a printout of the download progress.  Subsequent times you run the command, you will not see any printout as the data will be loaded from where `keras` stored it locally on your computer.  The `mnist.load_data` function outputs `numpy` arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(X_train, y_train), (X_test, y_test) = mnist.load_data()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(mnist.load_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on the variable name conventions\n",
    "In loading the MNIST data, we are storing the data (images) in `X_train` and `X_test` and the corresponding labels in `y_train` and `y_test`.  It is common convention to label the input data with a capital 'X' and the labels with a lowercase 'y'.  Since these data are images which can be represented as arrays, the convention of using 'X' and 'y' comes from matrix notation where vectors are assigned a lowercase variable and matrices an uppercase variable.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.3 Checking dimensionality of the MNIST data variables\n",
    "We know that the MNIST dataset consists of 70,000 examples of $28\\times28$ pixels images of handwritten digits from 0-9.  We also know that there are 60,000 images reserved for training and 10,000 reserved for testing.  As such, we expect that the dimensionality of `X_train` and `X_test` to reflect this.  We print the shape of the two variables. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The dimensions of X_train are:')\n",
    "print(X_train.shape)\n",
    "print('The dimensions of X_test are:')\n",
    "print(X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also check the variable types of `X_train` and `X_test`.  Since the `mnist.load_data` function outputs `numpy` arrays, we need to use the `dtype` method to query the variable type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The variable type of X_train is:')\n",
    "print(X_train.dtype)\n",
    "print('The variable type of X_test is:')\n",
    "print(X_test.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we were to use the type command, we would only get information that the variable is a `numpy` array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use the command %whos to get information about the variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%whos"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on tensors\n",
    "In the literature and documentation related to deep learning, you will see the word \"tensor\" quite often.  We have just encountered our first tensors.  Think of tensors as multidimensional arrays.  `X_train` took the 60,000 $28\\times28$ 2D pixel arrays, each of which represents an image, and stacked them to create a 3D array (tensor).  Before we're done here, we'll add a fourth dimension to `X_train` and `X_test`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.4 Visualizing an MNIST image\n",
    "From these dimensions, it appears that the first dimension indexes the sample (image) and the second and third dimensions index the spatial dimensions of the image.  It also appears that the images are `uint8`. We can check this assumption by visualizing one of the samples of `X_train`.  In this case we look at the first image in `X_train`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "plt.imshow(X_train[0],cmap='gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that this first image is of the digit 5.  We can look at the balance of the dataset in terms of digit representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('In the training dataset:')\n",
    "for d in range(0,10):\n",
    "    print('There are '+str((y_train==d).sum())+\\\n",
    "          ' images of digit '+str(d))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('In the test dataset:')\n",
    "for d in range(0,10):\n",
    "    print('There are '+str((y_test==d).sum())+\\\n",
    "          ' images of digit '+str(d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Look at some other images in `X_train` or `X_test`.  Does there appear to be any order in which the digits appear?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.5 MNIST label vectors\n",
    "The `y_train` variable contains the label, or the \"truth\" of what is represented in the image.  We can print out the label for the same image we visualized above (the first image in `X_train`)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(y_train[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This indicates that the image we plotted above corresponds to a ground truth label of '5'. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Revise your code from above to title your plot with the ground truth label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.5 A visualization of the digit variation in MNIST\n",
    "In addition to providing the labels for training a supervised classifier, these label vectors provide an important way to index into our dataset.  The following subsection illustrates one use of the label vector.\n",
    "\n",
    "We can get a brief sense of the sort of variation included in this dataset by plotting 10 examples for each of the digits.  The following code makes use of the `X_train` variable and also the corresponding labels in `y_train`.  \n",
    "\n",
    "In the following code, we loop over the 10 digits using variable `d` and over 10 examples using variable `k`.  We plot the first 10 examples for each digit.  Let's take a more careful look at the syntax `X_train[np.where(y_train==d)[0][k],:,:]`\n",
    "  - the `np.where(y_train==d)` finds those indexes where the ground truth indicates that we have a specific digit\n",
    "  - the `np.where` command returns a tuple; in this case there is only one dimension to the tuple, so we pull of the first dimension, so we have `np.where(y_train==d)[0]`\n",
    "  - we now pull off the `k`-th index, so we have `np.where(y_train==d)[0][k]`\n",
    "  - now we need to grab the image corresponding to the `k`-th instance of the digit `d`, and we have `X_train[np.where(y_train==d)[0][k],:,:]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(20,20))\n",
    "for d in range(0,10): # loop over the digits 0 through 9\n",
    "    for k in range(0,10): # choose 10 example images for each digit\n",
    "        plt.subplot(10,10,d*10+k+1) # select the current subplot\n",
    "        plt.imshow(X_train[np.where(y_train==d)[0][k],:,:],cmap='gray') # plot the image\n",
    "        plt.axis('off')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 2: Data Preprocessing (Dimensionality Wrangling)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 Input data dimensionality considerations\n",
    "`keras` expects input to be tensors of the form samples $\\times$ channels $\\times$ height $\\times$ width (`'channels_first'`) or samples $\\times$ height $\\times$ width $\\times$ channels (`'channels_last'`).  \n",
    "\n",
    "MNIST images are one channel (grayscale), but we don't see that explicitly represented in the shape of `X_train` or `X_test`.  Thus, we need to add a dimension to the `X_train` and `X_test` tensors to have the proper shape.  \n",
    "\n",
    "We can do this with the `reshape` command.  We choose the `'channels_last'` option and and tack on the channel as the fourth dimension.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)\n",
    "X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now when we check the shape, we find the expected form samples $\\times$ height $\\times$ width $\\times$ channels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The dimensions of X_train are:')\n",
    "print(X_train.shape)\n",
    "print('The dimensions of X_test are:')\n",
    "print(X_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We note that there is a default assumption of either `'channels_last'` or `'channels_first'` for each deep learning framework such as `theano` or `tensorflow`.  To avoid potential misinterpretation, we will explicitly specify `data_format='channels_last'` in our `keras` code below.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on the importance of dimensionality\n",
    "This is the first example of the care with which we need to consider the shape/dimensionality of our data.  This example is specific to `keras`, but the general principles here are similar for other deep learning frameworks, e.g., `tensorflow`, `caffe`, `pytorch`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Above, you worked with the original `X_train` and `X_test` arrays as loaded by `keras`.  Now we have expanded the dimensions of those arrays, what is the tensor dimensionality of the `k`-th data sample?  Does your visualization code from above still work?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on your library versions, you may have found that your visualization code from above no longer works.  If you get an error, it is likely similar to\n",
    "```\n",
    "TypeError: Invalid shape (28,28,1) for image data\n",
    "```\n",
    "when you try to visualize one of the MNIST images.  This error is due to the very fact that we explicitly expanded the dimensions to make `keras` happy.  More recent versions of the `matplotlib` library appear to be able to handle such image shapes. It is worth, however, introducing yourself to the `.squeeze` method for `numpy` arrays.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Modify your code to work with the newly shaped `X_train` and `X_test` arrays.  The `.squeeze` method for numpy arrays will likely be of use here: it removes single-dimensional entries from the shape of an array.  Note--you do not want to actually modify the shape of `X_train` or `X_test` here.  Your goal is to modify the visualization code to deal with the singleton dimensions.  Even if you were able to run the code above, it is a good exercise to learn the usage of `.squeeze` as other functions may grumble about singleton dimensions in the future."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 Input data intensity scaling considerations\n",
    "We noted earlier that `X_train` and `X_test` are of variable type `uint8`.  It is considered best practice to normalize the range of your input data, commonly to $[0,1]$.  Back in the world of classical machine learning, this avoids a specific feature dominating the classification simply because it is larger.  In deep learning, continuing this convention allows for more consistency and robustness in the computation of the various gradients during training.  The risk of overflow (exceeding the capabilities of a variable type to represent a very large number) or underflow (exceeding the capabilities of a variable type to represent a very small, i.e., close to zero, number) is very real in deep learning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Before we normalize the input data intensity, we should double check that the variables are within the range we expect.  Verify that `X_train` and `X_test` are within the expected range of [0,255] for a `uint8` variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Casting the data as float\n",
    "\n",
    "Here, we cast the `numpy` arrays as `float32` and divide by the maximum we expect for a `uint8` variable (255)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = X_train.astype('float32')/255\n",
    "X_test = X_test.astype('float32')/255"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on using lower precision variable types\n",
    "We could cast these variables as `float64` or as whatever native `float` your computer defaults to.  Deep learning, however, is relatively insensitive to precision of the input data.  In fact, many deep learning methods will work very well even with fixed point variables.  Since many deep learning methods are very memory intensive, we can save a bit of memory by using 32-bit floating point variables.  In this case we cast as `float32` since that is already overkill for `uint8` variables and it will take up less memory than casting those arrays as `float64`.  We note, however, that if your data is natively `float64`, you probably want to leave it as such."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Check the range of the normalized `X_train` and `X_test` arrays to verify that they are now in the range [0,1]."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on overwriting the X_train and X_test variables\n",
    "We overwrite the `X_train` and `X_test` variables when we scale the intensity.  The tensors of image data can be a very large data variable memory-wise.  As such, it is best not to keep multiple copies of the data in working memory.  It is important to note, however, that you will need to be careful rerunning code cells.  If you were to re-run the normalization cell above again, your normalized data will no longer be in the range $[0,1]$ since you will have divided by 255 twice."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A note on other common data preprocessing methods\n",
    "There are a suite of common data preprocessing methods.  Most of these involve some form of statistical normalization.  For example, we might scale our data to have a mean of 0 and a standard deviation of 1.  Or we might whiten the data to make it more normally distributed.  Here we have considered a simple range normalization, but note that other standard preprocessing routines exist.  See https://keras.io/preprocessing/image/ for some examples of other preprocessing methods and syntax for the built-in functions in `keras` to perform those methods."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 Label vector coding\n",
    "### 2.3.1 Dimensionality of the loaded label vectors\n",
    "Now we turn our attention to the label vectors.  We check the shape of the label vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The dimensions of y_train are:')\n",
    "print(y_train.shape)\n",
    "print('The dimensions of y_test are:')\n",
    "print(y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have already looked at entries in `y_train` and `y_test` and noted that they are integers that (at least in this case) directly correspond to the digit that the image represents.  More on this in a bit...\n",
    "\n",
    "However... `keras` (and many other common classification and deep learning frameworks) expects labels of shape $N_\\text{samples}\\times N_\\text{classes}$.  We see that we are okay in terms of $N_\\text{samples}$ (60,000 for training and 10,000 for test), but we have an empty second dimension. We somehow need to reconfigure the label vectors so that they will be $60,000\\times10$ for `y_train` and $10,000\\times10$ for `y_test`.  How do we get to a $60,000\\times10$ array for the labels?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3.2 A brief introduction to one-hot coding\n",
    "What we really need is a representation of the label vectors that better matches the typical output of a neural network.  The output layer of a neural network classifier will have $N_\\text{classes}$ nodes.  In a typical application, the last layer is a `softmax` layer which outputs *probabilities of a sample belonging to each of the classes* $C_j,~j=0,\\ldots,N_\\text{classes}-1$.  Thus, the `softmax` layer for an MNIST digit classification will have form $$[p(C_0),p(C_1),p(C_2),p(C_3),p(C_4),p(C_5),p(C_6),p(C_7),p(C_8),p(C_9)]^T.$$  A simple argmax predicts the label as the class with the highest probability, i.e., $\\hat{y}=\\text{argmax}_j p(C_j)$.  This means that if the network is absolutely 100% certain that a sample is a digit '3', all coefficients in the softmax layer will be zero except the coefficient corresponding to the digit '3', i.e., $$[0,0,0,1,0,0,0,0,0,0]^T$$ with $\\hat{y}=3$.  \n",
    "\n",
    "This gives us insight into how to \"encode\" the *input* label vector.  We want a value of 1 for the given class and zeros everywhere else; this is also known as **one-hot coding**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3.3 What do labels \"mean\"?\n",
    "If we print the first ten labels in `y_train`, we see that the labels are reported as the digit itself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The first ten entries of y_train are:')\n",
    "print(y_train[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case there is a very direct and obvious relationship between the label and the meaning.  If `y_train==3`, the data is an image of the numeral three.  \n",
    "\n",
    "### A very important note on the abstractness of labels\n",
    "It is important to keep in mind, however, that these labels are a very abstract concept--when we see the ASCII character '3' printed in the first ten entries of `y_train` above, we interpret that to mean 'an image of the numeral three.'  We could just as easily have labeled the images of the numeral three with the label 'hamster' and *nothing* about the following code would change.  The performance we will see below on the ability to correctly classify all images of the numeral 3 would be identical.  The only difference is that the network would very cheerfully report that an image of the numeral three is a 'hamster' instead of a '3'.  And it would be correct because we would have told it that images of the numeral three are 'hamsters.'  \n",
    "\n",
    "This highlights the importance of leveraging humans to provide labels for the training data.  It is the humans that are providing the abstract intepretation of what those images represent.  Computers, however, only understand numbers.  So we need to find some means to translate our abstract notion of the classes of the input data to something numerical for the computer to interpret.  \n",
    "\n",
    "As a more concrete example of this abstractness of the labels, consider the Fashion-MNIST dataset (see also https://keras.io/datasets/).  This dataset was designed to be a drop-in replacement for MNIST.  The dimensionality is exactly the same (60,000 28$\\times$28 pixel training images and 10,000 28$\\times$28 pixel testing images), but the images are grayscale images of clothing articles.  Thus in the Fashion-MNIT dataset, if ground truth label is specified as '3', instead of interpreting that as 'an image of the 'numeral three,' you interpret that as 'an image of a dress.'  We will take a brief look at the Fashion-MNIST dataset a bit later.\n",
    "\n",
    "A convenient and common translation is to these one-hot coded vectors.  Different frameworks and different networks may have different conventions.  Many frameworks have become more user friendly in the ability to specify a list of categories, similar to what we provided to the confusion matrix code in Tutorial 2.  Since so many frameworks rely on on-hot coding, and especially since many deep learing networks have a softmax output layer, it is worth the time to understand the basics of one-hot coding."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3.4 Converting labels to one-hot coding\n",
    "We will use the `keras` function `np_utils.to_categorical` to convert the label vector to a one-hot vector.  We specifying `y_train` or `y_test` as input and denote the one-hot label vector with a capital `Y` to remind ourselves that this is now actually a matrix of probabilities and thus a very different representation than the original label vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_train = np_utils.to_categorical(y_train, 10)\n",
    "Y_test = np_utils.to_categorical(y_test, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the dimensionality of these new one-hot labels `Y_train` and `Y_test`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The dimensions of Y_train are:')\n",
    "print(Y_train.shape)\n",
    "print('The dimensions of Y_test are:')\n",
    "print(Y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We note that the first dimension of `Y_train` and `Y_test` correspond to the sample and the second dimension consists of 10 entries.  Let's look at the one-hot label for the first 10 samples in `Y_train` and compare to the first 10 samples in the original label vector `y_train`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The first ten entries of y_train (original label vector) are:')\n",
    "print(y_train[0:10])\n",
    "print('The argmax of the first ten entries of Y_train (one-hot label vector) are:')\n",
    "print(Y_train[0:10,:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Verify to yourself that the correct entries in the one-hot label vector are hot."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3.5 Converting labels from one-hot coding\n",
    "We can double check ourselves by applying an argmax to the one-hot labels.  We expect to get back the original labels."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('The first ten entries of y_train (original label vector) are:')\n",
    "print(y_train[0:10])\n",
    "print('The first ten entries of Y_train (one-hot label vector) are:')\n",
    "print(np.argmax(Y_train[0:10,:],axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We find that we do get back the original labels."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 3: Building a CNN for MNIST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we've loaded and preprocessed the input data (samples `X_train` and labels `y_train`) needed to train a deep learning network.  We need to decide the specific architecture of the network itself.  We begin here with a simple 2-layer network.  This network will result in approximately 95% accuracy on the training data after several epochs, but can take a few minutes per epoch to run on a CPU (a few seconds on a GPU).  As such, we will set this up to run and then cycle back around to understand the details as it is training.  \n",
    "\n",
    "Within each epoch of training, the entire training dataset is visited once.  Thus, an epoch can be thought of in the general sense of an iteration.  Deep learning uses the distinct terminology of epoch to specifically mean the one visit of the entire training set.  Within each epoch, you have batches of the input data.  The decomposition of an epoch into multiple batches is particularly important for very large datasets that cannot fit into working memory.\n",
    "\n",
    "We repeat in the following cell the necessary steps for the data wrangling studied in Section 2.  This will make sure we have all the correct wrangling in place before inputting the data to the network below.  If we re-ran cells out of order as we were studying things above, it is likely that we may have ended up with an incorrectly wrangled `X_train`, `X_test`, `y_train`, or `y_test`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(X_train, y_train), (X_test, y_test) = mnist.load_data() # read in data\n",
    "X_train = X_train.astype('float32')/255 # normalize data to [0,1]\n",
    "X_test = X_test.astype('float32')/255 # normalize data to [0,1]\n",
    "X_train = X_train.reshape(X_train.shape[0], 28, 28, 1) # add channel dimension to end of tensor\n",
    "X_test = X_test.reshape(X_test.shape[0], 28, 28, 1) # add channel dimension to end of tensor\n",
    "Y_train = np_utils.to_categorical(y_train, 10) # compute one-hot label vector\n",
    "Y_test = np_utils.to_categorical(y_test, 10) # compute one-hot label vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Import necessary keras library functions\n",
    "In Section 0.3 we directly imported only those functions we use from `keras` to make our code more compact.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 Define the model architecture\n",
    "Next we define our first model, which we call `model1`.  We'll cycle back to understand the components of this model after we set it training, but also include some descriptions here:\n",
    " - The `Sequential` model is the base class used in `keras` to define a sequential stack of layers\n",
    " - The `Convolution2D` layer defines a typical convolution layer.  This layer takes a tensor of images as input and outputs a tensor of images.  Note that you need to define the input shape for the first convolutional layer $28\\times28\\times1$ in this case.  The input shape for all subsequent layers is automatically inferred to be the same as the output shape for the previous layer.  We explicitly specify `'channels_last'` for the `data_format` since that is how we defined our input data.  Note that for CNNs we generally only count the convolutional layers when reporting on the depth of the network.\n",
    " - The `MaxPooling2D` layer reduces the spatial dimensions of the input tensor.  We again explicitly specify `'channels_last'` for the `data_format`.\n",
    " - The `Flatten` layer essentially reshapes the dimensions of the data.  In this case it takes the $28\\times28\\times32$ tensor output from the second convolutional layer and reshapes it into a length $28*28*32=25088$ vector.\n",
    " - The `Dense` layer is the layer type for fully connected (i.e., dense) layers.  This type of layer defines a connection from all nodes in the previous layer to all nodes in the subsequent layer.  The first layer defined here takes the 25088 inputs and outputs 128 values.  The second fully connected layer takes those 128 as input and outputs 10 values.  \n",
    " - Note that the size of the output layer is consistent with the number of classes in our dataset (10 in this case) and that we have specified a `'softmax'` activation. This means that the output will be the probability of an example belonging to each of the 10 classes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1 = Sequential()\n",
    "model1.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model1.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model1.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model1.add(Flatten())\n",
    "model1.add(Dense(128, activation='relu'))\n",
    "model1.add(Dense(10, activation='softmax'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3 Compile the model\n",
    "Next we need to compile the model before we can train it.  This requires specification of several parameters including the `loss`, the `optimizer` and the `metrics`.  Again, we will cycle back to understand these after we set it training.  We specify a minimum of options here, including:\n",
    " - The loss function which is what the network uses to gauge its performance as it trains.  In this case we use `'categorical_crossentropy'` which is a common loss function for multi-class classification problems.  This loss expects labels in one-hot coded format.\n",
    " - The optimizer which is how the network adjusts its learning as it trains.  In this case we use the `'adam'` optimizer which is a good optimizer in the absence of any other prefered optimizer.  The `'adam'` optimizer adjusts the learning rate throughout the training process to help convergence.\n",
    " - The `metrics` which are the \"human-interpretable\" measurements of network performance.  Here we request the accuracy which will be a the percentage of correctly classified digits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.4 Train the model\n",
    "Now we finally start the actual training of this model.  We input the `X_train` and `Y_train` variables that we worked with above, and specify a few simulation parameters such as `batch_size` and `epochs` which we will cycle back to in a while.  We specify `verbose=1` in order to print out status so we can keep track of where we are in the training process.\n",
    "\n",
    "You may get one or more warnings, but as long as you don't get any errors, you should see \n",
    "something of the form\n",
    "```\n",
    "Epoch 1/10\n",
    "38208/60000 [==================>..........] - ETA: 1:12 - loss: 0.1849 - acc: 0.9448\n",
    "```\n",
    "\n",
    "We have specified a total of 1 epoch, so the ETA specified at the beginning of the current epoch is approximately the total time the training is expected to take.  How long the training takes is very dependent on the hardware and how well that hardware is configured to perform the sort of computation required for CNN training.  On my desktop machine (AMD Ryzen 7 2700X 4.3 GHz processor), the training took about 3.5 minutes on all 8 CPUs.  While this might be longer than the average person is accustomed to waiting for a computer to finish processing, this is actually a very reasonable time to train a complete deep network.  This is because the MNIST dataset is not too large, nor is the network we specified.  Ordinarily, we would (need to) run the training for more than one epoch.  For MNIST, however, we can converge to a very good accuracy within one epoch.  On my Titan XP GPU, the same training took about 2 seconds, illustrating the suitability of deep learning networks to GPU computation.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are some other issues related to testing the trained network that we will return to in Section 4.  For the remainder of this section, we focus on deepening our understanding of this model that we have trained on MNIST."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.5 Some Common Errors in Defining, Compiling, and Training the Model\n",
    "Error reporting is not always the most elucidating in deep learning models.  Here we explore some common errors in model definition, compilation, and training.  Below, we have copied the definition, compilation, and training stages from above and named this new model `model2`.  You can copy and paste this code into subsequent code cells and modify different aspects of the three stages to explore the effects and/or error messages encountered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2 = Sequential()\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model2.add(Flatten())\n",
    "model2.add(Dense(128, activation='relu'))\n",
    "model2.add(Dense(10, activation='softmax'))\n",
    "\n",
    "model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.5.1 Errors that actually report as errors\n",
    "The following errors should actually report as an error.  That error may or may not be particularly ellucidating in helping you find the source of the error if you weren't aware of the source in advance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "What happens if you specify `data_format='channels_first'`?  How useful is the error in this case?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2 = Sequential()\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_first'))\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_first'))\n",
    "model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_first'))\n",
    "model2.add(Flatten())\n",
    "model2.add(Dense(128, activation='relu'))\n",
    "model2.add(Dense(10, activation='softmax'))\n",
    "\n",
    "model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "What happens if you forget the flatten layer?  How useful is the error in this case?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2 = Sequential()\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "#model2.add(Flatten())\n",
    "model2.add(Dense(128, activation='relu'))\n",
    "model2.add(Dense(10, activation='softmax'))\n",
    "\n",
    "model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "What happens if you specify an output layer that is not length 10?  How useful is the error in this case?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2 = Sequential()\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model2.add(Flatten())\n",
    "model2.add(Dense(128, activation='relu'))\n",
    "model2.add(Dense(128, activation='softmax'))\n",
    "\n",
    "model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model2.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "What happens if you forgot to extend the dimensionality of the input tensor `X_train`?  Remember the method `.squeeze` will remove singleton dimensions.  How useful is the error in this case?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2 = Sequential()\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model2.add(Flatten())\n",
    "model2.add(Dense(128, activation='relu'))\n",
    "model2.add(Dense(128, activation='softmax'))\n",
    "\n",
    "model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model2.fit(X_train.squeeze(), Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "What happens if you accidentally used the label vector `y_train` instead of the one-hot label vector `Y_train`?  How useful is the error in this case?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2 = Sequential()\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model2.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model2.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model2.add(Flatten())\n",
    "model2.add(Dense(128, activation='relu'))\n",
    "model2.add(Dense(128, activation='softmax'))\n",
    "\n",
    "model2.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model2.fit(X_train.squeeze, y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.5.2 Errors that don't appear to be errors at first\n",
    "\n",
    "Sometimes, errors in your specification of the model will not result in an explicit coding error, which may cause further issues in debugging.  Here are two examples that we will explore further after we study more about testing models in Section 4."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if you use a `'tanh'` activation instead of a `'softmax'` activation on the output layer?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model3 = Sequential()\n",
    "model3.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model3.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model3.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model3.add(Flatten())\n",
    "model3.add(Dense(128, activation='relu'))\n",
    "model3.add(Dense(10, activation='tanh'))\n",
    "\n",
    "model3.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model3.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What happens if we specify an `'mse'` (mean squared error) loss function instead of `'categorical_crossentropy'`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model4 = Sequential()\n",
    "model4.add(Convolution2D(32, (3, 3), activation='relu', input_shape=(28,28,1), data_format='channels_last'))\n",
    "model4.add(Convolution2D(32, (3, 3), activation='relu', data_format='channels_last'))\n",
    "model4.add(MaxPooling2D(pool_size=(2,2),data_format='channels_last'))\n",
    "model4.add(Flatten())\n",
    "model4.add(Dense(128, activation='relu'))\n",
    "model4.add(Dense(10, activation='softmax'))\n",
    "\n",
    "model4.compile(loss='mse', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model4.fit(X_train, Y_train, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 4: Testing the Trained CNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1 Determining accuracy on test data\n",
    "The accuracies that you see reported as the network trains are the accuracies on the *training* data.  This can be a good indication of the convergence of the network since you expect that the loss should decrease and accuracy should increase as training progresses.\n",
    "\n",
    "There is concern, however, that the network has \"learned the data\" instead of learned a more general classifier.  That is why we set aside a separate test set.  All of the data in the test set were unseen in training and thus are brand new to the network.  If the network is good and has not overfit the training data (learned the training data), we expect to see a good accuracy on the test data.  We expect that the test accuracy will likely be a bit lower than the training accuracy.\n",
    "\n",
    "We can take the trained model and evaluate it on a dataset using the `evaluate` method of the trained model.  As a sanity check, if we were to input the training data again, we would expect exactly the last accuracy reported in training. \n",
    "\n",
    "We again use the `verbose=1` option here to track the progress of evaluating the model on all 10,000 test images.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = model1.evaluate(X_test, Y_test, batch_size=64, verbose=1)\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the test stage is very quick. The major computational overhead in deep learning is in training.  The operational use of the trained model is very computationally light.  On my desktop computer, using the CPU, all 10,000 test images were labeled and compared to their ground truth in 8s, or 840 $\\mu$s per image.  On my GPU, all 10,000 test images were labeled and compared to their ground truth in less than 1 s, or less than 31 $\\mu$s per image.\n",
    "\n",
    "This reported two values after completion.  We can check what those values are by looking at the `metrics_names` attribute of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model1.metrics_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We note that these metrics are the loss and accuracy.  The loss is reported by default since that is the metric used by the network during training.  We requested that the network keep track of the additional metric of accuracy with the option `metrics=['accuracy']` when compiling the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Determining the predicted labels\n",
    "### 4.2.1 Determining the one-hot labels\n",
    "We might want more information than just a summary of the accuracy.  If we output the predicted label for each of the test images, we can look more carefully at the performance of the network.  We use the `predict` method of the model.  This has to run all 10,000 test images through the trained network and determine the class for each image."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_predict = model1.predict(X_test,batch_size=64,verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we computed the one-hot coded label vector used to train the network, we began with the assumption that a one-hot form is consistent with the native output of the network.  We would thus expect that `Y_predict` is in a one-hot format.  We check this by printing the dimensions of `Y_predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(Y_predict.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`Y_predict` does have the dimensions we would expect for a one-hot coded label vector.  Similar to our process when we developed the one-hot coded vector `Y_test`, we can look at the first 10 entries of `Y_predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(Y_predict[0:10,:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At first glance, this form looks very different than those we saw for `Y_train`.  Remember, however, that with `Y_train` we knew exactly what the actual labels were.  Here, with `Y_predict`, the network is computing probabilities of the image belonging to each of the 10 classes.  If you pay careful attention to the exponents of the coefficients in each row of `Y_predict`, you will note that one coefficient is very close to 1 and the remainder are very close to zero.\n",
    "\n",
    "In fact, most of these coefficients round to 1 and 0 if rounded to two decimal places:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(np.round(Y_predict[0:10,:],2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2.2 Determining the numerical class labels\n",
    "We can apply the argmax function to the one-hot label vector `Y_predict` to determine the class label for each sample.  Since this output will have a similar form to the original label vectors, we denote it as `y_predict`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_predict = Y_predict.argmax(axis=-1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we print these numerical labels, we see that they correspond to the one-hot interpretation above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(y_predict[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2.3 Comparing the predicted labels to the ground truth\n",
    "The deterimination of accuracy requires a comparison of the predicted labels to the ground truth labels.  That is was is done \"under the hood\" when `keras` reports accuracy using the `evaluate` method of the model.  As a sanity check, we can compute the accuracy \"by hand\" using `y_predict` and `y_test`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_acc = (y_predict==y_test).sum()/len(y_predict)\n",
    "print('My accuracy computation says:')\n",
    "print(my_acc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that this value exactly matches that reported by `keras` above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use both `y_predict` and `y_test` to gain a bit more insight into the performance of the network.\n",
    "\n",
    "As a very simple verification, we can print the first 10 labels of both `y_predict` and `y_train` and compare by eye."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Actual labels are:')\n",
    "print(y_test[0:10])\n",
    "print('Predicted labels are:')\n",
    "print(y_predict[0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2.4 Investigate errors in classification\n",
    "Looking more closely at those images that the network incorrectly classified can give us some insight in the robustness of the network.  If the incorrectly classified images are difficult images, we may have more confidence in the network than if it is incorrectly classifying obvious images (more fun examples of that tomorrow!).  \n",
    "\n",
    "We can find which images were incorrectly classified by the network by looking for those images where the predicted and ground truth labels do not match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "incorrect_labels = np.where(y_predict!=y_test)[0]\n",
    "print('There are '+str(len(incorrect_labels))+' incorrectly classified images')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The code below visualizes the first 10 of these incorrectly classified images and titles the plots with both the correct and predicted label."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(8,8))\n",
    "for k in range(0,9): # choose 10 examples\n",
    "    plt.subplot(3,3,k+1) # select the current subplot\n",
    "    plt.imshow(np.squeeze(X_test[incorrect_labels[k],:,:]),cmap='gray') # plot the image\n",
    "    plt.title('Actual:'+str(y_test[incorrect_labels[k]])+' Predicted:'+str(y_predict[incorrect_labels[k]]))\n",
    "    plt.axis('off')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many of these cases, the digits do not appear \"typical\" in form and it is thus not surprising that the network may have had difficulty correctly classifying them.  In most cases, it is also easy to postulate what structures in the image may have resulted in the incorrect classification that did result."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "Explore the performance of `model3` in which we used a `'relu'` activation on the output layer and `model4` in which we used an `'mse'` loss."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 5: Transfer Learning\n",
    "## 5.1 Applying the MNIST Network Directly to Fashion-MNIST\n",
    "Here we look at what happens when we input data to a network that is completely different than what it has seen before.  To make our lives easier, we will use the Fashion-MNIST dataset which is designed as a dropin for the MNIST dataset.  This way, we don't need to worry about as many details in the data preprocessing and can focus on the behavior of the network to completely different data.\n",
    "\n",
    "We import and preprocess the Fashion-MNIST dataset in exactly the same way we did the MNIST data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(X_train_f, y_train_f), (X_test_f, y_test_f) = fashion_mnist.load_data()\n",
    "X_train_f = X_train_f.reshape(X_train_f.shape[0], 28, 28, 1)\n",
    "X_train_f = X_train_f.astype('float32')/255\n",
    "Y_train_f = np_utils.to_categorical(y_train_f, 10)\n",
    "X_test_f = X_test_f.reshape(X_test_f.shape[0], 28, 28, 1)\n",
    "X_test_f = X_test_f.astype('float32')/255\n",
    "Y_test_f = np_utils.to_categorical(y_test_f, 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check the performance of the MNIST network on this new dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = model1.evaluate(X_test_f, Y_test_f, verbose=1)\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a point of reference, since there are 10 classes in the Fashion-MNIST dataset, you would expect a random guess to yield approximately 10% accuracy.  We find about 10% accuracy (this may differ depending on exactly where your model converged to in training and may differ from run to run given the random initialization and randomization in assigning data to batches).  Why is the performance so bad?  \n",
    "\n",
    "Let's look at one of the images from the Fashion-MNIST dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "plt.imshow(np.squeeze(X_test_f[0]),cmap='gray')\n",
    "plt.show()\n",
    "print('This image is class '+str(y_test[0])+' in the Fashion-MNIST dataset')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that this is an image of a \"sneaker,\" which also corresponds to class 7 in the Fashion-MNIST dataset (see https://keras.io/datasets/ for the full list of class labels and descriptions).\n",
    "\n",
    "Let's see what class our digit MNIST network classifies this image as."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Y_example = model1.predict(X_test_f[0].reshape(1,28,28,1),verbose=1)\n",
    "y_example = np.argmax(Y_example)\n",
    "print(y_example)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(np.round(Y_example,2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This network has decided that this image of a \"sneaker\" is the digit \"2\" (the network usually converges to \"2\", but may have converged to a different value depending on differences in training).  It has never seen a sneaker.  But it will still do its level best to match that sneaker to the closest thing it knows.  In this case, that is apparently a \"2\". Looking at the softmax output, we see that the network is 88% certain this is the digit \"2\".  This is just one of many examples of networks being very certain about very wrong decisions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2 Adapting the MNIST Model for Fashion-MNIST (Transfer Learning)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In transfer learning, we can \"transfer\" knowledge learned in one domain (e.g., MNIST) to another domain (e.g., Fashion-MNIST).  The idea of transfer learning is predicated on the assumption that all images share the same basic primitives (edges, corners, etc.) which are essentially the features of images that we hand-designed in the second tutorial.  In transfer learning, we re-use those image primitives and only have to relearn how to combine those primitives together in order to correctly classify a new domain of images.  To do this, we will copy our MNIST `model1` architecture and \"freeze\" all layers except the last layer by setting the `trainable` attribute of layers to `False`.  All the parameters from the two convolutional layers and the first fully connected layer will remain in the state that we converged to when training the network on MNIST.  It is only that final fully connected layer that will change in order to (hopefully) learn to correctly classify the Fashion-MNIST data. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1_f = keras.models.clone_model(model1)\n",
    "model1_f.set_weights(model1.get_weights())\n",
    "\n",
    "for layer in model1_f.layers[:-1]:\n",
    "    layer.trainable=False\n",
    "    \n",
    "model1_f.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])\n",
    "\n",
    "model1_f.fit(X_train_f, Y_train_f, batch_size=64, epochs=1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main advantages of transfer learning are related to computational efficiency and small datasets:\n",
    " - **Computational Efficiency**: By freezing most of the layers in the network, we have fewer trainable parameters.  This allows us to adapt a network to a new problem with much less computation than it would take to train the entire network from scratch.  This computational advantage may not be apparent here since we are dealing with a relatively small network and small dataset, but can become significant with larger networks and datasets.\n",
    " - **Small Datasets**: Training a very large network on a small dataset runs the risk of significant overfitting (learning the data).  Transfer learning allows us to \"borrow\" knowledge from much larger datasets and focusing the learning on the classification task."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "How well does our new transfer learning model `model1_f` perform on the Fashion-MNIST data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "How well does our new transfer learning model `model1_f` perform on the original MNIST data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## <span style='color:Green'> Your turn: </span>\n",
    "How does the transfer learning work if you freeze fewer layers?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section 6: Saving Trained Models\n",
    "We can save a trained model so that we don't have to go through the bother of training it again later.  The following instructions save the model in a binary file in HDF5 (Hierarchical Data Format).  The use of these commands assume that you have `h5py` (the python interface to HDF5 format) installed.  For at least the Linux version of Anaconda 3.8 with the necessary libraries for this tutorial, it appears that `h5py` was included.  If it does not appear that you have `h5py` installed, you can run the following command from your computer's terminal\n",
    "```\n",
    "conda install h5py\n",
    "```\n",
    "The successful installation of `h5py`, however, requires that the HDF5 libraries to be installed on your computer.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1.save('model1.h5')\n",
    "model1_f.save('model1_f.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will save a binary file named `model1.h5` to the same directory as this notebook.  You can load this file and pick up right where we left off."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model1 = load_model('model1.h5')\n",
    "model1_f = load_model('model1_f.h5')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}