1. Fine-tune the pretrained AlexNet, VGG, and ResNet on the hymenoptera_data set, and report the training and testing accuracies after every epoch. Discuss your observations based on the comparison results.
2. Utilize the pretrained AlexNet, VGG, and ResNet as the feature extractors, to perform classification on the hymenoptera_data set. Report the training and testing accuracies after every epoch, and discuss your observations based on the comparison results.
3. Train AlexNet, VGG, and ResNet from scratch on the hymenoptera_data set. Report the training and testing accuracies after every epoch, and discuss your observations based on the comparison results.
4. Make comparison between the results of fine-tuning and training from scratch, using AlexNet, VGG, and ResNet, and plot the curves of validation accuracies in terms of training epochs.
5. Make comparison between the results of feature extractors and training from scratch, using AlexNet, VGG, and ResNet, and plot the curves of validation accuracies in terms of training epochs.
Objectives
1. To understand the concepts of two types of transfer learning, i.e., fine-tuning and feature extraction.
2. To get familiar with three classical deep-learning-based architectures, i.e., AlexNet, VGG, and ResNet.
3. To use pretrained neural networks to solve visual classification tasks.
Â
1 Transfer Learning
In this section, we first introduce two types of transfer learning related techniques, i.e., fine-tuning and feature extraction, which have been widely used to tackle various computer vision and machine learning tasks.
1.1 Fine-tuning
In the previous laboratory exercise, we discussed how to train models on the Fashion MNIST training data set, which only has 60,000 images. Here, we introduce the ImageNet, the most widely used large-scale image data set in the academic world, with more than 10 million images and objects of over 1,000 categories. Assume that we want to identify different kinds of insects in images, but the number of examples is limited. If we directly train a neural network from scratch, the accuracy of the final trained model may not meet the practical requirements.
One potential solution is to apply transfer learning to migrate the knowledge learned from the source data set to the target data set. For example, although the images in ImageNet are mostly unrelated to insects, models trained on this data set can extract more general image features that can help identify edges, textures, shapes, and object composition. These similar features may be equally effective for recognizing insects.
In this section, we introduce a powerful technique in transfer learning: fine-tuning. As shown in Fig. 1, fine tuning consists of the following four steps:
1) Pretrain a neural network model, i.e., the source model, on a source data set (usually large-scale, e.g., the ImageNet data set).
2) Create a new neural network model, i.e., the target model. This replicates all model designs and their parameters on the source model, except the output layer. We assume that the parameters of these deep models contain the knowledge learned from the source data set and this knowledge will be applicable to the target data set. We also assume that the output layer of the source model is closely related to the labels of the source data set and is therefore not used in the target model.
3) Add an output layer, whose output size is the number of categories in the target data set, to the target model, and randomly initialize the model parameters of this layer.
4) Train the target model on a target data set, such as an insect data set. We will train the output layer from scratch, while the parameters of all remaining layers are fine-tuned based on the parameters of the source model.
1.2 Feature Extraction
Another useful transfer-learning-based technique is feature extraction. Specifically, we start with a pretrained model and only update the final layer weights from which we derive predictions. It is called feature extraction, because we use the pretrained neural network as a fixed feature-extractor, and only change the output layer. Compared with the procedure of finetuning described in Section 1.1, the only difference is that in the last step, we only train the output layer, and freeze the parameters of all te remaining layers that are pretrained on the source data set.
2 Classical Deep-Learning-Based Architectures
In this section, we introduce three classical deep-learning-based architectures, including AlexNet, VGG, ResNet, which have been widely used as the backbone networks in the computer vision community.
2.1 AlexNet
AlexNet [1] was the first very successful CNN on the ImageNet data set. The overall architecture is illustrated in Fig. 2. It contains eight layers with weights; the first five are convolutional layers and the remaining three are fully connected layers. The first convolutional layer filters the 224 Ã 224 Ã 3 input image with 96 kernels of size 11 Ã 11 Ã 3, with a stride of 4 pixels. The second convolutional layer takes the output of the first convolutional layer as input, and filters its input with 256 kernels of size 5 Ã 5 Ã 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 Ã 3 Ã 256, connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 Ã 3 Ã 192, and the fifth convolutional layer has 256 kernels of size 3 Ã 3 Ã 192. The fully connected layers have 4,096 neurons.
Â
2.2 VGG
VGG [2] was proposed in 2015. The main contribution of it is a thorough evaluation of networks of increasing depth, using an architecture with very small (3 Ã 3) filters, which shows significant improvement by pushing the depth of the CNN network to 16â19 weight layers.The related works won the first and the second places in the localisation and classification tasks, in the ImageNet Challenge 2014. The network configurations are shown in Fig. 3. The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The parameters of the convolutional layer are denoted as âconv(receptive field size)-(number of channels)â. The ReLU activation function is not shown for brevity.
2.3 ResNet
Deeper neural networks are more difficult to train. ResNet [3] was proposed based on a residual learning framework, which can ease the training of networks that are substantially deeper thanthose used previously. It means that we explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. Instead of hopping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as ?(