Note: this article is also available in Italian as a pdf file.
Introduction
In solving an Image Classification task, the task is to have to attribute to an image one of the labels among those available. This means that, unlike many other tasks classics of computer vision, we need only this information; we won’t have to tell where the object is in the image with a bounding box or define membership classes pixel by pixel. Image, one class among a set of predefined classes, with no overlap. The CIFAR-10 dataset was used in this project, containing 60000 color images of size 32x32 pixels1. One of the peculiarities of this dataset is the fact that visually, the classes are completely mutually exclusive; therefore, there will not be in the same image a car and a truck, two of the categories of images contained In the dataset. This is a widely used dataset and widely used for benchmarking purposes of vision algorithms, created as a subset of the 80 Million Tiny Images2 dataset. The classes featured are airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks; it is also artificially balanced, With 6000 images per class. Therefore, we will not have any problems related to imbalances during classification. For all the reasons just listed, it is a dataset that is widely used as a benchmark of classification algorithms; its regular structure allows excellent repeatability of experiments and Of comparing similar metrics across multiple architectures. In our case, to address the classification task, we are going to use two different architectures of convolutional networks. We will see a first approach that will make use of transfer learning, going to Use the ResNet-18 convolutional network as a feature extractor pre-trained, and a second custom convolutional network architecture made ad hoc for the problem, although there are no lines unified guidelines for the definition of such architectures, but just some best practices. Finally, we will see how the computational resources available can become a bottleneck very quickly in applications of this genre.

Related Work
There is a great deal of literature in the area of Image Classification via convolutional networks; a quick search on search engines How Google Scholar returns tens of thousands of results between publications and books3. Some of the most established in Computer Vision to perform tasks of Image classification are VGGNet4, AlexNet5, GoogLeNet6 and ResNet7. The latter is the network that we are going to use for our experiment, and it is also the architecture that initiated the so-called “revolution of depth” of convolutional networks. Although the version we are going to use is composed of 18 layers, there are versions of ResNet that have more than a thousand layer8. One of the most recent publications, from 20199, reports an accuracy of up to 99% for image classification of CIFAR-10; therefore, it can be said that this is a very thoroughly tested dataset.
Proposed Approach
The structure of the CIFAR-10 dataset includes 50000 training images and 10000 test images; they too are balanced across classes. We therefore have 5000 training images and 1000 test images available for each class.
First approach: transfer learning, use of ResNet-18 as an extractor of features. Our first approach will be to use a strategy of transfer learning; in fact, we will use a convolutional network pre-trained by the PyTorch10 library and we will reuse it as a feature extractor. Therefore, we will not have to train from scratch the network to find all the weights to assign to the many convolutional layers. Thanks to libraries such as PyTorch this operation is very simple: we can directly import the network and replace the last fully connected layer with a dense layer having 10 output nodes, or as many as the number of classes that need to be discriminated. We can then fix the weights of the neurons of the convolutional layers, leaving to the optimizer only the work on the last fully connected layer. One of the peculiarities of ResNet that makes it particularly high-performance is the introduction of the residual block, see Fig.1.

This simply consists of forwarding the activation of a certain layer to a layer deeper in the network; as we can see in Fig. 1, the result of the activation coming from a previous layer is added to that of a more deep one. This simple expedient avoids problems with the numerical gradient in the optimization phase, and prevents the loss of information during the training of very deep networks. We will see later that although this method gives the possibility to reuse some prior knowledge without having to each time start from scratch with the training of a convolutional network, not always transfer learning is an efficient procedure to solve an image classification task.
*Second approach: custom convolutional network. As a second method, we will use a convolutional network created specifically for the task, and trained on the dataset from scratch. Its layers are defined as follows:
- INPUT
- CONVOLUTIONAL 1
- Number of filters: \(32\)
- Kernel Size: \( 3\times3 \)
- Padding: SAME
- Stride: 1px
- RELU
- MAX POOLING 2D
- CONVOLUTIONAL 2
- Number of filters: \(64\)
- Kernel Size: \( 3\times3 \)
- Padding: SAME
- Stride: 1px
- RELU
- MAX POOLING 2D
- CONVOLUTIONAL 3
- Number of filters: \(128\)
- Kernel Size: \( 3\times3 \)
- Padding: SAME
- Stride: 1px
- RELU
- FULLY CONNECTED 1
- Nodes Input: \( 8192: (8*8*128) \)
- Output nodes: \(512\)
- BATCH NORMALIZATION
- RELU
- DROPOUT (\(p=0.5\))
- FULLY CONNECTED 2
- Input nodes: \(512\)
- Output nodes: \( 10 \)
- DROPOUT (\(p=0.5\))
The loss function used for all trainings is the Cross Entropy Loss, also called Log Loss and, in the generic multi-class case with \(n\) classes, is formulated as follows: $$L=-\frac{1}{n}\sum_{i=1}^ny_{i}log(p_{i})$$ where \(p\) denotes the softmax probability for class \(i\), and \(y_i\) is the truth label. It is by no means the only loss function available; there are others such as hinge loss, or logistic loss and others, but this one function is certainly one of the most widely used; as stated in Pattern Recognition and Machine Learning11: using the cross-entropy error function instead of the sum-of-squares for a classification problem leads to faster training as as well as improved generalization. After the first two convolutional layers, we used MAX POOLING to reduce the resolution of filters and make the network invariant to translations of the input, taking only the maximum of one of the values of the region to which it belongs. This also increases the receptive field of neurons deeper, allowing the network To handle more semantic information at a time in subsequent filters, creating feature maps that are progressively more complex.

DROPOUTS were inserted after both of the last two layers.
fully connected, just after nonlinearity; this was
The original approach of the authors who proposed this type of
layer12. One of the most recent research shows.
Of the advantages obtainable in using this type of layer even after
first convolutions, always after nonlinearities, although with some
much lower \(p\) values, around \(0.1-0.2\)13.
The purpose of this type of layer is to regularize the network,
forcing it to adapt to randomly broken connections with
probability \(p\) at each forward step (and then only during
training). This causes that at each training step the layer on
to which the dropout is applied is seen and treated as having a
Number of nodes and a different configuration than the pitch
previous. The end result is to approximate, or better,
simulate, the training of multiple networks with different architectures, in
parallel.
In addition, a BATCH NORMALIZATION layer was inserted after the first
fully connected layer to help the convergence and stability of the
network, through re-centering and re-scaling the layers of
input14. It was believed that this practice reduced the shift
Of the internal covariance of the parameters, a problem related to
to network initialization, but more recent studies show.
How the reason for the increase in performance is not due to this
normalization effect15. In a publication
recently, it is shown that using a clipping technique of the
gradient and through some tricks on the tuning of certain
hyperparameters is made marginal the need for batch
normalization16.
Experiments
Four different trainings were performed:
1-2) ResNet-18:
- Optimizer: Adam and SGD; best model chosen.
- Learning rate scheduled from \(1 × 10^{-3}\) to \(8 × 10^{-5}\)
- Loss: Cross Entropy
3) Custom net 1:
- Optimizer: Adam
- Learning rate fixed at \(1 × 10^{-3}\)
- Loss: Cross Entropy
4) Custom net 2:
- Optimizer: Adam
- Learning rate scheduled from \(1 × 10^{-3}\) to \(8 × 10^{-5}\)
- Loss: Cross Entropy
All nets were trained for a total of 30 epochs, and this for a twofold reason: while the training time began to become almost prohibitive for a larger number of eras, On the other hand, it was seen that even as the eras (and the decrease in the learning rate in cases (1) and (3)) no benefit on the test accuracy metric. It may also have been implemented an early stopping technique, but it was deemed useful to let the training for all 30 eras complete the same scheduled, to still verify that they are not in a situation particular. In addition, better results could have been found by Due to the continuous decrease in the learning rate.

Custom network results Let us then analyze the results of this graph: as we can well see, after about 10-15 epochs the test accuracy is already that Ultimately, both for the classifier with learning rate scheduler and for the one without, and is around a 79% for the network with scheduler versus 77 percent for the network without. The difference is small, but remains detectable, however, for all training eras of the network, thus demonstrating the usefulness of this approach To optimization with progressive decrease in learning rate. Another thing to take into account is that the accuracy on the test set in both cases goes up to 99%, so completely in overfitting(?) regarding the test set. Transfer learning results with ResNet-18 Regarding the results obtained with the second architecture, and that is, the one that uses ResNet-18 as a feature extractor, the results Unfortunately, they leave much to be desired. Although even in this case a decreasing learning rate was used, the results of
accuracy almost immediately reach their maximum: accuracy on the
test set is close to 48% at most, and that on the train set reaches a
more stable value around 50 percent. Trying to investigate the problem,
I noticed that the network was pre-trained on the dataset
ImageNet10 17. The latter consists of.
millions of images, divided into more than 20000 categories. The theory I have
so formulated is that the relatively shallow depth of the network
may have caused a bottleneck problem in the power of
representability of the network itself with respect to such a large amount of
information, but it is only a hypothesis.
Let us now look at confusion matrix: for the network trained with architecture
custom we will consider only the model that makes use of learning rate
progressive, as it reports better results overall.

As we see from Fig. 3, classes are identified with a confidence more than good; given their low resolution, it is Interesting but predictable to see how the classes that come most confused with each other are “cat” with “dog,” “airplane” with “bird,” and “ship” (probably because of the often mostly blue background), and “automobile” with “truck,” also probably due to some common visual features among them, such as wheels or doors.

We see how in the example shown in Fig. 4, not all images are significantly different from each other from the visual point of view; some of them can easily be classified incorrectly even by humans18. This explains the false positives outside the diagonal of this matrix.

In the latter case, as we expected given the results of Top-1 accuracy in Fig. 2, the classifier struggles much more to Correctly identify the classes to which the various images. One notices the same confusions discussed above but much more pronounced, and outliers can also be seen: while there we might as well wait for some images of low “cat” resolution can be mistaken for images of “dog,” certainly we do not expect these “cat” images to be identified as “frog “ with a frequency that is half that of the classifications correct. On computational complexity I think it is important to mention the issue of the resources of calculation: the custom convolutional network seen in this project was Definitely not on par with the depth and performance of other networks convolutional at the top of competitions such as ILSVRC19. As we can see from Fig. 5, not necessarily more layers, and thus of parameters, automatically indicates better performance, as it could only mean more problems during the optimization. In any case, as the depth of the network increases, it will can undoubtedly achieve a greater representative capacity, something that I have not been able to experience as even just adding an additional convolutional layer (and thus bringing the total to four) already manifested the physical limitations of the computer in my possession, despite owning a graphics card with 2GB of memory dedicated video. With the configuration just seen, the use of memory is around 1.3-1.5 GB during the training phase, with a batch size of 64 images. By reducing this hyper-parameter you can reduce memory use, although, against my expectations, in rather marginally.

Also related to the ILSVRC, we can see from Fig. 6 as with ResNet there Has been a “depth revolution” of convolutional networks:
the number of layers has exploded, and the classification performance is doubled. The real revolution, however, is that all this has not brought also to an explosion of the parameters, and thus the size and complexity of the networks, but as we can see from Fig. 5 the number of parameters and the operations required for forward remained more than contained, or even decreased in some cases.

Finally, we visually explore some of the feature maps that have been extracted from the dataset. Also called “activation maps,” these indicate. The activation response of a given filter to the passage of An image in the convolutional network. In Fig. 7 we can see the responses of feature map number 6 in the various layers, and how the edge of the plane’s wings is made more and more pronounced.

In Fig. 8, we can see how the 11th feature map succeeds in removing some of the Background elements as not informative for identification of the dog class, and to maintain a high response on the contours Of the dog’s body and on the muzzle.

Conclusion
From the results, we can therefore conclude that even a network
from the shallow architecture such as the custom one discussed can achieve
more than satisfactory accuracy despite its simple nature.
In addition, the increase in complexity and computational resources have
quickly detected an obstacle to experimentation, failing to
To go beyond just 3 convolutional layers.
As for the network obtained from pre-trained ResNet, unfortunately.
performance is disappointing. As discussed earlier, the weights of the
Not fully connected layers of this network are derived from a
training on a different dataset from the one used for the rest
of the experiment, namely ImageNet10. Only the last
dense layer was retrained to fit the task; however,
this approach has not proved productive. Nevertheless, this
absolutely does not mean that the transfer technique is not useful
learning; instead, it means that it should be used with more care, as
e.g. with a fine tuning of the parameters instead of keeping them
frozen; in other words, starting from a minimum, instead of remaining
There you go toward a better minimum through multiple steps of
optimization.
Finally, it is worth mentioning how the classes that put the most in
difficulties the network are, of course, those that are most visually close, such as
e.g. cat and dog, car and truck, and for a matter of
common background, even airplane is confused a significant number
of times with images of the bird and ship classes; an example is that
shown in Fig. 4.
Incremental work on these results would be very easy to
realize conceptually, and could include increasing the layers
convolutional, the removal of one of the fully connected layers, and a
Better fine tuning of parameters in the case of reuse of
transfer learning. In practice, this would require resources
additional hardware, given what was discussed above.
The CIFAR-10 dataset
Alex Krizhevsky ↩︎80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition
W. T. Freeman and R. Fergus and A. Torralba ↩︎convolutional neural network image classification - Google Scholar
↩︎Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan and Andrew Zisserman ↩︎Imagenet classification with deep convolutional neural networks
Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E ↩︎Going Deeper with Convolutions
Christian Szegedy and Wei Liu and Yangqing Jia and Pierre Sermanet and Scott Reed and Dragomir Anguelov and Dumitru Erhan and Vincent Vanhoucke and Andrew Rabinovich ↩︎Deep Residual Learning for Image Recognition
Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun ↩︎Identity Mappings in Deep Residual Networks
Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun ↩︎GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Yanping Huang and Youlong Cheng and Ankur Bapna and Orhan Firat and Mia Xu Chen and Dehao Chen and HyoukJoong Lee and Jiquan Ngiam and Quoc V. Le and Yonghui Wu and Zhifeng Chen ↩︎Torchvision Models
The Torchvision library authors and community ↩︎ ↩︎ ↩︎Pattern Recognition and Machine Learning (Information Science and Statistics)
Bishop, Christopher M. ↩︎Improving neural networks by preventing co-adaptation of feature detectors
Geoffrey E. Hinton and Nitish Srivastava and Alex Krizhevsky and Ilya Sutskever and Ruslan R. Salakhutdinov ↩︎Analysis on the Dropout Effect in Convolutional Neural Networks
Park, Sungheon and Kwak, Nojun ↩︎Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy ↩︎How Does Batch Normalization Help Optimization?
Shibani Santurkar and Dimitris Tsipras and Andrew Ilyas and Aleksander Madry ↩︎High-Performance Large-Scale Image Recognition Without Normalization
Andrew Brock and Soham De and Samuel L. Smith and Karen Simonyan ↩︎Imagenet: A large-scale hierarchical image database
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li ↩︎After asking her to classify the pictures of the figure, a friend of mine mistakenly labeled the last two pictures as “cat” and “dog” respectively, thus reversing the classes.
↩︎ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky and Jia Deng and Hao Su and Jonathan Krause and Sanjeev Satheesh and Sean Ma and Zhiheng Huang and Andrej Karpathy and Aditya Khosla and Michael Bernstein and Alexander C. Berg and Li Fei-Fei ↩︎