Must-read Path-breaking Papers About Image Classification

Image Classification graph
The ILSVRC saw an exponential decline in top 5 error rate for neural network architecture for Image Classification over past few years

Deep Learning models for Image Classification have achieved an exponential decline in error rate through last few years. Since then, Deep Learning has become prime focus area for AI research. However, Deep Learning has been around for a few decades now. Yann Lecun, presented a paper pioneering the Convolutional Neural Networks (CNN) in 1998. But it wasn’t until the start of the current decade that Deep Learning really took off. The recent disruption can be attributed to increased processing power (aka GPUs), the availability of abundant data (aka Imagenet dataset) and new algorithms and techniques. It all started in 2012 with the AlexNet, a large, deep Convolutional Neural Network which won the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC). ILSVRC is a competition where research teams evaluate their algorithms on the given data set and compete to achieve higher accuracy on several visual recognition tasks.
Since then, variants of CNNs have dominated the ILSVRC and have surpassed the level of human accuracy, which is considered to lie in the 5-10% error range.

For us as humans, it very easy to understand contents of an image. For example, while watching a movie (like Lord of The Rings) I just need to see one example of a Dwarf and that allows me to identify other dwarves without any effort. However, for a machine, the task is extremely challenging because all it can see in an image is an array of numbers. If the task is to identify a cat in an image, you can appreciate the difficulty in finding a cat from this vast array of numbers. Also, cats come in all shapes, sizes, colors and poses, making the task even more challenging.

Image Classification Papers
How we see objects vs how a machine sees them

Based on our experience with Deep Learning for more than four years now, we are listing down some path breaking research papers that are a must-read for anyone associated with computer vision. In this blog-post we focus specifically on image classification and following posts will cover other areas such as object detection and localization.
Also, we have added our two cents about some upcoming algorithms which have the potential to shape the future of computer vision research.

Path-breaking Research Papers on Image Classification


In ILSVRC 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton presented AlexNet, a deep CNN. AlexNet clocked a 15.4% error rate, bettering the second best entry by more than 10% (The second best entry had the error rate of 26.2%). This impressive feat by AlexNet took the whole Computer Vision community by storm and made Deep Learning and CNNs the disruptions they are today.

Image Classification AlexNet
An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom.

This was the first time a model performed so well on a historically difficult ImageNet dataset. AlexNet set the foundation of advanced Deep Learning. It is still one of the highest cited paper concerning Deep Learning, being cited about ~7000 times.


Matthew D Zeiler(Founder of Clarifai) and Rob Fergus won the ILSVRC in 2013, outperforming AlexNet by reducing the error rate to 11.2%. ZFNet introduced a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier, both of which were missing in AlexNet.

Image Classification ZFNet
Network architecture of ZFNet

ZFNet opened the possibility of examining different feature activations and their relation to the input space using a technique called Deconvolutional Network.


Karen Simonyan and Andrew Zisserman of the University of Oxford created a deep CNN that was chosen as the second best entry in Image Classification task of ISLVRC 2014. VGG Net showed that a significant improvement on the prior-art configurations can be achieved by increasing the depth to 16-19 weight layers, which is substantially deeper than what has been used in the prior art.

Image Classification VGG
Macro-architecture of VGG Net. Credits: Davi Fossard

The architecture was praised because it was way simpler to understand (simpler than GoogleLeNet, winner of ISLVRC 2014) but still could manage optimum accuracy. Its feature maps are used a lot now in transfer learning and other algorithms that require pre-trained networks, like most GANs.


The winners of ISLVRC 2014, Christian Szegedy et al. presented a 22 layered neural network called GoogLeNet. It’s a type of Inception Model and solidified Google’s position in the Computer Vision space. GoogleNet clocked an error rate of 6.7%. The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. GoogLeNet introduced the concept of Inception module, where not everything is happening sequentially, as seen in previous architectures but there are certain pieces of the network that are happening in parallel.

Image Classification Googlenet
A schematic representation of GoogLeNet architecture with the highlighted box being the inception module.

Noticeably, GoogLeNet’s error rate approached human performance (lies in the range 5-10%). GoogLeNet was one of the first models which conceptualized that CNN layers didn’t always have to be stacked up sequentially. The Inception module made sure that a creative and careful structuring of layers improves performance and computationally efficiency.


Microsoft’s ResNet, developed by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, is a residual learning framework to ease the training of networks that are substantially deeper than those used previously. The authors provided comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

Image Classification resnet
A residual block in ResNet architecture.

ResNet surpassed human performance with an error rate of 3.57% with a new 152 layer network architecture that set new records in classification, detection, and localization through one incredible architecture.

Wide ResNets

Sergey Zagoruyko and Nikos Komodakis presented this paper in 2016 with a detailed experimental study on the architecture of ResNet blocks, based on which they propose a novel architecture where they decrease depth of the entire network and increase width of residual networks. Increasing width is using more feature maps in residual layers. Although the common wisdom says that this might overfit the network, it actually works.

Image Classification wideresnet
Various residual blocks used by the authors

The authors named the resulting network structures Wide Residual Networks (WRNs) and showed that these were far superior over their commonly used thin and very deep counterparts. A Wide ResNet can have 2-12X more feature maps as compared to ResNet in its convolutional layer.


ResNeXt secured second place in ILSCRV 2016. It is a simple highly modularized network architecture for image classification. The ResNeXt design results in a homogeneous, multi-branch architecture that have only a few hyper-parameters to set.

Image Classification resnext
A block of ResNeXt(right) compared to a block of ResNet(Left)

This strategy exposes a new dimension, which the authors named “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. Increasing cardinality is more effective than going deeper or wider when the capacity is increased. Thus, it fared better than both ResNets and Wide ResNets in accuracy.


Dense Convolutional Networks, developed by Gao Huang, Zhuang Liu, Kilian Q. Weinberger and Laurens van der Maaten in 2016, connects each layer to every other layer in a feed-forward fashion. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers.

Image Classification densenet
A 5-layer dense block. Each layer takes all preceding feature-maps as input.

DenseNets have several compelling advantages such as alleviating the vanishing-gradient problem, strengthening the feature propagation, encouraging feature reuse, and substantially reducing the number of parameters. DenseNets outperformed ResNets whilst requiring less memory and computation to achieve high performance.


New architectures with promising future potential

The variants of CNN are likely to dominate the Image Classification architecture design. Attention Modules and SENets are going to become more important in due course.


The winning entry of ILSCRV 2017, Squeeze-and-Excitation Networks (SENet), works on Squeeze, Excitation and Scaling operations. Rather than introducing a new spatial for the integration of feature channels, SENets works on a new “feature re-calibration” strategy.

Image Classification SENets
A schematic representation of SENet model: Squeeze, Excitation and Scaling Operations

The authors explicitly modeled the interdependence between feature channels. SENets is trained to automatically obtain the importance of each feature channel and use this importance to enhance useful features. In the ILSVRC 2017 contest, SENet model obtained an incredible  2.251% Top-5 error rate on the test set.

Residual Attention Networks

Residual Attention Network, a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. The attention residual learning is used to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers.

Image Classification AttentionNetworks
Residual Attention Network Classification Illustration: Selected images illustrating that different features have different corresponding attention masks in Residual Attention Network. The sky mask diminishes low-level background blue color features. The balloon instance mask highlights high-level balloon bottom part features.


The Path Forward


Credits: Waitbutwhy

Today, the processing power of a computer you can buy for $1000 is 1/1000th of the capacity of the human brain. By Moore’s law, we will reach computing power of human brain by 2025 and all of the humanity by 2050. AI’s effectiveness will only accelerate with time. As the availability of data and processing power are no longer holding researchers back, we can assume that the accuracy of Deep Learning models used for Image Classification is going to get better in due course. As a premier applied AI research group, we are here to be a part of this revolution.

ParallelDots AI APIs , are a Deep Learning powered web service by ParallelDots Inc, that can comprehend a huge amount of unstructured text and visual content to empower your products. You can check out some of our text analysis APIs and reach out to us by filling this form here or write to us at

Leave a Reply

Your email address will not be published. Required fields are marked *