Getting Started with ‘Street View House Numbers’ (SVHN) Dataset

Getting Started with ‘Street View House Numbers’ (SVHN) Dataset

SVHN is relatively new and popular dataset, a natural next step to MNIST and complement to other popular computer vision datasets. This is an overview of the common preprocessing techniques used and the best performance benchmarks, as well as a look at the state-of-the-art neural network architectures used. This will be useful for anyone considering testing their algorithms on SVHN.

We have previously discussed that we are conducting experiments using the MNIST dataset. For the next phase of our experiments, we have begun experimenting with the Street View House Numbers (SVHN) dataset to test the robustness of our algorithms. This is mainly because there is no more headroom in MNIST, and SVHN is required to increase the difficulty. We have recently implemented and open-sourced a preprocessing tool for SVHN to assist in getting the experiments started. The next part includes surveying the available literature and examining the best practices for preprocessing and neural network architectures.

Dataset Information

The Street View House Numbers (SVHN) is a real-world image dataset used for developing machine learning and object recognition algorithms. It is one of the commonly used benchmark datasets as It requires minimal data preprocessing and formatting. Although it shares some similarities with MNIST where the images are of small cropped digits, SVHN incorporates an order of magnitude more labelled data (over 600,000 digit images). It also comes from a significantly harder real world problem of recognising digits and numbers in natural scene images. The images lack any contrast normalisation, contain overlapping digits and distracting features which makes it a much more difficult problem compared to MNIST.

Examples of the images in the SVHN dataset

Examples of the images in the SVHN dataset

The dataset consists of 73,257 digits for training and 26,032 digits for testing. It also comes with an additional 531,131 somewhat less difficult samples that can be used as extra training data. It is recommended to use the full dataset (630K images) when evaluating algorithms as it is common practice in the majority of the literature.

Preprocessing Techniques

Preprocessing techniques are typically used to ensure the data is in a suitable format and within a consistent scale or range. We explored the commonly used preprocessing techniques for SVHN in order to maximise the performance of our algorithms and ensure there is a fair comparison of results.

In the original paper that introduced the SVHN dataset [1], Netzer et al. used very minimal preprocessing by converting the images to grayscale. Most further work using the SVHN dataset utilised various preprocessing techniques.

Normalising the intensity in the data through mean subtraction was used in [6, 8]. Global contrast normalisation and ZCA whitening were used in [7]. Local contrast normalisation was used in [3, 11]. In [2], both global and local contrast normalisation were used to preprocess the images.

Architectures Overview

This is a brief description of  some of the different architectures in unsupervised and supervised learning applied to the SVHN dataset, including autoencoders and convolutional neural networks. It is useful to survey the various architectures and approaches when designing new algorithms determine the most effective approaches and fairly compare these algorithms.

Unsupervised Learning

Netzer et al. originally introduced the dataset in [1] and evaluated the dataset against a stacked sparse auto-encoder (SSAE) and a convolutional K-means algorithm. They also estimated the human performance to be 2%. Classification was performed using SVM and a baseline performance was achieved which later papers significantly improved upon.

Makhanzi & Frey proposed a stacked convolutional winner-take-all autoencoder (Stacked Conv-WTA Autoencoder) in [2] which combines the benefits of autoencoders and convolutional architecture for learning shift-invariant sparse representations. SVM was used for classifying the learned representation in a similar fashion to the original paper [1]. These experiments were particularly interesting to us as it is inline with our ongoing research and experiments on sparse coding and unsupervised learning.

Supervised Learning

Liang and Hu proposed a Recurrent Convolutional Neural Network (RCNN) model in [6]. The motivation behind the RCNN architecture, not to be confused with Region-based Convolutional Neural Networks (R-CNN), is that recurrent connections are abundant in the visual system of the brain, and the importance of context in object recognition. Typical CNN models can only capture context in high level layers with larger receptive fields, and this information cannot modulate the activities of earlier layers.

Lee, Gallagher and Tu [3] proposed improving deep neural networks by generalising the pooling operations that are common in convolutional neural network architectures. They proposed two distinct approaches: combining max and average pooling via a learned pooling function, and learning a pooling function that consists of a tree-structured fusion of learned pooling filters. They then combined both of these approaches in a single architecture to achieve state-of-the-art results on the SVHN dataset using this model.


The best results achieved by these architectures are compared in the table. Publicly available benchmarks for SVHN and other datasets can be found here. The current state-of-the-art result according to those benchmarks is 1.69% test error produced by [3], which surpasses the 2% human performance estimated by [1].

Model Test Accuracy Test Error
Results from the Netzer et al. paper in [1]
Stacked Sparse Auto-encoders (SSAE) 89.70% 10.30%
Convolutional K-means 90.60% 9.40%
Results from the Makhanzi & Frey paper in [2]
Stacked CONV-WTA Autoencoder 93.10% 6.90%
Results from the Liang & Hu paper in [6]
RCNN-192 98.23% 1.77%
Results from Lee, Gallagher & Tu in [3]
Mixed Max-Average Pooling 98.24% 1.76%
Gated Max-Average Pooling 98.26% 1.74%
Tree Pooling 98.30% 1.70%
Tree + Max-Avg 98.31% 1.69%


SVHN is a very large and extensive dataset that comes from a significantly more difficult problem where images contain a lot of clutter and noisy features. It seems to be under utilised in the literature compared to MNIST, CIFAR-10 and CIFAR-100. Unlike MNIST and other datasets, preprocessing is common practice and very important for fairly comparing results. A form of contrast normalisation, in particular local contrast normalisation, is a common technique for preprocessing the SVHN dataset images. With regards to architecture, a convolutional architecture is quite common within the available benchmarks, which would be expected in a standard computer vision problem. The architectures in [3] and [6] highlight the some problems with convolutional neural networks and attempt to address them. This is related to ongoing research to address the fundamental issues with convolutional neural networks. This includes translation invariance and the loss of valuable information through pooling layers, which capsule networks in [9] aims to address.


[1] Netzer, Y., Wang, T., Coates, A., Bissaco, A., Wu, B. & Ng, A.Y. 2011, Reading Digits in Natural Images with Unsupervised Feature Learning [PDF].

[2] Makhzani, A & Frey, B. 2014, Winner-Take All Autoencoder [PDF].

[3] Lee, C.Y, Gallagher, P.W. & Tu, Z. 2016, Generalizing Pooling Functions in Convolutional Neural Networks: Mixed, Gated, and Tree [PDF].

[4] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:3371–3408, 2010 [PDF].

[5] A. Coates, H. Lee, and A. Y. Ng. An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In AI and Statistics, 2011 [PDF].

[6] Liang, M. & Hu, X. 2015, Recurrent Convolutional Neural Network for Object Recognition [PDF]

[7] I. Goodfellow, Q. Le, A. Saxe, H. Lee, and A. Ng. Measuring invariances in deep networks. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I.Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 646–654, 2009 [PDF].

[8] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area V2. In NIPS, 2007 [PDF].

[9] S. Sabour, N. Frosst and G. Hinton. Dynamic Routing Between Capsules. In NIPS, 2017 [PDF].

Also published on Medium.

Abdelrahman Ahmed

Research Engineer at Project AGI

Comments ( 2 )

  1. ReplyKhoi
    Thank you for the post. Between CIFAR-10 and SVHN, which do you think is a harder problem?
    • ReplyDavid Rawlinson
      Overall I'd say CIFAR10 is harder because the amount of variability in the appearance of the items being classified is greater. There are more limits on the way the digits 0..9 can be written, than on the appearance of e.g. boats. I'd say the backgrounds are also more challenging in CIFAR because the objects are bigger, making the backgrounds more variable.