Dog Breed Classification
Imagine having an image of a dog on your phone or device and the device tells you exactly what type of dog breed the dog is, or whether the image is that of a dog or a human being. With deep learning concepts, we are able to perform image classification more efficiently than before.
Project Overview
Image classification has become one of the most influential innovations in Computer Vision since the first digital image scanner. Developing models that can classify images has made tremendous advances in the way people interact (social media, search engines & image processing), retail (both in person and online), marketing, theatre & the performing arts, government, surveillance, law enforcement, etc. Thanks to image classification algorithms we are able to receive notifications on social media when someone has posted a picture that may look like us, or object recognition in self driving cars! The idea of a program being able to identify meaningful objects in an image and make a judgement as to what it is, what it’s connected with and where it belongs based on only the information found in an image has endless applications. I have implemented this classification using Convolutional Neural Networks. This project is part of my Udacity nanodegree and I give credit to them for the data sources used and also some of the code in this write up. The project description is available [here](https://github.com/udacity/dog-project). The datasets provided by Udacity for this project include;
- Dog Images — This is a dataset of images of different types of dogs we used
- Human Faces — This consists of different human faces used for the classification.
Problem Statement
How can we create a model that accurately classifies dog breeds and also tells the animal breed most closely connected to a human face. In this project, I am attempting to create software that predicts the breed of any dog image that is supplied to it.. The software is intended to accept any user-supplied image as input. If a dog is detected in the image, it will provide an estimate of the dog’s breed. If a human is detected, it will provide an estimate of the dog breed that is most resembling.
The project is divided into eight steps;
- Step 1: Import Datasets
- Step 2: Detect Humans
- Step 3: Detect Dogs
- Step 4: Create a CNN to Classify Dog Breeds (from Scratch)
- Step 5: Use a CNN to Classify Dog Breeds (using Transfer Learning)
- Step 6: Create a CNN to Classify Dog Breeds (using Transfer Learning)
- Step 7: Write your Algorithm
- Step 8: Test Your Algorithm
Metrics
We use accuracy as the measurement of choice to improve our model after each iteration. Accuracy is simply how many times our model predicts the dog breed correctly. Since the goal is to predict the breed of the image correctly, accuracy is a suitable metric to be used.
Data Exploration and Visualization
Import Datasets
To properly perform our classification, we had to import a lot of dog related data. There were 133 Dog categories and 8351 Dog images. We also had to import human data too because we tried to assign the most resembling dog breed to a given human face. There were 13233 human faces in the dataset.
Detect Humans
We made use of OpenCV’s implementation of Haar feature-based cascade classifiers to detect human faces in images. It is a machine learning-based approach where a cascade function is trained from a lot of positive and negative images, which is then used to detect objects in other images.
Detect Dogs
In this section, we use a pre-trained ResNet-50 model to detect dogs in images. Our first line of code downloads the ResNet-50 model, along with weights that have been trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks. ImageNet contains over 10 million URLs, each linking to an image containing an object from one of 1000 categories. Given an image, this pre-trained ResNet-50 model returns a prediction (derived from the available categories in ImageNet) for the object that is contained in the image.
Data Preprocessing
When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input, with shape
(nb_samples,rows,columns,channels),(nb_samples,rows,columns,channels),
where nb_samples
corresponds to the total number of images (or samples), and rows
, columns
, and channels
correspond to the number of rows, columns, and channels for each image, respectively.
The path_to_tensor
function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224224×224 pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape
(1,224,224,3).(1,224,224,3).
The paths_to_tensor
function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape
(nb_samples,224,224,3).(nb_samples,224,224,3).
Here, nb_samples
is the number of samples, or number of images, in the supplied array of image paths. It is best to think of nb_samples
as the number of 3D tensors (where each 3D tensor corresponds to a different image) in the dataset
Implementation
Create a CNN to Classify Dog Breeds (from Scratch)
Convolutional neural networks (CNNs) are a class of deep neural networks primarily used in the analysis of images. To a certain extent, the design of convolution networks was inspired by the way in which a mammal’s brain processes visual impressions. Translation invariance and shared weights are mostly cited to explain the advantages of CNNs over using other types of neural networks in image analysis. The architecture of a convolution network involves the use of multiple hidden layers that perform mathematical convolution operations on their input. Our model architecture has an input layer with a shape of 224*224*3. The inputs are then fed to two convolutional layers, followed by a max-pooling layer (for down-sampling). The outputs are then fed to a global average pooling layer to minimize overfitting by reducing the number of parameters. The output layer returns 133 probabilities for our breeds.
- The CNN layers extract high-level semantic feature by working on lower-level features.
- Dropout layers randomly drop part of the network, so that each part of the neural networks can have the opportunity of getting tuned alone.
- Relu activation function ensures non-linear transformations while preventing gradient saturation.
- Deeper models are known to enable better performance, so I added two fully-connected layers before the output softmax layer.
Our model was then trained and measured for accuracy. I made use of 11 epochs and I got an accuracy of 2.87% and 3.5885% on the second attempt which is higher than 1%.
Refinement
There are a lot of neural network models out there that already specialize in image recognition and have been trained on a huge amount of data. Our strategy now is to take advantage of such pre-trained networks and our plan can be outlined as follows:
- find a network model pre-trained for a general image classification task
- load the model with the pre-trained weights
- drop the “top of the model”, i. e. the section with the fully connected layers, because the specific task of a model is generally defined by this part of the network
- run the new data through the convolutional part of the pre-trained model. (this is also called feature extraction and the output of this step is also called bottleneck features.)
- create a new network to define the specific task at hand and train it with the output (the bottleneck features) of the previous step.
As we will see in a moment, the structure of the model into which we stuff the bottleneck features can usually be quite simple because a large part of the training work has already been done by the pre-trained model. Udacity is providing some kind of blueprint for this strategy by having already fed our images dataset into a pre-trained VGG16 model and making available the output as bottleneck features, which we can now feed into a very simple training network that essentially consists of just one global average pooling layer and a final dense output layer.
Create a CNN to Classify Dog Breeds (using Transfer Learning)
We choose InceptionV3 as the network that should provide us with the features for our training layers. InceptionV3 is a high performing model on the ImageNet dataset and its power lies in the fact that the network could be designed much deeper than other models by introducing subnetworks called inception modules.
With the pre-trained InceptionV3 model, we create our fully-connected (FC) layers with the following codes. We also add two dropout layers to prevent overfitting.
Model Evaluation and Validation
Performance
After performing transfer learning i.e the use of the inceptionV3 model and utilising 15 epochs, I obtained an accuracy of 80.9% which is a far cry from my initial two accuracies. This shows an improvement on our initial model, but these can further be improved as an 80.9% accuracy score isn’t the most optimal.
Architecture Test Accuracy
From-Scratch CNN (Initial) — 2.87%
From-Scratch CNN (Second Try) — 3.5885%
From Scratch CNN(Refined) Using InceptionV3— 80.980%
In this project, I have built a CNN from scratch using transfer learning along the way. Our model was able to accurately predict the breed of dogs and also distinguish humans as seen in the picture below
Justification
In this project we discussed two approaches for the development of an algorithm for the identification of dog breeds, and we achieved our best results with the application of a transfer learning model. We obtained an accuracy of 80.9% in our tests, which was a very good improvement from our initial model’s performance of 3.5%. Transfer learning has several benefits, but the main advantages are saving training time, better performance of neural networks (in most cases), and not needing a lot of data.
Conclusion
As shown in this project, CNN implementation is a relatively straight forward way of performing image classification . Keras provides a simple way to implement multiple types of CNN architectures and facilitates easy fine tuning of hyperparameters so that models can be easily optimized. Our initial model had a very low accuracy of 3.5%, but with the application of transfer learning I was able to increase the performance to 80.9%. The aspect of the algorithm being able to liken a human face to a dog image was something I found intriguing. Despite it’s relatively fair accuracy score, the algorithm was able to accurately predict the sample images it was supplied with.
However, we still see several options to further improve our algorithm in the future:
- We could gather more training data.
- We could employ data augmentation to prevent overfitting.
- In this project, I have tried training the model on VGG-16 and InceptionV3. One next step could be to try training the models using different architectures as well as changing the architect of our fully connected layers.
The code to the project can be found [here](https://github.com/UOIKENNA/Udacity_dog_app_capstone)