Skip to content

ogdenkev/udacity-dog-project

 
 

Repository files navigation

Project Overview

This repo houses my submission for the dog project from the Udacity AI Nanodegree.

The project was to build a convolutional neural network (CNN) and data pipeline that can process real-world, user-supplied images. Given an image of a dog, the CNN will estimate the canine’s breed. If supplied an image of a human, the code will identify the closest resembling dog breed.

Sample Output

The steps of this project were roughly.

  1. We started off using a pre-trained ResNet-50 model to detect images of dogs. The model was trained on ImageNet.
  2. Then we built our own CNN from scratch using Keras (which at the time was its own package) with Tensorflow as the backend.
  3. Then we used transfer learning with a "pre-trained VGG-16 model as a fixed feature extractor" to predict the dog breed.
  4. We used another pre-trained CNN and added a few layers of our own design to the output. I chose Xception as the pre-trained CNN. The weights of the pre-traiend CNN were fixed. My model achieved 82.5% accuracy, apparently.
  5. Finally, we plugged our trained model into a "pipeline" that would load images, use the model to make predictions, and then present results to the user.

My Work

All of my work is in the dog_app.ipynb and exported html.

Takeaways

This was a great project for learning the mechanics of building a CNN and also an example of how transfer learning could be done, not necessarily the best way to do it.

I would have liked to experiment more with the face detector and the pre-trained models, and perhaps have not used frozen weights when using transfer learning.

It'd be interesting to compare a gradient-boosted decision tree model that uses the Xception features against the deep learning network I built for this project.

One thing I did not appreciate at the time was that there are a lot of moving pieces that have to come together in order to actually use a deep learning model. Here, we had to load and preprocess images, use the Haar face detector to decide if the image had a human or not, then pass the images to the CNN and get predictions, then do something with the predictions. An improvement, too, would be to show the most similar dog breed image next to the input image. Of course, this project focused on building, training, and using the deep learning model so many of these other pieces were just black boxes (someone from Udacity already decided what to do with those components).

Highlights

Question 2:

This algorithmic choice necessitates that we communicate to the user that we accept human images only when they provide a clear view of a face (otherwise, we risk having unneccessarily frustrated users!). In your opinion, is this a reasonable expectation to pose on the user? If not, can you think of a way to detect humans in images that does not necessitate an image with a clearly presented face?

Answer:

Whether it is reasonable to enforce that the user inputs an image with a clear view of a human face depends on the purpose of the application that the algorithm will be used for. For instance, if the objective was to provide an accurate breakdown of the percentages of breeds that may be present in a dog, similar to the objective of having a dog's DNA sequenced, then it seems reasonable that the user inputs as high a quality image as possible. On the other hand, it seems that the objective of this application might be more fun-natured given that the algorithm will accept human faces. In this scenario, I think it would be best for the algorithm to attempt to classify the image as human or non-human regardless of whether a human face could be detected. Of course there will always be images that the algorithm performs poorly on, and in these cases we could then communicate that a better picture (perhaps one with a face) should be used.

One way to detect humans in images that does not require an image with a clearly presented face would be to build a neural network (of the appropriate architecture, e.g. a convolution neural net) to classify images as human or not without first detecting a face. This network could then detect a human with a picture of say the side or back. On the other hand, it seems that the face detector is pretty good with the human images. It seems more problematic that 11% of the dog images have a detected human face. Perhaps it would be better to build a detector that could discriminate human and dog faces.

Question 4: Outline the steps you took to get to your final CNN architecture and your reasoning at each step. If you chose to use the hinted architecture above, describe why you think that CNN architecture should work well for the image classification task.

Answer:

I started with a simplified AlexNet model, basing it off the description here. I limited the number of filters in the convolutional layers for computational speed with the intent to change those if needed based on the performance of the net. Given that the AlexNet architecture broke new ground on the image classification problem, I reasoned that it might be good to build a similar architecture for this dog breed classification problem.

Question 5: Outline the steps you took to get to your final CNN architecture and your reasoning at each step. Describe why you think the architecture is suitable for the current problem.

Answer:

  1. I started with a Global Average Pooling layer as in the example above
  2. I then added a Dropout layer to prevent problems due to overfitting
  3. I then added a Hidden Layer, starting with half as many neurons as the output from the Global Average Pooling. My motivation for this was that a similiar strategy was used for AlexNet.
  4. I added another Dropout layer to again reduce potential overfitting.
  5. I then added Final Output layer to do the prediction.

I think this architecture is suitable for this problem because it uses a simple architecture composed of a single hidden layer and an output layer to detect patterns in the Xception features that relate to dog breeds. Thus the model is not overly complicated by several hidden layers. Yet, there are enough features from Xception (2048) that many nodes can be included in the hidden layer. Finally, the dropout layers may help prevent overfitting (although as we shall see, the model does seem to be overfitting).

Parameters that I would tune to improve performance, if needed, include the number of nodes in the hidden layer, the number of hidden layers, the type of pooling step (global average vs. max pooling, for instance), the percentage of dropout, and the activation type of the hidden layer.

Question 6: Is the output better than you expected :) ? Or worse :( ? Provide at least three possible points of improvement for your algorithm.

Answer:

The output is a little worse than I expected for two reasons. First, although my dog is a mutt and I don't know the exact mix of breeds he is, most people think he is a boxer puppy. However, my algorithm predicted he is a Parson russell terrier. After looking at some pictures of Parson russell terriers, I can see that his face may be similar, especially with the white stripe down his nose, but they have longer, more wiry hair. Second, my algorithm thought that a giant tortoise from Galapagos was a human.

Possible points of improvement include:

  1. A more specific human face detector. The fact that the galapagos tortoise was identified as human, and that 11% of dogs also had a detected human face, suggests that the human face detector could be more specific. Perhaps it was trained solely (or primarily) on human faces and thus does a good job of finding human faces, but not of excluding other faces.
  2. Although my dog breed classifier had over 82% accuracy on the test dataset, this could still be improved. I noticed that my model seemed to be overfitting because the validation loss did not decrease after epoch 4. Thus, I could decrease the number of nodes in my hidden layer or add more aggressive dropout than 30%. I could also augment the data to try to improve slide invariance or rotation invariance, etc.
  3. Not all breeds are represented equally in the training dataset. For example, Norwegian buhunds appear only 26 times, Parson russell terriers appear 30 times, while Alaskan malamutes occur 77 times and Border collies 74 times. Balancing the number of instances of each breed in the training set may improve performance.
  4. If the algorithm detects a dog, but there's not one clear predicted breed, then it could report the percentages of the top 2 or 3 predicted breeds rather then just a single prediction.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 54.3%
  • Jupyter Notebook 45.7%