Recently I described how to use neural networks to recognize handwritten digits from MNIST dataset: Neural Network Training Using encog.
One of real world usage of machine learning is image detection: find object location on the image. Facebook finds faces on photos; Google self-driving cars detect pedestrians, cars and traffic signs.
I tried to use a neural network to detect eyes on photos. Yes, I know that there are many tools in various computer vision toolkits which are good to solve this problem, but I want to try generic object detection problem, and not focus just on face recognition. Though I’m good in software engineering, machine learning is very new field for me, and this is my first practice in image detection.
The system should get an image as an input, and provide a location (x, y, width, height of a rectangle) of eyes on the image. It is not much useful to send the whole image to a neural network and expect to get an object location, as a result. Probably it is possible to train such neural network, but it will take tremendous computational power and huge training dataset.
Sliding window approach can be used instead. A neural network is trained for classification of fixed size images, for an example it can just classify if the image is an eye or not. The system gets part of the image and classifies it. Then it gets another part of the image, slightly sliding its position, and so on. When all the image is covered, it starts over with sliding window of different size. When the process finishes, there are results of classification for every part of the image, therefore a location of all found objects is known.
Preparing a training dataset is another interesting problem. It is hard to find something useful publicly available. I made very small dataset using my own photos. It is easy to use annotated photos to grab sample of an object for training. To get samples of images which have no objects I just got random parts of the same photos, which do not overlap much with object rectangles. Maybe there are better ways to build a dataset using annotated images, I would be glad to know about it.
The source code can be found here: https://github.com/surmenok/ImageRecognition (see objectDetection package there).
Performance of a neural network with 2 hidden layers, with 500 neurons in each, is very far from perfection. One thing that makes it better is narrowing search space: first find a face, and then search for eyes only there. Another useful thing might be to take into account location of a sample related to location of the face. Both these things mean using some sort of domain knowledge: how different types of objects relate to each other.
It could be that simple neural networks are not the best tool for this problem. It is worth to try convolutional neural networks.