As part of Bites of Bytes I thought it would be cool if we could take a look at some interesting research papers that I have come across. Today we’ll be looking at the the EdgeBoxes algorithm for object detection developed by Larry Zitnick and Piotr Dollar from Microsoft Research.
The task of object detection within an image is one with a massive number of applications, from robotics to analytics to countless real world situations. It is often cited as the example to illustrate how “some problems so easy to humans are incredibly difficult for computers” which then neatly segues into a discussion about a data driven approach which has been vastly elaborated by various different works. Traditional methods to solve the problem mostly followed the approach of merging and clustering similar pixels to obtain a boundary of the object which transitioned into segmentation of the image into foreground and background by means of a user provided seed for the image. What makes the EdgeBoxes algorithm cool is that rather than going from the image downwards, proposals are generated from edges in a bottom-up manner. That is the main aim of this work, to yield a set of bounding boxes each of which contains an object within an image.
- Given an input image, for every individual pixel, an edge response value is calculated by means of the Structured Edge Detector. Essentially a score is provided for each pixel which indicates whether or not that pixel is an edge pixel or not. How this is done is by means of a Structured Random Forest.
- Breaking that down even further, a Decision Forest is a collection of trained decision trees. Each tree is trained individually but because of the tendency of these classifiers to overfit data, pooling the results of the entire collection is employed.
- When we say a Random Forest, we mean that to increase the diversity among the various trees and hence increase the accuracy of the prediction of the forest as a whole. This is done by means of random subsampling during training of the trees or injecting a few random split nodes. Accuracy of the individual trees may be sacrificed in favour of a better overall result.
- The Forest retrieves input features from the image for each data pixel, sampling neighbours, sampling pairwise differences, gradient channels in all directions and returns a score for each pixel which is labelled as an edge pixel or not based on a fixed threshold. For more details on the Structured edge detector you can take a look at the base paper. These features are similar to simpler edge detection methods like the Prewitt operator and Laplacian operator which work on the principle of convolution.
- You get a map of all edge pixels on the image but when connected you now have a set of very thick edges on the image. These are sharpened by means of Non Maximal Suppression. Again a standard computer vision technique which is used to find edge peaks. Follow this link to learn more about that, but essentially you are now returned a sparse, sharp edge map.
- Now we define a term called a contour which is a set of connected edges that that form a coherent boundary. Since there are a very large number of edges detected within the image, they are first combined to form edge groups of connected sets of 8 by a greedy selection. An affinity score is computed pairwise on the set of all edge groups which is based on the mean position and orientation of the different groups.
- With the edge groups and affinities, a proposal score can be computed for every bounding box. The connected edge groups form a contour around whatever is the proposed object in the bounding box by means of a similarity score between this contour and the boundary of the box. Only those edges completely within the bounding box are considered when evaluating contours.
- Based on this score, we either select or reject the bounding box as an object proposal.
The algorithm yields a very large number of proposals with a near perfect recall score. We can make use of a hypothesis selection technique like Intersection Over Union (IoU) to remove similar hypotheses based on the task at hand.
I suppose what drew my attention to the paper was the novel way they approach the problem of building from the pixel upwards. The way of building from pixel to edge to edge group to contour to bounding box seemed both logically intuitive and deceptively simple. Honestly I feel that that is the hallmark of a great algorithm, when you read it, it instantly makes sense in your head. Also, as the results in the paper show, when we used the official implementation, the performance of the algorithm is state-of-the-art. We had used this as part of a larger project of multilabel classification by identifying individual objects in the image and the biggest challenge was not this stage of the process but the selection from the large number of boxes. Stay tuned for that post. (Edit: Here it is 😛 )