Principal Investigator William Freeman
Co-investigator Antonio Torralba
Project Website http://www.csail.mit.edu.ezproxy.canberra.edu.au/node/127
Researchers seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. We have developed this web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.
Research in object detection and recognition in cluttered scenes requires large image and video collections with ground truth labels. The labels should provide information about the object classes present in each image, as well as their shape and locations, and possibly other attributes such as pose. Such data is useful for testing, as well as for supervised learning. Even algorithms that require little supervision need large databases with ground truth to validate the results. New algorithms that exploit context for object recognition equire databases with many labeled object classes embedded in complex scenes. Such databases should contain a wide variety of environments with annotated objects that co-occur in the same images.
Building a large database of annotated images with many objects is a costly and lengthy enterprise. Traditionally, databases are built by a single research group and are tailored to solve a specific problem (e.g, face detection). Many databases currently available only contain a small number of classes, such as faces, pedestrians, and cars. A notable exception is the Caltech 101 database, with 101 object classes. Unfortunately, the objects in this set are generally of uniform size and orientation within an object class, and lack rich backgrounds.
Web-based annotation tools provide a new way of building large annotated databases by relying on the collaborative effort of a large population of users. LabelMe is an online annotation tool that allows the sharing of images and annotations. The tool provides many functionalities such as drawing polygons, querying images, and browsing the database. Both the image database and all of the annotations are freely available. The tool runs on almost any web browser, and uses a standard Javascript drawing tool that is easy to use. The resulting labels are stored in the XML file format, which makes the annotations portable and easy to extend. A Matlab toolbox is available that provides functionalities for manipulating the database (database queries, communication with the online tool, image transformations, etc.). The database is also searchable online.
Currently the database contains more than 77,000 objects labeled within 23,000 images covering a large range of environments and several hundred object categories. The images are high resolution and cover a wide field of view, providing rich contextual information. Pose information is also available for a large number of objects. Since the annotation tool has been made available online there has been a constant increase in the size of the database, with about 7,500 new labels added every month, on average.
One important concern when data is collected using web-based tools is quality control. Currently quality control is provided by the users themselves. Polygons can be deleted and the object names can be corrected using the annotation tool online. Despite the lack of a more direct mechanism of control, the annotations are of quite good quality. Another issue is the complexity of the polygons provided by the users - do users provide simple or complex polygon boundaries? These object classes are among the most complicated. These polygons provide a good idea of the outline of the object, which is sufficient for most object detection and segmentation algorithms.
Another issue is what to label. For example, should you label a whole pedestrian, just the head, or just the face? What if it is a crowd of people -- should you label all of them? Currently we leave these decisions up to each user. In this way, we hope the annotations will reflect what various people think are "natural" ways to segment an image. A third issue is the label itself. For example, should you call this object a "person," "pedestrian," or "man/woman"? An obvious solution is to provide a drop-down menu of standard object category names. However, we currently prefer to let people use their own descriptions, since these may capture some nuances that will be useful in the future. The Matlab toolbox allows querying the database using a list of possible synonyms.