Despite continuous advances over many decades, automatic speech recognition remains fundamentally a supervised learning scenario that requires large quantities of annotated training data to achieve good performance. This requirement is arguably the major reason that less than 2% of the worlds' languages have achieved some form of ASR capability. Such a learning scenario also stands in stark contrast to the way that humans learn language, which inspires us to consider approaches that involve more learning and less supervision.
In our recent research towards unsupervised learning of spoken language, we are investigating the role that visual contextual information can play in learning word-like units from unannotated speech. This talk will outline our ongoing research in CSAIL to develop deep learning models that are able to associate images with unconstrained spoken descriptions, and present analyses that indicate that the models are learning correspondences between associated objects in images and their spoken instantiation.