Researchers from MIT, the MIT-IBM Watson AI Lab, IBM Research, and elsewhere have developed a brand new technique for analyzing unlabeled audio and visual data that would improve the performance of machine-learning models utilized in applications like speech recognition and object detection. The work, for the primary time, combines two architectures of self-supervised learning, contrastive learning and masked data modeling, in an effort to scale machine-learning tasks like event classification in single- and multimodal data without the necessity for annotation, thereby replicating how humans understand and perceive our world.
“A bigger portion of human knowledge is learned in a self-supervised way, because we do not at all times get supervision signals, and we would like to enable the machine-learning model to have the identical ability,” says Yuan Gong, an MIT postdoc within the Computer Science and Artificial Intelligence Laboratory (CSAIL).
“So, one other strategy to put it’s that self-supervised learning often forms the inspiration of an initial model, because it could actually learn on vast amounts of unlabeled data. After which you should use classical, supervised learning or reinforcement learning to tremendous tune the model to something particular if you ought to,” says Jim Glass, an MIT senior research scientist and member of the MIT-IBM Watson AI Lab.
The technique, called the contrastive audio-visual masked autoencoder (CAV-MAE), is a style of neural network that may learn to extract and map meaningful latent representations into high-dimensional space from acoustic and visual data by training on large YouTube datasets of audio and video 10-second clips. The researchers say the technique is simpler than previous approaches since it explicitly models the relationships between audio and visual data in a way that other methods don’t.
Joining Gong and Glass on the study are graduate students Andrew Rouditchenko and Alexander H. Liu of MIT, David Harwath PhD ’18 of the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne. Kuehne can be affiliated with Goethe University Frankfurt. The tactic was recently presented on the International Conference on Learning Representations.
A joint and coordinated approach
The CAV-MAE works by “learning by prediction” and “learning by comparison,” says Gong. The masked data modeling, or the prediction method, takes a video together with its coordinated audio waveform, converts the audio to a spectrogram, and masks 75 percent of each. The unmasked data is tokenized, then fed into separate audio and visual encoders before entering a joint encoder/decoder, where the model is asked to get better the missing data. The difference (reconstruction loss) between the resulting reconstructed prediction and the unique audio-visual combination is then used to coach the model for higher performance. An example of this may be covering a part of a video of a piano and a part of a spectrogram of piano music, after which asking the model to try to find out the masked inputs. Unfortunately, this method may not capture the association between the video and audio pair, whereas contrastive learning leverages this, but may discard some modality-unique information, just like the background in a video.
Contrastive learning goals to map representations which are similar close to one another. For instance, the model will try and place different video and audio data of various parrots close to one another and further away from pairs of video and audio of guitars playing. In a similar way to masked autoencoding, audio-visual pairs are passed into separate modality encoders; nevertheless, the audio and visual components are kept individually inside the joint encoder before the model performs pooling and contrastive loss. In this fashion, contrastive learning tries to discover the parts of every audio or video which are most relevant to the opposite. For instance, if a video shows someone speaking and the corresponding audio clip comprises speech, the autoencoder will learn to associate the mouth movements of the speaker with the words being spoken. It is going to then adjust the model’s parameters in order that those inputs are represented close to one another. Ultimately, the CAV-MAE method combines each techniques with multiple forward data streams with masking as a primary step, modality-specific encoders, and layer normalization in order that the representation strengths are similar.
“We [then] wanted to match the proposed CAV-MAE with a model trained only with a masked autoencoder and a model trained only with contrastive learning, because we would like to indicate that by combining masked autoencoder and contrastive learning, we will get some performance improvement,” says Gong, “and the outcomes support our hypothesis that there’s obvious improvement.”
The researchers tested CAV-MAE — in addition to their method without contrastive loss or a masked autoencoder — against other state-of-the-art methods on audio-visual retrieval and audio-visual event classification tasks using standard AudioSet (20K and 2M) and VGGSound datasets — labeled, realistic short clips, which could include multiple sounds. Audio-visual retrieval implies that the model sees either the audio or visual component of a question pair and searches for the missing one; event classification includes identifying actions or sounds inside data, like an individual singing or a automotive driving.
Overall, they found that contrastive learning and masked data modeling are complementary methods. CAV-MAE was capable of outperform previous techniques (with fully self-supervised pre-training) by about 2 percent for event classification performance verses models with comparable computation and, more impressively, kept pace with or outperformed models with industry-level computational resources. The team’s model ranked similarly to models trained with only the contrastive loss. And surprisingly, the team says, the incorporation of multi-modal data into CAV-MAE pre-training greatly improves the fine-tuning of single-modality representation via supervised learning (with some labeled data) and performance on audio-only event classification tasks. This demonstrates that, like humans, multi-modal information provides an extra “soft label” boost even for audio or visual only tasks; as an illustration, it helps the model to know if it’s searching for an electrical or acoustic guitar — a richer supervision signal.
“I feel people just like the elegance of this model for combining information in different audio and visual streams. It has the contrastive and the reconstruction loss, and in comparison with models which were evaluated with similar data, it clearly does thoroughly across a spread of those tasks,” says Glass.
Constructing on this, “one special thing is, our model can do each classification and the retrieval, which shouldn’t be common,” Gong adds. “Before this work, these methods are used individually, but after this work, I see that the majority of the audio-visual learning frameworks use contracting loss and the masked autoencoder together, implicitly or explicitly.”
Bringing self-supervised audio-visual learning into our world
The researchers see their contribution of the contrastive audio-visual masked autoencoder (CAV-MAE) as a crucial milestone and a step forward for applications, that are increasingly moving from single modality to multi-modality and which require or leverage audio-visual fusion. They hypothesize that someday it might be used for motion recognition in realms like sports, education, entertainment, motorized vehicles, and public safety. It could also, someday, extend to other modalities. Presently, the proven fact that, “this only applies to audio-visual data could also be a limitation, but we’re targeting multi-modal learning, which is trend of machine learning,” says Gong. “As humans, we now have multi-modalities — we now have smell, touch — many more things that just audio-visual. So, after we try to construct AI, we attempt to mimic humans someway, not necessarily from the biological perspective, and this method could [potentially be] generalized to other unexplored modalities.”
As machine-learning models proceed to play an increasingly essential role in our lives, techniques like this one will grow to be increasingly helpful.
This research was supported by the MIT-IBM Watson AI Lab.