PixelPlayer: Project developed by the MIT research team

Different sound sources can be automatically identified and separated from the video and matched with the picture position.
For example, it can identify which character is speaking or which instrument is being played in the video.
It is also possible to extract and separate the sounds of these sound sources separately.
PixelPlayer can learn and analyze itself without the need to manually annotate data.
This ability provides powerful tools for audio and video editing, multimedia content production, augmented reality applications and other fields, making it possible to independently adjust the volume of different sound sources in videos, remove or enhance specific sound sources, etc.
For example: It can be used to dub AI videos!

PixelPlayer’s core functions include:

1. Sound source separation: PixelPlayer can separate the sound signal into multiple components by analyzing the video, each component corresponding to a specific area in the video. This allows the system to identify and separate different sound sources in the video, such as the sounds of different musical instruments. For example, separating the voices, musical instruments, etc. in a video into separate audio tracks.
2. Sound localization: In addition to separating sounds, PixelPlayer can also locate the source of the sound, that is, determine which area in the video produced a specific sound. This means the system can identify which specific object in the video the sound is coming from. For example, it can identify which character is speaking or which instrument is being played in the video.
3. Multiple sound source processing: Even if there are multiple sound sources in the video that make sounds at the same time, PixelPlayer can recognize and process them separately.

Working principle:

1. Large-scale video training: Training on the PixelPlayer system uses a large number of videos containing people playing different instrument combinations, including solos and duets. No information was provided during the training process about which instruments were present in the video, their locations or their sounds.
2. Data-driven learning: The important thing is that PixelPlayer can perform these complex analysis and processing without the need to manually annotate the data. Traditional machine learning methods often rely on large amounts of annotated data to teach models to recognize and process information. In contrast, PixelPlayer learns to understand the relationship between sound and images by watching a large number of unlabeled videos, and realizes the separation and positioning of sound sources. This is a self-learning ability.
3. Utilization of video and audio synchronization: PixelPlayer relies on natural synchronization between visual and audio modalities, that is, the production of sound is often associated with visual elements such as human movements or the performance of musical instruments. By analyzing this synchronization relationship, PixelPlayer learns the sound characteristics produced by different objects or behaviors.
4. Correlation between sound and pixels: The system allocates a sound component to each pixel in the video through joint analysis of sound and images to achieve accurate positioning and separation of sound. This approach allows PixelPlayer to identify which areas in the video are producing sounds and break down the sounds into components that represent the sounds of each area.
5. Sound separation technology: Use advanced sound processing technology, such as source separation algorithms, to separate the mixed audio signal into multiple independent sound channels, each channel corresponding to a sound source in the video.

Application scenarios:

1. Separation of audio and video sources: PixelPlayer can automatically separate various sound sources from video, such as musical instrument sounds. This is very useful for music production and editing, allowing audio engineers and producers to separate individual instrument tracks from complex audio recordings for more refined audio processing and mixing.
2. Sound localization: By locating the specific location where sounds are generated in videos, PixelPlayer provides new possibilities for augmented reality (AR) and virtual reality (VR) applications. In an AR/VR environment, realistically simulating the source of sound based on the user’s perspective and interaction can greatly enhance the user experience.
3. AI content dubbing: In fields such as film production, video game development and online education, PixelPlayer can help content creators more easily dubbing visual content, such as automatically adding specific sound effects to different characters or objects in animations.
4. Automatic subtitle and description generation: For people with hearing impairment, PixelPlayer can help automatically generate more accurate subtitles and audio descriptions by identifying and separating the source of sound in the video, and improve the accessibility of video content.
5. Audio visualization: PixelPlayer provides an innovative way to visualize sound and music. By directly associating sound with visual content, novel music visualization experiences can be created, such as dynamic sound visualization based on instrument position in music videos.
6. Music teaching and learning: In music education, PixelPlayer can be used to display the sound distribution and characteristics of different instruments in an ensemble, helping students better understand the structure of the music and the interaction between the instruments.
7. Research and Development: As a research project, The Sound of Pixels promotes the research boundaries of cross-modal learning (that is, simultaneously processing and understanding multiple sensory information), providing new perspectives and tools for the development of future artificial intelligence systems.

Through this project, MIT’s research team not only promotes the boundaries of audio and video processing service Media Processing Service technology, but also provides new perspectives and tools for multimodal artificial intelligence research and applications.

Projects and demonstrations:http://sound-of-pixels.csail.mit.edu
Thesis:https://arxiv.org/abs/1804.03160
GitHub：https://github.com/hangzhaomit/Sound-of-Pixels

Video: