Multimodal AI

Mapping Pixels and Spectrograms to Unified Token Spaces

Discover how raw image patches and audio spectrograms are converted into discrete tokens using Vision Transformers (ViTs) and audio-specific encoders to enable unified model processing.

AI & MLAdvanced12 min read

In this article

The Bridge Between Continuous Signals and Discrete Tokens

Overcoming the Dimensionality Barrier

Vision Transformers: Transforming Pixels into Sequences

The Role of Positional Encodings

Audio Processing: From Waveforms to Spectrogram Patches

Temporal vs. Spectral Tokenization

Alignment Strategies and Unified Latent Spaces

Managing Context Window Constraints

Systems Engineering for Multimodal Pipelines

The Bridge Between Continuous Signals and Discrete Tokens

The primary challenge in multimodal AI is the fundamental difference in data representation across text, vision, and audio. Text is naturally discrete and hierarchical, whereas vision and audio are continuous signals with high spatial or temporal redundancy. Modern architectures overcome this by treating all inputs as a sequence of discrete tokens within a unified vector space.

Traditional neural networks used modality-specific layers like convolutions for images and recurrent units for audio. However, these specialized layers often create silos that make it difficult for a model to learn cross-modal relationships. By converting all data into a sequence of embeddings, we allow the Transformer architecture to apply the same self-attention mechanism regardless of the original data source.

This unification relies on the concept of a shared latent space where a token representing a specific concept in an image is mathematically similar to its textual description. Achieving this requires sophisticated tokenization strategies that preserve the underlying structure of the raw signal while discarding noise. This process effectively translates the language of the physical world into the language of high-dimensional vectors.

Overcoming the Dimensionality Barrier

Raw sensory data is incredibly dense and contains a significant amount of redundant information. A single high-definition image contains millions of pixels, and a one-second audio clip at standard sample rates contains tens of thousands of data points. Processing these raw signals directly would lead to an exponential increase in computational complexity within the attention mechanism.

The tokenization process serves as a downsampling and feature extraction layer that reduces this dimensionality. By grouping pixels into patches or audio samples into frequency bins, we focus on the most salient features of the data. This compression is essential for maintaining a manageable context window during inference and training.

Vision Transformers: Transforming Pixels into Sequences

Vision Transformers or ViTs revolutionized image processing by treating an image exactly like a sentence. Instead of words, the model receives a sequence of image patches that have been flattened into vectors. This approach allows the model to learn global dependencies across the entire image from the very first layer.

The first step in this pipeline is the patch extraction process where the image is divided into a grid of fixed-size squares. For a standard 224 by 224 pixel image, using a patch size of 16 results in a sequence of 196 individual patches. Each of these patches represents a local spatial region that contains texture, color, and edge information.

pythonImplementing Image Patching with Einops

1import torch
2from einops import rearrange
3
4def create_image_patches(image_tensor, patch_size=16):
5    # image_tensor shape: [Batch, Channels, Height, Width]
6    # Rearrange the image into flattened patches
7    # (b c (h p1) (w p2)) -> b (h w) (p1 p2 c)
8    patches = rearrange(
9        image_tensor, 
10        'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', 
11        p1=patch_size, 
12        p2=patch_size
13    )
14    return patches
15
16# Example: 1 batch, 3 channels, 224x224 image
17raw_image = torch.randn(1, 3, 224, 224)
18# Resulting shape will be [1, 196, 768]
19processed_patches = create_image_patches(raw_image)

Once patches are extracted, they are passed through a linear projection layer. This layer is essentially a learnable matrix that maps the high-dimensional pixel data into the same embedding dimension used by the Transformer. This projection ensures that the visual information is compatible with the model hidden states.

The choice of patch size is the most significant trade-off in vision tokenization. Smaller patches provide higher resolution and better detail capture but increase the sequence length quadratically, which dramatically raises memory consumption during the self-attention phase.

The Role of Positional Encodings

Standard Transformers are permutation invariant, meaning they do not inherently know the order of tokens in a sequence. In text, the order of words determines meaning, and in vision, the spatial arrangement of patches is critical for recognizing objects. Without spatial information, the model would treat the image as a bag of patches with no structural context.

To solve this, we add learnable positional embeddings to the patch projections before they enter the Transformer blocks. These embeddings are usually two-dimensional to reflect the grid-like nature of the original image. This allows the model to distinguish between a patch at the top-left corner and one in the center of the frame.

Audio Processing: From Waveforms to Spectrogram Patches

Audio data presents a unique challenge because it is a one-dimensional signal that varies over time. While we could tokenize raw waveforms directly, this is rarely efficient due to the high frequency of sound waves. Instead, most multimodal models convert audio into a visual representation known as a Mel-spectrogram.

A Mel-spectrogram represents the frequency content of the audio signal as it changes over time, mapped to a scale that mimics human hearing. This transformation turns a temporal audio problem into a spatial vision problem. Once the spectrogram is generated, we can apply the same patching techniques used in Vision Transformers to tokenize the sound.

pythonAudio Spectrogram Tokenization

1import torchaudio
2import torch.nn as nn
3
4class AudioEncoder(nn.Module):
5    def __init__(self, embed_dim=768):
6        super().__init__()
7        # Convert raw audio to a Mel-Spectrogram
8        self.spec_gen = torchaudio.transforms.MelSpectrogram(
9            sample_rate=16000,
10            n_mels=128
11        )
12        # A simple linear layer to project frequency bins to embeddings
13        self.projection = nn.Linear(128, embed_dim)
14
15    def forward(self, waveform):
16        # waveform shape: [batch, time]
17        spectrogram = self.spec_gen(waveform) # [batch, n_mels, time]
18        # Transpose to treat time as the sequence dimension
19        spectrogram = spectrogram.transpose(1, 2)
20        tokens = self.projection(spectrogram)
21        return tokens

By treating the spectrogram as an image, we can use an architecture known as the Audio Spectrogram Transformer or AST. This model patches the spectrogram along both the time and frequency axes. This enables the model to attend to specific acoustic events at specific frequencies, such as a bird chirping or a door slamming.

Temporal vs. Spectral Tokenization

Tokenizing audio requires balancing resolution in the time domain versus the frequency domain. If the time windows are too large, the model may miss brief transient sounds like percussion or clicks. Conversely, if the frequency bins are too wide, the model might struggle to distinguish between different musical notes or vocal pitches.

Most advanced systems use overlapping patches to ensure that no information is lost at the boundaries of the windows. This redundancy helps the model maintain temporal continuity across the sequence. The overlap introduces extra computation but is vital for high-fidelity audio understanding and generation.

Alignment Strategies and Unified Latent Spaces

After we have converted images and audio into sequences of tokens, the next hurdle is ensuring these tokens exist in a shared semantic space. If an image token for a dog and the word token for dog are in completely different parts of the vector space, the model cannot perform cross-modal reasoning. We use contrastive learning and shared projection heads to align these representations.

Contrastive learning works by pushing embeddings of related pairs closer together while pushing unrelated pairs further apart. For example, during training, a model might be fed an image of a sunset and its corresponding audio recording of ocean waves. The model is penalized if the distance between these two vectors is high, forcing it to find commonalities.

Linear Projection: Maps modality-specific features to a common vector dimension.
Modality Embeddings: Special tokens added to the sequence to help the model identify if a token came from text, vision, or audio.
Cross-Attention Layers: Allow one modality to attend to the features of another during the encoding process.
Bottleneck Adapters: Small learnable modules that fine-tune a pre-trained modality-specific encoder for a multimodal task.

The ultimate goal of this alignment is to create a model that is truly modality-agnostic. In such a system, the internal layers of the Transformer do not care where the token originated. They simply process semantic concepts that have been successfully extracted from the raw sensory inputs.

Managing Context Window Constraints

One major bottleneck in multimodal processing is the length of the resulting token sequences. An image produces hundreds of tokens, and a short audio clip can produce thousands. When combined with text, these sequences can easily exceed the context limits of standard Transformer architectures.

To mitigate this, engineers use techniques like Perceiver IO or focused pooling. These methods compress the large sequence of raw tokens into a smaller set of latent tokens before they enter the main processing blocks. This allows the model to process much longer multimodal inputs without a linear increase in latency.

Systems Engineering for Multimodal Pipelines

Implementing these tokenization strategies in a production environment requires careful consideration of data throughput and hardware utilization. Preprocessing images and audio is computationally expensive and can become a bottleneck for high-speed inference. Offloading these tasks to dedicated hardware or optimized C++ kernels is often necessary.

Engineers must also consider the synchronization of different modalities. In video processing, audio tokens must be perfectly aligned with the corresponding image frames to ensure the model understands temporal causality. Misalignment by even a few milliseconds can significantly degrade the performance of tasks like lip-syncing or action recognition.

Finally, monitoring the quality of the unified embedding space is essential for long-term model health. As new data types or sensors are added to the system, the underlying alignment can shift, leading to modality collapse. Robust evaluation suites that test cross-modal retrieval are the best defense against these architectural regressions.

Implementing Early, Late, and Intermediate Fusion Strategies