LLM Architecture

Mapping Text to Vectors: Advanced Tokenization and Embedding Techniques

Learn how Byte-Pair Encoding (BPE) and high-dimensional vector spaces allow models to process semantic relationships between discrete text units.

AI & MLAdvanced12 min read

In this article

The Bridge Between Discrete Text and Continuous Math

The Granularity Problem

Byte-Pair Encoding and Vocabulary Construction

Tokenizer Trade-offs

Mapping Symbols to High-Dimensional Vector Space

The Mechanics of Vector Lookups

Semantic Manifolds and Dimensionality Scaling

Standardization and Layer Norm

The Bridge Between Discrete Text and Continuous Math

Computers operate exclusively on numerical data, yet human language is discrete and symbolic. To bridge this gap, Large Language Models utilize a process that converts raw strings into a structured numerical format. This process ensures that the model can mathematically manipulate language while preserving the nuanced relationships between different words.

A common misconception is that models read words the way humans do. In reality, the first stage of any LLM pipeline is to break a string of text into smaller units called tokens. These tokens serve as the atomic building blocks for all subsequent processing within the neural network architecture.

Historically, engineers tried mapping every unique word in a dictionary to a specific integer. However, this approach failed when the model encountered new words or variations, such as plurals and verb tenses. Modern architectures solve this by utilizing subword tokenization, which allows the model to handle an infinite variety of text using a finite and manageable vocabulary.

The Granularity Problem

If we use a character-level approach, the model must process a massive number of steps just to understand a single sentence. This increases computational costs and makes it harder for the model to capture long-range dependencies between distant concepts. Conversely, word-level approaches suffer from the Out-of-Vocabulary problem where any word not seen during training becomes an unreadable token.

Subword tokenization addresses these constraints by breaking rare words into common fragments while keeping frequent words intact. For example, a specialized technical term might be broken into three smaller, recognizable parts. This ensures that the model always has a numerical representation for any input, regardless of whether it appeared in the original training set.

Byte-Pair Encoding and Vocabulary Construction

Byte-Pair Encoding, or BPE, is the standard algorithm used by models like GPT-4 and Llama to build their vocabularies. The algorithm starts at the character level and iteratively merges the most frequently occurring adjacent pairs of symbols into a single new symbol. This data-driven approach allows the vocabulary to adapt specifically to the patterns found in the training corpus.

During the training of a tokenizer, the BPE algorithm builds a merge table that dictates how characters should be combined. As the vocabulary size increases, the model can represent complex concepts with fewer tokens, which increases the context window efficiency. However, a larger vocabulary also requires more memory in the model's final output layer, representing a significant architectural trade-off.

pythonSimplified BPE Merge Logic

1from collections import Counter
2
3def get_stats(ids):
4    # Count the frequency of consecutive pairs
5    counts = Counter()
6    for pair in zip(ids, ids[1:]):
7        counts[pair] += 1
8    return counts
9
10def merge_tokens(ids, pair, new_id):
11    # Replace all occurrences of the most frequent pair with a new token ID
12    new_ids = []
13    i = 0
14    while i < len(ids):
15        if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
16            new_ids.append(new_id)
17            i += 2
18        else:
19            new_ids.append(ids[i])
20            i += 1
21    return new_ids
22
23# Example of one iteration of BPE merging
24current_ids = [1, 2, 3, 1, 2, 4]
25top_pair = (1, 2)
26new_token_id = 5
27updated_ids = merge_tokens(current_ids, top_pair, new_token_id)
28# updated_ids will be [5, 3, 5, 4]

The efficiency of BPE is most visible when dealing with code or structured data. Since common syntax patterns like function definitions or brackets appear frequently, the tokenizer merges them into single tokens. This allows a model to represent a complex block of logic in a fraction of the space required by a character-based system.

Tokenizer Trade-offs

When selecting a tokenizer configuration, engineers must balance the size of the vocabulary against the average number of tokens per document. A small vocabulary results in long sequences that exhaust the model's memory, while a very large vocabulary increases the size of the embedding matrix. Most modern models settle on a vocabulary size between thirty-two thousand and one hundred thousand tokens.

Vocabulary Size: Larger vocabularies improve compression but increase model parameters and memory usage.
Sequence Length: Efficient tokenization allows more information to fit within the fixed context window of the Transformer.
Morphological Awareness: Subword units allow the model to generalize meanings across related words like run and running.

Mapping Symbols to High-Dimensional Vector Space

Once text is converted into token IDs, the model must transform these integers into a format suitable for neural computation. This is the role of the Embedding Layer, which acts as a massive lookup table. Each unique token ID is mapped to a high-dimensional vector, often consisting of thousands of floating-point numbers.

These vectors represent the initial semantic meaning of the token before it has been influenced by its context. In this high-dimensional space, the distance and direction between vectors correspond to logical relationships. For example, the vector representing a programming language might be closer to the vector for software than it is to the vector for biology.

Embeddings are not just arbitrary numbers; they are the learned coordinates of human concepts within a mathematical landscape where proximity defines meaning.

The dimensionality of these embeddings is a critical hyperparameter. In smaller models, a dimension of 768 might be used, whereas state-of-the-art models often use dimensions exceeding 4096. Higher dimensionality allows the model to capture more subtle nuances and complex interactions between tokens, but it also increases the computational overhead of every matrix multiplication in the network.

The Mechanics of Vector Lookups

In practice, the embedding layer is a matrix of shape vocabulary size by hidden dimension. When a token ID is passed to the layer, the system performs a row-lookup operation to retrieve the corresponding vector. This operation is mathematically equivalent to multiplying a one-hot encoded vector by the weight matrix, but it is optimized in hardware for performance.

pythonPyTorch Embedding Layer Implementation

1import torch
2import torch.nn as nn
3
4# Define vocabulary size and embedding dimension
5vocab_size = 50000
6embedding_dim = 4096
7
8# Initialize the embedding layer with random weights
9# These weights are learned during the training process
10embedding_layer = nn.Embedding(vocab_size, embedding_dim)
11
12# A batch of sequences (3 sentences, each with 5 tokens)
13input_indices = torch.randint(0, vocab_size, (3, 5))
14
15# Lookup the continuous vector representations
16# Output shape: (batch_size, sequence_length, embedding_dim)
17vector_representations = embedding_layer(input_indices)
18
19print(f"Output tensor shape: {vector_representations.shape}")

Semantic Manifolds and Dimensionality Scaling

The power of embedding vectors lies in their ability to organize information geometrically. During training, the model adjusts these vectors so that tokens used in similar contexts are pulled closer together. This creates a semantic manifold where the model can perform vector arithmetic to solve analogies or identify synonyms.

As models scale, the density of this vector space becomes a bottleneck. If the dimensionality is too low, the model suffers from representational collapse, where distinct concepts are forced into the same coordinates. Conversely, excessive dimensionality can lead to overfitting, where the model memorizes specific training examples rather than learning general linguistic patterns.

Modern architectures often scale the embedding dimension proportionally with the number of layers and attention heads. This ensures that the information bottleneck remains consistent throughout the network. Engineers must also consider the initialization of these vectors, as poor starting values can lead to vanishing gradients during the early stages of training.

Standardization and Layer Norm

Raw embeddings often have high variance, which can destabilize the training of deep networks. To counter this, models frequently apply Layer Normalization or Root Mean Square Layer Normalization immediately after the embedding lookup. This keeps the activations within a predictable range and allows for faster convergence during backpropagation.

Furthermore, some architectures share the weights between the input embedding layer and the final output projection layer. This technique, known as weight tying, reduces the total parameter count of the model significantly without a proportional loss in performance. It forces the model to use the same internal representation for both understanding and generating specific tokens.

Optimizing Contextual Processing with Multi-Head and Grouped-Query Attention