Decentralized Storage

Architecting Distributed Data with Merkle Directed Acyclic Graphs

Explore how Merkle DAGs enable efficient data verification, deduplication, and versioning across decentralized peer-to-peer storage networks.

BlockchainIntermediate12 min read

In this article

The Evolution of Data Addressing

The Content Identifier (CID) Framework

Architecting Data with Merkle DAGs

Immutability and Verification

Efficiency Through Deduplication

Chunking Strategies

Data Persistence and Availability

Garbage Collection and Performance

The Evolution of Data Addressing

In the traditional web architecture, we rely heavily on location-based addressing. When you request a file via a URL, you are essentially telling the network to go to a specific IP address and look for a file at a specific path. This creates a single point of failure and makes the link fragile if the server moves or the file is renamed.

Decentralized storage flips this model by using content-based addressing. Instead of asking where a piece of data is located, the network asks what the data is. This is achieved through cryptographic hashing, where the unique digital fingerprint of the data becomes its address, ensuring that the same content always yields the same identifier regardless of its physical location.

The shift from location to content allows for a more resilient and distributed web. Multiple nodes can serve the same file because the client can verify the integrity of the data using the hash provided in the request. This eliminates the need for a central authority to validate the authenticity of the information being retrieved.

Location-based addressing tells you where to find a book in a library. Content-based addressing is like asking for the book by its unique ISBN number, which remains constant no matter which shelf it sits on.

By identifying data by its content, we also gain the ability to cache and distribute files more efficiently. If ten users in the same local network request the same popular video, the network can serve it from the nearest peer rather than reaching back to a distant central data center. This reduces latency and conserves global bandwidth.

The Content Identifier (CID) Framework

At the heart of content addressing is the Content Identifier, or CID. A CID is a self-describing string that points to a specific piece of data in a decentralized network. It does not just contain the hash, but also metadata about how the hash was generated and how to interpret the underlying data.

Modern CIDs utilize a standard known as Multihashes, which allows the network to be future-proof. If a specific hashing algorithm like SHA-256 is broken, the CID format allows the system to transition to a more secure algorithm without breaking the existing structure of the network nodes.

Multibase: Specifies the encoding used for the CID string, such as Base32 or Base58.
CID Version: Indicates whether the CID follows the older V0 or the modern V1 standard.
Multicodec: Defines the format of the data being linked, such as Protobuf or Raw binary.
Multihash: Contains the hash function code, the length of the hash, and the actual digest.

Architecting Data with Merkle DAGs

While simple files can be identified by a single hash, large datasets require a more complex structure to be managed effectively. This is where the Directed Acyclic Graph, or DAG, becomes essential. A Merkle DAG is a specific type of data structure where every node is identified by its cryptographic hash.

Unlike a standard Merkle Tree used in blockchains like Bitcoin, a Merkle DAG is more flexible. In a tree, each node has a single parent, but in a DAG, a node can have multiple parents. This allows for complex data relationships and efficient data reuse across different parts of the storage network.

The acyclic nature of the graph ensures that there are no loops. You cannot have a node that eventually points back to itself through its children. This property is crucial for maintaining a clear hierarchy and ensuring that every path through the graph is finite and verifiable.

javascriptConceptual DAG Node Structure

1/* 
2 * A simplified representation of a DAG node. 
3 * In a real system, these are serialized using Protobuf or CBOR.
4 */
5const dagNode = {
6  data: Buffer.from('This is a fragment of a larger file.'),
7  links: [
8    {
9      name: 'next_chunk',
10      cid: 'QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco',
11      size: 4096
12    }
13  ]
14};
15
16// The CID of this node is generated by hashing its data and its links.
17const nodeCid = generateCid(dagNode);

In a Merkle DAG, the hash of a parent node depends on the hashes of all its child nodes. If even a single bit of data changes in a leaf node at the bottom of the graph, the hash of that node changes. This change propagates up the graph, resulting in a completely different root hash.

Immutability and Verification

The mathematical properties of the Merkle DAG provide built-in immutability. Since any change to the data would change the CID, you can be certain that the data you receive is exactly what you requested. If a malicious peer tries to serve tampered data, the hash verification will fail immediately.

Verification can also be performed partially. In a decentralized environment, you might not want to download a 10GB file just to verify one small piece of it. The DAG allows you to verify a specific branch of the graph by only fetching the necessary intermediate hashes and the target data block.

This partial verification is the foundation of efficient light clients. These nodes can interact with the network and verify the state of the data without having to store the entire dataset. It creates a trustless environment where every byte of data can be proven to belong to the original root hash.

Efficiency Through Deduplication

One of the most powerful features of Merkle DAGs is automatic data deduplication. Because addresses are derived from content, two identical blocks of data will always have the same CID. In a global decentralized network, this means the system only needs to store one copy of any unique block, regardless of how many files reference it.

Consider a scenario where a software developer updates a library used by thousands of applications. In a centralized system, every application would have its own copy of that library. In a Merkle DAG-based system, the storage network identifies that the blocks representing the library are already present and simply creates links to them.

This efficiency extends to version control systems. When you modify a large file, most of the data blocks remain the same. The Merkle DAG for the new version will reuse all the unchanged blocks from the previous version, only creating new nodes for the modified sections and their parent path to the root.

pythonDeduplication Simulation

1# Simulating how DAGs handle identical content
2import hashlib
3
4def get_cid(data):
5    return hashlib.sha256(data.encode()).hexdigest()
6
7# Two different files containing the same block of text
8file_a_block = "This is a shared data block."
9file_b_block = "This is a shared data block."
10
11# The network sees them as the same identifier
12if get_cid(file_a_block) == get_cid(file_b_block):
13    print(f"Storage saved! Shared CID: {get_cid(file_a_block)}")
14else:
15    print("Storing unique blocks.")

Deduplication significantly reduces the overhead of data synchronization. When syncing a folder between two peers, the receiving peer only needs to request the blocks it does not already have. By comparing CIDs, the system can precisely identify the missing fragments without scanning the entire contents of every file.

Chunking Strategies

The effectiveness of deduplication depends heavily on how data is sliced into blocks, a process known as chunking. Fixed-size chunking is simple but fragile; adding a single byte at the beginning of a file shifts all subsequent boundaries, resulting in entirely different hashes for the entire file.

To solve this, many decentralized protocols use Content-Defined Chunking. Algorithms like Rabin fingerprinting use a sliding window to find natural break points in the data stream. If a byte is added at the start, only the first few blocks change, while the rest of the file maintains its original block boundaries and CIDs.

Choosing the right chunking strategy is a trade-off between computational overhead and deduplication efficiency. Content-defined chunking requires more CPU cycles to process but yields significantly better results for versioned data and large datasets with internal repetition.

Data Persistence and Availability

A common misconception is that storing data in a Merkle DAG automatically makes it permanent. While the DAG ensures integrity, the actual availability of the data depends on the nodes in the peer-to-peer network. If no node is hosting a specific block, that part of the DAG becomes inaccessible, making the entire file unrecoverable.

To ensure long-term persistence, decentralized networks use incentive layers or pinning services. Pinning is the act of telling a node to keep a specific CID and all its recursive children in its local storage and never delete them during garbage collection. This guarantees that the data remains available to the network as long as that node is online.

Incentive-based systems like Filecoin add a layer of cryptographic proofs to this process. Storage providers must periodically provide Proof-of-Spacetime to prove they are still hosting the data they committed to store. These proofs are recorded on a blockchain, ensuring a high level of accountability for the storage provider.

The CID is the address of the data, but pinning is the lease that ensures the building still stands at that address.

Distributed Hash Tables are used to locate which nodes are currently hosting specific CIDs. When you request a file, your node queries the DHT to find peers who have the blocks you need. This decentralized index allows the network to scale to millions of nodes without a central directory.

Garbage Collection and Performance

Because Merkle DAGs are immutable, nodes often accumulate a large amount of data that is no longer being actively used. Garbage collection is the process of deleting unpinned blocks to free up local storage space. Nodes must be careful not to delete blocks that are still being referenced by active DAGs.

The performance of a decentralized storage system is often determined by the speed of graph traversal. If a file is split into thousands of small blocks, fetching the file requires multiple round-trips to find and download each node in the hierarchy. Optimizing the DAG structure for depth versus breadth can significantly impact retrieval speeds.

Modern protocols implement pre-fetching and batching to mitigate these latencies. By analyzing the DAG structure, a client can predict which blocks will be needed next and request them in parallel from multiple peers, effectively saturating the available network bandwidth for faster downloads.

Mapping the Web with Content Identifiers (CIDs)Comparing Economic Models for Permanent and Contract-Based Storage