Graph Databases

Querying Interconnected Data with Cypher and SQL/PGQ

Master pattern-matching queries using the Cypher language and explore the new SQL:2023 standards for querying property graphs inside relational databases.

DatabasesIntermediate15 min read

In this article

The Evolution of Connected Data

The Join Explosion Problem
Mental Models: From Tables to Patterns

Mastering Pattern Matching with Cypher

Advanced Filtering and Aggregation
Working with Variable Length Paths

Bridging the Gap with SQL:2023

Defining a Property Graph in SQL
Executing Graph Queries in SQL

Performance Tuning and Pitfalls

Indexing in a Graph World
Choosing Between Graph and Relational

The Evolution of Connected Data

Relational databases have been the industry standard for decades because they excel at maintaining structured records and transactional integrity. However, the architecture of a relational database relies heavily on fixed schemas and foreign key constraints to define how data points relate to one another. When applications require deep traversals across multiple tables, such as identifying connections in a fraud ring or building a recommendation engine, the performance of join operations begins to degrade exponentially.

The fundamental problem with the relational approach is that connections are computed at query time rather than being inherent to the data structure itself. Each join operation requires the database engine to scan or seek through indexes to find matching records in separate tables. As the depth of the search increases from two levels to five or ten, the computational overhead becomes a bottleneck that limits real-time responsiveness.

Graph databases introduce a paradigm shift by treating relationships as first-class citizens. In this model, data is stored as nodes representing entities and edges representing the relationships between them. Both nodes and edges can contain properties, which are key-value pairs that provide additional context. This structure allows the database to traverse relationships by simply following pointers in memory, a technique often referred to as index-free adjacency.

Index-free adjacency is the secret sauce of graph performance. By storing the physical memory addresses of related nodes directly within the record, the engine avoids the need for expensive index lookups during deep traversals.

This architectural difference means that query performance in a graph database is dependent on the size of the subgraph being explored rather than the total volume of data stored in the system. Whether your database contains one million or one billion records, the time it takes to traverse from one specific user to their direct friends remains virtually constant. This predictable performance makes graph databases the ideal choice for modern, highly connected applications.

The Join Explosion Problem

In a traditional SQL environment, finding friends of friends might involve joining a single table to itself twice. While this is manageable for small datasets, adding more levels of depth creates a geometric increase in the number of rows the engine must process. If the average user has one hundred connections, a three-level traversal involves searching through one million potential paths.

Developers often try to mitigate this by denormalizing data or using complex caching layers, but these solutions introduce data consistency risks and architectural complexity. Graph databases eliminate the need for these workarounds by providing a native way to navigate many-to-many relationships without the overhead of repetitive index scans or nested loops.

Mental Models: From Tables to Patterns

Shifting to a graph mindset requires moving away from thinking about sets of rows and instead focusing on patterns of connectivity. Instead of asking which records in table A share an ID with table B, you ask which path exists between two specific entities. This pattern-based thinking is more intuitive for developers because it mirrors how we naturally visualize networks and flowcharts.

The property graph model is the most common implementation in modern development. It allows you to assign labels to nodes to categorize them and types to edges to describe the nature of the connection. By attaching properties to the edges themselves, you can store metadata such as the weight of a connection or the timestamp of an interaction, which simplifies complex analytical queries.

Mastering Pattern Matching with Cypher

Cypher is a declarative query language specifically designed to interact with property graphs. It was created to be easily readable and expressive, using a visual syntax that resembles the connections it queries. Instead of verbose select and join statements, Cypher uses ASCII-art style syntax to describe the paths you want to find within your data.

The core of a Cypher query is the match clause, which allows you to define a template of the relationship structure you are looking for. Parentheses are used to represent nodes, while brackets and arrows represent relationships. This direct mapping between the query syntax and the visual representation of the graph makes it easier for developers to design and debug complex queries.

cypherRecommendation Engine Query

1// Find products purchased by friends that the user has not bought yet
2MATCH (user:Customer {id: 'c-123'})-[:FRIEND]->(friend)-[:PURCHASED]->(product:Product)
3WHERE NOT (user)-[:PURCHASED]->(product)
4RETURN product.name, COUNT(*) AS recommendation_score
5ORDER BY recommendation_score DESC
6LIMIT 5;

In the code example above, we are traversing three levels of connections in a single readable line. We start with a specific customer, follow their friendship links to other customers, and then find the products those friends have bought. The query then filters out products the original user already owns, effectively creating a simple but powerful collaborative filtering recommendation engine.

Beyond simple traversals, Cypher excels at variable-length path queries. This allows you to find connections of an unknown depth, such as finding the shortest path between two people in a social network or identifying all dependencies in a complex microservices architecture. Handling this in SQL would typically require recursive common table expressions, which are significantly harder to write and optimize.

Advanced Filtering and Aggregation

Once a pattern is matched, Cypher provides a suite of tools for processing the results. You can use the where clause to apply predicates to node and relationship properties, much like in SQL. However, Cypher also supports list comprehension and map projections, allowing you to transform the resulting data into complex structures directly within the database layer.

Aggregation in Cypher is implicit based on the non-aggregated columns in your return statement. This removes the need for a group by clause, further simplifying the query structure. When you return a node and a count of its relationships, the engine automatically groups the results by the unique identity of that node.

Working with Variable Length Paths

Variable length paths allow you to search for connections that are any number of hops away. By using the asterisk symbol inside a relationship definition, you can specify a minimum and maximum depth for the search. This is particularly useful for detecting circular dependencies or auditing access control lists in an enterprise environment.

cypherFraud Detection Loop

1// Detect potential money laundering by finding circular transfer paths
2MATCH path = (account:BankAccount)-[:TRANSFER*3..6]->(account)
3WHERE account.id = 'acc-9988'
4RETURN path;
5// This looks for paths between 3 and 6 hops that start and end at the same account

This capability allows developers to uncover hidden patterns that are nearly impossible to detect using traditional relational queries. By identifying loops in financial transactions or identifying bottlenecks in logistics networks, businesses can gain deep insights into the operational health of their systems.

Bridging the Gap with SQL:2023

One of the most significant developments in the database industry is the inclusion of Property Graph Queries in the SQL:2023 standard. This update allows developers to use graph-style pattern matching directly within their existing relational databases. It acknowledges that while graph databases are powerful, many organizations have a massive investment in relational infrastructure that they cannot easily migrate.

The SQL/PGQ standard introduces a new sub-language that allows you to define a graph view over your existing tables. You can map your rows to nodes and your foreign keys to edges without duplicating any data. This gives you the best of both worlds: the transactional reliability and ecosystem of a relational database combined with the expressive power of graph traversals.

Using the match_query syntax within SQL, you can perform pattern matching that looks remarkably similar to Cypher. This convergent approach means that the skills developers learn in the graph world are becoming increasingly relevant in the relational world. It also simplifies the architecture of modern applications by reducing the need for specialized polyglot persistence layers.

Graph Definition: Use CREATE PROPERTY GRAPH to define nodes and edges over existing tables.
Pattern Matching: Use the GRAPH_TABLE function to execute match patterns within a standard SELECT statement.
Interoperability: Easily join graph query results with traditional relational data in a single result set.

However, it is important to understand the trade-offs. While SQL/PGQ provides the syntax of a graph database, the underlying execution engine may still rely on traditional joins. If the database engine has not been optimized with graph-specific storage structures like index-free adjacency, you may not see the same performance gains as you would with a native graph database like Neo4j.

Defining a Property Graph in SQL

To use graph features in SQL:2023, you must first define how your tables map to graph elements. You specify which tables represent vertexes and which represent edges. For edges, you define the source and destination keys that link the tables together.

sqlSQL:2023 Property Graph Definition

1CREATE PROPERTY GRAPH social_graph
2  VERTEX TABLES (Users, Products)
3  EDGE TABLES (
4    Purchases 
5      SOURCE KEY (user_id) REFERENCES Users (id)
6      DESTINATION KEY (product_id) REFERENCES Products (id)
7      LABEL bought
8  );

Executing Graph Queries in SQL

Once the graph is defined, you can query it using the GRAPH_TABLE operator. This function takes the graph name and a match pattern as arguments and returns a set of columns that can be treated like any other table. This allows you to integrate graph-derived insights into your existing reporting and analytics workflows.

This approach is particularly valuable for incremental adoption. You can start by solving a specific connectivity problem using graph syntax while keeping the rest of your application logic firmly rooted in the relational world. As your needs evolve, you can decide whether to stick with the converged model or move to a dedicated graph engine.

Performance Tuning and Pitfalls

While graph databases offer massive performance benefits for connected data, they are not a silver bullet for every use case. One of the most common pitfalls is the supernode problem. A supernode is a node with an unusually high number of relationships, such as a celebrity in a social network or a major hub in a transportation system.

When a query reaches a supernode, the engine must traverse every single one of its connections to find the next step in the pattern. This can lead to significant latency and high memory consumption. To handle supernodes, developers often use techniques such as relationship indexing, which allows the engine to filter edges based on properties before traversing them.

Memory management is another critical factor in graph performance. Because traversals happen in memory to ensure speed, large-scale queries that involve many intermediate paths can quickly exhaust the available heap space. It is essential to write specific queries that limit the depth and breadth of the search whenever possible.

cypherOptimizing with Limits

1// Avoid scanning too many paths by limiting the search space
2MATCH (n:Person {id: 'p-55'})-[:FOLLOWS*1..3]->(m:Person)
3WITH m, count(m) as influence
4ORDER BY influence DESC
5LIMIT 100
6RETURN m.name, influence;

By using the with clause in Cypher, you can pipe the results of one part of a query into the next, effectively creating a processing pipeline. This allows you to aggregate or filter data early in the execution process, reducing the amount of data that needs to be carried through to the final stages of the query.

Indexing in a Graph World

Even though graph databases use index-free adjacency for traversals, they still rely on traditional indexes to find the starting points for those traversals. Without an index on a property like a user ID or an email address, the engine would have to perform a full label scan to find the initial node.

Developers should proactively create indexes on properties that are frequently used in where clauses or as entry points for queries. Most modern graph databases support range indexes, full-text indexes, and even spatial indexes to support a wide variety of search patterns.

Choosing Between Graph and Relational

The decision to use a graph database should be driven by the shape of your data and the nature of your queries. If your data is largely disconnected and your primary use cases involve simple lookups and aggregations over flat tables, a relational database is likely more efficient.

However, if your business logic depends on understanding the relationships between entities, or if your schema is constantly evolving to include new types of connections, the graph model provides the flexibility and performance you need. Modern architectures often use a hybrid approach, using relational databases for transactional systems of record and graph databases for specialized analytical or social features.

Modeling Relationships with Property Graphs and RDF Building Real-Time Recommendation Engines with Graph Algorithms