Auto-Scaling Systems

Implementing Event-Driven Scaling with KEDA and Prometheus

Deploy advanced scaling patterns like scale-to-zero by integrating Kubernetes Event-Driven Autoscaling (KEDA) with external custom metric sources.

Cloud & InfrastructureIntermediate12 min read

In this article

Beyond Reactive Infrastructure: The Shift to Event-Driven Scaling

The Limitations of Resource-Based Metrics

Architecting with KEDA and ScaledObjects

The Role of the Metrics Adapter

Scaling Based on Queue Depth and Processing Latency

Handling the Scale-to-Zero Trade-off

Advanced Patterns: Scaling with Custom Prometheus Queries

Preventing Scaling Flapping

Beyond Reactive Infrastructure: The Shift to Event-Driven Scaling

Traditional cloud infrastructure relies heavily on resource metrics like CPU and memory utilization to decide when to scale. While this approach works for predictable workloads, it is inherently reactive because resource spikes often occur after the system is already under stress. By the time a new instance is provisioned and ready to serve traffic, the user experience has likely already suffered from increased latency or timeouts.

Modern distributed systems require a more proactive approach that looks at the source of the work rather than the symptoms of the load. This is the core philosophy behind event-driven scaling where the infrastructure reacts to the actual pressure within a system, such as the number of messages waiting in a queue or the count of active connections. By shifting the focus to these upstream signals, engineers can ensure that capacity is available exactly when the demand arrives.

One of the most powerful tools in the Kubernetes ecosystem for implementing this pattern is Kubernetes Event-Driven Autoscaling, commonly known as KEDA. It acts as a lightweight controller that extends the native Horizontal Pod Autoscaler by providing custom metrics based on external event sources. This allows developers to define scaling rules based on real-time application state rather than just hardware telemetry.

CPU and memory are lagging indicators of system load; to build truly responsive systems, you must scale based on the leading indicators found in your event streams and message queues.

The Limitations of Resource-Based Metrics

In a typical microservices architecture, a service might spend most of its time waiting for I/O operations or network calls, meaning its CPU usage remains low even as a massive backlog of work builds up. If you only scale on CPU, your system might never trigger an upscale event despite thousands of pending tasks. This leads to a bottleneck where the processing throughput remains stagnant while the queue grows indefinitely.

Conversely, some applications have high baseline resource usage even when idle due to background tasks or runtime overhead. This makes it difficult to find a reliable CPU threshold that accurately represents actual workload pressure. Using event-based signals removes this ambiguity by providing a direct correlation between the number of pending events and the number of workers needed.

Architecting with KEDA and ScaledObjects

KEDA operates by introducing a custom resource definition called a ScaledObject which maps a specific event source to a Kubernetes deployment. It effectively bridges the gap between external event providers like RabbitMQ, Kafka, or AWS SQS and the Kubernetes metrics server. The controller monitors the external source and periodically updates the metrics that the Horizontal Pod Autoscaler uses to determine the desired pod count.

A critical feature of KEDA is its ability to scale workloads down to zero instances when no events are detected. This is a significant improvement over standard Kubernetes autoscaling, which requires at least one pod to stay active to report metrics. By scaling to zero, organizations can drastically reduce cloud costs for intermittent workloads that only run occasionally throughout the day.

yamlExample KEDA ScaledObject for RabbitMQ

1apiVersion: keda.sh/v1alpha1
2kind: ScaledObject
3metadata:
4  name: order-processor-scaler
5  namespace: production
6spec:
7  scaleTargetRef:
8    name: order-worker-deployment # The deployment we want to scale
9  pollingInterval: 15 # How often KEDA checks the queue in seconds
10  cooldownPeriod: 300 # Wait 5 minutes before scaling back to zero
11  minReplicaCount: 0 # Enable scale-to-zero
12  maxReplicaCount: 50 # Prevent runaway costs
13  triggers:
14  - type: rabbitmq
15    metadata:
16      queueName: incoming-orders
17      queueLength: '20' # Target 20 messages per pod
18      host: RabbitMqConnectionSetting

The Role of the Metrics Adapter

KEDA serves as an implementation of the Kubernetes Custom Metrics API, which allows it to provide data to the Horizontal Pod Autoscaler without replacing it. When a ScaledObject is created, KEDA automatically generates an HPA resource if one does not already exist. This creates a clean separation of concerns where KEDA handles the event integration and the HPA handles the scaling logic.

The KEDA operator manages two main components: the agent and the metrics adapter. The agent is responsible for activating and deactivating deployments, moving them from zero to one and back. The metrics adapter provides the granular metric values needed for scaling from one replica up to the defined maximum count based on the specific trigger configuration.

Scaling Based on Queue Depth and Processing Latency

To implement a robust scaling strategy, developers must determine the optimal queue depth per pod, which is the number of messages a single instance can handle comfortably before latency increases. Setting this value too low causes aggressive scaling and resource waste, while setting it too high leads to processing delays. This often requires load testing to find the sweet spot where throughput is maximized and costs are minimized.

In a real-world scenario like an image processing pipeline, you might want to scale workers based on how many images are waiting in an S3-compatible storage bucket. KEDA can monitor the count of objects in a specific prefix and spin up workers to process them in parallel. This ensures that a sudden batch upload of thousands of images is handled quickly without manually intervening to increase capacity.

Target Value: The specific metric value that triggers a scale-up, such as 50 messages per worker.
Polling Interval: The frequency at which KEDA checks the external metric source to determine if scaling is necessary.
Cooldown Period: The duration the system should wait after the last event before scaling back down to the minimum replica count.
Activation Phase: The process of moving from zero to one replica, which often requires careful handling of cold starts.

Handling the Scale-to-Zero Trade-off

While scaling to zero offers maximum cost efficiency, it introduces the challenge of cold start latency. When a new message arrives and the pod count is zero, Kubernetes must pull the container image, start the pod, and wait for the application to initialize. For time-sensitive operations, this delay might be unacceptable, and keeping a minimum of one pod might be a better architectural decision.

To mitigate cold start issues, engineers should optimize container images for size and use fast-starting runtimes. Additionally, configuring a longer cooldown period in the ScaledObject ensures that the system doesn't immediately shut down during brief periods of inactivity. This creates a buffer that balances the need for cost savings with the requirement for high responsiveness.

Advanced Patterns: Scaling with Custom Prometheus Queries

Sometimes the built-in KEDA scalers for queues or databases are not enough to capture complex business logic. In these cases, the Prometheus scaler is an incredibly flexible alternative that allows you to scale based on any metric already being collected by your monitoring stack. You can write complex PromQL queries that combine multiple metrics, such as error rates and request durations, to drive scaling decisions.

For example, a service might need to scale up if the 99th percentile latency exceeds a certain threshold, but only if the success rate is also high. This prevents the system from scaling up in response to a dependency failure that is causing fast-failing errors. By using custom queries, the autoscaler becomes an intelligent component of the application lifecycle rather than a blunt instrument.

yamlPrometheus-Based Scaling Query

1triggers:
2- type: prometheus
3  metadata:
4    serverAddress: http://prometheus-operated.monitoring.svc.cluster.local:9090
5    metricName: http_request_duration_seconds
6    # Scale if the p95 latency of the 'checkout' route exceeds 500ms
7    query: >
8      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{path="/api/checkout"}[2m])) by (le))
9    threshold: '0.500'

When implementing Prometheus-based scaling, it is vital to ensure the queries are performant and return values quickly. A slow query can cause the HPA to receive outdated or missing data, leading to erratic scaling behavior or 'flapping' where pods are rapidly added and removed. Always test your scaling queries in a dashboard before applying them to a live ScaledObject definition.

Preventing Scaling Flapping

Flapping occurs when the metric value fluctuates around the threshold, causing the autoscaler to constantly change the replica count. This creates instability and puts unnecessary load on the Kubernetes control plane. KEDA and the HPA provide stabilization windows that allow you to define how quickly a downscale or upscale event can occur after a previous change.

Setting a longer stabilization window for downscaling is generally safer as it prevents the system from killing pods during a temporary dip in traffic. For upscale events, the window should be shorter to ensure the system can react quickly to genuine spikes. Tuning these parameters is a continuous process that depends on the specific traffic patterns of your application.

Preventing Flapping and Thrashing in Custom Metric Scaling All Auto-Scaling Systems Articles