Scaling background job system

For our background job system, we use Sidekiq. It picks up jobs from Redis and processes them—millions of jobs every day. However, the default queuing system did not scale well for our product. Notably, we faced the following issues:

A developer submits a job into the system, but has no idea when it will be processed. What is the expected wait time for each queue?
Sidekiq provides classic queues such as default, low, critical, and so on. We also had specific queues like mailers. How should developers decide which queue a job belongs to?
Sidekiq is a shared resource across the entire product. A noisy neighbour can degrade performance or completely break the system. Thousands of long-running jobs—each taking minutes—can clog the queues, block all workers, and effectively bring the system down.
A user exports a large file as a background job. It may take 20 minutes. Is this normal? Should the system alert the owners (developers or users)? Or is 20 minutes an acceptable duration for large file generation?

These are system design and UX challenges for both product developers and background system owners (Infrastructure and Developer Experience teams at Factorial).

A one-line solution

Our solution is a one-line change that addresses all of these problems.

Previously, jobs looked like this:

class TypicalJob < BaseJob
  queue_as :default
  def perform; end
end

We changed it to this:

class TypicalJob < BaseJob
  
slo ToBeDone
::WITHIN_10_SECONDS
  def perform; end
end

This change gives us the following benefits:

The ability to define time expectations for jobs as code
Removing the ability for developers to manually choose queue names
Automatic anti-clogging protection
Automatic monitoring and alerting

Let’s dive into each one.

Define expectations as code

We are usually very sensitive to web system metrics: page load time, time to first paint, and so on. But when it comes to background jobs, we are often clueless.

What is missing is expectation.

As the owner of a job, developers or designers should decide how long a job is expected to take from a user’s perspective. This is a UX decision, not just a technical one. Going back to the file export example, there should be a clear definition: should this job complete within one minute from the moment the user clicks the download button, or is a longer wait acceptable?

In classic Sidekiq usage, we often encode this expectation implicitly via queue names. A default queue means “soon,” a critical queue means “now,” and so on. This does not scale well when you have hundreds of jobs with different performance characteristics and datasets.

The line:

1
2

slo ToBeDone
::WITHIN_10_SECONDS

explicitly defines the expectation of the job. Instead of deciding which queue to enqueue into, we define how long the job is allowed to take. These SLOs are available as predefined options:

within 10 seconds
within 30 seconds
within 5 minutes
within 24 hours

Developers choose the appropriate SLO based on historical data or product requirements. This directly solves the problem of defining expectations.

Hide the ability to choose queues

We created queues with matching names: within_10_seconds, within_30_seconds, and so on. When developers set an SLO like within_10_seconds, the job is automatically enqueued into the corresponding queue.

However, this is not just a renaming of queue_as to slo.

The SLO defined in code represents the expected behaviour of the job. The developer states, “I expect this job to complete within 10 seconds,” and the system treats this as a contract. Through monitoring and autoscaling, the background system ensures that this promise is upheld.

For the job, the SLO is an expectation. For the system, it is metadata and a contract.

At this point, the job is placed in the appropriate queue, but that’s not the end of the story.

Automatic anti-clogging protection

When a Sidekiq worker picks up a job from Redis, it checks the contract. A middleware evaluates whether the job type has been able to meet its defined SLO recently. Specifically, it checks the p95 execution time over the last hour.

If the job meets its SLO, it is processed as usual. If not, the job is automatically moved to the least time-sensitive queue (within_24_hours) and the original job execution is canceled.

For example, consider an EmailExport job with:

sloToBeDone::WITHIN_30_SECONDS

The job is initially enqueued into within_30_seconds. At processing time, the middleware checks the last hour of execution data:

If p95 is under 30 seconds, the job proceeds.

If p95 exceeds 30 seconds, the middleware re-enqueues the job into within_24_hours and cancels the current attempt.

Note the queue name is different here, SLO is the same

This approach protects fast queues from being clogged while ensuring that jobs are not abandoned—they are delayed, not dropped.

Automatic monitoring and alerting

Our codebase is large, but we use CODEOWNERS files to define which team owns which parts of the system. Leveraging this, along with some Terraform work and DataDog integration, we automatically create monitors for each product team based on the jobs they own and the SLOs they define.

A simplified example of a DataDog query looks like this:

sidekiq.jobs.perform_total.95percentile{worker:EmailExport AND slo:within_10_seconds} > 10000

When this condition is met, we send an alert to the relevant team:

1
2
3

Alert CODEOWNERS(owner):
The EmailExport jobis violating its10-second SLO.
Check the dashboard here: <link>

This entire pipeline is automated. Product developers only need to define the SLO; monitoring and alerting are set up automatically as soon as the code is merged.

Final note

The idea of using queue names as time periods is not originally ours. We were inspired by this excellent article from Gusto.

We approached the problem from a different angle, but the core idea came from their work—credit where it’s due.

RubyScaling background job system