RubyScaling background job system
Ali Deishidi
For our background job system, we use Sidekiq. It picks up jobs from Redis and processes them—millions of jobs every day. However, the default queuing system did not scale well for our product. Notably, we faced the following issues:
- A developer submits a job into the system, but has no idea when it will be processed. What is the expected wait time for each queue?
- Sidekiq provides classic queues such as
default,low,critical, and so on. We also had specific queues likemailers. How should developers decide which queue a job belongs to? - Sidekiq is a shared resource across the entire product. A noisy neighbour can degrade performance or completely break the system. Thousands of long-running jobs—each taking minutes—can clog the queues, block all workers, and effectively bring the system down.
- A user exports a large file as a background job. It may take 20 minutes. Is this normal? Should the system alert the owners (developers or users)? Or is 20 minutes an acceptable duration for large file generation?
These are system design and UX challenges for both product developers and background system owners (Infrastructure and Developer Experience teams at Factorial).
A one-line solution
Our solution is a one-line change that addresses all of these problems.
Previously, jobs looked like this:
1
2
3
4
classTypicalJob < BaseJob
queue_as:default
def perform; end
endWe changed it to this:
1
2
3
4
classTypicalJob < BaseJob
sloToBeDone::WITHIN_10_SECONDS
def perform; end
endThis change gives us the following benefits:
- The ability to define time expectations for jobs as code
- Removing the ability for developers to manually choose queue names
- Automatic anti-clogging protection
- Automatic monitoring and alerting
Let’s dive into each one.
Define expectations as code
We are usually very sensitive to web system metrics: page load time, time to first paint, and so on. But when it comes to background jobs, we are often clueless.
What is missing is expectation.
As the owner of a job, developers or designers should decide how long a job is expected to take from a user’s perspective. This is a UX decision, not just a technical one. Going back to the file export example, there should be a clear definition: should this job complete within one minute from the moment the user clicks the download button, or is a longer wait acceptable?
In classic Sidekiq usage, we often encode this expectation implicitly via queue names. A default queue means “soon,” a critical queue means “now,” and so on. This does not scale well when you have hundreds of jobs with different performance characteristics and datasets.
The line:
1
sloToBeDone::WITHIN_10_SECONDSexplicitly defines the expectation of the job. Instead of deciding which queue to enqueue into, we define how long the job is allowed to take. These SLOs are available as predefined options:
- within 10 seconds
- within 30 seconds
- within 5 minutes
- within 24 hours
Developers choose the appropriate SLO based on historical data or product requirements. This directly solves the problem of defining expectations.
Hide the ability to choose queues
We created queues with matching names: within_10_seconds, within_30_seconds, and so on. When developers set an SLO like within_10_seconds, the job is automatically enqueued into the corresponding queue.
However, this is not just a renaming of queue_as to slo.
The SLO defined in code represents the expected behaviour of the job. The developer states, “I expect this job to complete within 10 seconds,” and the system treats this as a contract. Through monitoring and autoscaling, the background system ensures that this promise is upheld.
For the job, the SLO is an expectation. For the system, it is metadata and a contract.
At this point, the job is placed in the appropriate queue, but that’s not the end of the story.
Automatic anti-clogging protection
When a Sidekiq worker picks up a job from Redis, it checks the contract. A middleware evaluates whether the job type has been able to meet its defined SLO recently. Specifically, it checks the p95 execution time over the last hour.
If the job meets its SLO, it is processed as usual. If not, the job is automatically moved to the least time-sensitive queue (within_24_hours) and the original job execution is canceled.
For example, consider an EmailExport job with:
1
slo ToBeDone::WITHIN_30_SECONDSThe job is initially enqueued into within_30_seconds. At processing time, the middleware checks the last hour of execution data:
- If p95 is under 30 seconds, the job proceeds.
- If p95 exceeds 30 seconds, the middleware re-enqueues the job into
within_24_hoursand cancels the current attempt.
This approach protects fast queues from being clogged while ensuring that jobs are not abandoned—they are delayed, not dropped.
Automatic monitoring and alerting
Our codebase is large, but we use CODEOWNERS files to define which team owns which parts of the system. Leveraging this, along with some Terraform work and DataDog integration, we automatically create monitors for each product team based on the jobs they own and the SLOs they define.
A simplified example of a DataDog query looks like this:
1
sidekiq.jobs.perform_total.95percentile{worker:EmailExport AND slo:within_10_seconds} > 10000When this condition is met, we send an alert to the relevant team:
1
2
3
Alert CODEOWNERS(owner):
The EmailExport jobis violating its10-second SLO.
Check the dashboard here: <link>This entire pipeline is automated. Product developers only need to define the SLO; monitoring and alerting are set up automatically as soon as the code is merged.
Final note
The idea of using queue names as time periods is not originally ours. We were inspired by this excellent article from Gusto.
We approached the problem from a different angle, but the core idea came from their work—credit where it’s due.
