How we eliminated 100,000 unnecessary background jobs while keeping dashboards fresh - ZapDigits | Marketing Dashboards for Agencies

At ZapDigits, users connect data sources like Meta Ads, Search Console, Google Analytics, Google Ads and GitHub to build live dashboards for marketing agencies.

As usage grew, we ran into a quiet but dangerous scaling problem.

If we continued with our original architecture, we would soon be running nearly 100,000 background jobs per day just to keep dashboards fresh.

That was not sustainable.

Here’s how we redesigned our system and reduced that number to around 8,000 jobs per day, while making dashboards faster and more reliable.

The naive model: Compute on read

In the early version, our system behaved like this:

User opens a dashboard
Metrics trigger background jobs
Jobs fetch data from external APIs
Dashboard updates once jobs complete

We also ran daily refresh jobs to ensure dashboards were not stale.

Now imagine this at moderate scale:

350 users
5 dashboards each
50 metrics per dashboard

That’s 87,500 metric evaluations.

Even worse, many metrics required hitting different endpoints:

Stripe invoices
Stripe charges
Stripe subscriptions
Google Ads campaigns
GitHub events

We weren’t just running jobs. We were repeatedly calling external APIs, often for overlapping data.

The math was getting scary.

The core realization: Metrics should not drive API calls

The first architectural shift was simple but critical: Metrics should never call external APIs.Instead of: Metric → API call

We moved to: Clients → Data source sync → Metrics computed from stored data

External APIs are treated as upstream data stores that we replicate into our system.

For example, for Stripe we created sync modules per entity:

Charges
Invoices
Subscriptions
Balance transactions

Each module performs incremental sync and stores data in our database.

All Stripe metrics are then computed from this local data.

This immediately eliminated massive duplication.

The bigger shift: Client-scoped activation

Originally, background jobs were tied to users.

But users can have multiple clients, and they don’t use all of them daily.

So we moved job orchestration to the client level.

We added the following fields to our clients table:

jobs_active (boolean) last_synced_at (timestamp) refresh_status ("idle" | "refreshing" | "failed") last_seen_at (timestamp)

Now the logic is simple:

When a client is accessed:

Mark jobs_active = true
Update last_seen_at

We use a heartbeat mechanism to detect inactivity. If a client has not been seen in a short window, we mark it inactive.

Only active clients are eligible for background refresh.

Inactive clients consume zero compute.

This alone removed a large percentage of unnecessary jobs.

Lazy refresh with staleness Windows

Even active client do not refresh constantly. When a client becomes active, we check:

Is refresh_status not "refreshing"?
Is last_synced_at older than 15 minutes?

Only if both conditions are true do we enqueue a refresh job.

This gives us:

No duplicate jobs
No refresh storms
No blocking UI
No redundant API calls

Dashboards always read cached metrics from the database.

If a refresh is happening, we simply show a “Refreshing” indicator and display the last updated timestamp.

The UI never waits for background work.

Adding Redis: Making reads instant

Once job orchestration was under control, the next bottleneck appeared: read performance.

Even with precomputed metrics, complex dashboards can require multiple queries and aggregations.

So we introduced Redis as a hot cache layer.

When a background job completes:

It updates Postgres
It writes a serialized metric snapshot to Redis
It sets a TTL aligned with the refresh interval

Dashboard loads now follow this path:

Redis → Render

If Redis misses, we fall back to Postgres.

This made dashboard loads consistently fast and predictable.

Introducing ClickHouse for Heavy Aggregations

As some clients grew, especially those with large payment volumes or high event throughput, aggregation queries became heavier.

Instead of pushing Postgres beyond its comfort zone, we introduced ClickHouse as our analytics engine.

The architecture became layered:

Postgres: relational state and core data
ClickHouse: high-volume analytical queries
Redis: low-latency metric reads

Raw or semi-processed data is streamed into ClickHouse.

Complex time-range metrics are computed there using columnar storage, which dramatically reduces query times.

This separation allows:

Independent scaling of analytics
Fast large-range aggregations
No locking or stress on transactional tables

Each component has a clear responsibility.

The Final Architecture

At a high level:

Clients becomes active
Staleness is evaluated
Background job runs per client per data source
Data syncs incrementally
Metrics are computed
Results stored in Postgres
Snapshot written to Redis
Dashboards read instantly

No dashboard triggers external APIs.

No compute happens on read.

No refresh runs unless it is actually needed.

The Result

Before:

~100,000 background jobs per day at projected scale.

After:

~8,000 jobs per day, scoped to active clients and stale data only.

But the job count reduction is not the most important outcome.

What actually improved:

Dashboards load instantly
API rate limits are stable
Background queues remain shallow
Compute scales with active clients, not total users
The system behaves predictably

We did not just reduce jobs.

We reduced architectural noise.

What We Learned

Never compute on read for dashboards.
Metrics should not trigger API calls.
Scope background work to active entities, not all entities.
Use staleness windows instead of blind schedules.
Separate sync, compute, and read layers.

That shift took us from 100k background jobs to 8k, without sacrificing freshness, speed, or reliability.

And it made ZapDigits fundamentally more scalable going forward.

Thanks for reading...