How we eliminated 100,000 unnecessary background jobs while keeping dashboards fresh - ZapDigits

At ZapDigits, users connect data sources like Meta Ads, Search Console, Google Analytics, Google Ads and GitHub to build live dashboards for marketing agencies.
As usage grew, we ran into a quiet but dangerous scaling problem.
If we continued with our original architecture, we would soon be running nearly 100,000 background jobs per day just to keep dashboards fresh.
That was not sustainable.
Here’s how we redesigned our system and reduced that number to around 8,000 jobs per day, while making dashboards faster and more reliable.
The naive model: Compute on read
In the early version, our system behaved like this:
- User opens a dashboard
- Metrics trigger background jobs
- Jobs fetch data from external APIs
- Dashboard updates once jobs complete
We also ran daily refresh jobs to ensure dashboards were not stale.
Now imagine this at moderate scale:
- 350 users
- 5 dashboards each
- 50 metrics per dashboard
That’s 87,500 metric evaluations.
Even worse, many metrics required hitting different endpoints:
- Stripe invoices
- Stripe charges
- Stripe subscriptions
- Google Ads campaigns
- GitHub events
We weren’t just running jobs. We were repeatedly calling external APIs, often for overlapping data.
The math was getting scary.
The core realization: Metrics should not drive API calls
The first architectural shift was simple but critical: Metrics should never call external APIs.Instead of: Metric → API call
We moved to: Clients → Data source sync → Metrics computed from stored data
External APIs are treated as upstream data stores that we replicate into our system.
For example, for Stripe we created sync modules per entity:
- Charges
- Invoices
- Subscriptions
- Balance transactions
Each module performs incremental sync and stores data in our database.
All Stripe metrics are then computed from this local data.
This immediately eliminated massive duplication.
The bigger shift: Client-scoped activation
Originally, background jobs were tied to users.
But users can have multiple clients, and they don’t use all of them daily.
So we moved job orchestration to the client level.
We added the following fields to our clients table:
jobs_active (boolean) last_synced_at (timestamp) refresh_status ("idle" | "refreshing" | "failed") last_seen_at (timestamp)
Now the logic is simple:
When a client is accessed:
- Mark jobs_active = true
- Update last_seen_at
We use a heartbeat mechanism to detect inactivity. If a client has not been seen in a short window, we mark it inactive.
Only active clients are eligible for background refresh.
Inactive clients consume zero compute.
This alone removed a large percentage of unnecessary jobs.
Lazy refresh with staleness Windows
Even active client do not refresh constantly. When a client becomes active, we check:
- Is refresh_status not "refreshing"?
- Is last_synced_at older than 15 minutes?
Only if both conditions are true do we enqueue a refresh job.
This gives us:
- No duplicate jobs
- No refresh storms
- No blocking UI
- No redundant API calls
Dashboards always read cached metrics from the database.
If a refresh is happening, we simply show a “Refreshing” indicator and display the last updated timestamp.
The UI never waits for background work.
Adding Redis: Making reads instant
Once job orchestration was under control, the next bottleneck appeared: read performance.
Even with precomputed metrics, complex dashboards can require multiple queries and aggregations.
So we introduced Redis as a hot cache layer.
When a background job completes:
- It updates Postgres
- It writes a serialized metric snapshot to Redis
- It sets a TTL aligned with the refresh interval
Dashboard loads now follow this path:
Redis → Render
If Redis misses, we fall back to Postgres.
This made dashboard loads consistently fast and predictable.
Introducing ClickHouse for Heavy Aggregations
As some clients grew, especially those with large payment volumes or high event throughput, aggregation queries became heavier.
Instead of pushing Postgres beyond its comfort zone, we introduced ClickHouse as our analytics engine.
The architecture became layered:
- Postgres: relational state and core data
- ClickHouse: high-volume analytical queries
- Redis: low-latency metric reads
Raw or semi-processed data is streamed into ClickHouse.
Complex time-range metrics are computed there using columnar storage, which dramatically reduces query times.
This separation allows:
- Independent scaling of analytics
- Fast large-range aggregations
- No locking or stress on transactional tables
Each component has a clear responsibility.
The Final Architecture
At a high level:
- Clients becomes active
- Staleness is evaluated
- Background job runs per client per data source
- Data syncs incrementally
- Metrics are computed
- Results stored in Postgres
- Snapshot written to Redis
- Dashboards read instantly
No dashboard triggers external APIs.
No compute happens on read.
No refresh runs unless it is actually needed.
The Result
Before:
~100,000 background jobs per day at projected scale.
After:
~8,000 jobs per day, scoped to active clients and stale data only.
But the job count reduction is not the most important outcome.
What actually improved:
- Dashboards load instantly
- API rate limits are stable
- Background queues remain shallow
- Compute scales with active clients, not total users
- The system behaves predictably
We did not just reduce jobs.
We reduced architectural noise.
What We Learned
- Never compute on read for dashboards.
- Metrics should not trigger API calls.
- Scope background work to active entities, not all entities.
- Use staleness windows instead of blind schedules.
- Separate sync, compute, and read layers.
That shift took us from 100k background jobs to 8k, without sacrificing freshness, speed, or reliability.
And it made ZapDigits fundamentally more scalable going forward.
Thanks for reading...

