The Problem
A transactional web application with several years of production data had progressively slowed to the point where certain user flows were timing out. The engineering team had added caching layers to mask the symptoms, but the underlying queries were growing worse as the dataset grew.
Response times on key pages had become a support issue. The on-call rotation was dealing with database-related alerts weekly. The team understood it was a database problem but lacked the depth to diagnose and fix it systematically.
Constraints
- Production database could not be taken offline — all analysis and changes had to be non-disruptive.
- The ORM (Eloquent) was in use throughout — raw SQL rewrites were acceptable only where necessary.
- Schema migrations needed to be safe against production traffic, with rollback capability.
- The team needed to understand the changes, not just apply them.
Approach
The engagement started with a structured audit rather than immediately proposing solutions. The slow query log was enabled on a replica, and a week of real traffic was analysed to build a prioritised list of problem queries by frequency and execution time.
Each problem query was examined using EXPLAIN ANALYSE before and after changes. This created a clear, reviewable record of every change and its measured impact.
The remediation work was grouped into three categories:
- Missing indexes — the most common and highest-impact category. Several frequently-queried foreign keys and filter columns had no indexes.
- N+1 query patterns — Eloquent relationships loaded lazily in loops, generating hundreds of queries per page load. Resolved with eager loading and query consolidation.
- Schema structural issues — two tables with very wide rows were queried entirely when only a small number of columns were needed. These were addressed with select-specific queries and, in one case, a denormalised summary table for the reporting path.
All index additions were deployed during low-traffic windows using ALTER TABLE ... ALGORITHM=INPLACE, LOCK=NONE to avoid table locks.
Technical Decisions
A slow query monitoring dashboard was set up as part of the engagement so the team could observe query performance over time after the remediation was complete. Preventing regression required visibility, not just a one-time fix.
The caching layers that had been added to mask symptoms were reviewed — some were necessary and retained, others were removed because the underlying queries no longer warranted the added complexity.
Outcomes
- The timeout-producing user flows resolved without application code changes in those areas.
- Database-related alerts reduced substantially during the weeks following the engagement.
- The engineering team able to read and interpret
EXPLAINoutput — a lasting capability improvement. - A prioritised backlog of remaining lower-priority schema improvements left as documented follow-on work.
- Caching layer simplified — fewer moving parts to maintain.