Agent-Safe Database Migrations with Expand-Contract and Rollback Checkpoints

A lot of AI coding agents are surprisingly decent at adding a column, updating an ORM model, and wiring a feature flag. They are much worse at respecting the ugly part of schema changes, which is that production data outlives code suggestions.

The failure mode is predictable. The model sees a rename, generates a destructive migration, updates the app code, and moves on. Meanwhile, background workers, cached binaries, old jobs, BI queries, and rollback paths still depend on the old shape.

If you want AI assistance on database changes without turning every migration into a trust exercise, you need a workflow that is hard to do unsafely. This is the one I keep coming back to.

Why this matters

Schema changes are one of the easiest places for an AI coding workflow to look correct in review while still being operationally unsafe. A text diff can hide the blast radius, especially when the generated patch touches migrations, app code, workers, and dashboards in the same PR.

The right question is not whether the model can generate SQL. It usually can. The right question is whether the system can survive partial rollout, retries, stale readers, and rollback.

Pitfall: if the first generated step is a rename or a drop in a production table, treat that as a smell. The model is optimizing for a neat patch, not operational reality.

Architecture or workflow overview

1. Expand

Add the new column or table without breaking old readers.

2. Dual-write

Deploy app code that writes both shapes and reads safely.

3. Backfill + verify

Move old data in batches, then prove counts and drift are clean.

flowchart LR
  A[Agent plan] --> B[Expand schema]
  B --> C[Deploy dual-read and dual-write app]
  C --> D[Run bounded backfill job]
  D --> E[Verify counts and null drift]
  E --> F[Flip reads to new column]
  F --> G[Contract old schema later]

Expand the schema with additive changes only.
Deploy code that tolerates both the old and new shapes.
Backfill in bounded chunks with observable progress.
Verify counts, drift, and read-path correctness.
Flip reads behind a flag.
Contract the old schema later, never in the same rush.

Implementation details

Start with an additive migration

If an AI agent proposes a direct rename in production, I would reject it by default.

-- 20260423_expand_users_display_name.sql
ALTER TABLE users
  ADD COLUMN display_name TEXT;

CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_users_display_name
  ON users (display_name);

This keeps rollback options alive. Old code still works. New code can write the new field. Analysts do not wake up to a broken dashboard just because the app diff looked tidy.

Force a dual-write period

The dual-write window is the part AI-generated patches most often skip.

export async function updateProfile(userId: string, input: { fullName?: string }) {
  const displayName = normalizeDisplayName(input.fullName ?? null);

  await db.user.update({
    where: { id: userId },
    data: {
      full_name: input.fullName,
      display_name: displayName,
    },
  });
}

export function readPreferredDisplayName(user: {
  display_name: string | null;
  full_name: string | null;
}) {
  return user.display_name ?? user.full_name ?? "Anonymous";
}

Reads should prefer the new field but tolerate the old one. Writes should keep both fields aligned until backfill and rollout are proven clean.

Backfill in bounded chunks

A one-shot backfill is exactly the kind of thing a model writes when nobody forces it to think about locks, retryability, or pause points.

const BATCH_SIZE = 1000;

export async function backfillDisplayNames() {
  let cursor: string | null = null;

  while (true) {
    const rows = await db.$queryRaw<Array<{ id: string; full_name: string | null }>>`
      SELECT id, full_name
      FROM users
      WHERE display_name IS NULL
        AND (${cursor}::uuid IS NULL OR id > ${cursor}::uuid)
      ORDER BY id
      LIMIT ${BATCH_SIZE}
    `;

    if (rows.length === 0) break;

    for (const row of rows) {
      await db.user.update({
        where: { id: row.id },
        data: { display_name: normalizeDisplayName(row.full_name) },
      });
    }

    cursor = rows[rows.length - 1].id;
  }
}

Leave behind verification queries

SELECT
  COUNT(*) AS total_users,
  COUNT(*) FILTER (WHERE display_name IS NULL) AS missing_display_name,
  COUNT(*) FILTER (
    WHERE COALESCE(display_name, '') <> COALESCE(full_name, '')
  ) AS drifted_rows
FROM users;

$ pnpm tsx scripts/backfill-display-name.ts
processed=1000 remaining=42120 rate=910 rows/s
processed=1000 remaining=41120 rate=928 rows/s
processed=1000 remaining=40120 rate=905 rows/s
verification total_users=982144 missing_display_name=0 drifted_rows=0

If the PR does not tell reviewers how to verify migration success, it is incomplete.

What went wrong, and the tradeoffs

Destructive first step: the agent renames or drops before compatibility code ships.
Unbounded backfill: one giant update creates lock pressure or replica lag.
Half migration: writers update both fields, but readers still hardcode the old one.
Rollback illusion: code rollback works, but the schema is now in an awkward intermediate state.
Hidden dependency drift: dashboards, exports, and sidecar jobs keep reading the old field.

Strategy	Speed	Rollback safety	Operational load	Good fit
Direct rename/drop	High	Poor	Low at first, high later	Local or dev only
Expand-contract	Medium	Strong	Medium	Most production apps
Shadow table copy	Low	Strongest	High	Very large or high-risk rewrites

Best practice: if an AI agent helped author a production schema change, treat expand-contract as the default unless you can explain clearly why a faster destructive path is actually safe.

References

Practical checklist

[ ] Keep the first migration additive only.
[ ] Make the app read both shapes before switching fully.
[ ] Dual-write during the transition window.
[ ] Backfill in resumable chunks with visible progress.
[ ] Add verification SQL directly in the PR.
[ ] Delay the contract step until after a clean release cycle.
[ ] Treat AI-generated migration plans as drafts, not authority.

Conclusion

AI coding agents can help with database work, but only when the workflow is opinionated enough to resist the model’s tendency toward short, clean, unsafe patches.

Use expand-contract, make the dual-write period explicit, require verification artifacts, and postpone destructive cleanup. The safest migration is usually the one that looks slightly overcautious in review and pleasantly boring in production.

AI Agents Database Migrations Reliability DevOps Production Engineering