I Replaced Debezium with a Cron Job. Here’s How I Made It Reliable.
Debezium was going down 3–4 times a month. Its only job in our system was publishing events to Kafka — order created, PO created, that kind of thing. Every time it went down, events were lost.
So I replaced it with a cron job and a Postgres table.
That probably sounds too simple. But there are a few non-obvious things you need to get right, otherwise you just trade one failure mode for another. This post covers those.
Why Not Just Call producer.send() Directly?
The obvious fix is to publish to Kafka right inside your service, after the DB write:
await db.query('INSERT INTO orders ...');
await producer.send({ topic: 'order-events', ... });The problem: these two operations aren’t atomic. The DB write can succeed, and the Kafka publish can fail. If your service restarts before the retry fires, the event is gone. You’ll never know.
The safer approach is to write the event to your database first, in the same transaction as your business record, and let a background job handle the Kafka publish separately. This way, the event survives in Postgres even if Kafka is down, and gets delivered when things recover.
This is called the Transactional Outbox Pattern. Fancy name, straightforward idea.
The Table
CREATE TABLE outbox_events (
id SERIAL PRIMARY KEY,
payload JSONB NOT NULL,
event_type TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'PENDING',
retries INTEGER,
scheduled_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
When an order is created, your service does two inserts in one transaction:
await db.query('INSERT INTO orders ...');
await db.query(`INSERT INTO outbox_events (event_type, payload) VALUES ($1, $2)`,
['ORDER_CREATED', { orderId: 123, ... }]
);If the transaction fails, both roll back. If it succeeds, the event is durably stored and will eventually reach Kafka, even if Kafka is currently down.
The Cron Job
A cron fires every 30 seconds, picks up pending events, publishes them to Kafka, and marks them done.
const schedules = [
{
name: 'process_outbox_kafka',
cronTime: '*/30 * * * * *',
enabled: true
}
];
One thing you need: a concurrency guard. Cron fires on a fixed interval regardless of whether the last run finished. If processing 50 events takes longer than 30 seconds, you’ll have two runs overlapping and potentially publishing the same events twice.
Fix is simple — a Set that tracks what's currently running:
const runningTasks = new Set();
cron.schedule('*/30 * * * * *', async () => {
if (runningTasks.has('process_outbox_kafka')) {
return; // previous run still going, skip this tick
} runningTasks.add('process_outbox_kafka');
try {
await processOutboxEvents();
} finally {
runningTasks.delete('process_outbox_kafka');
}
});The Claim Query
This is the most important part. When the cron fires, it needs to:
- Find events that need processing
- Mark them as claimed so another instance doesn’t grab the same ones
- Return them — all in one query
UPDATE outbox_events
SET status = 'PROCESSING', updated_at = NOW()
WHERE id IN (
SELECT id FROM outbox_events
WHERE (
status IN ('PENDING', 'FAILED')
OR (status = 'PROCESSING' AND updated_at < NOW() - INTERVAL '5 minutes')
)
AND scheduled_at <= NOW()
AND COALESCE(retries, 0) < 5
LIMIT 50
FOR UPDATE SKIP LOCKED
)
RETURNING id, payload, event_type;
A few things worth explaining here:
FOR UPDATE SKIP LOCKED — If you ever run multiple instances of this service, two cron jobs could try to claim the same rows simultaneously. This tells Postgres to lock the rows I'm selecting, and if any row is already locked by someone else, skip it rather than waiting.
status = 'PROCESSING' AND updated_at < NOW() - 5 minutes — If a cron run crashes halfway through, those rows stay stuck in PROCESSING forever. This condition reclaims them after 5 minutes. Crash recovery in one SQL clause.
COALESCE(retries, 0) < 5 — Hard cap of 5 attempts. After that, the event stays FAILED and never retries again. Prevents an endlessly broken event from blocking the queue.
LIMIT 50 — Without this, a large backlog could pull thousands of rows into memory in one go. 50 per tick keeps things predictable (as per my server).
Publishing to Kafka
Once you have the batch, publish everything concurrently. Use Promise.allSettled not Promise.allYou want every event attempted, even if one fails:
const results = await Promise.allSettled(
events.map(({ id, payload, event_type }) => {
return producer.send({
topic: event_type,
messages: [{ key: String(id), value: JSON.stringify(payload) }],
});
})
);
const successIds = [];
const failIds = [];
results.forEach((result, index) => {
if (result.status === 'fulfilled') {
successIds.push(events[index].id);
} else {
failIds.push(events[index].id);
}
});
Then update both groups in one query each — not one query per row:
if (successIds.length > 0) {
await db.query(
`UPDATE outbox_events SET status = 'COMPLETED', updated_at = NOW() WHERE id = ANY($1::int[])`,
[successIds]
);
}if (failIds.length > 0) {
await db.query(
`UPDATE outbox_events SET status = 'FAILED', retries = COALESCE(retries, 0) + 1, updated_at = NOW() WHERE id = ANY($1::int[])`,
[failIds]
);
}Two DB round-trips for the entire batch, regardless of size.
Keep the Kafka Producer as a Singleton
Don’t reconnect on every cron tick. The broker handshake is expensive. Create the producer once and reuse it:
let producerInstance = null;
const getKafkaProducer = async () => {
if (producerInstance) return producerInstance;
const producer = kafka.producer();
await producer.connect();
producerInstance = producer;
return producerInstance;
};And disconnect it cleanly on shutdown:
process.on('SIGTERM', async () => {
if (producerInstance) await producerInstance.disconnect();
process.exit(0);
});What This Gets You
If Kafka goes down, events queue up in Postgres and drain automatically when it recovers. Your application services don’t need to know anything happened.
If the cron crashes mid-batch, the 5-minute stuck-processing recovery reclaims those events on the next tick.
If you scale to multiple instances,FOR UPDATE SKIP LOCKED handles the coordination.
The whole thing is ~150 lines of Node.js and one table. In our case, it also meant we could remove Debezium entirely, a piece of infrastructure that required Kafka Connect, connector config, WAL-level Postgres permissions. Still, it went down every week or two (Suggest improvements if you think you could make the system more reliable or reduce downtime further.).
Things to Watch
Events that fail 5 times stay failed permanently. You need an alert on COUNT(*) WHERE status = 'FAILED' AND retries >= 5. If that number climbs, something is structurally broken.
This is at least once delivery, not exactly once. If a publish succeeds but the service crashes before marking it COMPLETED, it'll be published again on the next retry. Your consumers should handle duplicates.
That’s it. A cron job, a Postgres table, and a few SQL clauses that handle the edge cases.