[GSoC Proposal Draft] - Abdur Rehman - SQL Adapter for Background Tasks #230

Coder1221 · 2026-03-07T21:20:25Z

Coder1221
Mar 7, 2026

Draft-V1

About Me

My name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that.

Problem Understanding

Rage::Deferred currently provides task execution backed by a Write-Ahead Log (WAL) that stores metadata on disk. The issue arises in modern infrastructures such as Docker containers that have ephemeral disk storage.
If a container is restarted or redeployed, the log file is lost. To prevent this data loss, we need a database persistence layer that survives container restarts and ensures no background jobs are dropped.

Technical Approach

Rage::Deferred::Backends::Disk — existing implementation, unchanged
Rage::Deferred::Backends::Database — new core logic

The Database backend will implement the interfaces currently defined in nil.rb. It will serve as a base class with three concrete adapters: one for :mysql, one for :postgres and one for :active_record.

The structure will look roughly like this:

class Rage::Deferred::Backends::Database
  def add
    # public logic
    insert_task
  end

  def remove
    # public logic
    remove_task
  end

  def pending_tasks
    # public logic
    pending_task
  end

  private

  # MySQL and PostgreSQL adapters will implement these
  def insert_task
    raise NotImplementedError
  end

  def remove_task
    raise NotImplementedError
  end

  def pending_task
    raise NotImplementedError
  end
end

class Rage::Deferred::Backends::Database::Postgres < Rage::Deferred::Backends::Database
  def insert_task
    # PostgreSQL-specific implementation
  end

  def remove_task
    # PostgreSQL-specific implementation
  end

  def pending_task
    # PostgreSQL-specific implementation
  end
end

And the config will look this.

config.deferred.backend = :database
config.deferred.backend.adapter = :active_record

Milestones & Timeline

This is a rough outline and subject to refinement:

Week 1: Deep dive into the Rage codebase, read documentation, and engage with the community
Week 2: Finalize the technical approach and begin implementation
Week 3: Complete the base Database adapter and implement mysql, postgres and active_record adapters
Week 4: Write comprehensive test cases for adapters
Week 5: Write documentation, address reviewer feedback, and get the PR merged

Deliverables

A pull request containing the database persistence backend, updated documentation, and well-written test cases.

Validation

A demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers.

rsamoilov · 2026-03-07T23:28:45Z

rsamoilov
Mar 7, 2026
Maintainer

Hi @Coder1221,

Thanks for submitting your draft so early! You've clearly grasped the core goal of the project, which gives us a great starting point.

To be candid, this draft currently reads more like a statement of intent than a technical design document. Because this is a 350-hour project, your proposal needs to be highly technical. It should serve as proof that you have a solid architectural plan before the coding period begins. The real complexity here isn't just the class structure - it's database management and concurrency.

To make this proposal competitive, please update your next draft to include:

Database Schema: Define the exact SQL table structure (columns, data types, indexes).
Task Lifecycle & Edge Cases: Walk us through a task's lifecycle. How will you handle:
- Race conditions: Preventing two workers from picking up the same task.
- Crash recovery: Distinguishing between orphaned jobs from a crash vs. jobs safely running in another process.
- Table growth: How and when are completed tasks cleaned up?
Adapter Strategy: You mentioned building :mysql and :postgres adapters alongside :active_record. Since ActiveRecord abstracts MySQL and Postgres, do you plan to write raw SQL alongside an ActiveRecord implementation?
Revised Timeline: Your timeline is 5 weeks, but 350-hour GSoC projects run much longer. Also, deep dives and research should happen during the Community Bonding period before coding starts. Please expand your timeline to reflect the 350-hour scope with concrete technical milestones.

Take your time to think through these challenges, and feel free to ask questions as you work on Version 2.

Looking forward to your next iteration!

1 reply

Coder1221 Mar 8, 2026
Author

Thank you for your feedback. I will review the challenges you mentioned and incorporate solutions in the next draft.

Coder1221 · 2026-03-12T12:02:50Z

Coder1221
Mar 12, 2026
Author

Draft-V2

About Me

My name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that.

Problem Understanding

Rage::Deferred currently provides task execution backed by a Write-Ahead Log (WAL) that stores metadata on disk. The issue arises in modern infrastructures such as Docker containers that have ephemeral disk storage.
If a container is restarted or redeployed, the log file is lost. To prevent this data loss, we need a database persistence layer that survives container restarts and ensures no background jobs are dropped.

Technical Approach

1. DataBase Schema

CREATE TABLE rage_deferred_tasks (
  id UUID PRIMARY KEY,
  payload JSONB NOT NULL,
  status VARCHAR(20) NOT NULL DEFAULT 'pending',
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_at TIMESTAMPTZ,
  completed_at TIMESTAMPTZ   
)

-- indexes
CREATE INDEX idx_rage_deferred_pending ON rage_deferred_tasks (id) WHERE status = 'pending';

Key columns in rage_deferred_tasks:

status: Possible values (pending, running, completed)
payload: Payload of the task with all necessary info to start a background job.
locked_at: timestamp when a worked claimed a task.
completed_at: when it is finished (used for clean up)

2. Race Condition

The problem, two fibers/thread in rage could see the same task pending and try to claim it. Solution is row level locking.

SELECT * FROM rage_deferred_tasks
WHERE status = 'pending'
ORDER BY created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED

For update locks the row and skip locked means any other thread that hits the same query skips over already-locked rows. That's how we will prevent the race condition between workers.

3. Crash Recovery

This is the trickiest part.

Consider a scenario when server crashes or we have to restart the server. On server boot-up We will update all running jobs to pendingstate to get picked again

-- called on startup.
update rage_deferred_tasks 
  set status = 'pending',
  locked_at = null,
where status = 'running';

Consider another scenario:

Process A claims a task -> sets status = running and locked_at = now()
Process A crashed mid execution
Now we have a entry in our table with status = running forever - A ghost job that will never complete.

The solution for this problem:

Read for crashed jobs which have status='running' and certain time-period have passed like 5 mins. We will reset there status from running to pending.

-- called periodically.
update rage_deferred_tasks 
  set status = 'pending',
  locked_at = null,
where status = 'running' and locked_at < now() - INTERVAL '5 minutes';

Any task that has been running for more than 5 minutes will be automatically considered crashed. Its status will be updated to pending, allowing another worker to pick it up for execution.

Introducing a mechanism that periodically scans for crashed tasks and re-queues them adds an additional complexity: some tasks may legitimately require more than 5 minutes to complete. These long-running tasks should not be mistakenly treated as crashed.

To handle such scenarios, a heartbeat mechanism must be implemented. While a task is being processed, the worker periodically updates the locked_at timestamp associated with that task. As long as the timestamp continues to be refreshed within the configured timeout window, the system considers the task active and avoids re-queuing it.

If the locked_at timestamp is not refreshed within the timeout period, the system assumes the worker has crashed or stalled, and the task is marked as pending so it can be retried by worker.

def run_with_heart_beat
  heartBeat = Thread.new do 
    loop do 
      sleep(2.5 * 60) # half of the above 5 minutes (time interval for checking orphaned/crashed jobs)
      refresh_lock(task[:id]) # UPDATE SET locked_at = NOW() where id = task_id
    end
  end
  # task execution
ensure
  heartBeat.kill
end

This way locked_at stays fresh for long running tasks and only true crashed jobs will be reset and available again for workers to pick.

4. Table growth

Completed tasks need to be cleaned up after they have completed or they will grow forever. Here are three possibilities with trade-offs

Option 1: Delete on Completion (Simplest)

delete from rage_deferred_tasks where id = task_id

Clean and simple. Downside: you lose all history- no visibility what ran.

Option 2: Delete after some interval like after 6 hours.

Delete all jobs that are older than 6 hour.

delete from rage_deferred_tasks where status = 'completed' and  completed_at < now() - interval '6 hours'

This way there is short audit window (last 6 hours) while preventing table growth.

Option 3: Archive on completion

Move completed rows to a rage_deferred_tasks_archive table instead of deleting. Schema of rage_deferred_tasks_archive table will be similar to rage_deferred_tasks.

Database Backend Implementation

The Database backend will implement the interfaces currently defined in nil.rb. It will serve as a base class for adapters.

Rage::Deferred::Backends::Disk # existing implementation, unchanged
Rage::Deferred::Backends::Database  # new core logic

class Rage::Deferred::Backends::Database
  def add
  end

  def remove
  end

  def pending_tasks
  end
end

Database Adapters

For the database backend, three adapters will be implemented:

:active_record
:mysql
:psql

While Active Record provides an abstraction over databases such as MySQL and PostgreSQL, dedicated MySQL and PostgreSQL adapters are still required. These adapters support environments where Active Record is not used and the application interacts directly with the database through gems such as pg or mysql2. In these cases, database operations are performed using raw SQL queries rather than through an ORM abstraction.

class Rage::Deferred::Backends::Database::ActiveRecord < Rage::Deferred::Backends::Database
  def insert_task
  end

  def remove_task
  end

  def pending_task
  end
end

Milestones & Timeline

This is a rough outline and subject to refinement:

Week 1-2: Deep dive into the Rage codebase, read documentation, engage with the community, finalize the technical approach
Week 3-6: Build base class Rage::Deferred::Backends::Database
Week 7-10: Implement adapters (:active_record, :mysql, :psql)
Week 10-12: Documentation

Deliverables

A pull request containing the database persistence backend, updated documentation, and well-written test cases.

Validation

A demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers.

1 reply

Coder1221 Mar 12, 2026
Author

Hi @rsamoilov ,
I hope you are doing well. I have incorporated the feedback you mentioned earlier. I’ve added technical details on how I plan to address the challenges and have also revised the timeline to 12 weeks. Please let me know if you notice any technical gaps or loopholes that I may have missed. Thanks

rsamoilov · 2026-03-12T18:33:21Z

rsamoilov
Mar 12, 2026
Maintainer

Hi @Coder1221

You've significantly improved the draft!

You mentioned building adapters for MySQL, PostgreSQL, and ActiveRecord, but your schema and queries use Postgres-specific features (e.g., JSONB). You need to clarify how you will abstract these queries so they work across different databases. Also, I think having Postgres and MySQL adapters in addition to ActiveRecord is redundant.
You added a completed_at column but didn't answer the question about how and when completed tasks are cleaned up. Consider describing the solution you consider optimal.
What happens with the failed tasks? Consider maintaining a "dead-letter queue".

0 replies

Coder1221 · 2026-03-16T13:09:18Z

Coder1221
Mar 16, 2026
Author

Draft-V3

About Me

My name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that.

Problem Understanding

Rage::Deferred currently provides task execution backed by a Write-Ahead Log (WAL) that stores metadata on disk. The issue arises in modern infrastructures such as Docker containers that have ephemeral disk storage.
If a container is restarted or redeployed, the log file is lost. To prevent this data loss, we need a database persistence layer that survives container restarts and ensures no background jobs are dropped.

Technical Approach

1. DataBase Schema

create_table :rage_deferred_tasks, if_not_exists: true, id: :uuid do |t|
  t.jsonb :payload, null: false
  t.string :status, default: 'pending', limit: 20, null: false
  t.datetime :locked_at
  t.integer :attempt_count, default: 0 
  t.text :error_message,
  t.datetime :failed_at
  t.datetime :retry_at
  t.datetime :completed_at
  t.timestamps
end

add_index :rage_deferred_tasks, :id, where: "status = 'pending'"

The if_not_exists: true option makes the migration idempotent—it won't fail if the table already exists.

Key columns in rage_deferred_tasks:

status: Possible values (pending, running, completed)
payload: Payload of the task with all necessary info to start a background job.
locked_at: timestamp when a worked claimed a task.
completed_at: When it is finished (used for clean up)
error_message: if the task fails, we can store the error message for debugging and analysis.
attempt_count: to track how many times a task has been attempted, useful for retry
failed_at: timestamp when a task is marked as failed.
retry_at: timestamp when a task is scheduled for retry.

2. Race Condition

The problem, two fibers/thread in rage could see the same task pending and try to claim it. Solution is row level locking.

# Fetch and lock the first pending task
task = DeferredTask.where(status: 'pending').order(created_at: :asc).lock.first

Active Record's .lock method generates FOR UPDATE SKIP LOCKED, which:

Locks the row so only one worker can claim it
SKIP LOCKED means other workers skip over already-locked rows
This prevents race conditions between workers

3. Crash Recovery

This is the trickiest part.

Consider a scenario when server crashes or we have to restart the server. On server boot-up we will update all running jobs to pending state to get picked again:

# Called on startup
DeferredTask.where(status: 'running').update_all(status: 'pending', locked_at: nil)

Consider another scenario:

Process A claims a task -> sets status = running and locked_at = now()
Process A crashed mid execution
Now we have a entry in our table with status = running forever - A ghost job that will never complete.

The solution for this problem:

Read for crashed jobs which have status='running' and a certain time-period have passed (e.g., 5 minutes). Reset their status from running to pending:

# Called periodically
crash_timeout = config.task_timeout  # default 5 minutes
DeferredTask.where(status: 'running')
  .where('locked_at < ?', Time.current - crash_timeout)
  .update_all(status: 'pending', locked_at: nil)

Any task that has been running for more than 5 minutes will be automatically considered crashed. Its status will be updated to pending, allowing another worker to pick it up for execution.

Introducing a mechanism that periodically scans for crashed tasks and re-queues them adds an additional complexity: some tasks may legitimately require more than 5 minutes to complete. These long-running tasks should not be mistakenly treated as crashed.

To handle such scenarios, a heartbeat mechanism must be implemented. While a task is being processed, the worker periodically updates the locked_at timestamp associated with that task. As long as the timestamp continues to be refreshed within the configured timeout window, the system considers the task active and avoids re-queuing it.

If the locked_at timestamp is not refreshed within the timeout period, the system assumes the worker has crashed or stalled, and the task is marked as pending so it can be retried by worker.

def run_with_heart_beat(task)
  heartbeat_interval = config.task_timeout / 2  # half of crash detection timeout
  heartbeat_thread = Thread.new do
    loop do
      sleep(heartbeat_interval)
      # Refresh the lock timestamp to indicate task is still running
      DeferredTask.where(id: task.id).update_all(locked_at: Time.current)
    end
  end
  # task execution
ensure
  heartbeat_thread.kill
end

This way locked_at stays fresh for long running tasks and only true crashed jobs will be reset and available again for workers to pick.

4. Cleanup Strategy & Table Growth

Completed tasks need to be cleaned up after they have completed or the table will grow unbounded. Three approaches were considered:

Option 1: Delete on Completion (Simplest)

DeferredTask.find(task_id).delete

Pros: Simple, minimal storage
Cons: No audit trail; visibility into recent execution history is lost immediately.

Option 2: Delete after time interval (6 hours)

DeferredTask.where(status: 'completed')
  .where('completed_at < ?', Time.current - 6.hours)
  .delete_all

Pros: Short audit window for debugging; prevents unbounded growth
Cons: Limited history window

Option 3: Archive on completion

Move completed rows to a rage_deferred_tasks_archive table with identical schema.
Pros: Full audit trail; can query historical data; completed table stays lean
Cons: More complex; requires archive table management

Selected Approach: Option 2 (Delete after interval)

We will implement an automatic cleanup job that deletes completed tasks older than a configurable interval (default: 6 hours).

Benefits:

Recent task history available for debugging
Prevents unbounded growth and maintains query performance
Users can adjust the retention window in configuration

The cleanup will run periodically (e.g., every hour) as a background maintenance task:

# Cleanup background job
def cleanup_completed_tasks
  threshold = config.task_retention_window  # default 6 hours
  Task.where("status = 'completed' AND completed_at < ?", Time.current - threshold).delete_all
end

5. Dead-Letter Queue for Failed Tasks

Tasks may fail due to transient or permanent errors. Without proper handling, failed tasks could:

Be retried indefinitely, consuming resources
Block worker capacity

Solution: Implement a dead-letter queue (DLQ) mechanism

Failed task workflow:

When a task raises an exception during execution, it's marked with status = 'failed' and error_message is recorded.
If attempt_count < max_retries (configurable, default 3), the task is moved back to status = 'pending' for retry with exponential backoff
If attempt_count >= max_retries, the task is moved to a dead-letter queue table

Dead-letter queue table:

create_table :rage_deferred_tasks_dlq, if_not_exists: true, id: :uuid do |t|
  t.uuid :original_id, null: false
  t.jsonb :payload, null: false
  t.text :error_message
  t.integer :attempt_count
  t.datetime :failed_at, null: false
  t.timestamps
end

This table captures all failed tasks that have exhausted their retry attempts, allowing for:

Post-mortem analysis of failures
Manual intervention or reprocessing if needed

def execute_task(task)
  run_with_heart_beat(task)
rescue => e
  task.attempt_count += 1
  if task.attempt_count <= config.max_task_retries
    # Retry with exponential backoff
    backoff = (2 ** task.attempt_count) * 60  # seconds
    task.update(
      status: 'pending',
      error_message: e.message,
      attempt_count: task.attempt_count,
      retry_at: Time.current + backoff
    )
  else
    # Move to DLQ
    FailedDeferredTask.create!(
      original_id: task.id,
      payload: task.payload,
      error_message: e.message,
      attempt_count: task.attempt_count,
      failed_at: Time.current
    )
    task.delete
  end
end

Database Backend Implementation

The Database backend will implement the interfaces currently defined in nil.rb. It will serve as a base class for adapters.

Rage::Deferred::Backends::Disk # existing implementation, unchanged
Rage::Deferred::Backends::Database  # new core logic

class Rage::Deferred::Backends::Database
  def add
  end

  def remove
  end

  def pending_tasks
  end
end

Database Adapter

The implementation will use Active Record, which provides built-in abstraction for database differences including JSON column support across PostgreSQL and MySQL:

class Rage::Deferred::Backends::Database::ActiveRecord < Rage::Deferred::Backends::Database
  def pending_task
    # Active Record handles SQL generation and JSON serialization across databases
    RageDeferredTask.where(status: 'pending')
      .where('retry_at IS NULL OR retry_at <= ?', Time.current)
      .order(created_at: :asc).lock.first
  end
end

Using Active Record ensures:

Database portability: JSON payloads work identically on PostgreSQL and MySQL
Query abstraction: Row-level locking (FOR UPDATE SKIP LOCKED) is handled by Active Record
Simplified maintenance: Single adapter to maintain rather than database-specific implementations

Milestones & Timeline

This is a rough outline and subject to refinement:

Week 1-2: Deep dive into the Rage codebase, read documentation, engage with the community, finalize the technical approach
Week 3-6: Build base class Rage::Deferred::Backends::Database with core methods
Week 7-9: Implement Active Record adapter with race condition handling, crash recovery, and heartbeat mechanism
Week 10-11: Implement dead-letter queue and cleanup strategies with comprehensive tests
Week 12: Documentation and demo video

Deliverables

A pull request containing the database persistence backend, updated documentation, and well-written test cases.

Validation

A demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers.

2 replies

Coder1221 Mar 16, 2026
Author

Hi @rsamoilov , I have incorporated the feedback you mentioned earlier. We will be implementing only active_record adapter for the database backend, which will work across both PostgreSQL and MySQL. Also converted schema and queries to active record. I also have suggested the cleanup strategy and dead-letter queue mechanism for failed tasks. Please let me know if you have any further feedback or suggestions.

rsamoilov Mar 16, 2026
Maintainer

Hi @Coder1221 ,

You have laid out a very solid foundation here, particularly with the Dead Letter Queue design, which is excellent.

There are a few architectural adjustments and edge cases we need to address to ensure this works smoothly at scale:

1. The Execution Model & Restart Distribution

Your approach to row-level locking (FOR UPDATE SKIP LOCKED) is great for a traditional distributed queue (like Sidekiq) that continuously polls the database. However, Rage::Deferred operates differently: tasks are normally kept in-memory and scheduled via Rage's fiber scheduler, meaning the process that enqueues the task usually executes it.

The database should act strictly as a persistence/recovery layer. On restart, we don't care which process originally enqueued the tasks, but we do need to distribute them evenly across the booting processes.

Please update the proposal to detail how a booting process will fetch and claim a batch of pending tasks from the database to load into memory, rather than operating in a continuous DB-polling loop.

2. Crash Recovery in Multi-Server Environments

Your proposed heartbeat mechanism for detecting stalled or crashed jobs is fantastic. However, the startup script to reset running tasks is risky in multi-server or multi-container deployments.

If you run DeferredTask.where(status: 'running').update_all(...) on boot-up, and Container A restarts while Container B is perfectly healthy, Container A will instantly reset all legitimately running tasks on Container B.

I'd suggest removing the boot-time status reset entirely. The system should rely solely on your background heartbeat/timeout mechanism to detect orphans and transition them back to pending.

3. Database Agnosticism & Future-Proofing

While Active Record abstracts a lot, we need to be careful with database-specific constraints to ensure true compatibility between PostgreSQL and MySQL:

Data Types: You specified t.jsonb :payload. jsonb is PostgreSQL-specific, while MySQL uses json. Your migration/adapter logic needs to account for this difference gracefully.
Indexing: The partial index you proposed (add_index :rage_deferred_tasks, :id, where: "status = 'pending'") can be problematic in MySQL and doesn't optimize your fetch query, which orders by created_at.

4. Database Cleanup Strategy

Deleting completed tasks older than 6 hours is an effective approach to prevent table bloat. However, executing a single delete_all on a massive dataset can cause table locks and degrade performance for active workers.

Please specify that the background cleanup job will delete records in smaller, manageable batches (e.g., using limit(1000) or in_batches) rather than a single massive query.

5. Dead Letter Queue (DLQ)

Your DLQ implementation is sound. Moving failed tasks to a completely separate table is exactly what we want to see, as it keeps the primary tasks table lean and fast for active workers. Great job on the exponential backoff logic here as well.

Looking forward to seeing your revisions!

Coder1221 · 2026-03-19T14:34:19Z

Coder1221
Mar 19, 2026
Author

About Me

My name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that.

Problem Understanding

Rage::Deferred currently provides task execution backed by a Write-Ahead Log (WAL) that stores metadata on disk. The issue arises in modern infrastructures such as Docker containers that have ephemeral disk storage.
If a container is restarted or redeployed, the log file is lost. To prevent this data loss, we need a database persistence layer that survives container restarts and ensures no background jobs are dropped.

Technical Approach

1. DataBase Schema

create_table :rage_deferred_tasks, if_not_exists: true do |t|
  t.text :payload, null: false
  t.string :status, default: 'pending', limit: 20, null: false
  t.datetime :locked_at
  t.integer :attempt_count, default: 0 
  t.text :error_message
  t.datetime :failed_at
  t.datetime :retry_at
  t.datetime :completed_at
  t.timestamps
end

add_index :rage_deferred_tasks, [:status, :created_at]

The if_not_exists: true option makes the migration idempotent—it won't fail if the table already exists.

Key columns in rage_deferred_tasks:

status: Possible values (pending, scheduled, completed)
payload: Payload of the task with all necessary info to start a background job.
locked_at: timestamp when a worked claimed a task.
completed_at: When it is finished (used for clean up)
error_message: if the task fails, we can store the error message for debugging and analysis.
attempt_count: to track how many times a task has been attempted, useful for retry
failed_at: timestamp when a task is marked as failed.
retry_at: timestamp when a task is scheduled for retry.

2. Crash Recovery And Restart Distribution

This is the trickiest part.

Consider a scenario:

Process A claims a task -> sets status = scheduled and locked_at = now()
Process A crashed mid execution
Now we have a entry in our table with status = scheduled forever - A ghost job that will never complete.

The solution for this problem:

Read for crashed jobs which have status='scheduled' and a certain time-period have passed (e.g., 5 minutes). Reset their status from scheduled to pending:

# Called periodically
crash_timeout = config.task_timeout  # default 5 minutes
DeferredTask.where(status: 'scheduled')
  .where('locked_at < ?', Time.current - crash_timeout)
  .update_all(status: 'pending', locked_at: nil)

Any task that has been running for more than 5 minutes will be automatically considered crashed. Its status will be updated to pending, allowing another worker to pick it up for execution.

Introducing a mechanism that periodically scans for crashed tasks and re-queues them adds an additional complexity: some tasks may legitimately require more than 5 minutes to complete. These long-running tasks should not be mistakenly treated as crashed.

To handle such scenarios, a heartbeat mechanism must be implemented. While a task is being processed, the worker periodically updates the locked_at timestamp associated with that task. As long as the timestamp continues to be refreshed within the configured timeout window, the system considers the task active and avoids re-queuing it.

If the locked_at timestamp is not refreshed within the timeout period, the system assumes the worker has crashed or stalled, and the task is marked as pending so it can be retried by worker.

def run_with_heart_beat(task)
  heartbeat_interval = config.task_timeout / 2  # half of crash detection timeout
  heartbeat_thread = Thread.new do
    loop do
      sleep(heartbeat_interval)
      # Refresh the lock timestamp to indicate task is still running
      DeferredTask.where(id: task.id).update_all(locked_at: Time.current)
    end
  end
  # task execution
ensure
  heartbeat_thread.kill
end

This way locked_at stays fresh for long running tasks and only true crashed jobs will be reset and available again for workers to pick.

Consider a scenario:
When a container/server restarts, all tasks that were in scheduled state will be considered crashed and heartbeat will not be running to refresh their locked_at timestamp. After the configured timeout period (e.g., 5 minutes), these tasks will be automatically reset to pending and become available for workers to pick up again. This ensures that no tasks are lost due to container restarts, and the system can recover gracefully from such events.

2.1 Restart Distribution

Consider a multi-server environment like kubernetes where we have multiple pods. If all pending tasks are claimed in one batch on startup, one pod will be overwhelmed with all pending tasks while others will be idle. To prevent this, we will implement claiming strategy on startup which will balance the load across the workers/pods.

On startup, each worker will claim a batch of pending tasks and start executing them. After the initial batch is claimed, a background thread will continue to claim remaining pending tasks in batches until all are claimed. This allows for a more balanced distribution of pending tasks across workers on startup, preventing a single worker from being overwhelmed with all pending tasks.

def on_boot
  recover_pending_tasks   # immediate batch claim
  start_drain_thread      # background sweep for remaining pending tasks if any
end

def recover_pending_tasks(batch_size)
  claimed = claim_batch(batch_size)
  claimed.each { |task| schedule_task(task) }
end

def claim_batch(batch_size)
  ApplicationRecord.transaction do
    rows = RageDeferredTask
      .where(status: "pending")
      .order(created_at: :asc)
      .limit(batch_size)
      .lock("FOR UPDATE SKIP LOCKED") # to prevent race conditions between workers
      .to_a

    return [] if rows.empty?

    ids = rows.map(&:id)

    RageDeferredTask
      .where(id: ids)
      .update_all(
        status: "Scheduled",
        locked_at: Time.current,
        locked_by: worker_id
      )
    rows
  end
end

def start_drain_thread
  Thread.new do
    loop do
      sleep(drain_interval)   # e.g. 30 seconds

      claimed = claim_batch(batch_size)

      if claimed.empty?
        # Nothing left unclaimed — recovery is complete, stop polling
        break
      end

      claimed.each { |task| schedule_task(task) }
    end
  end
end

This approach allows for a more balanced distribution of pending tasks across workers on startup, preventing a single worker from being overwhelmed with all pending tasks. By claiming tasks in batches and starting a background thread to continue claiming remaining tasks, we can ensure a smoother recovery process and better load distribution among workers.

2.2 Background Recovery Thread for Crashed Tasks

In addition to the startup recovery process, we will implement a background thread that periodically checks for crashed tasks and enqueue them to the fiber scheduler.

def start_recovery_thread
  Thread.new do
    loop do
      sleep(config.task_timeout / 2) # check for crashed tasks every half of the crash timeout period
      tasks = ApplicationRecord.transaction do
        rows = RageDeferredTask
          .where(status: "scheduled")
          .where("locked_at < ?", Time.current - config.task_timeout)
          .limit(batch_size)
          .lock("FOR UPDATE SKIP LOCKED")
          .to_a
        return [] if rows.empty?
        RageDeferredTask
          .where(id: rows.map(&:id))
          .update_all(status: "pending", locked_at: nil)
        rows
      end
      tasks.each { |task| schedule_task(task) } # re-enqueue into fiber scheduler for execution
    end
  end
end

3. Cleanup Strategy & Table Growth

Completed tasks need to be cleaned up after they have completed or the table will grow unbounded. Three approaches were considered:

Option 1: Delete on Completion (Simplest)

DeferredTask.find(task_id).delete

Pros: Simple, minimal storage
Cons: No audit trail; visibility into recent execution history is lost immediately.

Option 2: Delete after time interval (6 hours)

# Deletes in batches to avoid table locks and maintain performance for active workers
DeferredTask.where(status: 'completed')
  .where('completed_at < ?', Time.current - 6.hours)
  .in_batches(of: 1000).delete_all

Pros: Short audit window for debugging; prevents unbounded growth; batched deletion avoids table locks
Cons: Limited history window

Option 3: Archive on completion

Move completed rows to a rage_deferred_tasks_archive table with identical schema.
Pros: Full audit trail; can query historical data; completed table stays lean
Cons: More complex; requires archive table management

Selected Approach: Option 2 (Delete after interval)

We will implement an automatic cleanup job that deletes completed tasks older than a configurable interval (default: 6 hours).

Benefits:

Recent task history available for debugging
Prevents unbounded growth and maintains query performance
Users can adjust the retention window in configuration

The cleanup will run periodically (e.g., every hour) as a background maintenance task:

def cleanup_completed_tasks
  threshold = config.task_retention_window
  batch_size = config.cleanup_batch_size

  loop do
    deleted_count = RageDeferredTask
      .where(status: "completed")
      .where("completed_at < ?", Time.current - threshold)
      .limit(batch_size)
      .delete_all  # returns integer count directly

    break if deleted_count == 0

    sleep(0.5)
  end
end

Why Batched Deletion is Critical:

Prevents table locks: A single massive delete_all on millions of rows can acquire an exclusive lock on the entire table, blocking active workers from claiming pending tasks
Maintains worker performance: By deleting in configurable batches (default 1000 rows), the lock is held only briefly per batch, allowing workers to continue processing tasks during cleanup
Configurable batch size: Users can tune cleanup_batch_size based on their database hardware and task throughput
Graceful degradation: If many completed tasks exist, cleanup slowly progresses over multiple batches rather than stalling the system in one massive operation

4. Dead-Letter Queue for Failed Tasks

Tasks may fail due to transient or permanent errors. Without proper handling, failed tasks could:

Be retried indefinitely, consuming resources
Block worker capacity

Solution: Implement a dead-letter queue (DLQ) mechanism

Failed task workflow:

When a task raises an exception during execution, it's marked with status = 'failed' and error_message is recorded.
If attempt_count < max_retries (configurable, default 3), the task is moved back to status = 'pending' for retry with exponential backoff
If attempt_count >= max_retries, the task is moved to a dead-letter queue table

Dead-letter queue table:

create_table :rage_deferred_tasks_dlq, if_not_exists: true do |t|
  t.bigint :original_id, null: false
  t.text :payload, null: false
  t.text :error_message
  t.integer :attempt_count
  t.datetime :failed_at, null: false
  t.timestamps
end

This table captures all failed tasks that have exhausted their retry attempts, allowing for:

Post-mortem analysis of failures
Manual intervention or reprocessing if needed

def execute_task(task)
  run_with_heart_beat(task) { perform_task(task.payload) }
  task.update!(status: 'completed', completed_at: Time.current)
rescue => e
  task.attempt_count += 1
  if task.attempt_count <= config.max_task_retries
    # Retry with exponential backoff
    backoff = (2 ** task.attempt_count) * 60  # seconds
    task.update(
      status: 'pending',
      error_message: e.message,
      attempt_count: task.attempt_count,
      retry_at: Time.current + backoff
    )
  else
    # Move to DLQ
    FailedDeferredTask.create!(
      original_id: task.id,
      payload: task.payload,
      error_message: e.message,
      attempt_count: task.attempt_count,
      failed_at: Time.current
    )
    task.delete
  end
end

Database Backend Implementation

The Database backend will implement the interfaces currently defined in nil.rb. It will serve as a base class for adapters. All concurrent infrastructure like heartbeats, drain thread and batched claiming will be encapsulated in this base class, while database-specific logic will be implemented in the adapter.

Rage::Deferred::Backends::Disk # existing implementation, unchanged
Rage::Deferred::Backends::Database  # new core logic

class Rage::Deferred::Backends::Database
  def initialize
    on_boot
  end

  def add
  end

  def remove
  end

  def pending_tasks
  end

  private
  
  def on_boot
    recover_pending_tasks #initial batch claim on startup to balance load across workers
    start_drain_thread #background thread to claim remaining pending tasks in batches until all are claimed

    start_recovery_thread #background thread to recover crashed tasks periodically (continue to live for the lifetime of the process)
    start_cleanup_thread #background thread to clean up completed tasks periodically (continue to live for the lifetime of the process)
  end

  def recover_pending_tasks
    claim_batch(config.startup_batch_size).each { |task| schedule_task(task) }
  end

  def schedule_task(task)
    Fiber.schedule do
      execute_task(task)
    end
  end

  def start_drain_thread
    Thread.new do
      loop do
        sleep(drain_interval)   # e.g. 30 seconds
        claimed = claim_batch(batch_size)
        if claimed.empty?
          break
        end
        claimed.each { |task| schedule_task(task) }
      end
    end
  end

  def start_recovery_thread
    Thread.new do
      loop do
        sleep(config.orphan_check_interval)
        fetch_orphaned_tasks.each { |task| schedule_task(task) }
      end
    end
  end
  
  def start_cleanup_thread
    Thread.new do
      loop do
        sleep(config.cleanup_interval) # e.g. every hour
        cleanup_completed_tasks
      end
    end
  end

  def cleanup_completed_tasks
    threshold = config.task_retention_window
    batch_size = config.cleanup_batch_size

    loop do
      rows = fetch_completed_tasks(batch_size, threshold)
      deleted_count = rows.delete_all
      break if deleted_count < batch_size
      sleep(0.5)
    end
  end
  
  # adapter will implement these
  def claim_batch(size)
    raise NotImplementedError
  end

  def fetch_orphaned_tasks
    raise NotImplementedError
  end

  def fetch_completed_tasks(batch_size, threshold)
    raise NotImplementedError
  end
end

Database Adapter

The implementation will use Active Record, which provides built-in abstraction for database

class Rage::Deferred::Backends::Database::ActiveRecord < Rage::Deferred::Backends::Database
  def add
    RageDeferredTask.create!(payload: task.payload, status: 'pending', ...)
  end
 end
end

Using Active Record ensures:

Simplified maintenance: Single adapter to maintain rather than database-specific implementations

Milestones & Timeline

This is a rough outline and subject to refinement:

Week 1-2: Deep dive into the Rage codebase, read documentation, engage with the community, finalize the technical approach
Week 3-6: Build base class Rage::Deferred::Backends::Database with core methods
Week 7-9: Implement Active Record adapter with race condition handling, crash recovery, and heartbeat mechanism
Week 10-11: Implement dead-letter queue and cleanup strategies with comprehensive tests
Week 12: Documentation and demo video

Deliverables

A pull request containing the database persistence backend, updated documentation, and well-written test cases.

Validation

A demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers.

4 replies

Coder1221 Mar 19, 2026
Author

Hi @rsamoilov,
I have incorporated the feedback you mentioned earlier.

Addressed the issue of tasks distribution on startup in multi-server environment (section 2.1).
Removed the boot-time reset. System solely depends on heartbeat/timeout mechanism.
Addressed database agnosticism issues (updated index and change column type to text).
Updated database clean up strategy to include batching as well.

Please Let me know if know if you see further technical gaps. Thanks

rsamoilov Mar 23, 2026
Maintainer

Hi @Coder1221

This is a massive improvement!

One minor note I have is to use the standard auto-incrementing primary keys instead of UUIDs. Also, before submitting your proposal, I would encourage you to think about the following problems:

Picking up crashed tasks - you covered updating statuses/hearbeats for crashed tasks and the restart distribution, but you haven't covered the background thread/fiber to actually reschedule them.
A way to encapsulate the heartbeats/drain thread/batched claiming inside the disk backend as much as possible (probably one of the hardest parts since none of this exists in the default Disk backend).

Additionally, status = running might be a bit ambiguous - if the task is picked up doesn't mean it's already running. Tasks can also be scheduled to be executed later. Maybe scheduled instead of running could work better?

Other than that, that's a great proposal. And it's pretty much ready to be submitted!

Coder1221 Mar 26, 2026
Author

Hi @rsamoilov,
I have incorporated the feedback.

Addressed the issue of picking up crashed tasks and schedule them in fiber schedular
Updated database backend abstractions
Removed uuid as primary key and used status = 'scheduled' instead of running

Have a final look. And let me know any final touches before submitting. Thanks for your guidance throughout the proposal drafting process - it means a lot.

rsamoilov Mar 27, 2026
Maintainer

Hi @Coder1221

This looks great 👍 Please don't forget to submit your proposal in the GSoC dashboard!

[GSoC Proposal Draft] - Abdur Rehman - SQL Adapter for Background Tasks #230

Uh oh!

Uh oh!

Coder1221 Mar 7, 2026

About Me

Problem Understanding

Technical Approach

Milestones & Timeline

Deliverables

Validation

Replies: 5 comments · 8 replies

Uh oh!

rsamoilov Mar 7, 2026 Maintainer

Uh oh!

Coder1221 Mar 8, 2026 Author

Uh oh!

Uh oh!

Coder1221 Mar 12, 2026 Author

About Me

Problem Understanding

Technical Approach

1. DataBase Schema

Key columns in rage_deferred_tasks:

2. Race Condition

3. Crash Recovery

The solution for this problem:

4. Table growth

Option 1: Delete on Completion (Simplest)

Option 2: Delete after some interval like after 6 hours.

Option 3: Archive on completion

Database Backend Implementation

Database Adapters

Milestones & Timeline

Deliverables

Validation

Uh oh!

Coder1221 Mar 12, 2026 Author

Uh oh!

rsamoilov Mar 12, 2026 Maintainer

Uh oh!

Uh oh!

Coder1221 Mar 16, 2026 Author

About Me

Problem Understanding

Technical Approach

1. DataBase Schema

Key columns in rage_deferred_tasks:

2. Race Condition

3. Crash Recovery

The solution for this problem:

4. Cleanup Strategy & Table Growth

Option 1: Delete on Completion (Simplest)

Option 2: Delete after time interval (6 hours)

Option 3: Archive on completion

5. Dead-Letter Queue for Failed Tasks

Database Backend Implementation

Database Adapter

Milestones & Timeline

Deliverables

Validation

Uh oh!

Uh oh!

Coder1221 Mar 16, 2026 Author

Uh oh!

rsamoilov Mar 16, 2026 Maintainer

Uh oh!

Uh oh!

Coder1221 Mar 19, 2026 Author

About Me

Problem Understanding

Technical Approach

1. DataBase Schema

Key columns in rage_deferred_tasks:

2. Crash Recovery And Restart Distribution

The solution for this problem:

2.1 Restart Distribution

2.2 Background Recovery Thread for Crashed Tasks

3. Cleanup Strategy & Table Growth

Option 1: Delete on Completion (Simplest)

Option 2: Delete after time interval (6 hours)

Coder1221
Mar 7, 2026

Replies: 5 comments 8 replies

rsamoilov
Mar 7, 2026
Maintainer

Coder1221 Mar 8, 2026
Author

Coder1221
Mar 12, 2026
Author

Coder1221 Mar 12, 2026
Author

rsamoilov
Mar 12, 2026
Maintainer

Coder1221
Mar 16, 2026
Author

Coder1221 Mar 16, 2026
Author

rsamoilov Mar 16, 2026
Maintainer

Coder1221
Mar 19, 2026
Author

Coder1221 Mar 19, 2026
Author

rsamoilov Mar 23, 2026
Maintainer

Coder1221 Mar 26, 2026
Author

rsamoilov Mar 27, 2026
Maintainer