Replies: 5 comments 8 replies
-
|
Hi @Coder1221, Thanks for submitting your draft so early! You've clearly grasped the core goal of the project, which gives us a great starting point. To be candid, this draft currently reads more like a statement of intent than a technical design document. Because this is a 350-hour project, your proposal needs to be highly technical. It should serve as proof that you have a solid architectural plan before the coding period begins. The real complexity here isn't just the class structure - it's database management and concurrency. To make this proposal competitive, please update your next draft to include:
Take your time to think through these challenges, and feel free to ask questions as you work on Version 2. Looking forward to your next iteration! |
Beta Was this translation helpful? Give feedback.
-
Draft-V2About MeMy name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that. Problem Understanding
Technical Approach1. DataBase SchemaCREATE TABLE rage_deferred_tasks (
id UUID PRIMARY KEY,
payload JSONB NOT NULL,
status VARCHAR(20) NOT NULL DEFAULT 'pending',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
locked_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ
)
-- indexes
CREATE INDEX idx_rage_deferred_pending ON rage_deferred_tasks (id) WHERE status = 'pending';Key columns in rage_deferred_tasks:
2. Race ConditionThe problem, two fibers/thread in rage could see the same task pending and try to claim it. Solution is row level locking. SELECT * FROM rage_deferred_tasks
WHERE status = 'pending'
ORDER BY created_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
3. Crash RecoveryThis is the trickiest part. Consider a scenario when server crashes or we have to restart the server. On server boot-up We will update all -- called on startup.
update rage_deferred_tasks
set status = 'pending',
locked_at = null,
where status = 'running';Consider another scenario:
The solution for this problem:
Read for crashed jobs which have -- called periodically.
update rage_deferred_tasks
set status = 'pending',
locked_at = null,
where status = 'running' and locked_at < now() - INTERVAL '5 minutes';Any task that has been running for more than 5 minutes will be automatically considered crashed. Its status will be updated to pending, allowing another worker to pick it up for execution. Introducing a mechanism that periodically scans for crashed tasks and re-queues them adds an additional complexity: some tasks may legitimately require more than 5 minutes to complete. These long-running tasks should not be mistakenly treated as crashed. To handle such scenarios, a heartbeat mechanism must be implemented. While a task is being processed, the worker periodically updates the locked_at timestamp associated with that task. As long as the timestamp continues to be refreshed within the configured timeout window, the system considers the task active and avoids re-queuing it. If the locked_at timestamp is not refreshed within the timeout period, the system assumes the worker has crashed or stalled, and the task is marked as pending so it can be retried by worker. def run_with_heart_beat
heartBeat = Thread.new do
loop do
sleep(2.5 * 60) # half of the above 5 minutes (time interval for checking orphaned/crashed jobs)
refresh_lock(task[:id]) # UPDATE SET locked_at = NOW() where id = task_id
end
end
# task execution
ensure
heartBeat.kill
endThis way 4. Table growthCompleted tasks need to be cleaned up after they have completed or they will grow forever. Here are three possibilities with trade-offs Option 1: Delete on Completion (Simplest)delete from rage_deferred_tasks where id = task_idClean and simple. Downside: you lose all history- no visibility what ran. Option 2: Delete after some interval like after 6 hours.Delete all jobs that are older than 6 hour. delete from rage_deferred_tasks where status = 'completed' and completed_at < now() - interval '6 hours'This way there is short audit window (last 6 hours) while preventing table growth. Option 3: Archive on completionMove completed rows to a Database Backend ImplementationThe Database backend will implement the interfaces currently defined in Rage::Deferred::Backends::Disk # existing implementation, unchanged
Rage::Deferred::Backends::Database # new core logicclass Rage::Deferred::Backends::Database
def add
end
def remove
end
def pending_tasks
end
endDatabase AdaptersFor the database backend, three adapters will be implemented:
While Active Record provides an abstraction over databases such as MySQL and PostgreSQL, dedicated MySQL and PostgreSQL adapters are still required. These adapters support environments where Active Record is not used and the application interacts directly with the database through gems such as class Rage::Deferred::Backends::Database::ActiveRecord < Rage::Deferred::Backends::Database
def insert_task
end
def remove_task
end
def pending_task
end
endMilestones & TimelineThis is a rough outline and subject to refinement:
DeliverablesA pull request containing the database persistence backend, updated documentation, and well-written test cases. ValidationA demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Coder1221 You've significantly improved the draft!
|
Beta Was this translation helpful? Give feedback.
-
Draft-V3About MeMy name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that. Problem Understanding
Technical Approach1. DataBase Schemacreate_table :rage_deferred_tasks, if_not_exists: true, id: :uuid do |t|
t.jsonb :payload, null: false
t.string :status, default: 'pending', limit: 20, null: false
t.datetime :locked_at
t.integer :attempt_count, default: 0
t.text :error_message,
t.datetime :failed_at
t.datetime :retry_at
t.datetime :completed_at
t.timestamps
end
add_index :rage_deferred_tasks, :id, where: "status = 'pending'"The Key columns in rage_deferred_tasks:
2. Race ConditionThe problem, two fibers/thread in rage could see the same task pending and try to claim it. Solution is row level locking. # Fetch and lock the first pending task
task = DeferredTask.where(status: 'pending').order(created_at: :asc).lock.firstActive Record's
3. Crash RecoveryThis is the trickiest part. Consider a scenario when server crashes or we have to restart the server. On server boot-up we will update all # Called on startup
DeferredTask.where(status: 'running').update_all(status: 'pending', locked_at: nil)Consider another scenario:
The solution for this problem:Read for crashed jobs which have # Called periodically
crash_timeout = config.task_timeout # default 5 minutes
DeferredTask.where(status: 'running')
.where('locked_at < ?', Time.current - crash_timeout)
.update_all(status: 'pending', locked_at: nil)Any task that has been running for more than 5 minutes will be automatically considered crashed. Its status will be updated to pending, allowing another worker to pick it up for execution. Introducing a mechanism that periodically scans for crashed tasks and re-queues them adds an additional complexity: some tasks may legitimately require more than 5 minutes to complete. These long-running tasks should not be mistakenly treated as crashed. To handle such scenarios, a heartbeat mechanism must be implemented. While a task is being processed, the worker periodically updates the locked_at timestamp associated with that task. As long as the timestamp continues to be refreshed within the configured timeout window, the system considers the task active and avoids re-queuing it. If the locked_at timestamp is not refreshed within the timeout period, the system assumes the worker has crashed or stalled, and the task is marked as pending so it can be retried by worker. def run_with_heart_beat(task)
heartbeat_interval = config.task_timeout / 2 # half of crash detection timeout
heartbeat_thread = Thread.new do
loop do
sleep(heartbeat_interval)
# Refresh the lock timestamp to indicate task is still running
DeferredTask.where(id: task.id).update_all(locked_at: Time.current)
end
end
# task execution
ensure
heartbeat_thread.kill
endThis way 4. Cleanup Strategy & Table GrowthCompleted tasks need to be cleaned up after they have completed or the table will grow unbounded. Three approaches were considered: Option 1: Delete on Completion (Simplest)DeferredTask.find(task_id).deletePros: Simple, minimal storage Option 2: Delete after time interval (6 hours)DeferredTask.where(status: 'completed')
.where('completed_at < ?', Time.current - 6.hours)
.delete_allPros: Short audit window for debugging; prevents unbounded growth Option 3: Archive on completionMove completed rows to a Selected Approach: Option 2 (Delete after interval) We will implement an automatic cleanup job that deletes completed tasks older than a configurable interval (default: 6 hours). Benefits:
The cleanup will run periodically (e.g., every hour) as a background maintenance task: # Cleanup background job
def cleanup_completed_tasks
threshold = config.task_retention_window # default 6 hours
Task.where("status = 'completed' AND completed_at < ?", Time.current - threshold).delete_all
end5. Dead-Letter Queue for Failed TasksTasks may fail due to transient or permanent errors. Without proper handling, failed tasks could:
Solution: Implement a dead-letter queue (DLQ) mechanism Failed task workflow:
Dead-letter queue table: create_table :rage_deferred_tasks_dlq, if_not_exists: true, id: :uuid do |t|
t.uuid :original_id, null: false
t.jsonb :payload, null: false
t.text :error_message
t.integer :attempt_count
t.datetime :failed_at, null: false
t.timestamps
endThis table captures all failed tasks that have exhausted their retry attempts, allowing for:
def execute_task(task)
run_with_heart_beat(task)
rescue => e
task.attempt_count += 1
if task.attempt_count <= config.max_task_retries
# Retry with exponential backoff
backoff = (2 ** task.attempt_count) * 60 # seconds
task.update(
status: 'pending',
error_message: e.message,
attempt_count: task.attempt_count,
retry_at: Time.current + backoff
)
else
# Move to DLQ
FailedDeferredTask.create!(
original_id: task.id,
payload: task.payload,
error_message: e.message,
attempt_count: task.attempt_count,
failed_at: Time.current
)
task.delete
end
endDatabase Backend ImplementationThe Database backend will implement the interfaces currently defined in Rage::Deferred::Backends::Disk # existing implementation, unchanged
Rage::Deferred::Backends::Database # new core logicclass Rage::Deferred::Backends::Database
def add
end
def remove
end
def pending_tasks
end
endDatabase AdapterThe implementation will use Active Record, which provides built-in abstraction for database differences including JSON column support across PostgreSQL and MySQL: class Rage::Deferred::Backends::Database::ActiveRecord < Rage::Deferred::Backends::Database
def pending_task
# Active Record handles SQL generation and JSON serialization across databases
RageDeferredTask.where(status: 'pending')
.where('retry_at IS NULL OR retry_at <= ?', Time.current)
.order(created_at: :asc).lock.first
end
endUsing Active Record ensures:
Milestones & TimelineThis is a rough outline and subject to refinement:
DeliverablesA pull request containing the database persistence backend, updated documentation, and well-written test cases. ValidationA demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers. |
Beta Was this translation helpful? Give feedback.
-
About MeMy name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that. Problem Understanding
Technical Approach1. DataBase Schemacreate_table :rage_deferred_tasks, if_not_exists: true do |t|
t.text :payload, null: false
t.string :status, default: 'pending', limit: 20, null: false
t.datetime :locked_at
t.integer :attempt_count, default: 0
t.text :error_message
t.datetime :failed_at
t.datetime :retry_at
t.datetime :completed_at
t.timestamps
end
add_index :rage_deferred_tasks, [:status, :created_at]The Key columns in rage_deferred_tasks:
2. Crash Recovery And Restart DistributionThis is the trickiest part. Consider a scenario:
The solution for this problem:Read for crashed jobs which have # Called periodically
crash_timeout = config.task_timeout # default 5 minutes
DeferredTask.where(status: 'scheduled')
.where('locked_at < ?', Time.current - crash_timeout)
.update_all(status: 'pending', locked_at: nil)Any task that has been running for more than 5 minutes will be automatically considered crashed. Its status will be updated to pending, allowing another worker to pick it up for execution. Introducing a mechanism that periodically scans for crashed tasks and re-queues them adds an additional complexity: some tasks may legitimately require more than 5 minutes to complete. These long-running tasks should not be mistakenly treated as crashed. To handle such scenarios, a heartbeat mechanism must be implemented. While a task is being processed, the worker periodically updates the locked_at timestamp associated with that task. As long as the timestamp continues to be refreshed within the configured timeout window, the system considers the task active and avoids re-queuing it. If the locked_at timestamp is not refreshed within the timeout period, the system assumes the worker has crashed or stalled, and the task is marked as pending so it can be retried by worker. def run_with_heart_beat(task)
heartbeat_interval = config.task_timeout / 2 # half of crash detection timeout
heartbeat_thread = Thread.new do
loop do
sleep(heartbeat_interval)
# Refresh the lock timestamp to indicate task is still running
DeferredTask.where(id: task.id).update_all(locked_at: Time.current)
end
end
# task execution
ensure
heartbeat_thread.kill
endThis way Consider a scenario: 2.1 Restart DistributionConsider a multi-server environment like kubernetes where we have multiple pods. If all pending tasks are claimed in one batch on startup, one pod will be overwhelmed with all pending tasks while others will be idle. To prevent this, we will implement claiming strategy on startup which will balance the load across the workers/pods. On startup, each worker will claim a batch of pending tasks and start executing them. After the initial batch is claimed, a background thread will continue to claim remaining pending tasks in batches until all are claimed. This allows for a more balanced distribution of pending tasks across workers on startup, preventing a single worker from being overwhelmed with all pending tasks. def on_boot
recover_pending_tasks # immediate batch claim
start_drain_thread # background sweep for remaining pending tasks if any
enddef recover_pending_tasks(batch_size)
claimed = claim_batch(batch_size)
claimed.each { |task| schedule_task(task) }
enddef claim_batch(batch_size)
ApplicationRecord.transaction do
rows = RageDeferredTask
.where(status: "pending")
.order(created_at: :asc)
.limit(batch_size)
.lock("FOR UPDATE SKIP LOCKED") # to prevent race conditions between workers
.to_a
return [] if rows.empty?
ids = rows.map(&:id)
RageDeferredTask
.where(id: ids)
.update_all(
status: "Scheduled",
locked_at: Time.current,
locked_by: worker_id
)
rows
end
enddef start_drain_thread
Thread.new do
loop do
sleep(drain_interval) # e.g. 30 seconds
claimed = claim_batch(batch_size)
if claimed.empty?
# Nothing left unclaimed — recovery is complete, stop polling
break
end
claimed.each { |task| schedule_task(task) }
end
end
endThis approach allows for a more balanced distribution of pending tasks across workers on startup, preventing a single worker from being overwhelmed with all pending tasks. By claiming tasks in batches and starting a background thread to continue claiming remaining tasks, we can ensure a smoother recovery process and better load distribution among workers. 2.2 Background Recovery Thread for Crashed TasksIn addition to the startup recovery process, we will implement a background thread that periodically checks for crashed tasks and enqueue them to the fiber scheduler. def start_recovery_thread
Thread.new do
loop do
sleep(config.task_timeout / 2) # check for crashed tasks every half of the crash timeout period
tasks = ApplicationRecord.transaction do
rows = RageDeferredTask
.where(status: "scheduled")
.where("locked_at < ?", Time.current - config.task_timeout)
.limit(batch_size)
.lock("FOR UPDATE SKIP LOCKED")
.to_a
return [] if rows.empty?
RageDeferredTask
.where(id: rows.map(&:id))
.update_all(status: "pending", locked_at: nil)
rows
end
tasks.each { |task| schedule_task(task) } # re-enqueue into fiber scheduler for execution
end
end
end3. Cleanup Strategy & Table GrowthCompleted tasks need to be cleaned up after they have completed or the table will grow unbounded. Three approaches were considered: Option 1: Delete on Completion (Simplest)DeferredTask.find(task_id).deletePros: Simple, minimal storage Option 2: Delete after time interval (6 hours)# Deletes in batches to avoid table locks and maintain performance for active workers
DeferredTask.where(status: 'completed')
.where('completed_at < ?', Time.current - 6.hours)
.in_batches(of: 1000).delete_allPros: Short audit window for debugging; prevents unbounded growth; batched deletion avoids table locks Option 3: Archive on completionMove completed rows to a Selected Approach: Option 2 (Delete after interval) We will implement an automatic cleanup job that deletes completed tasks older than a configurable interval (default: 6 hours). Benefits:
The cleanup will run periodically (e.g., every hour) as a background maintenance task: def cleanup_completed_tasks
threshold = config.task_retention_window
batch_size = config.cleanup_batch_size
loop do
deleted_count = RageDeferredTask
.where(status: "completed")
.where("completed_at < ?", Time.current - threshold)
.limit(batch_size)
.delete_all # returns integer count directly
break if deleted_count == 0
sleep(0.5)
end
endWhy Batched Deletion is Critical:
4. Dead-Letter Queue for Failed TasksTasks may fail due to transient or permanent errors. Without proper handling, failed tasks could:
Solution: Implement a dead-letter queue (DLQ) mechanism Failed task workflow:
Dead-letter queue table: create_table :rage_deferred_tasks_dlq, if_not_exists: true do |t|
t.bigint :original_id, null: false
t.text :payload, null: false
t.text :error_message
t.integer :attempt_count
t.datetime :failed_at, null: false
t.timestamps
endThis table captures all failed tasks that have exhausted their retry attempts, allowing for:
def execute_task(task)
run_with_heart_beat(task) { perform_task(task.payload) }
task.update!(status: 'completed', completed_at: Time.current)
rescue => e
task.attempt_count += 1
if task.attempt_count <= config.max_task_retries
# Retry with exponential backoff
backoff = (2 ** task.attempt_count) * 60 # seconds
task.update(
status: 'pending',
error_message: e.message,
attempt_count: task.attempt_count,
retry_at: Time.current + backoff
)
else
# Move to DLQ
FailedDeferredTask.create!(
original_id: task.id,
payload: task.payload,
error_message: e.message,
attempt_count: task.attempt_count,
failed_at: Time.current
)
task.delete
end
endDatabase Backend ImplementationThe Database backend will implement the interfaces currently defined in Rage::Deferred::Backends::Disk # existing implementation, unchanged
Rage::Deferred::Backends::Database # new core logicclass Rage::Deferred::Backends::Database
def initialize
on_boot
end
def add
end
def remove
end
def pending_tasks
end
private
def on_boot
recover_pending_tasks #initial batch claim on startup to balance load across workers
start_drain_thread #background thread to claim remaining pending tasks in batches until all are claimed
start_recovery_thread #background thread to recover crashed tasks periodically (continue to live for the lifetime of the process)
start_cleanup_thread #background thread to clean up completed tasks periodically (continue to live for the lifetime of the process)
end
def recover_pending_tasks
claim_batch(config.startup_batch_size).each { |task| schedule_task(task) }
end
def schedule_task(task)
Fiber.schedule do
execute_task(task)
end
end
def start_drain_thread
Thread.new do
loop do
sleep(drain_interval) # e.g. 30 seconds
claimed = claim_batch(batch_size)
if claimed.empty?
break
end
claimed.each { |task| schedule_task(task) }
end
end
end
def start_recovery_thread
Thread.new do
loop do
sleep(config.orphan_check_interval)
fetch_orphaned_tasks.each { |task| schedule_task(task) }
end
end
end
def start_cleanup_thread
Thread.new do
loop do
sleep(config.cleanup_interval) # e.g. every hour
cleanup_completed_tasks
end
end
end
def cleanup_completed_tasks
threshold = config.task_retention_window
batch_size = config.cleanup_batch_size
loop do
rows = fetch_completed_tasks(batch_size, threshold)
deleted_count = rows.delete_all
break if deleted_count < batch_size
sleep(0.5)
end
end
# adapter will implement these
def claim_batch(size)
raise NotImplementedError
end
def fetch_orphaned_tasks
raise NotImplementedError
end
def fetch_completed_tasks(batch_size, threshold)
raise NotImplementedError
end
endDatabase AdapterThe implementation will use Active Record, which provides built-in abstraction for database class Rage::Deferred::Backends::Database::ActiveRecord < Rage::Deferred::Backends::Database
def add
RageDeferredTask.create!(payload: task.payload, status: 'pending', ...)
end
end
endUsing Active Record ensures:
Milestones & TimelineThis is a rough outline and subject to refinement:
DeliverablesA pull request containing the database persistence backend, updated documentation, and well-written test cases. ValidationA demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Draft-V1
About Me
My name is Abdur Rehman. I have been a professional software engineer since 2021, currently working full-time at a health tech company. I have one year of hands-on experience with ruby, apart from that I worked with Node.js as well. I have free time outside of work that I'd like to dedicate to contributing to open source projects, and Google Summer of Code 2026 is the perfect opportunity to do that.
Problem Understanding
Rage::Deferredcurrently provides task execution backed by a Write-Ahead Log (WAL) that stores metadata on disk. The issue arises in modern infrastructures such as Docker containers that have ephemeral disk storage.If a container is restarted or redeployed, the log file is lost. To prevent this data loss, we need a database persistence layer that survives container restarts and ensures no background jobs are dropped.
Technical Approach
Rage::Deferred::Backends::Disk— existing implementation, unchangedRage::Deferred::Backends::Database— new core logicThe Database backend will implement the interfaces currently defined in
nil.rb. It will serve as a base class with three concrete adapters: one for:mysql, one for:postgresand one for:active_record.The structure will look roughly like this:
And the config will look this.
config.deferred.backend = :databaseconfig.deferred.backend.adapter = :active_recordMilestones & Timeline
This is a rough outline and subject to refinement:
Week 1: Deep dive into the Rage codebase, read documentation, and engage with the community
Week 2: Finalize the technical approach and begin implementation
Week 3: Complete the base Database adapter and implement mysql, postgres and active_record adapters
Week 4: Write comprehensive test cases for adapters
Week 5: Write documentation, address reviewer feedback, and get the PR merged
Deliverables
A pull request containing the database persistence backend, updated documentation, and well-written test cases.
Validation
A demo video demonstrating the feature, along with any additional validation artifacts requested by the reviewers.
Beta Was this translation helpful? Give feedback.
All reactions