Skip to content

Feature request: Introduce SyncStatus and InvalidSyncPoint for enhanced synchronization feedback #248

@Vanuan

Description

@Vanuan

Problem statement

Blade’s current wait_for API returns a boolean to indicate whether a synchronization point was reached within the specified timeout. However, this design has significant limitations:

  1. A false return value could mean either a timeout or an error.
  2. In scenarios like suspend/resume, the GPU may be reinitialized, causing sync points to become invalid. The current API cannot distinguish between these cases.
  3. Developers lack detailed feedback to handle synchronization outcomes effectively, making debugging and error recovery challenging.

To address these limitations I propose introducing a new API: SyncPoint::wait_for(timeout) -> SyncStatus

Proposed solution

Keep the existing API

The existing wait_for API will remain unchanged to ensure backward compatibility. It will continue to return a boolean:

  • true: Synchronization completed successfully.
  • false: Synchronization failed (timeout or error).

Introduce a new API

A new API, sync_point.wait_for(timeout), will be introduced to provide detailed feedback through a SyncStatus enum. This API will explicitly distinguish between:

  1. Completed: The synchronization point was reached successfully.
  2. Timeout: The operation timed out before the synchronization point was reached.
  3. InvalidSyncPoint: The sync point became invalid (e.g., due to GPU reinitialization).
  4. Error: Any other errors or unexpected issues during synchronization, with a detailed error message.

New SyncStatus enum

The SyncStatus enum will provide detailed feedback for the new API.

pub enum SyncStatus {
    Completed,
    Timeout,
    InvalidSyncPoint,
    Error { error_string: String },
}

New wait_for(or wait_for_detailed) method

The new wait_for methods to be added to the SyncPoint and will return SyncStatus.

trait SyncPoint {
    fn wait_for(&self, timeout_ms: u32) -> bool;
    fn wait_for_detailed(&self, timeout_ms: u32) -> SyncStatus;
}

Example usage

Simple usage

let sync_point = device.create_sync_point();
// ...
if sync_point.wait_for(1000) {
    println!("GPU work completed!");
} else {
    println!("Timeout or error occurred."); // Cannot distinguish between timeout, error, or invalid sync point
}

Enhanced usage

let sync_point = device.create_sync_point();
// ...
match sync_point.wait_for_detailed(1000) {
    SyncStatus::Completed => println!("GPU work completed!"),
    SyncStatus::Timeout => println!("Timeout while waiting for GPU work."),
    SyncStatus::InvalidSyncPoint => println!("Sync point is invalid (e.g., GPU reinitialized)."),
    SyncStatus::Error { error_string } => println!("An error occurred: {}", error_string),
}

How different APIs handle invalid sync points

1. Vulkan

In Vulkan, synchronization primitives like semaphores and fences are tied to the logical device. If the device is lost (e.g., due to a GPU crash or driver issue), all synchronization primitives become invalid. Vulkan provides explicit mechanisms to detect device loss:

  • Device Lost Error: When a device is lost, Vulkan operations return VK_ERROR_DEVICE_LOST. This can be used to detect invalid sync points.
  • Timeline Semaphores: If a timeline semaphore is used, its value may become invalid if the device is lost.

Example:

unsafe {
    match self.device.timeline_semaphore.wait_semaphores(&wait_info, timeout_ns) {
        Ok(_) => SyncStatus::Completed,
        Err(vk::Result::TIMEOUT) => SyncStatus::Timeout,
        Err(vk::Result::ERROR_DEVICE_LOST) => SyncStatus::InvalidSyncPoint,
        Err(err) => SyncStatus::Error {
            error_string: format!("Vulkan error: {:?}", err),
        },
    }
}

2. Metal

In Metal, command buffers and their associated synchronization primitives are tied to the command queue and device. If the GPU is reset or the device is reinitialized, command buffers and their sync points may become invalid. Metal provides status checks for command buffers:

  • Command Buffer Status: A command buffer can be in states like NotEnqueued, Enqueued, Committed, Scheduled, Completed, or Error.
  • Invalid State: If a command buffer is in an invalid state (e.g., NotEnqueued after GPU reinitialization), it can be treated as an invalid sync point.

Example:

match sync_point.cmd_buf.status() {
    metal::MTLCommandBufferStatus::Completed => SyncStatus::Completed,
    metal::MTLCommandBufferStatus::Error => {
        let error_message = sync_point.cmd_buf.error()
            .map(|e| e.to_string())
            .unwrap_or_else(|| "Unknown Metal error".to_string());
        SyncStatus::Error {
            error_string: error_message,
        }
    }
    metal::MTLCommandBufferStatus::NotEnqueued => SyncStatus::InvalidSyncPoint,
    _ => SyncStatus::Timeout,
}

3. GLES

In GLES, synchronization relies on sync objects (e.g., created with glFenceSync), which are tied to the GL context. If the context is lost—such as during suspend/resume or GPU reinitialization—all sync objects become invalid. The glow crate provides abstractions for working with GLES sync operations. The glClientWaitSync function is used to wait for a sync object to be signaled and can return specific statuses: GL_ALREADY_SIGNALED or GL_CONDITION_SATISFIED indicates the sync completed successfully, GL_TIMEOUT_EXPIRED means the wait timed out, and GL_WAIT_FAILED signals that the sync object is invalid, often due to context loss. This mechanism allows for explicit handling of synchronization outcomes, including errors and invalid states.

Example:

impl SyncPoint for GLESSyncPoint {
    fn wait_for(&self, timeout_ms: u32) -> bool {
        matches!(self.wait_for_detailed(timeout_ms), SyncStatus::Completed)
    }

    fn wait_for_detailed(&self, timeout_ms: u32) -> SyncStatus {
        let gl = self.lock();
        let timeout_ns = if timeout_ms == !0 { !0 } else { timeout_ms as u64 * 1_000_000 };
        let timeout_ns_i32 = timeout_ns.min(i32::MAX as u64) as i32;

        let status = unsafe {
            gl.client_wait_sync(self.fence, glow::SYNC_FLUSH_COMMANDS_BIT, timeout_ns_i32)
        };

        match status {
            glow::ALREADY_SIGNALED | glow::CONDITION_SATISFIED => SyncStatus::Completed,
            glow::TIMEOUT_EXPIRED => SyncStatus::Timeout,
            glow::WAIT_FAILED => SyncStatus::InvalidSyncPoint,
            _ => SyncStatus::Error { error_string: "GLES sync failed".to_string() },
        }
    }
}

Is "invalid sync point" a universal concept?

While the term invalid sync point isn’t explicitly defined in graphics APIs, the concept exists in practice. Each API has its own way of handling scenarios where synchronization primitives become unusable:

  • Vulkan: device loss (VK_ERROR_DEVICE_LOST)
  • Metal: command buffer invalidation (NotEnqueued or Error)
  • GLES: context loss (glow::WAIT_FAILED)

By introducing an InvalidSyncPoint and the SyncStatus enum, we provide a unified way to handle these scenarios across all backends.

Use cases

The InvalidSyncPoint is particularly useful for handling suspend/resume scenarios, where the GPU may be reinitialized, causing sync points to become invalid. Without this variant, developers cannot distinguish between:

  • A legitimate timeout (e.g., the GPU is busy but still operational), where we can present user with a choice to continue waiting or terminate (maybe through some Operating System API if the desktop environment is not experiencing the same busy status).
  • An invalid sync point (e.g., the GPU was reinitialized, and the sync point is no longer valid), where we need to either recover from error state or gracefully shutdown.

By explicitly including InvalidSyncPoint, we enable developers to handle these cases appropriately, improving robustness and debuggability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions