From 424985f38d286d0f7380bc6793cb73480a6a0c17 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?M=C3=A1t=C3=A9=20Ferenc=20Nagy-Egri?= <mate@streamhpc.com>
Date: Mon, 15 Nov 2021 13:18:33 +0100
Subject: [PATCH 1/2] Initial interop doc

---
 chapters/how_does_opencl-opengl_interop.md | 47 ++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 chapters/how_does_opencl-opengl_interop.md

diff --git a/chapters/how_does_opencl-opengl_interop.md b/chapters/how_does_opencl-opengl_interop.md
new file mode 100644
index 0000000..152f3a6
--- /dev/null
+++ b/chapters/how_does_opencl-opengl_interop.md
@@ -0,0 +1,47 @@
+# OpenCL-OpenGL interop
+
+Both OpenCL and OpenGL have specific extensions targeting resource sharing and synchronizing between the two runtimes. Doing so one may omit fetching data from the device, only to send it immediately back resulting in significant performane gains. Because the way the two APIs work, there are few thing to keep in mind when designing applications that intend interoperating.
+
+## How is it different than using OpenGL compute shaders?
+
+OpenGL compute shaders are slightly more restricted than OpenCL compute kernels. This is also reflected in the duality of the intermediate formats they can be compiled to. When using SPIR-V as an intermediate representation (IR), compute shaders are compiled to the graphics flavor of SIPR-V, which must exhibit structured control flow and must not use pointer arithmetic. These two cannot arise when using GLSL or other traditional shading languages. OpenCL C, being a C-derivate is far more liberal in the expressable language constructs than shading languages and as such requires a more feature complete intermediate representation, the so called compute flavor of SPIR-V. Different compiler infrastructure is required behind the scenes to process these two types of workloads, irrespective of ingesting IR or compiling from source.
+
+Beside the OpenCL ecosystem having far more libraries and utilities tailored toward compute tasks, for applications which are heavier on compute and are graphically less intensive, formulating the majority of the application in a pure compute fashion with a few graphics extensions may be a better solution than having to deal with render pipelines to utilize one pipeline stage almost exclusively.
+
+## Setting up interop
+
+The core of the OpenGL API has remained backward compatible with itself all the way back to it's initial incarnations. This feature of OpenGL imposes some restrictions on how interoperability can be setup.
+
+In layman's terms, OpenCL is the "smarter" API, OpenGL does some part of init unaware of OpenCL, or even before any OpenCL API function has been invoked. Once all the shared resources (buffers and textures) were created in OpenGL, _only then_ is the OpenCL interop context even created. While OpenGL created resources as normal, OpenCL (and only OpenCL) has special functions which take `GLuint` as input to designate which exact OpenGL resource is bing given a corresponding OpenCL handle.
+
+## Using shared resources
+
+The asymmetry in responsibilities is visible in how resources are accessed as well. OpenGL rendering (without further extensions) is conducted as normal, once again only OpenCL has specific functionality to note shared resource usage. Shared resources can only be used, if the device (through a commandqueue) has signaled OpenCL use of the resource via explicit acquire/release semantics using `clEnqueueAcquireGLObjects`/`clEnqueueReleaseGLObjects` functions.
+
+## Synchronizing the two APIs
+
+There are a handful of ways the two APIs may be synchronized depending on how your application is designed and what level of OpenCL-OpenGL interoperabiltiy is supported by the runtimes.
+
+The following sections are practical paraphrases of the OpenCL Extensions specification sections [Synchronizing OpenCL and OpenGL Access to Shared Objects](https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_gl_sharing__memobjs-synchronizing-opencl-and-opengl-access-to-shared-objects) and changes to this behavior when event sharing is supported described in section [Additions to the OpenCL Extension Specification](https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_gl_event-additions-to-extension-specification).
+
+### Basic sync
+
+The most basic level of synchronization is when only `cl_khr_gl_sharing` is supported. In such cases the only portable sync pattern uses the most heavyweight sync operations. Rendering in OpenGL and compute in OpenCL shall not overlap and the developer must ensure this using `glFinish()`/`clFinish()`. Both functions signal that all operations have completed in their respective APIs.
+
+_(Note: glFinish() synchronizes the OpenGL client and server too, and as such requries OS intervention. In remote scenarios (such as X-forwarding) this requires network communication as well.)_
+
+### Implicit sync
+
+If `cl_khr_gl_event` is supported, without making use of the added API surface, a faster sync is available is the application is designed in a compatible manner. If the OpenGL context is bound on the thread where acquire/release and compute kernels are enqueued, the OpenCL runtime has a chance to observe the state of the OpenGL context. In such cases, acquiring the OpenGL objects waits for all OpenGL commands to finish that used the acquired resources, _and_ OpenGL calls using these resources which are issued after the release command will not start executing until the effects of release are visible to the OpenGL context.
+
+Implicit sync from the code's perspective resembles that of the previous approach when one does not sync, just flushes the queues instead of finishing them. (Flushing a queue in OpenGL does not involve the OpenGL server.)
+
+_(Note: If in a loop one is calling GL-CL-GL-CL... commands in succession, one blocking sany somewhere will still be required, otherwise such loops on the host may spin faster than rendering and compute commands are processed on the device, leading to spilling the limit of commands in the queues. Blocking can both be done on OpenGL sync objects or OpenCL events.)_
+
+### Explicit sync
+
+When `cl_khr_gl_event` is supported but the context cannot be made current on the thread enqueueing OpenCL commands, one may still sync faster than invoking `glFinish()`/`clFinish()`. Because the OpenCL runtime cannot directly observe the OpenGL context, some channel of information need be made explicit for syncing to occur. As the name suggests, this extension involves events, specifically one is able to create an OpenCL event from an OpenGL sync object.
+
+By mapping a sync object that is enqueue after a render command using some shared resource to an OpenCL event, one can use such events in the call to `clEnqueueAcquireGLObjects` in the event wait list. That way `glFinish()` may be omitted, as OpenCL can explicitly wait on certain parts of the rendering queue to complete. Note than using only this, `clFinish()` strictly speaking is still required.
+
+The corollary to this extension is `GL_ARB_cl_event` which allows syncing in the "reverse direction" by mapping OpenCL event to OpenGL sync objects. That way OpenGL too gets the chance to invoke a lightweight sync operation to make sure that relevant OpenCL operations have completed.
\ No newline at end of file

From 5b8451f816d9d7ebe4a093901808a0e681176383 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?M=C3=A1t=C3=A9=20Ferenc=20Nagy-Egri?= <mate@streamhpc.com>
Date: Wed, 16 Feb 2022 09:08:24 +0100
Subject: [PATCH 2/2] Vulkan URL moved

---
 chapters/how_does_opencl_compare.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/chapters/how_does_opencl_compare.md b/chapters/how_does_opencl_compare.md
index c435ec4..a61f4ed 100644
--- a/chapters/how_does_opencl_compare.md
+++ b/chapters/how_does_opencl_compare.md
@@ -14,7 +14,7 @@ Khronos has two high-level standards that focus on ease of programming with effe
 
 SYCL and OpenVX implementations can be accelerated over lower level Khronos APIs such as Vulkan and OpenCL – though that is not mandated. Both Vulkan and OpenCL provide lower-level, explicit access to hardware resources for maximum flexibility and control. 
 
-[Vulkan](https://www.khronos.org/vulkan/) is a widely used new generation GPU API that can accelerate compute operations on any compatible GPU using compute shaders (shaders are the graphics equivalent of OpenCL's kernels), as well as rendering 3D graphics. When comparing OpenCL and GPU APIs such as Vulkan, some developers that are just interested in compute find that OpenCL provides a more straightforward programming model, a lighter weight runtime, more language flexibility compared to graphics shading languages - for example OpenCL C has pointers - and more rigorously defined numerical precision for math operations that can be critical for many applications. And of course, Vulkan can only be used to program GPUs, whereas OpenCL can be used to program heterogeneous accelerators.
+[Vulkan](https://www.vulkan.org/) is a widely used new generation GPU API that can accelerate compute operations on any compatible GPU using compute shaders (shaders are the graphics equivalent of OpenCL's kernels), as well as rendering 3D graphics. When comparing OpenCL and GPU APIs such as Vulkan, some developers that are just interested in compute find that OpenCL provides a more straightforward programming model, a lighter weight runtime, more language flexibility compared to graphics shading languages - for example OpenCL C has pointers - and more rigorously defined numerical precision for math operations that can be critical for many applications. And of course, Vulkan can only be used to program GPUs, whereas OpenCL can be used to program heterogeneous accelerators.
 
 Vulkan and many implementations of OpenCL use Khronos’ [SPIR-V](https://www.khronos.org/spir/) standard as a programming language intermediate representation that enables significant language compiler tooling flexibility.