Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request?
Low (would be nice)
Please provide a clear description of problem this feature solves
According to the test_bytecode.py file, cuTile supports launching kernels provided in CUBIN format, which enables execution of CUDA C kernels compiled offline. However, this creates a strict separation between Python-authored cuTile kernels and CUDA C kernels. Users must choose one approach or the other, with no supported mechanism to combine them. As a result, it is difficult to reuse existing CUDA C or PTX code, or to optimize performance-critical regions within an otherwise Python-based cuTile kernel.
Feature Description
Add support for embedding or injecting CUDA C or PTX code into a Python-authored cuTile kernel. This would enable a hybrid programming model where most kernel logic is expressed in Python, while selected sections can be implemented in CUDA C or PTX for fine-grained performance tuning or access to low-level hardware features. This capability would improve cuTile's flexibility, allow reuse of existing CUDA C/PTX code, and make cuTile a more powerful tool for advanced CUDA kernel development.
Describe your ideal solution
Provide an API that allows directly inserting CUDA C or PTX instructions into a Python-authored cuTile kernel, analogous to asm volatile(...) in CUDA C. This API would act as a low-level escape hatch, enabling users to inline raw code at specific points in the kernel.
Describe any alternatives you have considered
No response
Additional context
No response
Contributing Guidelines
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request?
Low (would be nice)
Please provide a clear description of problem this feature solves
According to the test_bytecode.py file, cuTile supports launching kernels provided in CUBIN format, which enables execution of CUDA C kernels compiled offline. However, this creates a strict separation between Python-authored cuTile kernels and CUDA C kernels. Users must choose one approach or the other, with no supported mechanism to combine them. As a result, it is difficult to reuse existing CUDA C or PTX code, or to optimize performance-critical regions within an otherwise Python-based cuTile kernel.
Feature Description
Add support for embedding or injecting CUDA C or PTX code into a Python-authored cuTile kernel. This would enable a hybrid programming model where most kernel logic is expressed in Python, while selected sections can be implemented in CUDA C or PTX for fine-grained performance tuning or access to low-level hardware features. This capability would improve cuTile's flexibility, allow reuse of existing CUDA C/PTX code, and make cuTile a more powerful tool for advanced CUDA kernel development.
Describe your ideal solution
Provide an API that allows directly inserting CUDA C or PTX instructions into a Python-authored cuTile kernel, analogous to
asm volatile(...)in CUDA C. This API would act as a low-level escape hatch, enabling users to inline raw code at specific points in the kernel.Describe any alternatives you have considered
No response
Additional context
No response
Contributing Guidelines