[FEA]: CUDA C or PTX Injection

### Is this a new feature, an improvement, or a change to existing functionality?

New Feature

### How would you describe the priority of this feature request?

Low (would be nice)

### Please provide a clear description of problem this feature solves

According to the [test_bytecode.py](https://github.com/NVIDIA/cutile-python/blob/main/test/test_bytecode.py) file, cuTile supports launching kernels provided in CUBIN format, which enables execution of CUDA C kernels compiled offline. However, this creates a strict separation between Python-authored cuTile kernels and CUDA C kernels. Users must choose one approach or the other, with no supported mechanism to combine them. As a result, it is difficult to reuse existing CUDA C or PTX code, or to optimize performance-critical regions within an otherwise Python-based cuTile kernel.

### Feature Description

Add support for embedding or injecting CUDA C or PTX code into a Python-authored cuTile kernel. This would enable a hybrid programming model where most kernel logic is expressed in Python, while selected sections can be implemented in CUDA C or PTX for fine-grained performance tuning or access to low-level hardware features. This capability would improve cuTile's flexibility, allow reuse of existing CUDA C/PTX code, and make cuTile a more powerful tool for advanced CUDA kernel development.

### Describe your ideal solution

Provide an API that allows directly inserting CUDA C or PTX instructions into a Python-authored cuTile kernel, analogous to `asm volatile(...)` in CUDA C. This API would act as a low-level escape hatch, enabling users to inline raw code at specific points in the kernel.

### Describe any alternatives you have considered

_No response_

### Additional context

_No response_

### Contributing Guidelines

- [x] I agree to follow cuTile Python's contributing guidelines
- [x] I have searched the [open feature requests](https://github.com/nvidia/cutile-python/issues?q=is%3Aopen+is%3Aissue+label%3A%22feature+request) and have found no duplicates for this feature request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: CUDA C or PTX Injection #69

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request?

Please provide a clear description of problem this feature solves

Feature Description

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Contributing Guidelines

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA]: CUDA C or PTX Injection #69

Description

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request?

Please provide a clear description of problem this feature solves

Feature Description

Describe your ideal solution

Describe any alternatives you have considered

Additional context

Contributing Guidelines

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions