⚡️ Speed up method ResizeGenerator.forward by 99%#34
Closed
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Closed
⚡️ Speed up method ResizeGenerator.forward by 99%#34codeflash-ai[bot] wants to merge 1 commit intomainfrom
ResizeGenerator.forward by 99%#34codeflash-ai[bot] wants to merge 1 commit intomainfrom
Conversation
The optimized code achieves a **98% speedup** (from 26.9ms to 13.6ms) through two key optimizations in tensor creation and manipulation: ## Primary Optimization: Efficient Tensor Broadcasting in `bbox_generator` The original code creates bounding boxes inefficiently: 1. Creates a template tensor `[[[0, 0], [0, 0], [0, 0], [0, 0]]]` 2. Calls `.repeat(batch_size, 1, 1)` to replicate it 3. Performs 6 separate in-place operations (`+=`) with `.view(-1, 1)` reshaping The optimized version: 1. Pre-computes corner coordinates (`x1 = x_start + width - 1`, `y1 = y_start + height - 1`) 2. Uses `torch.stack` to directly construct the bbox tensor in a single operation 3. Eliminates all in-place modifications **Why this is faster**: - Avoids memory allocation overhead from `.repeat()` (profiler shows ~4.5ms in `torch.tensor().repeat()` in original) - Reduces 6 indexing + in-place operations (~18ms total) to 2 arithmetic ops + 1 vectorized stack (~7.5ms total) - Better utilizes PyTorch's vectorized operations instead of element-wise modifications ## Secondary Optimization: Batch-Aware Tensor Creation in `ResizeGenerator.forward` The original creates scalar tensors then broadcasts: ```python torch.tensor(0, device=_device, dtype=_dtype) # scalar ).repeat(batch_size, 1, 1) # then broadcast ``` The optimized uses batch-sized tensors directly: ```python torch.full((batch_size,), 0, device=_device, dtype=_dtype) # already batched ``` **Why this is faster**: - `torch.full` creates the correctly-sized tensor in one allocation - Eliminates the `.repeat()` operation entirely (saves ~2ms per `bbox_generator` call) - Reduces tensor creation overhead by ~40% (from ~1.5ms to ~0.4ms per scalar tensor) ## Impact Analysis Based on annotated tests, the optimization delivers: - **64-103% speedup** for typical batch sizes (1-100) - **Best performance** on single-image workloads (81-106% faster) - common in inference pipelines - **Consistent gains** across all image sizes and aspect ratios - **No degradation** on edge cases (empty batches, error paths remain similar) The optimization is particularly valuable for data augmentation pipelines where `ResizeGenerator.forward` is called repeatedly during training, as the ~50% reduction in per-call latency compounds over thousands of iterations.
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs within 7 days. Thank you for your contributions! |
|
This pull request has been automatically closed due to inactivity. Feel free to reopen it if you would like to continue working on it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 99% (0.99x) speedup for
ResizeGenerator.forwardinkornia/augmentation/random_generator/_2d/resize.py⏱️ Runtime :
26.9 milliseconds→13.6 milliseconds(best of5runs)📝 Explanation and details
The optimized code achieves a 98% speedup (from 26.9ms to 13.6ms) through two key optimizations in tensor creation and manipulation:
Primary Optimization: Efficient Tensor Broadcasting in
bbox_generatorThe original code creates bounding boxes inefficiently:
[[[0, 0], [0, 0], [0, 0], [0, 0]]].repeat(batch_size, 1, 1)to replicate it+=) with.view(-1, 1)reshapingThe optimized version:
x1 = x_start + width - 1,y1 = y_start + height - 1)torch.stackto directly construct the bbox tensor in a single operationWhy this is faster:
.repeat()(profiler shows ~4.5ms intorch.tensor().repeat()in original)Secondary Optimization: Batch-Aware Tensor Creation in
ResizeGenerator.forwardThe original creates scalar tensors then broadcasts:
The optimized uses batch-sized tensors directly:
Why this is faster:
torch.fullcreates the correctly-sized tensor in one allocation.repeat()operation entirely (saves ~2ms perbbox_generatorcall)Impact Analysis
Based on annotated tests, the optimization delivers:
The optimization is particularly valuable for data augmentation pipelines where
ResizeGenerator.forwardis called repeatedly during training, as the ~50% reduction in per-call latency compounds over thousands of iterations.✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-ResizeGenerator.forward-mkdw96qiand push.