Conversation
Summary of ChangesHello @kkimmk, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the MAX framework's capabilities by introducing a complete image generation workflow through the Flux1 diffusion pipeline. It integrates several advanced deep learning models (VAE, CLIP, T5, and a custom Flux Transformer) and a specialized diffusion scheduler. The changes enable users to generate high-quality images from textual prompts with fine-grained control over various parameters, while also enhancing core utilities for model compilation and image handling. This lays a robust foundation for future generative AI applications within the MAX ecosystem. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is a large but well-structured pull request that adds support for the FLUX.1 diffusion model. The changes span from new model architecture implementations, a new diffusion pipeline, CLI extensions, to various supporting utilities. The code is generally of high quality. My review focuses on improving robustness, correctness, and maintainability in a few key areas. I've identified a potential bug in the CLIP text encoder's pooling logic, suggested improvements for command-line argument parsing and exception handling, and recommended using pathlib for more robust path manipulation.
| if self.eos_token_id == 2: | ||
| eos_token_indices = ops.argmax(input_ids, axis=-1).cast(DType.int32) | ||
| else: | ||
| eos_token_indices = ops.argmax( | ||
| ops.equal(input_ids, self.eos_token_id).cast(DType.int32), | ||
| axis=-1, | ||
| ).cast(DType.int32) |
There was a problem hiding this comment.
The special handling for self.eos_token_id == 2 using ops.argmax(input_ids, ...) is brittle and likely incorrect. It assumes that the token with the highest ID in the sequence is the EOS token, which is not a safe assumption. This can lead to incorrect pooling by selecting the wrong token's hidden state.
The logic in the else block, which explicitly finds the eos_token_id, is much more robust. I recommend using this robust logic for all cases to prevent potential bugs.
eos_token_indices = ops.argmax(
ops.equal(input_ids, self.eos_token_id).cast(DType.int32),
axis=-1,
).cast(DType.int32)| parser.add_argument( | ||
| "--model-path", type=str, default="black-forest-labs/FLUX.1-dev" | ||
| ) | ||
| parser.add_argument("--use-torch-randn", type=bool, default=True) |
There was a problem hiding this comment.
Using type=bool with argparse can lead to unexpected behavior. For instance, python script.py --use-torch-randn False would result in True because bool('False') evaluates to True.
For boolean flags, it's better to use action=argparse.BooleanOptionalAction (for Python 3.9+) which automatically creates --use-torch-randn and --no-use-torch-randn flags. This makes the CLI behavior explicit and less error-prone.
| parser.add_argument("--use-torch-randn", type=bool, default=True) | |
| parser.add_argument("--use-torch-randn", action=argparse.BooleanOptionalAction, default=True) |
| except Exception as e: | ||
| raise ValueError( | ||
| f"Failed to load configuration from {config_path}: {e}" | ||
| ) from e |
There was a problem hiding this comment.
Catching a generic Exception is too broad and can hide unexpected errors. It's better to catch more specific exceptions that you expect to handle, such as json.JSONDecodeError for parsing issues and IOError for file reading problems.
| except Exception as e: | |
| raise ValueError( | |
| f"Failed to load configuration from {config_path}: {e}" | |
| ) from e | |
| except (json.JSONDecodeError, IOError) as e: | |
| raise ValueError( | |
| f"Failed to load configuration from {config_path}: {e}" | |
| ) from e |
| weight_paths = [ | ||
| Path(pretrained_model_name_or_path) / weight_path | ||
| for weight_path in self.pipeline_config.model.weight_path | ||
| if weight_path.split("/")[0] == name |
There was a problem hiding this comment.
Using weight_path.split("/")[0] for path manipulation is not platform-agnostic and can fail on operating systems that use a different path separator (like Windows). It's more robust to use pathlib for this, as it handles path components correctly across different OS environments.
| if weight_path.split("/")[0] == name | |
| if Path(weight_path).parts[0] == name |
072e77c to
6252c71
Compare
a149668 to
c16451a
Compare
The `InlineArray` copy/move constructors current implementation simply
looped through each element and moved/copied. This produced poor codegen
as it does not take advantage of any loop unrolling or checking if the
element types are trivial and simply copying the underlying MLIR
storage.
The previous codegen for a simple function:
```mojo
fn return_array() -> InlineArray[Int32, 4]:
var arr = InlineArray[Int32, 4](fill=0)
return arr^
```
produces this codegen:
```
define dso_local void @"test_inline_array::return_array"(ptr noalias noundef nonnull writeonly captures(none) %0) #0 !dbg !5 {
tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %0, i8 0, i64 16, i1 false), !dbg !90
tail call void asm sideeffect "nop", ""() #2, !dbg !91
ret void, !dbg !32
}
```
but now, we get this improved LLVM IR:
```
define dso_local void @"test_inline_array::_return_array"(ptr noalias noundef nonnull writeonly captures(none) initializes((0, 16)) %0) #0 !dbg !5 {
tail call void @llvm.memset.p0.i64(ptr noundef nonnull align 4 dereferenceable(16) %0, i8 0, i64 16, i1 false), !dbg !51
ret void, !dbg !32
}
```
Notably there is no `asm sideffect "nop"` and an `initializes((0, 16))`
is on the `ptr` argument showing that the values do get initialized.
MODULAR_ORIG_COMMIT_REV_ID: 14efb85038ec4697b7e52a8b7beb76e2b48b09bc
token when context is reset
When a preemption occurs when using overlap scheduler, MAX Serve crashes
in the subsequent iteration.
For example:
```
Requesting API: 51%|█████ | 674/1319 [04:05<02:01, 5.31it/s]01:00:23.055 INFO: Executed TG batch with 64 reqs | Terminated: 0 reqs, Pending: 0 reqs | Input Tokens: 64/128 toks | Context Tokens: 91369/10880 toks | Prompt Tput: 1.5K tok/s, Generation Tput: 1.5K tok/s | Batch creation: 474.32us, Execution: 42.34ms | KVCache usage: 51.3% of 680 blocks, Cache hit rate: 0.0% | All Preemptions: 0 reqs
Requesting API: 51%|█████ | 675/1319 [04:05<03:02, 3.53it/s]01:00:24.328 INFO: Preempted a request due to lack of KV pages. This can affect the end-to-end performance. Consider increasing device-memory-utilization via `--device-memory-utilization` to provide more KV cache memory. Total Preemption Count: 1
...
File "/home/runner/.cache/bazel/_bazel_runner/e5185c6d580a8eeb67820192d4a9baf7/execroot/_main/bazel-out/k8-opt-ci-build/bin/max/tests/integration/accuracy/pipelines-lm-eval.runfiles/_main/max/python/max/profiler/tracing.py", line 111, in wrapper
return func(*args, **kwargs)
File "/home/runner/.cache/bazel/_bazel_runner/e5185c6d580a8eeb67820192d4a9baf7/execroot/_main/bazel-out/k8-opt-ci-build/bin/max/tests/integration/accuracy/pipelines-lm-eval.runfiles/_main/max/python/max/pipelines/lib/pipeline_variants/overlap_text_generation.py", line 810, in execute
context.update_with_future_token()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/runner/.cache/bazel/_bazel_runner/e5185c6d580a8eeb67820192d4a9baf7/execroot/_main/bazel-out/k8-opt-ci-build/bin/max/tests/integration/accuracy/pipelines-lm-eval.runfiles/_main/max/python/max/pipelines/core/context.py", line 315, in update_with_future_token
raise ValueError("Cannot have multiple future tokens.")
ValueError: Cannot have multiple future tokens.
```
This is what happens:
```
iteration 1: [100, 101, ..., 116] (generated_length = 0)
iteration 2: [100, 101, ..., 116, FUTURE_TOKEN] (generated_length = 1)
iteration 3: [100, 101, ..., 116, 117, FUTURE_TOKEN] (generated_length = 2)
iteration 4: [100, 101, ..., 116, 117, 118, FUTURE_TOKEN] (generated_length = 0)
**PREEMPT!!!**
iteration 5: [100, 101, ..., 116, 117, 118, 119, FUTURE_TOKEN, FUTURE_TOKEN] (generated_length = 1)
^^^^ Cannot have multiple future tokens.
```
Notice that we have this logic:
```
\\ If generated_length is still 0, then there is no placeholder
\\ future token. This is possible due to chunked prefill.
if context.tokens.generated_length:
context.realize_future_token(
new_token=next_token, log_probabilities=log_probs
)
```
So when a preemption occurs, the `generated_length` is reset to 0. Thus
the `context.realize_future_token` is not run after the preemption. This
causes us to have multiple future tokens in a row.
There are two options to fix this:
1. When a preemption occurs, keep the future token. Then realize the
future
token in the next iteration even if `generated_length` is 0. This means
that we will not lose the progress from that iteration.
2. When a preemption occurs, delete the future token. This means that we
will
lose the progress from that iteration.
I picked #2. This loses an iteration of progress when a preemption
occurs
relative to option #1. However, it is simpler. This is because the
placeholder
future token would become part of the prompt in #1 despite the future
token
still needing to be associated with log_probs and still needing to be
streamed
to the user in `consume_recently_generated_tokens`.
MODULAR_ORIG_COMMIT_REV_ID: 0a21e63721148e2c2c85f9037f0d68b27eaeb405
No description provided.