Conversation
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
There was a problem hiding this comment.
Pull request overview
Adds a first-class index migration workflow to RedisVL, centered around a plan-first drop_recreate strategy (and associated CLI/docs/tests), including an interactive wizard and validation/report artifacts.
Changes:
- Introduce
redisvl.migrationmodule (planner/patch merge + diff classification, executor, validator, wizard, utilities, and Pydantic models). - Add
rvl migrateCLI surface (helper,list,plan,wizard,apply,validate) and wire it into the main CLI. - Add substantial unit/integration test coverage and end-user documentation pages for migrations.
Reviewed changes
Copilot reviewed 35 out of 36 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_migration_wizard.py | Unit tests for interactive wizard patch-building, including vector config prompts and edge cases |
| tests/unit/test_migration_planner.py | Unit tests for planner patch merge + diff classification and supported/blocked change detection |
| tests/integration/test_migration_v1.py | Integration test for plan→apply→validate flow using drop_recreate |
| tests/integration/test_field_modifier_ordering_integration.py | Additional integration coverage for field modifiers (INDEXEMPTY/UNF/NOINDEX) |
| redisvl/migration/wizard.py | Interactive wizard to build schema patches and plans |
| redisvl/migration/validation.py | Migration validation logic + optional query checks |
| redisvl/migration/utils.py | YAML helpers, schema canonicalization/equality, readiness polling, index listing |
| redisvl/migration/planner.py | Source snapshotting, patch normalization/merge, diff classification, plan writing |
| redisvl/migration/models.py | Pydantic models for patches, plans, reports, validation policy/results |
| redisvl/migration/executor.py | Plan execution for drop_recreate, including optional vector datatype conversion |
| redisvl/migration/init.py | Migration package exports |
| redisvl/cli/utils.py | Refactor CLI option helpers (redis connection options reused by index parsing options) |
| redisvl/cli/migrate.py | New rvl migrate command implementation |
| redisvl/cli/main.py | Register migrate as a top-level CLI command |
| nitin_docs/index_migrator/90_prd.md | PRD for index migrator initiative |
| nitin_docs/index_migrator/22_v2_iterative_shadow_tests.md | Planned Phase 2 tests doc |
| nitin_docs/index_migrator/21_v2_iterative_shadow_tasks.md | Planned Phase 2 tasks doc |
| nitin_docs/index_migrator/20_v2_iterative_shadow_spec.md | Planned Phase 2 spec doc |
| nitin_docs/index_migrator/12_v1_drop_recreate_tests.md | Phase 1 test plan doc |
| nitin_docs/index_migrator/11_v1_drop_recreate_tasks.md | Phase 1 tasks doc |
| nitin_docs/index_migrator/10_v1_drop_recreate_spec.md | Phase 1 spec doc |
| nitin_docs/index_migrator/03_benchmarking.md | Benchmarking guidance doc |
| nitin_docs/index_migrator/02_architecture.md | Architecture doc |
| nitin_docs/index_migrator/01_context.md | Context doc |
| nitin_docs/index_migrator/00_index.md | Workspace index doc |
| docs/user_guide/index.md | Add migrations to user guide landing page taxonomy |
| docs/user_guide/how_to_guides/migrate-indexes.md | New “Migrate an Index” how-to guide |
| docs/user_guide/how_to_guides/index.md | Link the new migration how-to guide |
| docs/user_guide/cli.ipynb | Update CLI notebook to include rvl migrate |
| docs/concepts/search-and-indexing.md | Mention migrator workflow + link to concepts/how-to |
| docs/concepts/index.md | Add “Index Migrations” concept entry |
| docs/concepts/index-migrations.md | New migrations concept page (supported/blocked + mode explanation) |
| docs/concepts/field-attributes.md | Extend vector datatype docs + wizard migration support notes |
| CONTRIBUTING.md | Add conventional branch naming + Conventional Commits guidance |
| AGENTS.md | Add contributor/developer workflow reference and repo conventions |
| .gitignore | Ignore migration-generated YAML artifacts and a local notebook |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| client = source_index._redis_client | ||
| prefix = plan.source.schema_snapshot["index"]["prefix"] | ||
| storage_type = ( | ||
| plan.source.schema_snapshot["index"].get("storage_type", "hash").lower() | ||
| ) | ||
|
|
There was a problem hiding this comment.
_quantize_vectors() builds the SCAN match pattern from plan.source.schema_snapshot["index"]["prefix"], but IndexSchema allows prefix to be a list and also uses key_separator to delimit IDs. With multiple prefixes (or a prefix missing the separator), f"{prefix}*" will scan the wrong keys (or none), causing datatype changes to be applied inconsistently and potentially corrupting vector bytes relative to the new schema. Use plan.source.keyspace.prefixes + plan.source.keyspace.key_separator (same pattern as _sample_keys) and iterate prefixes when scanning.
| logger.warning( | ||
| f"JSON storage quantization for key {key} - " | ||
| "vectors stored as arrays may not need re-encoding" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Datatype migrations for JSON storage are currently treated as supported by the planner, but executor quantization is a no-op (it only logs a warning). For JSON indexes, changing datatype to INT8/UINT8 requires rewriting the stored JSON arrays to ints (and float16/float32 may be a safe no-op). Either implement JSON vector conversion here or have the planner block/limit datatype changes for JSON storage types so apply doesn’t produce a schema/data mismatch.
| logger.warning( | |
| f"JSON storage quantization for key {key} - " | |
| "vectors stored as arrays may not need re-encoding" | |
| ) | |
| json_client = client.json() | |
| for field_name, change in datatype_changes.items(): | |
| # We expect the vector to be stored as a JSON array under the | |
| # top-level property named after the field. | |
| json_path = f".{field_name}" | |
| try: | |
| value = json_client.get(key, json_path) | |
| except Exception as exc: | |
| logger.warning( | |
| "Failed to read JSON vector field '%s' for key %s: %s", | |
| field_name, | |
| key, | |
| exc, | |
| ) | |
| continue | |
| if value is None: | |
| # Field not present on this document | |
| continue | |
| # RedisJSON can return either the array directly or wrapped; | |
| # attempt to extract the actual numeric array. | |
| array_value = value | |
| if ( | |
| isinstance(value, list) | |
| and len(value) == 1 | |
| and isinstance(value[0], list) | |
| ): | |
| array_value = value[0] | |
| if not isinstance(array_value, list): | |
| logger.warning( | |
| "Expected JSON array for vector field '%s' on key %s, " | |
| "got %r instead; skipping", | |
| field_name, | |
| key, | |
| type(array_value), | |
| ) | |
| continue | |
| target_dtype = str(change.get("target", "")).lower() | |
| # For INT8/UINT8 targets, coerce values to ints in the valid range. | |
| if target_dtype in ("int8", "uint8"): | |
| converted = [] | |
| if target_dtype == "int8": | |
| min_val, max_val = -128, 127 | |
| else: | |
| min_val, max_val = 0, 255 | |
| for v in array_value: | |
| try: | |
| num = float(v) | |
| except (TypeError, ValueError): | |
| logger.warning( | |
| "Non-numeric value %r in JSON vector field '%s' " | |
| "for key %s; leaving value unchanged", | |
| v, | |
| field_name, | |
| key, | |
| ) | |
| converted.append(v) | |
| continue | |
| iv = int(round(num)) | |
| if iv < min_val: | |
| iv = min_val | |
| elif iv > max_val: | |
| iv = max_val | |
| converted.append(iv) | |
| else: | |
| # For floating-point datatype changes (e.g., float32/float16), | |
| # JSON numerics are already type-agnostic; treat as no-op. | |
| converted = array_value | |
| try: | |
| json_client.set(key, json_path, converted) | |
| keys_to_update.append(key) | |
| except Exception as exc: | |
| logger.warning( | |
| "Failed to write converted JSON vector field '%s' for " | |
| "key %s: %s", | |
| field_name, | |
| key, | |
| exc, | |
| ) |
| if storage_type == "hash": | ||
| # Read all vector fields that need conversion | ||
| for field_name, change in datatype_changes.items(): | ||
| field_data = client.hget(key, field_name) | ||
| if field_data: | ||
| # Convert: source dtype -> array -> target dtype -> bytes | ||
| array = buffer_to_array(field_data, change["source"]) | ||
| new_bytes = array_to_buffer(array, change["target"]) | ||
| pipe.hset(key, field_name, new_bytes) | ||
| keys_to_update.append(key) |
There was a problem hiding this comment.
Quantization reads use client.hget inside the per-key loop, which is one network round-trip per (key, field) and will be very slow for large indexes. Consider batching reads (e.g., pipeline/hmget for all vector fields per key, or pipelined HGETs for the whole batch) to avoid O(N) RTT behavior during migrations.
| if field_data: | ||
| # Convert: source dtype -> array -> target dtype -> bytes | ||
| array = buffer_to_array(field_data, change["source"]) | ||
| new_bytes = array_to_buffer(array, change["target"]) | ||
| pipe.hset(key, field_name, new_bytes) | ||
| keys_to_update.append(key) |
There was a problem hiding this comment.
For float→int8/uint8 changes, the current conversion path is effectively a direct numpy cast (float values truncated into the int range). That is not a valid embedding quantization strategy for typical float embeddings and will silently destroy vector similarity behavior. It’s safer to either (a) restrict migrator auto-conversion to float↔float dtype changes, or (b) require an explicit quantization strategy/scale parameters before allowing float→int8/uint8 conversions.
| print(f"{label} options: {', '.join(choices)}") | ||
| value = input(f"{label}: ").strip().lower() | ||
| if value not in choices: | ||
| print(block_message) | ||
| return None | ||
| return value |
There was a problem hiding this comment.
_prompt_from_choices() always prints the provided block_message on invalid input, even when the user simply mistypes a valid option. This results in misleading errors (e.g., saying “Vector fields cannot be added” when the input was just not in the list). Consider separating “invalid selection” messaging from the “blocked” explanation so users get accurate feedback.
| live_schema = target_index.schema.to_dict() | ||
| validation.schema_match = schemas_equal(live_schema, plan.merged_target_schema) | ||
|
|
||
| source_num_docs = int(plan.source.stats_snapshot.get("num_docs", 0) or 0) | ||
| target_num_docs = int(target_info.get("num_docs", 0) or 0) | ||
| validation.doc_count_match = source_num_docs == target_num_docs | ||
|
|
||
| source_failures = int( | ||
| plan.source.stats_snapshot.get("hash_indexing_failures", 0) or 0 | ||
| ) | ||
| target_failures = int(target_info.get("hash_indexing_failures", 0) or 0) | ||
| validation.indexing_failures_delta = target_failures - source_failures | ||
|
|
||
| key_sample = plan.source.keyspace.key_sample | ||
| if not key_sample: | ||
| validation.key_sample_exists = True | ||
| else: | ||
| existing_count = target_index.client.exists(*key_sample) | ||
| validation.key_sample_exists = existing_count == len(key_sample) | ||
|
|
||
| if query_check_file: | ||
| validation.query_checks = self._run_query_checks( | ||
| target_index, | ||
| query_check_file, | ||
| ) | ||
|
|
||
| if not validation.schema_match: | ||
| validation.errors.append("Live schema does not match merged_target_schema.") | ||
| if not validation.doc_count_match: | ||
| validation.errors.append( | ||
| "Live document count does not match source num_docs." | ||
| ) | ||
| if validation.indexing_failures_delta != 0: |
There was a problem hiding this comment.
MigrationValidator.validate() ignores the plan.validation policy (require_doc_count_match / require_schema_match) and always enforces both checks, adding errors unconditionally. If these policy fields are part of the plan schema, validation should respect them (e.g., skip the error when the corresponding require_* flag is False).
| for doc_id in query_checks.get("fetch_ids", []): | ||
| fetched = target_index.fetch(doc_id) | ||
| results.append( | ||
| QueryCheckResult( | ||
| name=f"fetch:{doc_id}", | ||
| passed=fetched is not None, | ||
| details=( | ||
| "Document fetched successfully" | ||
| if fetched | ||
| else "Document not found" | ||
| ), | ||
| ) |
There was a problem hiding this comment.
In _run_query_checks(), the details message uses if fetched which will treat empty dict/empty payloads as falsy, producing “Document not found” even though passed is True (passed is fetched is not None). Use the same condition (fetched is not None) for details to keep result reporting consistent.
| assert plan.diff_classification.supported is False | ||
| assert any( | ||
| "prefix" in reason.lower() and "iterative_shadow" in reason | ||
| for reason in plan.diff_classification.blocked_reasons | ||
| ) |
There was a problem hiding this comment.
This assertion expects blocked_reasons strings to mention iterative_shadow, but MigrationPlanner.classify_diff() currently uses messages like “requires document migration (not yet supported)” and does not include that substring. Either adjust the blocked-reason wording in the planner (preferred if you want to guide users to Phase 2) or relax the test to match the actual message content.
| assert plan.diff_classification.supported is False | ||
| assert any( | ||
| "dims" in reason and "iterative_shadow" in reason | ||
| for reason in plan.diff_classification.blocked_reasons | ||
| ) |
There was a problem hiding this comment.
Same as the prefix test above: the planner’s dims blocked-reason text doesn’t currently include iterative_shadow, so this assertion will fail. Align the test expectations with the actual message, or update the planner messages to explicitly route users to the Phase 2 iterative_shadow path.
| Supported changes: | ||
| - Adding or removing non-vector fields (text, tag, numeric, geo) | ||
| - Changing field options (sortable, separator, weight) | ||
| - Changing vector algorithm (FLAT, HNSW, SVS_VAMANA) |
There was a problem hiding this comment.
The helper text lists the SVS algorithm as SVS_VAMANA (underscore), but throughout the codebase and docs the canonical name is SVS-VAMANA (hyphen). Since users will copy/paste this, it’s worth aligning the help output with the accepted/normalized value to avoid confusion.
| - Changing vector algorithm (FLAT, HNSW, SVS_VAMANA) | |
| - Changing vector algorithm (FLAT, HNSW, SVS-VAMANA) |
- Add rvl migrate subcommand (helper, list, plan, apply, validate) - Implement MigrationPlanner for schema diff classification - Implement MigrationExecutor with drop_recreate mode - Support vector quantization (float32 <-> float16) during migration - Add MigrationValidator for post-migration validation - Show error messages prominently on migration failure - Add migration temp files to .gitignore
- Add MigrationWizard for guided schema changes - Support add/update/remove field operations - Algorithm-specific datatype prompts (SVS-VAMANA vs HNSW/FLAT) - SVS-VAMANA params: GRAPH_MAX_DEGREE, COMPRESSION - HNSW params: M, EF_CONSTRUCTION - Normalize SVS_VAMANA -> SVS-VAMANA input - Preview patch as YAML before finishing
- Add conceptual guide: how migrations work (Diataxis explanation) - Add task guide: step-by-step migration walkthrough (Diataxis how-to) - Expand field-attributes.md with migration support matrix - Add vector datatypes table with algorithm compatibility - Update navigation indexes to include new guides - Normalize SVS-VAMANA naming throughout docs
- Unit tests for MigrationPlanner diff classification - Unit tests for MigrationWizard (41 tests incl. adversarial inputs) - Integration test for drop_recreate flow - Field modifier ordering integration tests (INDEXEMPTY, INDEXMISSING, etc.)
6a61cfd to
8fffbef
Compare
No description provided.