Feat/index migrator by nkanu17 · Pull Request #534 · redis/redis-vl-python

nkanu17 · 2026-03-19T22:24:37Z

No description provided.

jit-ci · 2026-03-19T22:27:00Z

🛡️ Jit Security Scan Results

✅ No security findings were detected in this PR

^{Security scan by Jit}

Copilot

Pull request overview

Adds a first-class index migration workflow to RedisVL, centered around a plan-first drop_recreate strategy (and associated CLI/docs/tests), including an interactive wizard and validation/report artifacts.

Changes:

Introduce redisvl.migration module (planner/patch merge + diff classification, executor, validator, wizard, utilities, and Pydantic models).
Add rvl migrate CLI surface (helper, list, plan, wizard, apply, validate) and wire it into the main CLI.
Add substantial unit/integration test coverage and end-user documentation pages for migrations.

Reviewed changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated 16 comments.

Show a summary per file

File	Description
tests/unit/test_migration_wizard.py	Unit tests for interactive wizard patch-building, including vector config prompts and edge cases
tests/unit/test_migration_planner.py	Unit tests for planner patch merge + diff classification and supported/blocked change detection
tests/integration/test_migration_v1.py	Integration test for plan→apply→validate flow using `drop_recreate`
tests/integration/test_field_modifier_ordering_integration.py	Additional integration coverage for field modifiers (INDEXEMPTY/UNF/NOINDEX)
redisvl/migration/wizard.py	Interactive wizard to build schema patches and plans
redisvl/migration/validation.py	Migration validation logic + optional query checks
redisvl/migration/utils.py	YAML helpers, schema canonicalization/equality, readiness polling, index listing
redisvl/migration/planner.py	Source snapshotting, patch normalization/merge, diff classification, plan writing
redisvl/migration/models.py	Pydantic models for patches, plans, reports, validation policy/results
redisvl/migration/executor.py	Plan execution for `drop_recreate`, including optional vector datatype conversion
redisvl/migration/init.py	Migration package exports
redisvl/cli/utils.py	Refactor CLI option helpers (redis connection options reused by index parsing options)
redisvl/cli/migrate.py	New `rvl migrate` command implementation
redisvl/cli/main.py	Register `migrate` as a top-level CLI command
nitin_docs/index_migrator/90_prd.md	PRD for index migrator initiative
nitin_docs/index_migrator/22_v2_iterative_shadow_tests.md	Planned Phase 2 tests doc
nitin_docs/index_migrator/21_v2_iterative_shadow_tasks.md	Planned Phase 2 tasks doc
nitin_docs/index_migrator/20_v2_iterative_shadow_spec.md	Planned Phase 2 spec doc
nitin_docs/index_migrator/12_v1_drop_recreate_tests.md	Phase 1 test plan doc
nitin_docs/index_migrator/11_v1_drop_recreate_tasks.md	Phase 1 tasks doc
nitin_docs/index_migrator/10_v1_drop_recreate_spec.md	Phase 1 spec doc
nitin_docs/index_migrator/03_benchmarking.md	Benchmarking guidance doc
nitin_docs/index_migrator/02_architecture.md	Architecture doc
nitin_docs/index_migrator/01_context.md	Context doc
nitin_docs/index_migrator/00_index.md	Workspace index doc
docs/user_guide/index.md	Add migrations to user guide landing page taxonomy
docs/user_guide/how_to_guides/migrate-indexes.md	New “Migrate an Index” how-to guide
docs/user_guide/how_to_guides/index.md	Link the new migration how-to guide
docs/user_guide/cli.ipynb	Update CLI notebook to include `rvl migrate`
docs/concepts/search-and-indexing.md	Mention migrator workflow + link to concepts/how-to
docs/concepts/index.md	Add “Index Migrations” concept entry
docs/concepts/index-migrations.md	New migrations concept page (supported/blocked + mode explanation)
docs/concepts/field-attributes.md	Extend vector datatype docs + wizard migration support notes
CONTRIBUTING.md	Add conventional branch naming + Conventional Commits guidance
AGENTS.md	Add contributor/developer workflow reference and repo conventions
.gitignore	Ignore migration-generated YAML artifacts and a local notebook

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T22:30:48Z

redisvl/migration/executor.py

+        client = source_index._redis_client
+        prefix = plan.source.schema_snapshot["index"]["prefix"]
+        storage_type = (
+            plan.source.schema_snapshot["index"].get("storage_type", "hash").lower()
+        )
+


_quantize_vectors() builds the SCAN match pattern from plan.source.schema_snapshot["index"]["prefix"], but IndexSchema allows prefix to be a list and also uses key_separator to delimit IDs. With multiple prefixes (or a prefix missing the separator), f"{prefix}*" will scan the wrong keys (or none), causing datatype changes to be applied inconsistently and potentially corrupting vector bytes relative to the new schema. Use plan.source.keyspace.prefixes + plan.source.keyspace.key_separator (same pattern as _sample_keys) and iterate prefixes when scanning.

Copilot · 2026-03-19T22:30:48Z

redisvl/migration/executor.py

+                        logger.warning(
+                            f"JSON storage quantization for key {key} - "
+                            "vectors stored as arrays may not need re-encoding"
+                        )
+


Datatype migrations for JSON storage are currently treated as supported by the planner, but executor quantization is a no-op (it only logs a warning). For JSON indexes, changing datatype to INT8/UINT8 requires rewriting the stored JSON arrays to ints (and float16/float32 may be a safe no-op). Either implement JSON vector conversion here or have the planner block/limit datatype changes for JSON storage types so apply doesn’t produce a schema/data mismatch.

Suggested change

logger.warning(

f"JSON storage quantization for key {key} - "

"vectors stored as arrays may not need re-encoding"

)

json_client = client.json()

for field_name, change in datatype_changes.items():

# We expect the vector to be stored as a JSON array under the

# top-level property named after the field.

json_path = f".{field_name}"

try:

value = json_client.get(key, json_path)

except Exception as exc:

logger.warning(

"Failed to read JSON vector field '%s' for key %s: %s",

field_name,

key,

exc,

)

continue

if value is None:

# Field not present on this document

continue

# RedisJSON can return either the array directly or wrapped;

# attempt to extract the actual numeric array.

array_value = value

if (

isinstance(value, list)

and len(value) == 1

and isinstance(value[0], list)

):

array_value = value[0]

if not isinstance(array_value, list):

logger.warning(

"Expected JSON array for vector field '%s' on key %s, "

"got %r instead; skipping",

field_name,

key,

type(array_value),

)

continue

target_dtype = str(change.get("target", "")).lower()

# For INT8/UINT8 targets, coerce values to ints in the valid range.

if target_dtype in ("int8", "uint8"):

converted = []

if target_dtype == "int8":

min_val, max_val = -128, 127

else:

min_val, max_val = 0, 255

for v in array_value:

try:

num = float(v)

except (TypeError, ValueError):

logger.warning(

"Non-numeric value %r in JSON vector field '%s' "

"for key %s; leaving value unchanged",

v,

field_name,

key,

)

converted.append(v)

continue

iv = int(round(num))

if iv < min_val:

iv = min_val

elif iv > max_val:

iv = max_val

converted.append(iv)

else:

# For floating-point datatype changes (e.g., float32/float16),

# JSON numerics are already type-agnostic; treat as no-op.

converted = array_value

try:

json_client.set(key, json_path, converted)

keys_to_update.append(key)

except Exception as exc:

logger.warning(

"Failed to write converted JSON vector field '%s' for "

"key %s: %s",

field_name,

key,

exc,

)

Copilot · 2026-03-19T22:30:48Z

redisvl/migration/executor.py

+                    if storage_type == "hash":
+                        # Read all vector fields that need conversion
+                        for field_name, change in datatype_changes.items():
+                            field_data = client.hget(key, field_name)
+                            if field_data:
+                                # Convert: source dtype -> array -> target dtype -> bytes
+                                array = buffer_to_array(field_data, change["source"])
+                                new_bytes = array_to_buffer(array, change["target"])
+                                pipe.hset(key, field_name, new_bytes)
+                                keys_to_update.append(key)


Quantization reads use client.hget inside the per-key loop, which is one network round-trip per (key, field) and will be very slow for large indexes. Consider batching reads (e.g., pipeline/hmget for all vector fields per key, or pipelined HGETs for the whole batch) to avoid O(N) RTT behavior during migrations.

Copilot · 2026-03-19T22:30:49Z

redisvl/migration/executor.py

+                            if field_data:
+                                # Convert: source dtype -> array -> target dtype -> bytes
+                                array = buffer_to_array(field_data, change["source"])
+                                new_bytes = array_to_buffer(array, change["target"])
+                                pipe.hset(key, field_name, new_bytes)
+                                keys_to_update.append(key)


For float→int8/uint8 changes, the current conversion path is effectively a direct numpy cast (float values truncated into the int range). That is not a valid embedding quantization strategy for typical float embeddings and will silently destroy vector similarity behavior. It’s safer to either (a) restrict migrator auto-conversion to float↔float dtype changes, or (b) require an explicit quantization strategy/scale parameters before allowing float→int8/uint8 conversions.

Copilot · 2026-03-19T22:30:49Z

redisvl/migration/wizard.py

+        print(f"{label} options: {', '.join(choices)}")
+        value = input(f"{label}: ").strip().lower()
+        if value not in choices:
+            print(block_message)
+            return None
+        return value


_prompt_from_choices() always prints the provided block_message on invalid input, even when the user simply mistypes a valid option. This results in misleading errors (e.g., saying “Vector fields cannot be added” when the input was just not in the list). Consider separating “invalid selection” messaging from the “blocked” explanation so users get accurate feedback.

Copilot · 2026-03-19T22:30:50Z

redisvl/migration/validation.py

+        live_schema = target_index.schema.to_dict()
+        validation.schema_match = schemas_equal(live_schema, plan.merged_target_schema)
+
+        source_num_docs = int(plan.source.stats_snapshot.get("num_docs", 0) or 0)
+        target_num_docs = int(target_info.get("num_docs", 0) or 0)
+        validation.doc_count_match = source_num_docs == target_num_docs
+
+        source_failures = int(
+            plan.source.stats_snapshot.get("hash_indexing_failures", 0) or 0
+        )
+        target_failures = int(target_info.get("hash_indexing_failures", 0) or 0)
+        validation.indexing_failures_delta = target_failures - source_failures
+
+        key_sample = plan.source.keyspace.key_sample
+        if not key_sample:
+            validation.key_sample_exists = True
+        else:
+            existing_count = target_index.client.exists(*key_sample)
+            validation.key_sample_exists = existing_count == len(key_sample)
+
+        if query_check_file:
+            validation.query_checks = self._run_query_checks(
+                target_index,
+                query_check_file,
+            )
+
+        if not validation.schema_match:
+            validation.errors.append("Live schema does not match merged_target_schema.")
+        if not validation.doc_count_match:
+            validation.errors.append(
+                "Live document count does not match source num_docs."
+            )
+        if validation.indexing_failures_delta != 0:


MigrationValidator.validate() ignores the plan.validation policy (require_doc_count_match / require_schema_match) and always enforces both checks, adding errors unconditionally. If these policy fields are part of the plan schema, validation should respect them (e.g., skip the error when the corresponding require_* flag is False).

Copilot · 2026-03-19T22:30:50Z

redisvl/migration/validation.py

+        for doc_id in query_checks.get("fetch_ids", []):
+            fetched = target_index.fetch(doc_id)
+            results.append(
+                QueryCheckResult(
+                    name=f"fetch:{doc_id}",
+                    passed=fetched is not None,
+                    details=(
+                        "Document fetched successfully"
+                        if fetched
+                        else "Document not found"
+                    ),
+                )


In _run_query_checks(), the details message uses if fetched which will treat empty dict/empty payloads as falsy, producing “Document not found” even though passed is True (passed is fetched is not None). Use the same condition (fetched is not None) for details to keep result reporting consistent.

Copilot · 2026-03-19T22:30:51Z

tests/unit/test_migration_planner.py

+    assert plan.diff_classification.supported is False
+    assert any(
+        "prefix" in reason.lower() and "iterative_shadow" in reason
+        for reason in plan.diff_classification.blocked_reasons
+    )


This assertion expects blocked_reasons strings to mention iterative_shadow, but MigrationPlanner.classify_diff() currently uses messages like “requires document migration (not yet supported)” and does not include that substring. Either adjust the blocked-reason wording in the planner (preferred if you want to guide users to Phase 2) or relax the test to match the actual message content.

Copilot · 2026-03-19T22:30:51Z

tests/unit/test_migration_planner.py

+    assert plan.diff_classification.supported is False
+    assert any(
+        "dims" in reason and "iterative_shadow" in reason
+        for reason in plan.diff_classification.blocked_reasons
+    )


Same as the prefix test above: the planner’s dims blocked-reason text doesn’t currently include iterative_shadow, so this assertion will fail. Align the test expectations with the actual message, or update the planner messages to explicitly route users to the Phase 2 iterative_shadow path.

Copilot · 2026-03-19T22:30:51Z

redisvl/cli/migrate.py

+Supported changes:
+  - Adding or removing non-vector fields (text, tag, numeric, geo)
+  - Changing field options (sortable, separator, weight)
+  - Changing vector algorithm (FLAT, HNSW, SVS_VAMANA)


The helper text lists the SVS algorithm as SVS_VAMANA (underscore), but throughout the codebase and docs the canonical name is SVS-VAMANA (hyphen). Since users will copy/paste this, it’s worth aligning the help output with the accepted/normalized value to avoid confusion.

Suggested change

- Changing vector algorithm (FLAT, HNSW, SVS_VAMANA)

- Changing vector algorithm (FLAT, HNSW, SVS-VAMANA)

- Add rvl migrate subcommand (helper, list, plan, apply, validate) - Implement MigrationPlanner for schema diff classification - Implement MigrationExecutor with drop_recreate mode - Support vector quantization (float32 <-> float16) during migration - Add MigrationValidator for post-migration validation - Show error messages prominently on migration failure - Add migration temp files to .gitignore

- Add MigrationWizard for guided schema changes - Support add/update/remove field operations - Algorithm-specific datatype prompts (SVS-VAMANA vs HNSW/FLAT) - SVS-VAMANA params: GRAPH_MAX_DEGREE, COMPRESSION - HNSW params: M, EF_CONSTRUCTION - Normalize SVS_VAMANA -> SVS-VAMANA input - Preview patch as YAML before finishing

- Add conceptual guide: how migrations work (Diataxis explanation) - Add task guide: step-by-step migration walkthrough (Diataxis how-to) - Expand field-attributes.md with migration support matrix - Add vector datatypes table with algorithm compatibility - Update navigation indexes to include new guides - Normalize SVS-VAMANA naming throughout docs

- Unit tests for MigrationPlanner diff classification - Unit tests for MigrationWizard (41 tests incl. adversarial inputs) - Integration test for drop_recreate flow - Field modifier ordering integration tests (INDEXEMPTY, INDEXMISSING, etc.)

Copilot AI review requested due to automatic review settings March 19, 2026 22:24

Copilot started reviewing on behalf of nkanu17 March 19, 2026 22:25 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

nkanu17 added 6 commits March 19, 2026 18:59

docs(index-migrator): add planning workspace and repo guidance

8ab9d71

chore: add nitin_docs and nitin_scripts to gitignore

8fffbef

nkanu17 force-pushed the feat/index-migrator-v0-lc branch from 6a61cfd to 8fffbef Compare March 19, 2026 23:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/index migrator#534

Feat/index migrator#534
nkanu17 wants to merge 6 commits intomainfrom
feat/index-migrator-v0-lc

nkanu17 commented Mar 19, 2026

Uh oh!

jit-ci bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                        logger.warning(
-                            f"JSON storage quantization for key {key} - "
-                            "vectors stored as arrays may not need re-encoding"
-                        )
+                        json_client = client.json()
+                        for field_name, change in datatype_changes.items():
+                            # We expect the vector to be stored as a JSON array under the
+                            # top-level property named after the field.
+                            json_path = f".{field_name}"
+                            try:
+                                value = json_client.get(key, json_path)
+                            except Exception as exc:
+                                logger.warning(
+                                    "Failed to read JSON vector field '%s' for key %s: %s",
+                                    field_name,
+                                    key,
+                                    exc,
+                                )
+                                continue
+                            if value is None:
+                                # Field not present on this document
+                                continue
+                            # RedisJSON can return either the array directly or wrapped;
+                            # attempt to extract the actual numeric array.
+                            array_value = value
+                            if (
+                                isinstance(value, list)
+                                and len(value) == 1
+                                and isinstance(value[0], list)
+                            ):
+                                array_value = value[0]
+                            if not isinstance(array_value, list):
+                                logger.warning(
+                                    "Expected JSON array for vector field '%s' on key %s, "
+                                    "got %r instead; skipping",
+                                    field_name,
+                                    key,
+                                    type(array_value),
+                                )
+                                continue
+                            target_dtype = str(change.get("target", "")).lower()
+                            # For INT8/UINT8 targets, coerce values to ints in the valid range.
+                            if target_dtype in ("int8", "uint8"):
+                                converted = []
+                                if target_dtype == "int8":
+                                    min_val, max_val = -128, 127
+                                else:
+                                    min_val, max_val = 0, 255
+                                for v in array_value:
+                                    try:
+                                        num = float(v)
+                                    except (TypeError, ValueError):
+                                        logger.warning(
+                                            "Non-numeric value %r in JSON vector field '%s' "
+                                            "for key %s; leaving value unchanged",
+                                            v,
+                                            field_name,
+                                            key,
+                                        )
+                                        converted.append(v)
+                                        continue
+                                    iv = int(round(num))
+                                    if iv < min_val:
+                                        iv = min_val
+                                    elif iv > max_val:
+                                        iv = max_val
+                                    converted.append(iv)
+                            else:
+                                # For floating-point datatype changes (e.g., float32/float16),
+                                # JSON numerics are already type-agnostic; treat as no-op.
+                                converted = array_value
+                            try:
+                                json_client.set(key, json_path, converted)
+                                keys_to_update.append(key)
+                            except Exception as exc:
+                                logger.warning(
+                                    "Failed to write converted JSON vector field '%s' for "
+                                    "key %s: %s",
+                                    field_name,
+                                    key,
+                                    exc,
+                                )

	- Changing vector algorithm (FLAT, HNSW, SVS_VAMANA)
	- Changing vector algorithm (FLAT, HNSW, SVS-VAMANA)

Conversation

nkanu17 commented Mar 19, 2026

Uh oh!

jit-ci bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛡️ Jit Security Scan Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jit-ci bot commented Mar 19, 2026 •

edited

Loading