Skip to content

Feat/index migrator#534

Draft
nkanu17 wants to merge 6 commits intomainfrom
feat/index-migrator-v0-lc
Draft

Feat/index migrator#534
nkanu17 wants to merge 6 commits intomainfrom
feat/index-migrator-v0-lc

Conversation

@nkanu17
Copy link
Collaborator

@nkanu17 nkanu17 commented Mar 19, 2026

No description provided.

Copilot AI review requested due to automatic review settings March 19, 2026 22:24
@jit-ci
Copy link

jit-ci bot commented Mar 19, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a first-class index migration workflow to RedisVL, centered around a plan-first drop_recreate strategy (and associated CLI/docs/tests), including an interactive wizard and validation/report artifacts.

Changes:

  • Introduce redisvl.migration module (planner/patch merge + diff classification, executor, validator, wizard, utilities, and Pydantic models).
  • Add rvl migrate CLI surface (helper, list, plan, wizard, apply, validate) and wire it into the main CLI.
  • Add substantial unit/integration test coverage and end-user documentation pages for migrations.

Reviewed changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
tests/unit/test_migration_wizard.py Unit tests for interactive wizard patch-building, including vector config prompts and edge cases
tests/unit/test_migration_planner.py Unit tests for planner patch merge + diff classification and supported/blocked change detection
tests/integration/test_migration_v1.py Integration test for plan→apply→validate flow using drop_recreate
tests/integration/test_field_modifier_ordering_integration.py Additional integration coverage for field modifiers (INDEXEMPTY/UNF/NOINDEX)
redisvl/migration/wizard.py Interactive wizard to build schema patches and plans
redisvl/migration/validation.py Migration validation logic + optional query checks
redisvl/migration/utils.py YAML helpers, schema canonicalization/equality, readiness polling, index listing
redisvl/migration/planner.py Source snapshotting, patch normalization/merge, diff classification, plan writing
redisvl/migration/models.py Pydantic models for patches, plans, reports, validation policy/results
redisvl/migration/executor.py Plan execution for drop_recreate, including optional vector datatype conversion
redisvl/migration/init.py Migration package exports
redisvl/cli/utils.py Refactor CLI option helpers (redis connection options reused by index parsing options)
redisvl/cli/migrate.py New rvl migrate command implementation
redisvl/cli/main.py Register migrate as a top-level CLI command
nitin_docs/index_migrator/90_prd.md PRD for index migrator initiative
nitin_docs/index_migrator/22_v2_iterative_shadow_tests.md Planned Phase 2 tests doc
nitin_docs/index_migrator/21_v2_iterative_shadow_tasks.md Planned Phase 2 tasks doc
nitin_docs/index_migrator/20_v2_iterative_shadow_spec.md Planned Phase 2 spec doc
nitin_docs/index_migrator/12_v1_drop_recreate_tests.md Phase 1 test plan doc
nitin_docs/index_migrator/11_v1_drop_recreate_tasks.md Phase 1 tasks doc
nitin_docs/index_migrator/10_v1_drop_recreate_spec.md Phase 1 spec doc
nitin_docs/index_migrator/03_benchmarking.md Benchmarking guidance doc
nitin_docs/index_migrator/02_architecture.md Architecture doc
nitin_docs/index_migrator/01_context.md Context doc
nitin_docs/index_migrator/00_index.md Workspace index doc
docs/user_guide/index.md Add migrations to user guide landing page taxonomy
docs/user_guide/how_to_guides/migrate-indexes.md New “Migrate an Index” how-to guide
docs/user_guide/how_to_guides/index.md Link the new migration how-to guide
docs/user_guide/cli.ipynb Update CLI notebook to include rvl migrate
docs/concepts/search-and-indexing.md Mention migrator workflow + link to concepts/how-to
docs/concepts/index.md Add “Index Migrations” concept entry
docs/concepts/index-migrations.md New migrations concept page (supported/blocked + mode explanation)
docs/concepts/field-attributes.md Extend vector datatype docs + wizard migration support notes
CONTRIBUTING.md Add conventional branch naming + Conventional Commits guidance
AGENTS.md Add contributor/developer workflow reference and repo conventions
.gitignore Ignore migration-generated YAML artifacts and a local notebook

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +254 to +259
client = source_index._redis_client
prefix = plan.source.schema_snapshot["index"]["prefix"]
storage_type = (
plan.source.schema_snapshot["index"].get("storage_type", "hash").lower()
)

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_quantize_vectors() builds the SCAN match pattern from plan.source.schema_snapshot["index"]["prefix"], but IndexSchema allows prefix to be a list and also uses key_separator to delimit IDs. With multiple prefixes (or a prefix missing the separator), f"{prefix}*" will scan the wrong keys (or none), causing datatype changes to be applied inconsistently and potentially corrupting vector bytes relative to the new schema. Use plan.source.keyspace.prefixes + plan.source.keyspace.key_separator (same pattern as _sample_keys) and iterate prefixes when scanning.

Copilot uses AI. Check for mistakes.
Comment on lines +297 to +301
logger.warning(
f"JSON storage quantization for key {key} - "
"vectors stored as arrays may not need re-encoding"
)

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datatype migrations for JSON storage are currently treated as supported by the planner, but executor quantization is a no-op (it only logs a warning). For JSON indexes, changing datatype to INT8/UINT8 requires rewriting the stored JSON arrays to ints (and float16/float32 may be a safe no-op). Either implement JSON vector conversion here or have the planner block/limit datatype changes for JSON storage types so apply doesn’t produce a schema/data mismatch.

Suggested change
logger.warning(
f"JSON storage quantization for key {key} - "
"vectors stored as arrays may not need re-encoding"
)
json_client = client.json()
for field_name, change in datatype_changes.items():
# We expect the vector to be stored as a JSON array under the
# top-level property named after the field.
json_path = f".{field_name}"
try:
value = json_client.get(key, json_path)
except Exception as exc:
logger.warning(
"Failed to read JSON vector field '%s' for key %s: %s",
field_name,
key,
exc,
)
continue
if value is None:
# Field not present on this document
continue
# RedisJSON can return either the array directly or wrapped;
# attempt to extract the actual numeric array.
array_value = value
if (
isinstance(value, list)
and len(value) == 1
and isinstance(value[0], list)
):
array_value = value[0]
if not isinstance(array_value, list):
logger.warning(
"Expected JSON array for vector field '%s' on key %s, "
"got %r instead; skipping",
field_name,
key,
type(array_value),
)
continue
target_dtype = str(change.get("target", "")).lower()
# For INT8/UINT8 targets, coerce values to ints in the valid range.
if target_dtype in ("int8", "uint8"):
converted = []
if target_dtype == "int8":
min_val, max_val = -128, 127
else:
min_val, max_val = 0, 255
for v in array_value:
try:
num = float(v)
except (TypeError, ValueError):
logger.warning(
"Non-numeric value %r in JSON vector field '%s' "
"for key %s; leaving value unchanged",
v,
field_name,
key,
)
converted.append(v)
continue
iv = int(round(num))
if iv < min_val:
iv = min_val
elif iv > max_val:
iv = max_val
converted.append(iv)
else:
# For floating-point datatype changes (e.g., float32/float16),
# JSON numerics are already type-agnostic; treat as no-op.
converted = array_value
try:
json_client.set(key, json_path, converted)
keys_to_update.append(key)
except Exception as exc:
logger.warning(
"Failed to write converted JSON vector field '%s' for "
"key %s: %s",
field_name,
key,
exc,
)

Copilot uses AI. Check for mistakes.
Comment on lines +285 to +294
if storage_type == "hash":
# Read all vector fields that need conversion
for field_name, change in datatype_changes.items():
field_data = client.hget(key, field_name)
if field_data:
# Convert: source dtype -> array -> target dtype -> bytes
array = buffer_to_array(field_data, change["source"])
new_bytes = array_to_buffer(array, change["target"])
pipe.hset(key, field_name, new_bytes)
keys_to_update.append(key)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantization reads use client.hget inside the per-key loop, which is one network round-trip per (key, field) and will be very slow for large indexes. Consider batching reads (e.g., pipeline/hmget for all vector fields per key, or pipelined HGETs for the whole batch) to avoid O(N) RTT behavior during migrations.

Copilot uses AI. Check for mistakes.
Comment on lines +289 to +294
if field_data:
# Convert: source dtype -> array -> target dtype -> bytes
array = buffer_to_array(field_data, change["source"])
new_bytes = array_to_buffer(array, change["target"])
pipe.hset(key, field_name, new_bytes)
keys_to_update.append(key)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For float→int8/uint8 changes, the current conversion path is effectively a direct numpy cast (float values truncated into the int range). That is not a valid embedding quantization strategy for typical float embeddings and will silently destroy vector similarity behavior. It’s safer to either (a) restrict migrator auto-conversion to float↔float dtype changes, or (b) require an explicit quantization strategy/scale parameters before allowing float→int8/uint8 conversions.

Copilot uses AI. Check for mistakes.
Comment on lines +488 to +493
print(f"{label} options: {', '.join(choices)}")
value = input(f"{label}: ").strip().lower()
if value not in choices:
print(block_message)
return None
return value
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_prompt_from_choices() always prints the provided block_message on invalid input, even when the user simply mistypes a valid option. This results in misleading errors (e.g., saying “Vector fields cannot be added” when the input was just not in the list). Consider separating “invalid selection” messaging from the “blocked” explanation so users get accurate feedback.

Copilot uses AI. Check for mistakes.
Comment on lines +33 to +65
live_schema = target_index.schema.to_dict()
validation.schema_match = schemas_equal(live_schema, plan.merged_target_schema)

source_num_docs = int(plan.source.stats_snapshot.get("num_docs", 0) or 0)
target_num_docs = int(target_info.get("num_docs", 0) or 0)
validation.doc_count_match = source_num_docs == target_num_docs

source_failures = int(
plan.source.stats_snapshot.get("hash_indexing_failures", 0) or 0
)
target_failures = int(target_info.get("hash_indexing_failures", 0) or 0)
validation.indexing_failures_delta = target_failures - source_failures

key_sample = plan.source.keyspace.key_sample
if not key_sample:
validation.key_sample_exists = True
else:
existing_count = target_index.client.exists(*key_sample)
validation.key_sample_exists = existing_count == len(key_sample)

if query_check_file:
validation.query_checks = self._run_query_checks(
target_index,
query_check_file,
)

if not validation.schema_match:
validation.errors.append("Live schema does not match merged_target_schema.")
if not validation.doc_count_match:
validation.errors.append(
"Live document count does not match source num_docs."
)
if validation.indexing_failures_delta != 0:
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MigrationValidator.validate() ignores the plan.validation policy (require_doc_count_match / require_schema_match) and always enforces both checks, adding errors unconditionally. If these policy fields are part of the plan schema, validation should respect them (e.g., skip the error when the corresponding require_* flag is False).

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +95
for doc_id in query_checks.get("fetch_ids", []):
fetched = target_index.fetch(doc_id)
results.append(
QueryCheckResult(
name=f"fetch:{doc_id}",
passed=fetched is not None,
details=(
"Document fetched successfully"
if fetched
else "Document not found"
),
)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _run_query_checks(), the details message uses if fetched which will treat empty dict/empty payloads as falsy, producing “Document not found” even though passed is True (passed is fetched is not None). Use the same condition (fetched is not None) for details to keep result reporting consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +313 to +317
assert plan.diff_classification.supported is False
assert any(
"prefix" in reason.lower() and "iterative_shadow" in reason
for reason in plan.diff_classification.blocked_reasons
)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion expects blocked_reasons strings to mention iterative_shadow, but MigrationPlanner.classify_diff() currently uses messages like “requires document migration (not yet supported)” and does not include that substring. Either adjust the blocked-reason wording in the planner (preferred if you want to guide users to Phase 2) or relax the test to match the actual message content.

Copilot uses AI. Check for mistakes.
Comment on lines +467 to +471
assert plan.diff_classification.supported is False
assert any(
"dims" in reason and "iterative_shadow" in reason
for reason in plan.diff_classification.blocked_reasons
)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the prefix test above: the planner’s dims blocked-reason text doesn’t currently include iterative_shadow, so this assertion will fail. Align the test expectations with the actual message, or update the planner messages to explicitly route users to the Phase 2 iterative_shadow path.

Copilot uses AI. Check for mistakes.
Supported changes:
- Adding or removing non-vector fields (text, tag, numeric, geo)
- Changing field options (sortable, separator, weight)
- Changing vector algorithm (FLAT, HNSW, SVS_VAMANA)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper text lists the SVS algorithm as SVS_VAMANA (underscore), but throughout the codebase and docs the canonical name is SVS-VAMANA (hyphen). Since users will copy/paste this, it’s worth aligning the help output with the accepted/normalized value to avoid confusion.

Suggested change
- Changing vector algorithm (FLAT, HNSW, SVS_VAMANA)
- Changing vector algorithm (FLAT, HNSW, SVS-VAMANA)

Copilot uses AI. Check for mistakes.
nkanu17 added 6 commits March 19, 2026 18:59
- Add rvl migrate subcommand (helper, list, plan, apply, validate)
- Implement MigrationPlanner for schema diff classification
- Implement MigrationExecutor with drop_recreate mode
- Support vector quantization (float32 <-> float16) during migration
- Add MigrationValidator for post-migration validation
- Show error messages prominently on migration failure
- Add migration temp files to .gitignore
- Add MigrationWizard for guided schema changes
- Support add/update/remove field operations
- Algorithm-specific datatype prompts (SVS-VAMANA vs HNSW/FLAT)
- SVS-VAMANA params: GRAPH_MAX_DEGREE, COMPRESSION
- HNSW params: M, EF_CONSTRUCTION
- Normalize SVS_VAMANA -> SVS-VAMANA input
- Preview patch as YAML before finishing
- Add conceptual guide: how migrations work (Diataxis explanation)
- Add task guide: step-by-step migration walkthrough (Diataxis how-to)
- Expand field-attributes.md with migration support matrix
- Add vector datatypes table with algorithm compatibility
- Update navigation indexes to include new guides
- Normalize SVS-VAMANA naming throughout docs
- Unit tests for MigrationPlanner diff classification
- Unit tests for MigrationWizard (41 tests incl. adversarial inputs)
- Integration test for drop_recreate flow
- Field modifier ordering integration tests (INDEXEMPTY, INDEXMISSING, etc.)
@nkanu17 nkanu17 force-pushed the feat/index-migrator-v0-lc branch from 6a61cfd to 8fffbef Compare March 19, 2026 23:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants