Skip to content

Feat : Row data type support in Rust#442

Open
hemanthsavasere wants to merge 3 commits intoapache:mainfrom
hemanthsavasere:388-row-datatype-support
Open

Feat : Row data type support in Rust#442
hemanthsavasere wants to merge 3 commits intoapache:mainfrom
hemanthsavasere:388-row-datatype-support

Conversation

@hemanthsavasere
Copy link

@hemanthsavasere hemanthsavasere commented Mar 15, 2026

Purpose

Linked issue: close #388

DataType Row was defined in the schema layer but the serialization stack was entirely missing. CompactedRowWriter and Reader panicked, FieldGetter and ValueWriter hit unimplemented macros, and Datum had no Row variant. This PR wires up the full stack.

Brief change log

  • Add Datum Row variant and as_row accessor.
  • Add get_row to InternalRow trait with a default error implementation.
  • Implement get_row on GenericRow and CompactedRow.
  • Implement ColumnarRow get_row via Arrow StructArray extraction with OnceLock caching per column, invalidated on set_row_id.
  • Add InnerValueWriter Row which serializes into a temporary CompactedRowWriter then calls write_bytes, matching the Java wire format.
  • Add DataType Row arm in CompactedRowDeserializer using read_bytes and recursive deserialize.
  • Add InnerFieldGetter Row which automatically enables ROW in CompactedKeyEncoder.
  • Handle Datum Row in C++ resolve_row_types.

Tests

  • test_row_simple_nesting: round-trip ROW of INT and STRING.
  • test_row_deep_nesting: round-trip ROW of ROW of INT.
  • test_row_with_nullable_fields: null field inside nested row and null outer ROW column.
  • test_row_as_primary_key: ROW through CompactedKeyEncoder; asserts non-empty, deterministic, and distinguishable output.
  • columnar_row_reads_nested_row, columnar_row_reads_deeply_nested_row, and columnar_row_get_row_cache_invalidated_on_set_row_id for Arrow StructArray extraction.

API and Format

API: Datum Row is a new variant. Exhaustive match on Datum will require a new arm. InternalRow get_row is additive with a default implementation.

Wire format: Unchanged. ROW uses varint-length plus CompactedRow blob, identical to String or Bytes. This matches the Java reference byte-for-byte.

Documentation

No new documentation needed.

hemanthsavasere and others added 3 commits March 15, 2026 18:27
- Add `Datum::Row(Box<GenericRow>)` variant with `as_row()` accessor
- Add `get_row()` to `InternalRow` trait with default error impl
- Implement `GenericRow::get_row()` and `CompactedRow::get_row()` delegation
- Implement `ColumnarRow::get_row()` with Arrow StructArray extraction + OnceLock caching
- Add `InnerValueWriter::Row(RowType)` and write path via nested CompactedRowWriter
- Add `DataType::Row` arm in `CompactedRowDeserializer` for eager nested decode
- Add `InnerFieldGetter::Row` and hook up FieldGetter/ValueWriter pipeline
- Handle `Datum::Row` in `resolve_row_types` (C++ bindings)
- Add round-trip tests: simple nesting, deep nesting, nullable fields, ROW as primary key

Wire format matches Java: varint-length-prefixed blob of a complete CompactedRow.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@hemanthsavasere hemanthsavasere changed the title feat: add end-to-end ROW (nested struct) column serialization support Row data type support in Rust Mar 15, 2026
@hemanthsavasere hemanthsavasere changed the title Row data type support in Rust [Feat] Row data type support in Rust Mar 15, 2026
@hemanthsavasere hemanthsavasere changed the title [Feat] Row data type support in Rust Feat : Row data type support in Rust Mar 15, 2026
@hemanthsavasere
Copy link
Author

Hi @fresh-borzoni,
Can you please review the PR. Thanks

@leekeiabstraction
Copy link
Contributor

@charlesdong1991 would be good if you can review this as well as you've worked on the array type.

Copy link
Contributor

@leekeiabstraction leekeiabstraction left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Left some comments/questions, PTAL.

(InnerValueWriter::TimestampLtz(p), Datum::TimestampLtz(ts)) => {
writer.write_timestamp_ltz(ts, *p);
}
(InnerValueWriter::Row(row_type), Datum::Row(inner_row)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

currently, i think a new writer is created per write call, which is not ideal


match array.data_type() {
ArrowDataType::Boolean => {
let a = array.as_any().downcast_ref::<BooleanArray>().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's error with appropriate message instead of unwrapping/panic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

);
}

fn make_struct_batch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for test only? If so, move under '#[test]'/mod?


// Access outer struct at column 0, row 0
let outer = row.get_row(0).unwrap();
assert_eq!(outer.get_int(0).unwrap(), 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not: should we assert second row as well?

0,
nested_bytes.len(),
);
let nested_deser = CompactedRowDeserializer::new_from_owned(row_type.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use from 'new' which borrows instead? Seems like nested_deser does not live beyond current scope anyway.

@leekeiabstraction
Copy link
Contributor

Additionally, please can you add to existing integration test? TY

Copy link
Contributor

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR! left couple comments

(InnerValueWriter::TimestampLtz(p), Datum::TimestampLtz(ts)) => {
writer.write_timestamp_ltz(ts, *p);
}
(InnerValueWriter::Row(row_type), Datum::Row(inner_row)) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

currently, i think a new writer is created per write call, which is not ideal

// Validation is done at TimestampLTzType construction time
Ok(InnerValueWriter::TimestampLtz(t.precision()))
}
DataType::Row(row_type) => Ok(InnerValueWriter::Row(row_type.clone())),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should not store clone in innervaluewriter::row, probabl store in pre built child writer is better?

Comment on lines +253 to +257
InnerValueWriter::create_inner_value_writer(&field.data_type, None)
.expect("create_inner_value_writer failed for nested row field");
vw.write_value(&mut nested, i, datum)
.expect("write_value failed for nested row field");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in current way, it will panic inside a Result, does it mean those will be hidden for users?


match array.data_type() {
ArrowDataType::Boolean => {
let a = array.as_any().downcast_ref::<BooleanArray>().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

let field_count = row_type.fields().len();
let mut nested = CompactedRowWriter::new(field_count);
for (i, field) in row_type.fields().iter().enumerate() {
let datum = &inner_row.values[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential panic on OOB?

})?;
let batch = Arc::clone(&self.record_batch);
let row_id = self.row_id;
Ok(lock.get_or_init(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to use get_or_try_init here?

@leekeiabstraction
Copy link
Contributor

FYI: we're cutting a release branch soon hence we are taking time to review/merge. Just wanted contributors to stay informed about longer turnaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Row data type support in Rust

3 participants