Feat : Row data type support in Rust#442
Feat : Row data type support in Rust#442hemanthsavasere wants to merge 3 commits intoapache:mainfrom
Conversation
- Add `Datum::Row(Box<GenericRow>)` variant with `as_row()` accessor - Add `get_row()` to `InternalRow` trait with default error impl - Implement `GenericRow::get_row()` and `CompactedRow::get_row()` delegation - Implement `ColumnarRow::get_row()` with Arrow StructArray extraction + OnceLock caching - Add `InnerValueWriter::Row(RowType)` and write path via nested CompactedRowWriter - Add `DataType::Row` arm in `CompactedRowDeserializer` for eager nested decode - Add `InnerFieldGetter::Row` and hook up FieldGetter/ValueWriter pipeline - Handle `Datum::Row` in `resolve_row_types` (C++ bindings) - Add round-trip tests: simple nesting, deep nesting, nullable fields, ROW as primary key Wire format matches Java: varint-length-prefixed blob of a complete CompactedRow.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Hi @fresh-borzoni, |
|
@charlesdong1991 would be good if you can review this as well as you've worked on the array type. |
leekeiabstraction
left a comment
There was a problem hiding this comment.
Thank you for your contribution! Left some comments/questions, PTAL.
| (InnerValueWriter::TimestampLtz(p), Datum::TimestampLtz(ts)) => { | ||
| writer.write_timestamp_ltz(ts, *p); | ||
| } | ||
| (InnerValueWriter::Row(row_type), Datum::Row(inner_row)) => { |
There was a problem hiding this comment.
Why don't we delegate like on Java side?
There was a problem hiding this comment.
+1
currently, i think a new writer is created per write call, which is not ideal
|
|
||
| match array.data_type() { | ||
| ArrowDataType::Boolean => { | ||
| let a = array.as_any().downcast_ref::<BooleanArray>().unwrap(); |
There was a problem hiding this comment.
Let's error with appropriate message instead of unwrapping/panic?
| ); | ||
| } | ||
|
|
||
| fn make_struct_batch( |
There was a problem hiding this comment.
Is this for test only? If so, move under '#[test]'/mod?
|
|
||
| // Access outer struct at column 0, row 0 | ||
| let outer = row.get_row(0).unwrap(); | ||
| assert_eq!(outer.get_int(0).unwrap(), 1); |
There was a problem hiding this comment.
Not: should we assert second row as well?
| 0, | ||
| nested_bytes.len(), | ||
| ); | ||
| let nested_deser = CompactedRowDeserializer::new_from_owned(row_type.clone()); |
There was a problem hiding this comment.
Can we use from 'new' which borrows instead? Seems like nested_deser does not live beyond current scope anyway.
|
Additionally, please can you add to existing integration test? TY |
charlesdong1991
left a comment
There was a problem hiding this comment.
thanks for the PR! left couple comments
| (InnerValueWriter::TimestampLtz(p), Datum::TimestampLtz(ts)) => { | ||
| writer.write_timestamp_ltz(ts, *p); | ||
| } | ||
| (InnerValueWriter::Row(row_type), Datum::Row(inner_row)) => { |
There was a problem hiding this comment.
+1
currently, i think a new writer is created per write call, which is not ideal
| // Validation is done at TimestampLTzType construction time | ||
| Ok(InnerValueWriter::TimestampLtz(t.precision())) | ||
| } | ||
| DataType::Row(row_type) => Ok(InnerValueWriter::Row(row_type.clone())), |
There was a problem hiding this comment.
i think we should not store clone in innervaluewriter::row, probabl store in pre built child writer is better?
| InnerValueWriter::create_inner_value_writer(&field.data_type, None) | ||
| .expect("create_inner_value_writer failed for nested row field"); | ||
| vw.write_value(&mut nested, i, datum) | ||
| .expect("write_value failed for nested row field"); | ||
| } |
There was a problem hiding this comment.
i think in current way, it will panic inside a Result, does it mean those will be hidden for users?
|
|
||
| match array.data_type() { | ||
| ArrowDataType::Boolean => { | ||
| let a = array.as_any().downcast_ref::<BooleanArray>().unwrap(); |
| let field_count = row_type.fields().len(); | ||
| let mut nested = CompactedRowWriter::new(field_count); | ||
| for (i, field) in row_type.fields().iter().enumerate() { | ||
| let datum = &inner_row.values[i]; |
There was a problem hiding this comment.
potential panic on OOB?
| })?; | ||
| let batch = Arc::clone(&self.record_batch); | ||
| let row_id = self.row_id; | ||
| Ok(lock.get_or_init(|| { |
There was a problem hiding this comment.
maybe better to use get_or_try_init here?
|
FYI: we're cutting a release branch soon hence we are taking time to review/merge. Just wanted contributors to stay informed about longer turnaround. |
Purpose
Linked issue: close #388
DataType Row was defined in the schema layer but the serialization stack was entirely missing. CompactedRowWriter and Reader panicked, FieldGetter and ValueWriter hit unimplemented macros, and Datum had no Row variant. This PR wires up the full stack.
Brief change log
Tests
API and Format
API: Datum Row is a new variant. Exhaustive match on Datum will require a new arm. InternalRow get_row is additive with a default implementation.
Wire format: Unchanged. ROW uses varint-length plus CompactedRow blob, identical to String or Bytes. This matches the Java reference byte-for-byte.
Documentation
No new documentation needed.