Skip to content

Add support for structured dtypes to zarr3 driver, open structs as void#271

Open
BrianMichell wants to merge 83 commits intogoogle:masterfrom
BrianMichell:v3_structs_and_void
Open

Add support for structured dtypes to zarr3 driver, open structs as void#271
BrianMichell wants to merge 83 commits intogoogle:masterfrom
BrianMichell:v3_structs_and_void

Conversation

@BrianMichell
Copy link
Contributor

Supersedes #264

Resolves comments 1 and 2

Implement shim for `open_as_void` driver level flag
* Begin removing void field shim

* Fully removed void string shim

* Cleanup debug prints

* Remove shimmed validation

* Remove unnecessary comment

* Prefer false over zero for ternary clarity
* Implement a more general and portable example set

* Fix driver cache bug

* Update example for template

* Cleanup example

* Remove testing examples from source
* Use the appropriate fill value for open_as_void structured data

* Cleanup
@laramiel
Copy link
Collaborator

I'll try to get to this in about a week, before I look this one over, please double check that the prior PR works for you. Also look over this one and see if any of the suggestions from the other one applies.

Matches the pattern from zarr v2 driver (PR google#272). When both "field"
and "open_as_void" are specified in the spec, return an error since
these options are mutually exclusive - field selects a specific field
from a structured array, while open_as_void provides raw byte access
to the entire structure.
The zarr3 URL syntax cannot represent field selection or void access
mode. Following the pattern from zarr v2 driver (PR google#272), ToUrl() now
returns an error when either of these options is specified instead of
silently ignoring them.
…trip

Following the pattern from zarr v2 driver (PR google#272), override
GetBoundSpecData in ZarrDataCache to set spec.open_as_void from
ChunkCacheImpl::open_as_void_. This ensures that when you open a
store with open_as_void=true and then call spec(), the resulting
spec correctly has open_as_void=true set.

Without this fix, opening a store with open_as_void=true and then
getting its spec would lose the open_as_void flag, causing incorrect
behavior if the spec is used to re-open the store.
@laramiel
Copy link
Collaborator

laramiel commented Mar 17, 2026

Due to limitations on our copybara import settings, is it possible for you to squash your commits into a single commit?

// Element [0,0] should be 0xAABBCCDD
EXPECT_EQ(static_cast<int32_t>(0xAABBCCDD), typed_ptr[0]);
}

Copy link
Collaborator

@laramiel laramiel Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably ought to test edge cases more. And more constraints.

TEST(Zarr3OpenAsVoidTest, OpenAsVoidWithRankConstraint) {

  // Create an array with rank 2.
  ::nlohmann::json create_spec{
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix/"}}},
      {"metadata",
       {
           {"data_type", "int32"},
           {"shape", {4, 4}},
           {"chunk_grid",
            {{"name", "regular"},
             {"configuration", {{"chunk_shape", {2, 2}}}}}},
       }},
  };

  // Attempt to open with open_as_void=true and rank=3 constraint.
  ::nlohmann::json open_as_void_spec{
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix/"}}},
      {"open_as_void", true},
      {"rank", 3},
  };
  // ... create and then open.
  // Should this pass or fail?
}

TEST(Zarr3OpenAsVoidTest, OpenAsVoidWithMetadataShape) {

  ::nlohmann::json create_spec{
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix/"}}},
      {"metadata",
       {
           {"data_type", "int32"},
           {"shape", {4, 4}},
           {"chunk_grid",
            {{"name", "regular"},
             {"configuration", {{"chunk_shape", {2, 2}}}}}},
       }},
  };
  ::nlohmann::json open_as_void_spec{
      {"driver", "zarr3"},
      {"rank", 3},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix/"}}},
      {"open_as_void", true},
      {"metadata", {{"shape", {4, 4}}}},
  };

  // ... create and then open.
  EXPECT_THAT(void_store.domain().shape(), ElementsAre(4, 4, 4));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see these; I've suggested other variants above.

{"kvstore", {{"driver", "memory"}, {"path", "abc.zarr3/def/"}}}});
}

// Tests for open_as_void functionality
Copy link
Collaborator

@laramiel laramiel Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are the structured tests which are independent of open_as_void?

There should be tests for read/write of structures with fields, nested structs, etc. Kind of like this:

TEST(Zarr3StructuredTest, StructuredFieldWithArrayField) {
  // dtype: { "x": int32[2, 2] }
  ::nlohmann::json create_spec{
      {"driver", "zarr3"},
      {"rank", 4},
      {"field", "x"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix/"}}},
      {"metadata",
       {
           {"data_type",
            {{"name", "structured"},
             {"configuration", {{"fields", {{"x", "int32", {2, 2}}}}}}}},
           {"shape", {4, 2}},
           {"chunk_grid",
            {{"name", "regular"},
             {"configuration", {{"chunk_shape", {2, 2}}}}}},
       }},
  };

  TENSORSTORE_ASSERT_OK_AND_ASSIGN(
      auto store, tensorstore::Open(create_spec, context,
                                    tensorstore::OpenMode::create)
                      .result());

  auto data = tensorstore::MakeArray<int32_t>(
      {{{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}},
       {{{9, 10}, {11, 12}}, {{13, 14}, {15, 16}}},
       {{{17, 18}, {19, 20}}, {{21, 22}, {23, 24}}},
       {{{25, 26}, {27, 28}}, {{29, 30}, {31, 32}}}});

  TENSORSTORE_ASSERT_OK(tensorstore::Write(data, store).result());

  TENSORSTORE_ASSERT_OK_AND_ASSIGN(auto read_result,
                                   tensorstore::Read(store).result());

  EXPECT_EQ(data, read_result);
}

Copy link
Collaborator

@laramiel laramiel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like fields can no longer be array-like. But there still are array-like fields using the "r16" syntax, so an equivalent test should exist, right?

TEST(Zarr3DriverTest, StructuredFieldWithFieldShape) {
  // Test reading and writing with a structured field that has a field_shape

  auto context = Context::Default();

  // Struct layout: a (int32, 4 bytes) + b (r16, 2 bytes/elements [2]) = 6 bytes
  ::nlohmann::json create_spec{
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix_field_shape/"}}},
      {"field", "b"},
      {"metadata",
       {
           {"data_type",
            {{"name", "struct"},
             {"configuration",
              {{"fields", ::nlohmann::json::array(
                              {{{"name", "a"}, {"data_type", "int32"}},
                               {{"name", "b"}, {"data_type", "r16"}}})}}}}},
           {"shape", {4, 4}},  /* rank 2 */
           {"chunk_grid",
            {{"name", "regular"},
             {"configuration", {{"chunk_shape", {2, 2}}}}}},
       }},
  };

  TENSORSTORE_ASSERT_OK_AND_ASSIGN(
      auto store,
      tensorstore::Open(create_spec, context, tensorstore::OpenMode::create,
                        tensorstore::ReadWriteMode::read_write)
          .result());

  // Field 'b' (r16) should have rank original_rank + 1 = 3
  ASSERT_EQ(3, store.rank());
  EXPECT_THAT(store.domain().shape(), ::testing::ElementsAre(4, 4, 2));

  // Write some data to field b
  auto data = tensorstore::MakeArray<uint8_t>(
      {{{1, 2}, {3, 4}}, {{5, 6}, {7, 8}}});
  TENSORSTORE_ASSERT_OK(
      tensorstore::Write(
          data, store | tensorstore::Dims(0, 1).SizedInterval({0, 0}, {2, 2})));

  // Read it back
  TENSORSTORE_ASSERT_OK_AND_ASSIGN(
      auto read_data,
      tensorstore::Read(store | tensorstore::Dims(0, 1).SizedInterval({0, 0},
                                                                     {2, 2}))
          .result());

  EXPECT_EQ(data, read_data);
}

Decode directly into byte array for open on void
EXPECT_EQ(output["configuration"]["fields"][0]["name"], "x");
EXPECT_EQ(output["configuration"]["fields"][0]["data_type"], "uint8");
}

Copy link
Collaborator

@laramiel laramiel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST(ParseDType, Raw64BitFieldShape) {
  ::nlohmann::json input = {
      {"name", "struct"},
      {"configuration",
       {{"fields",
         ::nlohmann::json::array({{{"name", "x"}, {"data_type", "r64"}}})}}}};
  // r64 results in field_shape [8]
  TENSORSTORE_ASSERT_OK_AND_ASSIGN(auto dtype, ParseDType(input));
  ASSERT_EQ(dtype.fields.size(), 1);
  EXPECT_THAT(dtype.fields[0].field_shape, ::testing::ElementsAre(8));
  EXPECT_EQ(dtype.fields[0].num_bytes, 8);
}

TEST(ParseDType, ManyFieldsOffsets) {
  // How many fields are allowed?
  ::nlohmann::json::array_t fields;
  for (int i = 0; i < 1000; ++i) {
    fields.push_back({{"name", absl::StrCat("f", i)}, {"data_type", "int64"}});
  }
  ::nlohmann::json input = {{"name", "struct"},
                            {"configuration", {{"fields", fields}}}};
  TENSORSTORE_ASSERT_OK_AND_ASSIGN(auto dtype, ParseDType(input));
  EXPECT_EQ(dtype.fields.size(), 1000);
  EXPECT_EQ(dtype.bytes_per_outer_element, 8000);
  EXPECT_EQ(dtype.fields[99].byte_offset, 99 * 8);
  EXPECT_EQ(dtype.fields[999].byte_offset, 999 * 8);
}

/*.bytes_per_outer_element=*/3,
});
}

Copy link
Collaborator

@laramiel laramiel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TEST(ParseDType, StructuredWithFieldShape) {
  ::nlohmann::json input = {
      {"name", "struct"},
      {"configuration",
       {{"fields", ::nlohmann::json::array(
                       {{{"name", "scalar"}, {"data_type", "int32"}},
                        {{"name", "array"}, {"data_type", "r16"}}})}}}};

  ZarrDType expected{
      /*.has_fields=*/true,
      /*.fields=*/
      {
          {{
               /*.encoded_dtype=*/"int32",
               /*.dtype=*/dtype_v<int32_t>,
               /*.flexible_shape=*/{},
           },
           /*.name=*/"scalar",
           /*.field_shape=*/{},
           /*.num_inner_elements=*/1,
           /*.byte_offset=*/0,
           /*.num_bytes=*/4},
          {{
               /*.encoded_dtype=*/"r16",
               /*.dtype=*/dtype_v<tensorstore::dtypes::byte_t>,
               /*.flexible_shape=*/{2},
           },
           /*.name=*/"array",
           /*.field_shape=*/{2},
           /*.num_inner_elements=*/2,
           /*.byte_offset=*/4,
           /*.num_bytes=*/2},
      },
      /*.bytes_per_outer_element=*/6,
  };

  CheckDType(input, expected);
}

absl::StatusCode::kInvalidArgument,
".*open_as_void is only supported for structured dtypes.*"));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone provides constraints when using open_as_void, then they should be properly validated against the schema, right? Do you have a test equivalent to this? The errors below are approximations of what they should be.

TEST(Zarr3OpenAsVoidTest, InvalidSchema) {
  auto context = Context::Default();

  // Create a structured array
  ::nlohmann::json create_spec{
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix_void_invalid/"}}},
      {"field", "x"},
      {"metadata",
       {
           {"data_type",
            {{"name", "struct"},
             {"configuration",
              {{"fields", ::nlohmann::json::array(
                              {{{"name", "x"}, {"data_type", "uint8"}},
                               {{"name", "y"}, {"data_type", "int16"}}})}}}}},
           {"shape", {4, 4}},
           {"chunk_grid",
            {{"name", "regular"},
             {"configuration", {{"chunk_shape", {2, 2}}}}}},
       }},
  };

  TENSORSTORE_ASSERT_OK_AND_ASSIGN(
      auto store,
      tensorstore::Open(create_spec, context, tensorstore::OpenMode::create,
                        tensorstore::ReadWriteMode::read_write)
          .result());

  ::nlohmann::json void_spec{
      {"driver", "zarr3"},
      {"kvstore", {{"driver", "memory"}, {"path", "prefix_void_invalid/"}}},
      {"open_as_void", true},
  };

  // Test rank mismatch: RankConstraint should be 3 for this open_as_void:
  EXPECT_THAT(
      tensorstore::Open(void_spec, context, tensorstore::RankConstraint(2))
          .result(),
      tensorstore::MatchesStatus(absl::StatusCode::kFailedPrecondition,
                        "Rank specified by schema \\(2\\) does not match rank "
                        "specified by metadata \\(3\\)"));

  // Test dtype mismatch: open_as_void dtype must be byte.
  EXPECT_THAT(tensorstore::Open(void_spec, context, dtype_v<int16_t>).result(),
              tensorstore::MatchesStatus(absl::StatusCode::kFailedPrecondition,
                                "data_type from metadata \\(uint8\\) does not "
                                "match dtype in schema \\(int16\\)"));
}

return bytes[i * s0 + j * s1 + k * s2];
};

// [0,0]: x=0 (fill), y=100 (0x0064 LE: 0x64, 0x00)
Copy link
Collaborator

@laramiel laramiel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these would be nicer as array comparisons, IMO. Why are you not reading from the array instead of computing the values directly? This applies to similar lines above and below here.

https://google.github.io/tensorstore/cpp/api/tensorstore.Array.operator-call-indices.html
https://google.github.io/tensorstore/cpp/api/tensorstore.Array.operator-subscript-indices.html

auto expected_y = tensorstore::MakeArray<int16_t>({{200}});
EXPECT_EQ(expected_y, y_data);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the above but also with r16 (or r64). This could maybe be a TEST_P.

@@ -123,14 +131,48 @@ class ZarrDriverSpec
"metadata",
jb::Validate(
[](const auto& options, auto* obj) -> absl::Status {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move the validation into the jb::Initialize() call, so that all the checks happen in the same location here.

}

TEST(ParseDType, SimpleStringBool) {
CheckDType("bool", ZarrDType{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several places a ZarrDType with fields is fully spelled out.
I think that I'd like a method like:

AddFieldToZarrDType(dtype, Field{...})

Where the method updates bits like byte_offsets and bytes_per_outer_elements (depending on has_fields).

FillValueDataTypeFunctions::Make<::tensorstore::dtypes::T>(); \
/**/
TENSORSTORE_ZARR3_FOR_EACH_DATA_TYPE(TENSORSTORE_INTERNAL_DO_DEF)
// Add char_t support for string data types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit #undef TENSORSTORE_INTERNAL_DO_DEF immediately after use.

.result());

EXPECT_EQ(3, byte_array.rank());
EXPECT_EQ(2, byte_array.shape()[0]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXPECT_THAT(byte_array.shape(), ElementsAre(2, 2, 3));

EXPECT_EQ(2, byte_array.shape()[1]);
EXPECT_EQ(3, byte_array.shape()[2]);

// Verify bytes - we use the array's data() and strides
Copy link
Collaborator

@laramiel laramiel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  EXPECT_EQ(tensorstore::MakeArray<uint8_t>(
                {{{0, 0x64, 0x00}, {0, 0xC8, 0x00}},
                 {{0, 0x2C, 0x01}, {0, 0x90, 0x01}}}),
            tensorstore::StaticCast<const uint8_t>(byte_array));

Though probably keep comments.


namespace tensorstore {
namespace internal_zarr3 {

Copy link
Collaborator

@laramiel laramiel Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kVoidFieldIndex appears in more than one location; please use a single definition.

// Sentinel value for field_index indicating access to the raw byte "void" view
// of a structured array.
constexpr size_t kVoidFieldIndex = size_t(-1);

if (obj->open_as_void && !obj->selected_field.empty()) {
return absl::InvalidArgumentError(
"\"field\" and \"open_as_void\" are mutually exclusive");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should probably be a method for this. I'm using auto to avoid typing out types here.
And it's probably not quite correct.

TENSORSTORE_RETURN_IF_ERROR(TrySetMetadataConstraints(obj->metadata_constraints, obj->schema, obj->selected_field, obj->open_as_void));

absl::Status TrySetMetadataConstraints(
   const ZarrMetadataConstraints& metadata_constraints,
   auto& schema,
   auto selected_field,
   bool open_as_void)
   
  auto& metadata_constraints = obj->metadata_constraints;

  auto rank = metadata_constraints.rank;
  if (rank == dynamic_rank && metadata_constraints.shape) {
    rank = metadata_constraints.shape->size();
  }
  auto add_rank = [&rank](auto n) { if (rank != dynamic_rank) rank += n; };

  if (open_as_void) {
    add_rank (1);
    TENSORSTORE_RETURN_IF_ERROR(schema.Set(dtype_v<std::byte>));
  } else if (metadata_constraints.dtype) {
    // get field index
    if (field_index) {
      field =  ... get field from metadata ...
      TENSORSTORE_RETURN_IF_ERROR(schema.Set(field->dtype));
      add_rank(field->field_shape.size());
    } else {
      // not sure; shouldn't be void.
    }
  }
  if (rank != dynamic_rank) {
    TENSORSTORE_RETURN_IF_ERROR(obj->schema.Set(
                    RankConstraint{rank}));
  }
  // ... maybe more?
  return absl::OkStatus();
}

Or maybe it should return something like a modified struct SpecRankAndFieldInfo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants