Added --detect-timestamps flag to preserve Ion timestamps during JSON roundtrip#206
Added --detect-timestamps flag to preserve Ion timestamps during JSON roundtrip#206jobarr-amzn merged 28 commits intomainfrom
Conversation
…ve deserialization
src/bin/ion/commands/from/json.rs
Outdated
| use anyhow::{Context, Result}; | ||
| use clap::{arg, ArgMatches, Command}; | ||
| use ion_rs::Element; | ||
| use serde_json::{Deserializer, Value}; |
There was a problem hiding this comment.
It is not necessary to use a JSON serde, the Ion reader can already read the JSON. Legal JSON is also legal Ion. You can handle this type conversion entirely value by value reading "Ion" values.
Perhaps the best way to handle this would be to extend the CatCommand infrastructure to allow a supplied mapping function. Alternatively, this could be done via a jq command (if only we had e.g. some timestamp_from_string function available).
src/bin/ion/commands/from/json.rs
Outdated
| s.len() >= 4 | ||
| && s[..4].chars().all(|c| c.is_ascii_digit()) | ||
| && (s.contains('T') || (s.len() == 10 && s.matches('-').count() == 2)) | ||
| && !(s.len() == 4 && s.chars().all(|c| c.is_ascii_digit())) |
There was a problem hiding this comment.
I like some elements of the logic here- .len() is the cheapest check. We could apply some simplifying assumptions (e.g. we only detect timestamps < 9999 A.D and only to nanosecond precision. Then you can add a max length check and also check specific offsets in the string rather than using .contains or .matches.
E.g. I believe that timestamps must be:
len < 10 && last char is 'T' // 2007T, 2007-01T
|| len == 10 && form is YYYY-MM-DD // 2007-01-01 ...
|| len > 10 && s[10] is 'T' // FWIW this should be far and away the most common case
This function could use documentation explaining the constraints and referencing the relevant specifications. You could start from here: https://amazon-ion.github.io/ion-docs/docs/spec.html#timestamp
…entation explaining timestamp heuristic logic
…nction for the --detect-timestamp flag
Changes
|
src/bin/ion/commands/cat.rs
Outdated
| let mapper = if detect_timestamps { | ||
| Some(convert_timestamps as fn(Element) -> Result<Element>) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
What if instead of plumbing optionality all the way down you defaulted to essentially an identity mapper? (i -> Ok(i)).
You can get rid of branches at several levels that way. I don't know which would be more performant, but I think that would end up with fewer code changes and be easier to follow.
There was a problem hiding this comment.
Removing the optionality makes the code more readable and makes it easier to implement changes in the future without having to worry about the optionality logic, but it adds an overhead to each value in the dataset - each value will have an identity mapper on it even when the --detect-timestamps is not enabled.
There was a problem hiding this comment.
Of course that's correct. Quantified, is it significant? How does that work compare to the work hiding behind any other statement executed for each value in the stream?
| /// Recursively converts timestamp-like strings to Ion timestamps in the given element. | ||
| pub fn convert_timestamps(element: Element) -> Result<Element> { |
There was a problem hiding this comment.
Two comments on recursion here:
Structural recursion: we have only a couple use cases at the moment, but a general purpose structural recursion function would be useful. There's at least one more place where we do this, in stats.rs. If we allow the mapper to handle only top level values than any mapper implementer must separately implement recursion.
Procedural recursion: Whenever we implement procedural recursion anywhere it allows user-submitted data to blow the stack. Pass in a kilobyte[1] of nested parentheses and poof. For that reason we should prefer an iterative approach to managing structural recursion. Again see stats.rs, specifically the top level depth calculation.
[1]: Your mileage may vary depending on programming language, runtime, architecture, and the value of $CURRENT_YEAR. There's always some number though.
| /// Iteratively applies a mapper to all elements in an Ion structure | ||
| /// This function uses an explicit work stack instead of recursion | ||
| /// It processes the structure in post-order (children before parents) to enable reconstruction |
There was a problem hiding this comment.
Post-order makes it more difficult to create a mapper which efficiently drops or mutates or members. Would it be better to map each value and then descend into the result?
As an aside, this also makes me wonder ControlFlow could be useful for signaling when the mapper knows that further recursion is not needed. I'm not suggesting that you use ControlFlow now- we have only two extant use cases and both depend on total exploration of the value.
|
Using an identity mapper |
|
Reverted from using an identity mapper to using optional plumbing, with a difference, where now the mapper can be passed as none and we are not using 2 separate methods for mapper and non-mapper. |
jobarr-amzn
left a comment
There was a problem hiding this comment.
I still need to take a closer look at the structural recursion method, and I'm not sure about adding the flag to cat. Otherwise, this looks good.
|
|
||
| /// Heuristic to identify strings that could be Ion timestamps | ||
| /// | ||
| /// Ion timestamps follow ISO 8601 format with these constraints: |
There was a problem hiding this comment.
Not exactly. See: https://amazon-ion.github.io/ion-docs/docs/spec.html#timestamp
In the text format, timestamps follow the W3C note on date and time formats, but they must end with the literal “T” if not at least whole-day precision. Fractional seconds are allowed, with at least one digit of precision and an unlimited maximum. Local-time offsets may be represented as either hour:minute offsets from UTC, or as the literal “Z” to denote a local time of UTC. They are required on timestamps with time and are not allowed on date values.
Ion follows the “Unknown Local Offset Convention” of RFC3339:
To see RFC3339 contrasted with ISO-8601, see: https://ijmacd.github.io/rfc3339-iso8601/
Description of changes:
Added the
--detect-timestamps/-tflag to thefrom jsoncommand, which preserves the timestamp data type when converting an Ion value to JSON and back to Ion.The timestamp detection and conversion logic is implemented in the
convert_timestampsfunction, which performs two checks:is_timestamp_like- A heuristic filter that quickly identifies potential timestamp strings to avoid unnecessary processing of non-timestamp valuesion_rs::IonType::Timestamp- Uses Ion-rust's implementation to validate and convert valid timestamp strings back to Ion timestamp typesThe implementation uses an optional mapper approach in the transcribe module, where:
When
-tflag is present: Each Ion element is processed throughconvert_timestampsWhen
-tflag is absent: Elements are written directly without materialization for optimal performanceExample
Input
echo '{created: 2025-12-25T10:30:00Z}' | ion to json | ion from json --detect-timestamps --format prettyOutput
{ created: 2025-12-25T10:30:00+00:00 }By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.