Skip to content

Added --detect-timestamps flag to preserve Ion timestamps during JSON roundtrip#206

Merged
jobarr-amzn merged 28 commits intomainfrom
shaurya/detect-timestamps
Sep 30, 2025
Merged

Added --detect-timestamps flag to preserve Ion timestamps during JSON roundtrip#206
jobarr-amzn merged 28 commits intomainfrom
shaurya/detect-timestamps

Conversation

@Shaurya0108
Copy link
Copy Markdown
Contributor

@Shaurya0108 Shaurya0108 commented Sep 12, 2025

Description of changes:

Added the --detect-timestamps/-t flag to the from json command, which preserves the timestamp data type when converting an Ion value to JSON and back to Ion.

The timestamp detection and conversion logic is implemented in the convert_timestamps function, which performs two checks:

is_timestamp_like - A heuristic filter that quickly identifies potential timestamp strings to avoid unnecessary processing of non-timestamp values

ion_rs::IonType::Timestamp - Uses Ion-rust's implementation to validate and convert valid timestamp strings back to Ion timestamp types

The implementation uses an optional mapper approach in the transcribe module, where:

When -t flag is present: Each Ion element is processed through convert_timestamps

When -t flag is absent: Elements are written directly without materialization for optimal performance

Example

Input
echo '{created: 2025-12-25T10:30:00Z}' | ion to json | ion from json --detect-timestamps --format pretty

Output
{ created: 2025-12-25T10:30:00+00:00 }


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

use anyhow::{Context, Result};
use clap::{arg, ArgMatches, Command};
use ion_rs::Element;
use serde_json::{Deserializer, Value};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary to use a JSON serde, the Ion reader can already read the JSON. Legal JSON is also legal Ion. You can handle this type conversion entirely value by value reading "Ion" values.

Perhaps the best way to handle this would be to extend the CatCommand infrastructure to allow a supplied mapping function. Alternatively, this could be done via a jq command (if only we had e.g. some timestamp_from_string function available).

Comment on lines +118 to +121
s.len() >= 4
&& s[..4].chars().all(|c| c.is_ascii_digit())
&& (s.contains('T') || (s.len() == 10 && s.matches('-').count() == 2))
&& !(s.len() == 4 && s.chars().all(|c| c.is_ascii_digit()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like some elements of the logic here- .len() is the cheapest check. We could apply some simplifying assumptions (e.g. we only detect timestamps < 9999 A.D and only to nanosecond precision. Then you can add a max length check and also check specific offsets in the string rather than using .contains or .matches.

E.g. I believe that timestamps must be:

   len < 10 && last char is 'T' // 2007T, 2007-01T
|| len == 10 && form is YYYY-MM-DD // 2007-01-01 ... 
|| len > 10 && s[10] is 'T' // FWIW this should be far and away the most common case

This function could use documentation explaining the constraints and referencing the relevant specifications. You could start from here: https://amazon-ion.github.io/ion-docs/docs/spec.html#timestamp

Shaurya Dwivedi added 2 commits September 12, 2025 14:41
@Shaurya0108
Copy link
Copy Markdown
Contributor Author

Shaurya0108 commented Sep 12, 2025

Changes

  • Added more in depth timestamp identification logic to filter out unwanted data types.
  • Changed the JSON parser from using JSON serde to using Ion's ability to read JSON.
  • Implemented --detect-timestamp flag in the CatCommand infrastructure, it now uses with_mapper functions for timestamp related data because Ion's lazy evaluation doesn't support partial string inspection. There might be work arounds/different implementations for this? (using with_mapper increases the space used with cat commands, but only is triggered for timestamps when the flag is used)
  • Added shorter command for --detect-timestamp: -t

Comment on lines +41 to +45
let mapper = if detect_timestamps {
Some(convert_timestamps as fn(Element) -> Result<Element>)
} else {
None
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of plumbing optionality all the way down you defaulted to essentially an identity mapper? (i -> Ok(i)).
You can get rid of branches at several levels that way. I don't know which would be more performant, but I think that would end up with fewer code changes and be easier to follow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the optionality makes the code more readable and makes it easier to implement changes in the future without having to worry about the optionality logic, but it adds an overhead to each value in the dataset - each value will have an identity mapper on it even when the --detect-timestamps is not enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course that's correct. Quantified, is it significant? How does that work compare to the work hiding behind any other statement executed for each value in the stream?

Comment on lines +4 to +5
/// Recursively converts timestamp-like strings to Ion timestamps in the given element.
pub fn convert_timestamps(element: Element) -> Result<Element> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two comments on recursion here:

Structural recursion: we have only a couple use cases at the moment, but a general purpose structural recursion function would be useful. There's at least one more place where we do this, in stats.rs. If we allow the mapper to handle only top level values than any mapper implementer must separately implement recursion.

Procedural recursion: Whenever we implement procedural recursion anywhere it allows user-submitted data to blow the stack. Pass in a kilobyte[1] of nested parentheses and poof. For that reason we should prefer an iterative approach to managing structural recursion. Again see stats.rs, specifically the top level depth calculation.

[1]: Your mileage may vary depending on programming language, runtime, architecture, and the value of $CURRENT_YEAR. There's always some number though.

Comment on lines +17 to +19
/// Iteratively applies a mapper to all elements in an Ion structure
/// This function uses an explicit work stack instead of recursion
/// It processes the structure in post-order (children before parents) to enable reconstruction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-order makes it more difficult to create a mapper which efficiently drops or mutates or members. Would it be better to map each value and then descend into the result?

As an aside, this also makes me wonder ControlFlow could be useful for signaling when the mapper knows that further recursion is not needed. I'm not suggesting that you use ControlFlow now- we have only two extant use cases and both depend on total exploration of the value.

@Shaurya0108
Copy link
Copy Markdown
Contributor Author

Using an identity mapper |element| Ok(element) when no transformation is needed, eliminating the optionality and branching.

@Shaurya0108
Copy link
Copy Markdown
Contributor Author

Reverted from using an identity mapper to using optional plumbing, with a difference, where now the mapper can be passed as none and we are not using 2 separate methods for mapper and non-mapper.

@Shaurya0108 Shaurya0108 self-assigned this Sep 18, 2025
@Shaurya0108 Shaurya0108 marked this pull request as ready for review September 18, 2025 22:03
Copy link
Copy Markdown
Contributor

@jobarr-amzn jobarr-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to take a closer look at the structural recursion method, and I'm not sure about adding the flag to cat. Otherwise, this looks good.


/// Heuristic to identify strings that could be Ion timestamps
///
/// Ion timestamps follow ISO 8601 format with these constraints:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. See: https://amazon-ion.github.io/ion-docs/docs/spec.html#timestamp

In the text format, timestamps follow the W3C note on date and time formats, but they must end with the literal “T” if not at least whole-day precision. Fractional seconds are allowed, with at least one digit of precision and an unlimited maximum. Local-time offsets may be represented as either hour:minute offsets from UTC, or as the literal “Z” to denote a local time of UTC. They are required on timestamps with time and are not allowed on date values.

Ion follows the “Unknown Local Offset Convention” of RFC3339:

To see RFC3339 contrasted with ISO-8601, see: https://ijmacd.github.io/rfc3339-iso8601/

@jobarr-amzn jobarr-amzn merged commit f1e3a0a into main Sep 30, 2025
6 checks passed
@jobarr-amzn jobarr-amzn deleted the shaurya/detect-timestamps branch September 30, 2025 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants