Skip to content

feat: Support $ref to load from external JSON/YAML file#105

Merged
njzjz merged 10 commits intomasterfrom
copilot/support-ref-external-json
Feb 24, 2026
Merged

feat: Support $ref to load from external JSON/YAML file#105
njzjz merged 10 commits intomasterfrom
copilot/support-ref-external-json

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 24, 2026

Adds support for loading dict values from external JSON or YAML files via a "$ref" key. Before validation or normalization, dargs resolves any "$ref" key by loading the referenced file and merging its contents into the dict, with local keys taking precedence over loaded ones. Chained $ref values (where a loaded file itself contains $ref) are fully resolved, with cycle detection to prevent infinite loops.

Loading from external files is disabled by default (secure by default). Users must explicitly opt in by passing allow_ref=True to the relevant method.

Changes Made

  • dargs/dargs.py: Added _load_ref and _resolve_ref helper functions.
    • _load_ref opens files with explicit encoding="utf-8" and validates the loaded value is a dict, raising a clear ValueError for non-mapping content.
    • _resolve_ref raises ValueError when $ref is found but allow_ref=False; resolves chained $ref values in a loop with cycle detection.
    • check and check_value deepcopy input when allow_ref=True so the caller's data is never mutated.
    • The allow_ref parameter is threaded through _traverse_sub, traverse_value, traverse, and all public methods (check, check_value, normalize, normalize_value).
  • dargs/check.py: Added allow_ref=False parameter to the check helper, passed through to normalize_value and check_value.
  • dargs/cli.py: Added --allow-ref flag to the dargs check CLI subcommand.
  • dargs/notebook.py: Added allow_ref=False to JSON(), print_html(), and ArgumentData; resolves $ref on a copy of the data in _init_subdata to avoid side effects on the caller's data.
  • pyproject.toml: Added pyyaml as an optional dependency under a new [yaml] extra; also added to [test] extra.
  • tests/test_ref.py: New tests covering secure-by-default behavior, JSON loading, YAML loading (.yml/.yaml), local key override, unsupported file extension, non-dict file content, cyclic ref detection, chained ref resolution, no-mutation of caller data in check/check_value, and normalize_value.
  • docs/ref.md: New dedicated documentation page explaining the $ref feature, allow_ref=True requirement, supported formats, and a usage example.
  • docs/index.rst: Added ref to the Sphinx toctree.
  • README.md: Replaced the full $ref section with a single bullet point in "Additional features" linking to the docs.

Supported File Formats

Extension Requires
.json (built-in)
.yaml, .yml pip install pyyaml or pip install dargs[yaml]

Usage

argument.check(data, allow_ref=True)
argument.normalize(data, allow_ref=True)
argument.check_value(value, allow_ref=True)
argument.normalize_value(value, allow_ref=True)

CLI:

dargs check --allow-ref -f my_module.my_args config.json
Original prompt

This section details on the original issue you should resolve

<issue_title>Support $ref to load from external JSON file</issue_title>
<issue_description>Before checking/normalizing the input dict, check if there is a $ref key in the dict. If so, load the dict from the external file path.

Use the property loading way according to the file extension (.json, .yml, or .yaml). (pyyaml can be an optional dependency in pyproject.toml - give the user a tip if it is not installed)

Add the documentation for this behavior.</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Copilot AI changed the title [WIP] Add support for $ref to load from external JSON file Support $ref to load from external JSON/YAML file Feb 24, 2026
Copilot AI requested a review from njzjz February 24, 2026 02:30
Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Copilot AI requested a review from njzjz February 24, 2026 02:35
@njzjz
Copy link
Copy Markdown
Member

njzjz commented Feb 24, 2026

pre-commit.ci autofix

pre-commit-ci bot and others added 2 commits February 24, 2026 02:37
Signed-off-by: Jinzhe Zeng <jinzhe.zeng@ustc.edu.cn>
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 87.09677% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.14%. Comparing base (2ed9020) to head (a261505).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
dargs/check.py 0.00% 2 Missing ⚠️
dargs/cli.py 0.00% 2 Missing ⚠️
dargs/dargs.py 95.45% 2 Missing ⚠️
dargs/notebook.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #105      +/-   ##
==========================================
+ Coverage   74.22%   75.14%   +0.91%     
==========================================
  Files           9        9              
  Lines         811      853      +42     
==========================================
+ Hits          602      641      +39     
- Misses        209      212       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@njzjz-bot njzjz-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing $ref support! However, there is a security concern that should be addressed before merging.

Security Issue: Arbitrary File Read

The current implementation allows reading arbitrary files from the filesystem:

def _load_ref(ref_path: str) -> dict:
    with open(ref_path) as f:  # No path validation!
        return json.load(f)

An attacker who can control the input JSON (e.g., from untrusted user input) could read sensitive files:

{"$ref": "/home/user/.config/credentials.json"}

Suggested Fix: Add allow_ref Parameter

I recommend adding an allow_ref parameter (defaulting to False) to explicitly enable this feature. This follows the principle of secure by default.

The parameter would need to be added to:

  • Argument.check(), Argument.normalize(), Argument.check_value(), Argument.normalize_value()
  • Argument._traverse_sub() and _resolve_ref()
  • Potentially Variant methods as well

Example implementation:

def _resolve_ref(d: dict, allow_ref: bool = False) -> None:
    if "$ref" not in d:
        return
    if not allow_ref:
        raise ValueError(
            "$ref is not allowed by default. "
            "Pass allow_ref=True to enable loading from external files."
        )
    ref_path = d.pop("$ref")
    # ... rest of the implementation

This way, existing code remains secure, and users who need $ref can explicitly opt-in.

Would you be open to adding this safeguard?

@njzjz
Copy link
Copy Markdown
Member

njzjz commented Feb 24, 2026

Thanks for implementing $ref support! However, there is a security concern that should be addressed before merging.

Security Issue: Arbitrary File Read

The current implementation allows reading arbitrary files from the filesystem:

def _load_ref(ref_path: str) -> dict:
    with open(ref_path) as f:  # No path validation!
        return json.load(f)

An attacker who can control the input JSON (e.g., from untrusted user input) could read sensitive files:

{"$ref": "/home/user/.config/credentials.json"}

Suggested Fix: Add allow_ref Parameter

I recommend adding an allow_ref parameter (defaulting to False) to explicitly enable this feature. This follows the principle of secure by default.

The parameter would need to be added to:

  • Argument.check(), Argument.normalize(), Argument.check_value(), Argument.normalize_value()
  • Argument._traverse_sub() and _resolve_ref()
  • Potentially Variant methods as well

Example implementation:

def _resolve_ref(d: dict, allow_ref: bool = False) -> None:
    if "$ref" not in d:
        return
    if not allow_ref:
        raise ValueError(
            "$ref is not allowed by default. "
            "Pass allow_ref=True to enable loading from external files."
        )
    ref_path = d.pop("$ref")
    # ... rest of the implementation

This way, existing code remains secure, and users who need $ref can explicitly opt-in.

Would you be open to adding this safeguard?

@copilot Please also scan other methods that call these methods.

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 24, 2026

Thanks for implementing $ref support! However, there is a security concern that should be addressed before merging.

Security Issue: Arbitrary File Read...

Done in 12d1d49. Added allow_ref: bool = False threaded through all relevant methods:

  • _resolve_ref — raises ValueError when $ref is found and allow_ref is False
  • _traverse_sub, traverse_value, traverse — all accept and propagate allow_ref
  • Public API: check, check_value, normalize, normalize_value — all accept allow_ref=False
  • dargs/check.py::check — also updated with allow_ref=False

Users must now explicitly opt in with allow_ref=True to enable file loading.

@njzjz
Copy link
Copy Markdown
Member

njzjz commented Feb 24, 2026

@copilot you may also need to add this argument to CLI and notebook.

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 24, 2026

@copilot you may also need to add this argument to CLI and notebook.

Done in 782d6e8.

CLI (dargs/cli.py): Added --allow-ref flag to the check subcommand:

dargs check --allow-ref -f my_module.my_args config.json

Notebook (dargs/notebook.py): Added allow_ref=False to JSON(), print_html(), and ArgumentData. _resolve_ref is called in _init_subdata when allow_ref=True, resolving any $ref in nested dicts before rendering.

@njzjz
Copy link
Copy Markdown
Member

njzjz commented Feb 24, 2026

pre-commit.ci autofix

@njzjz njzjz marked this pull request as ready for review February 24, 2026 07:29
Copilot AI review requested due to automatic review settings February 24, 2026 07:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in support for resolving "$ref" keys by loading and merging external JSON/YAML files before validation/normalization.

Changes:

  • Implemented $ref loading/resolution helpers and threaded allow_ref through traversal, public APIs, CLI, and notebook rendering.
  • Added YAML optional dependency via a new [yaml] extra and extended type-check workflow extras.
  • Added documentation and new unit tests for secure-by-default behavior and supported formats.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
dargs/dargs.py Core $ref loader/resolver and allow_ref plumbing through traversal/check/normalize.
dargs/check.py Adds allow_ref parameter and passes it through normalize/check helpers.
dargs/cli.py Adds --allow-ref flag and threads it into CLI check flow.
dargs/notebook.py Adds allow_ref to notebook helpers and resolves $ref during rendering.
pyproject.toml Introduces [yaml] extra (pyyaml) and includes it in [test].
.github/workflows/type-check.yml Adds yaml extra to type-check install set.
tests/test_ref.py New tests for $ref resolution across check/normalize/value APIs and formats.
docs/ref.md New documentation page for $ref feature and allow_ref=True.
docs/index.rst Adds ref page to Sphinx toctree.
README.md Mentions $ref feature in additional features list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

f"key `{path[-1]}` gets wrong value type, "
f"requires dict but {type(value).__name__} is given",
)
_resolve_ref(value, allow_ref)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument.check() / check_value() now mutate user-provided data when allow_ref=True because _resolve_ref() runs inside traversal and modifies dicts in place (pops $ref, merges loaded keys). Checking is typically expected to be side-effect free; consider resolving refs on a copy for check/check_value (e.g., deepcopy at entry when allow_ref=True) or adding an inplace/mutate option to _resolve_ref so validation can use a merged view without modifying the input.

Copilot uses AI. Check for mistakes.
dargs/dargs.py Outdated
Comment on lines +1125 to +1138
ext = os.path.splitext(ref_path)[1].lower()
if ext == ".json":
with open(ref_path) as f:
return json.load(f)
elif ext in (".yml", ".yaml"):
try:
import yaml
except ImportError as e:
raise ImportError(
"pyyaml is required to load YAML files referenced by $ref. "
"Install it with: pip install pyyaml"
) from e
with open(ref_path) as f:
return yaml.safe_load(f)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_ref() is documented to return a dict, but json.load() / yaml.safe_load() can legally return non-dicts (e.g., list/scalar) or None (empty YAML). This will raise at merge time ({**loaded, **d}) with a TypeError instead of a clear error. Validate the loaded value is a dict and raise a ValueError with a helpful message when the referenced file doesn't contain a mapping at the top level.

Copilot uses AI. Check for mistakes.
dargs/dargs.py Outdated
Comment on lines +1177 to +1179
loaded = _load_ref(ref_path)
# Merge: loaded content as base, local keys take precedence
merged = {**loaded, **d}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_load_ref() is documented to return a dict, but json.load() / yaml.safe_load() can legally return non-dicts (e.g., list/scalar) or None (empty YAML). This will raise at merge time ({**loaded, **d}) with a TypeError instead of a clear error. Validate the loaded value is a dict and raise a ValueError with a helpful message when the referenced file doesn't contain a mapping at the top level.

Copilot uses AI. Check for mistakes.
)


def _resolve_ref(d: dict, allow_ref: bool = False) -> None:
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolve_ref() only resolves a single $ref occurrence and does not re-run if the loaded content itself contains $ref. This can leave $ref keys in the final merged dict, which can break strict-mode checks (undefined key) and prevents ref-chaining. Consider resolving in a loop (while \"$ref\" in d) and add cycle detection (e.g., track visited ref paths) to avoid infinite self-references.

Copilot uses AI. Check for mistakes.
dargs/dargs.py Outdated
Comment on lines +1176 to +1183
ref_path = d.pop("$ref")
loaded = _load_ref(ref_path)
# Merge: loaded content as base, local keys take precedence
merged = {**loaded, **d}
d.clear()
d.update(merged)


Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_resolve_ref() only resolves a single $ref occurrence and does not re-run if the loaded content itself contains $ref. This can leave $ref keys in the final merged dict, which can break strict-mode checks (undefined key) and prevents ref-chaining. Consider resolving in a loop (while \"$ref\" in d) and add cycle detection (e.g., track visited ref paths) to avoid infinite self-references.

Suggested change
ref_path = d.pop("$ref")
loaded = _load_ref(ref_path)
# Merge: loaded content as base, local keys take precedence
merged = {**loaded, **d}
d.clear()
d.update(merged)
# Track visited refs to avoid infinite loops in case of cyclic references.
visited_refs: set[str] = set()
# Resolve chained `$ref`s until no more are present at the top level.
while "$ref" in d:
ref_path = d.pop("$ref")
if ref_path in visited_refs:
raise ValueError(f"Cyclic $ref detected for path: {ref_path!r}")
visited_refs.add(ref_path)
loaded = _load_ref(ref_path)
if not isinstance(loaded, dict):
raise ValueError(
f"Referenced file {ref_path!r} must contain a mapping/object."
)
# Merge: loaded content as base, local keys take precedence
merged = {**loaded, **d}
d.clear()
d.update(merged)

Copilot uses AI. Check for mistakes.
dargs/dargs.py Outdated
"""
ext = os.path.splitext(ref_path)[1].lower()
if ext == ".json":
with open(ref_path) as f:
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File reads in _load_ref() rely on platform default encoding. To make behavior consistent across environments (especially for YAML), open the referenced files with an explicit encoding (commonly encoding=\"utf-8\").

Copilot uses AI. Check for mistakes.
dargs/dargs.py Outdated
"pyyaml is required to load YAML files referenced by $ref. "
"Install it with: pip install pyyaml"
) from e
with open(ref_path) as f:
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File reads in _load_ref() rely on platform default encoding. To make behavior consistent across environments (especially for YAML), open the referenced files with an explicit encoding (commonly encoding=\"utf-8\").

Copilot uses AI. Check for mistakes.
and isinstance(self.arg, Argument)
and not (self.arg.repeat and not self.repeat)
):
_resolve_ref(self.data, self.allow_ref)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArgumentData resolves $ref by mutating self.data in place during notebook rendering. For a display/helper API (JSON() / print_html()), mutating the caller’s original data can be surprising. Consider copying the dict before resolving refs (or resolving into a derived structure) so rendering has no side effects.

Copilot uses AI. Check for mistakes.
@njzjz
Copy link
Copy Markdown
Member

njzjz commented Feb 24, 2026

@copilot apply changes based on the comments in this thread

…ection, no mutation

Co-authored-by: njzjz <9496702+njzjz@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 24, 2026

@copilot apply changes based on the comments in this thread

Done in 193d0f9. Applied all review comments:

  • _load_ref: explicit encoding="utf-8" on file opens; validates loaded value is a dict and raises a clear ValueError otherwise
  • _resolve_ref: resolves chained $ref values in a loop; detects cyclic references and raises ValueError
  • check/check_value: deepcopy input when allow_ref=True so the caller's data is never mutated
  • notebook.py: copy self.data before calling _resolve_ref in _init_subdata to avoid side effects

New tests added for: no-mutation of caller data, non-dict file content, cyclic ref detection, and chained $ref resolution.

@njzjz
Copy link
Copy Markdown
Member

njzjz commented Feb 24, 2026

pre-commit.ci autofix

@njzjz njzjz changed the title Support $ref to load from external JSON/YAML file feat: Support $ref to load from external JSON/YAML file Feb 24, 2026
@njzjz njzjz merged commit 52a3484 into master Feb 24, 2026
14 checks passed
@njzjz njzjz deleted the copilot/support-ref-external-json branch February 24, 2026 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support $ref to load from external JSON file

4 participants