Skip to content

Fix from_serializable_dict to ignore plain data dicts with "type" key#1803

Merged
unclecode merged 1 commit intodevelopfrom
fix/deserialize-schema-type-false-positive
Mar 7, 2026
Merged

Fix from_serializable_dict to ignore plain data dicts with "type" key#1803
unclecode merged 1 commit intodevelopfrom
fix/deserialize-schema-type-false-positive

Conversation

@SohamKukreti
Copy link
Collaborator

Summary

The typed-object entry condition ("type" in data) was too broad: it also matched plain business dicts that happen to carry a "type" key, such as JsonCssExtractionStrategy field specs ({"type": "text"}) and LLMExtractionStrategy JSON Schema fragments ({"type": "string"}). These were never config objects, but the deserializer tried to treat them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised a ValueError — causing /crawl to return HTTP 500 for perfectly valid extraction-strategy payloads.

Fix: narrow the entry condition to require "params" (or "type":"dict"

  • "value"), matching only the shapes that to_serializable_dict() actually produces. Dicts with "type" but no "params"/"value" fall through to the raw-dict path and are passed as plain data.

The RCE protection from commit 0104db6 is fully preserved: any real class-instantiation attack still requires "type" + "params", still enters the typed path, and is still blocked by the allowlist.

Fixes #1797

How Has This Been Tested?

  • Ran deploy/docker/tests/run_security_tests.py
  • Ran deploy/docker/tests/test_security_fixes.py
  • Ran tests/docker_example.py for testing CSS Extraction and LLM extraction with docker

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added/updated unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

… with "type" key

The typed-object entry condition (`"type" in data`) was too broad: it
also matched plain business dicts that happen to carry a "type" key,
such as JsonCssExtractionStrategy field specs ({"type": "text"}) and
LLMExtractionStrategy JSON Schema fragments ({"type": "string"}).
These were never config objects, but the deserializer tried to treat
them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised
a ValueError — causing /crawl to return HTTP 500 for perfectly valid
extraction-strategy payloads.

Fix: narrow the entry condition to require "params" (or "type":"dict"
+ "value"), matching only the shapes that to_serializable_dict() actually
produces. Dicts with "type" but no "params"/"value" fall through to the
raw-dict path and are passed as plain data.

The RCE protection from commit 0104db6 is fully preserved: any real
class-instantiation attack still requires "type" + "params", still
enters the typed path, and is still blocked by the allowlist.

Fixes #1797
@unclecode unclecode merged commit b008671 into develop Mar 7, 2026
1 check passed
@unclecode
Copy link
Owner

Thanks @SohamKukreti — merged into develop, will be in the next release. Clean fix for the deserialization false positive. We've added you to CONTRIBUTORS.md.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants