Fix `from_serializable_dict` to ignore plain data dicts with "type" key by SohamKukreti · Pull Request #1803 · unclecode/crawl4ai

SohamKukreti · 2026-03-06T07:49:01Z

Summary

The typed-object entry condition ("type" in data) was too broad: it also matched plain business dicts that happen to carry a "type" key, such as JsonCssExtractionStrategy field specs ({"type": "text"}) and LLMExtractionStrategy JSON Schema fragments ({"type": "string"}). These were never config objects, but the deserializer tried to treat them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised a ValueError — causing /crawl to return HTTP 500 for perfectly valid extraction-strategy payloads.

Fix: narrow the entry condition to require "params" (or "type":"dict"

"value"), matching only the shapes that to_serializable_dict() actually produces. Dicts with "type" but no "params"/"value" fall through to the raw-dict path and are passed as plain data.

The RCE protection from commit 0104db6 is fully preserved: any real class-instantiation attack still requires "type" + "params", still enters the typed path, and is still blocked by the allowlist.

Fixes #1797

How Has This Been Tested?

Ran deploy/docker/tests/run_security_tests.py
Ran deploy/docker/tests/test_security_fixes.py
Ran tests/docker_example.py for testing CSS Extraction and LLM extraction with docker

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added/updated unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

… with "type" key The typed-object entry condition (`"type" in data`) was too broad: it also matched plain business dicts that happen to carry a "type" key, such as JsonCssExtractionStrategy field specs ({"type": "text"}) and LLMExtractionStrategy JSON Schema fragments ({"type": "string"}). These were never config objects, but the deserializer tried to treat them as such, hit the ALLOWED_DESERIALIZE_TYPES allowlist, and raised a ValueError — causing /crawl to return HTTP 500 for perfectly valid extraction-strategy payloads. Fix: narrow the entry condition to require "params" (or "type":"dict" + "value"), matching only the shapes that to_serializable_dict() actually produces. Dicts with "type" but no "params"/"value" fall through to the raw-dict path and are passed as plain data. The RCE protection from commit 0104db6 is fully preserved: any real class-instantiation attack still requires "type" + "params", still enters the typed path, and is still blocked by the allowlist. Fixes #1797

unclecode · 2026-03-07T03:44:12Z

Thanks @SohamKukreti — merged into develop, will be in the next release. Clean fix for the deserialization false positive. We've added you to CONTRIBUTORS.md.

unclecode added a commit that referenced this pull request Mar 7, 2026

Update PR-TODOLIST and CONTRIBUTORS for merged PRs #1805, #1763, #1803

d458890

unclecode merged commit b008671 into develop Mar 7, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `from_serializable_dict` to ignore plain data dicts with "type" key#1803

Fix `from_serializable_dict` to ignore plain data dicts with "type" key#1803
unclecode merged 1 commit intodevelopfrom
fix/deserialize-schema-type-false-positive

SohamKukreti commented Mar 6, 2026

Uh oh!

Uh oh!

unclecode commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SohamKukreti commented Mar 6, 2026

Summary

How Has This Been Tested?

Checklist:

Uh oh!

Uh oh!

unclecode commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants