Intelligent JSON Schema Discovery & Data Transformation
Features • Installation • Quick Start • Documentation • Use Cases
SchemaForge is a schema-first data pipeline tool that automatically discovers JSON structures and converts them to analytics-ready formats. Stop wasting time on manual schema definitions and data wrangling—let SchemaForge do the heavy lifting.
Traditional Workflow: SchemaForge Workflow:
───────────────── ──────────────────
📄 JSON Files 📄 JSON Files
↓ ↓
⚙️ Manual Analysis 🔍 Automatic Scan
↓ ↓
📝 Write Schemas 📊 Schema Report
↓ ↓
💻 Write Code 🔨 One Command
↓ ↓
🐛 Debug Type Errors ✅ Parquet/CSV
↓
⏰ Hours Later...
↓
✅ Parquet/CSV
Time: Hours → Minutes
Errors: Many → Zero
- Advanced Type Detection: Strings, integers, floats, booleans, timestamps, URLs, emails, UUIDs, IP addresses, arrays, objects
- Smart String Analysis: Detects URLs, email addresses, UUIDs, IP addresses, and numeric strings
- Enhanced Timestamp Detection: Supports ISO dates, Unix timestamps, and multiple date formats
- Enum Detection: Automatically identifies fields with limited distinct values (enum-like fields)
- Statistical Analysis: Collects min/max values for numbers, length statistics for strings
- Nested Structure Handling: Flattens nested JSON with dot notation (
user.address.city) - Nullable Field Detection: Identifies which fields can be null
- Mixed Type Recognition: Detects and reports inconsistent types across records
- Embedded JSON Parsing: Automatically detects and parses JSON strings embedded in fields
Handles 11+ JSON formats automatically:
- ✅ Standard JSON Arrays:
[{...}, {...}] - ✅ NDJSON (Newline-Delimited): One object per line
- ✅ Wrapper Objects:
{data: [...]},{results: [...]}, etc. - ✅ Array-Based Tabular: Socrata/OpenData format with metadata
- ✅ GeoJSON: FeatureCollection format
- ✅ Single Objects: Single-record datasets
- ✅ Python Literal Format:
{'key': 'value'}with single quotes (Python dict/list syntax) - ✅ Embedded JSON Strings: JSON stored as string values (auto-parsed)
- ✅ Numeric Strings: String values that represent numbers (auto-detected)
- ✅ Mixed Format Files: Handles files with inconsistent structures
- ✅ JSON with Comments: Basic support for comment-like structures
- Scan once → Generate comprehensive schema reports
- Review → Human-readable Markdown + machine-readable JSON
- Convert everywhere → Consistent schemas across all conversions
- Robust Error Handling: Graceful failures, detailed logging
- Sampling Support: Process large files efficiently
- Batch Processing: Convert multiple files in one command
- Type Coercion: Intelligent type conversion with fallbacks
- Markdown Report: Beautiful, human-readable documentation
- JSON Report: Machine-readable schema for programmatic use
- Python 3.8 or higher
- pip package manager
# Clone the repository
git clone https://github.com/yourusername/schemaforge.git
cd schemaforge
# Install required packages
pip install -r requirements.txtpandas>=2.0.0 # Data manipulation
pyarrow>=12.0.0 # Parquet support
pytest>=7.0.0 # Testing framework
# Copy your JSON files to the data directory
cp your_data/*.json data/# Scan all JSON files and generate schema reports
python -m src.cli scan-schemasOutput:
reports/schema_report.md- Beautiful, human-readable reportreports/schema_report.json- Machine-readable schema definitions
# Check the generated report
cat reports/schema_report.md# Convert to Parquet (recommended for analytics)
python -m src.cli convert --format parquet
# Or convert to CSV
python -m src.cli convert --format csvThat's it! Your data is now in output/ directory, ready for analysis.
python -m src.cli scan-schemas [OPTIONS]Options:
| Option | Description | Default |
|---|---|---|
--data-dir |
Input directory containing JSON files | data |
--output-report |
Path for Markdown report | reports/schema_report.md |
--max-sample-size |
Max records to analyze per file | All records |
--sampling-strategy |
Sampling method: first or random |
first |
Examples:
# Basic usage
python -m src.cli scan-schemas
# Analyze only first 1000 records per file
python -m src.cli scan-schemas --max-sample-size 1000
# Use random sampling for better representation
python -m src.cli scan-schemas --sampling-strategy random --max-sample-size 500
# Custom data directory
python -m src.cli scan-schemas --data-dir my_json_datapython -m src.cli convert --format [parquet|csv] [OPTIONS]Options:
| Option | Description | Default |
|---|---|---|
--format |
Output format: parquet or csv |
Required |
--data-dir |
Input directory | data |
--output-dir |
Output directory | output |
--schema-report |
JSON schema report path | reports/schema_report.json |
Examples:
# Convert to Parquet
python -m src.cli convert --format parquet
# Convert to CSV with custom directories
python -m src.cli convert --format csv \
--data-dir my_data \
--output-dir csv_output
# Use custom schema report
python -m src.cli convert --format parquet \
--schema-report custom_schemas/report.json1️⃣ Standard JSON Array
[
{"id": 1, "name": "Alice", "age": 30},
{"id": 2, "name": "Bob", "age": 25}
]Use case: Most common JSON format from APIs and exports
2️⃣ Newline-Delimited JSON (NDJSON)
{"id": 1, "name": "Alice", "age": 30}
{"id": 2, "name": "Bob", "age": 25}
Use case: Log files, streaming data, large datasets
3️⃣ Wrapper Objects
{
"data": [
{"id": 1, "name": "Alice"},
{"id": 2, "name": "Bob"}
],
"metadata": {...}
}Auto-detected fields: data, results, items, records, rows, entries
Use case: API responses with metadata
4️⃣ Array-Based Tabular Data
{
"meta": {
"view": {
"columns": [
{"name": "id", "fieldName": "id", "dataTypeName": "number"},
{"name": "name", "fieldName": "name", "dataTypeName": "text"}
]
}
},
"data": [
[1, "Alice"],
[2, "Bob"]
]
}Use case: Socrata, CKAN, and other open data portals
Features:
- ✅ Extracts column definitions from metadata
- ✅ Converts arrays to objects using column names
- ✅ Skips hidden/meta columns automatically
5️⃣ GeoJSON Format
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {"name": "Location 1", "value": 100},
"geometry": {"type": "Point", "coordinates": [-122.4, 37.8]}
}
]
}Use case: Geographic data from mapping APIs
Note: Extracts properties field from features
6️⃣ Single JSON Object
{
"id": 1,
"name": "Alice",
"address": {
"city": "New York",
"zip": "10001"
}
}Use case: Configuration files, single-record exports
Note: Treated as a single-record dataset
SchemaForge detects these types automatically:
| Type | Description | Example |
|---|---|---|
string |
Text data | "Alice" |
integer |
Whole numbers | 42 |
float |
Decimal numbers | 3.14 |
boolean |
True/false | true |
timestamp |
Date/time strings | "2023-01-01T10:00:00Z", "2023/01/01", Unix timestamps |
url |
Web URLs | "https://example.com" |
email |
Email addresses | "user@example.com" |
uuid |
UUID identifiers | "550e8400-e29b-41d4-a716-446655440000" |
ip_address |
IP addresses (IPv4/IPv6) | "192.168.1.1", "2001:db8::1" |
numeric_string |
String values representing numbers | "123", "45.67" |
json_string |
Embedded JSON stored as string | "{\"key\": \"value\"}" |
array<T> |
Lists of values | ["a", "b", "c"] |
object |
Nested structures | {"key": "value"} |
Nested objects are flattened with dot notation:
Input:
{
"user": {
"name": "Alice",
"address": {
"city": "NYC"
}
}
}Output Columns:
user.name(string)user.address.city(string)
Fields containing null values are marked as nullable in the schema.
SchemaForge automatically collects statistical information for each field:
- Numeric Statistics: Min/max values for integer and float fields
- String Statistics: Min/max/average length for string fields
- Enum Detection: Fields with limited distinct values (≤20) are flagged as enum-like
- Value Distribution: Distinct value sets are tracked for enum detection
Example Report Output:
| Field Name | Type | Statistics | Notes |
|------------|------|------------|-------|
| `age` | integer | min: 18, max: 65 | nullable |
| `email` | email | len: 10-50 (avg: 25.3) | nullable |
| `status` | string | enum: active, inactive, pending | enum-like |
Problem: Building data pipelines with inconsistent JSON from multiple sources
Solution: Automatic schema discovery and Parquet conversion
Benefit: 80% faster pipeline development, consistent data types
# Example workflow
python -m src.cli scan-schemas --data-dir api_exports/
python -m src.cli convert --format parquet --output-dir data_lake/Problem: Diverse JSON datasets from experiments, surveys, APIs
Solution: One-command conversion to analysis-ready formats
Benefit: More time for research, less time on data wrangling
Example use cases:
- Social media data analysis
- Scientific instrument outputs
- Survey response processing
- Open data portal research
Problem: Socrata/CKAN array-based format is difficult to work with
Solution: Automatic column extraction and conversion
Benefit: Easy access to government and public datasets
Supported portals:
- data.gov datasets
- City open data portals
- Research institution repositories
Problem: REST APIs return JSON in various formats
Solution: Schema-first approach ensures consistency
Benefit: Reliable data integration into warehouses
Problem: Need efficient storage format for JSON in data lakes
Solution: Convert to Parquet with preserved schemas
Benefit: Better compression, faster queries, lower costs
Problem: Migrating from JSON-based systems to columnar formats
Solution: Intelligent schema inference preserves data semantics
Benefit: Accurate migrations without data loss
schemaforge/
├── 📁 data/ # Input JSON files (place your data here)
│ └── *.json # Your JSON data files
├── 📁 output/ # Converted output files
│ ├── *.parquet # Parquet output files
│ └── *.csv # CSV output files
├── 📁 reports/ # Generated schema reports
│ ├── schema_report.md # Human-readable report
│ └── schema_report.json # Machine-readable report
├── 📁 src/
│ ├── __init__.py
│ ├── schema_reader.py # Schema inference engine
│ ├── converter.py # Format conversion module
│ └── cli.py # Command-line interface
├── 📁 tests/
│ ├── test_schema_reader.py
│ └── test_converter.py
├── README.md
├── requirements.txt
└── pytest.ini
For very large JSON files, use sampling:
# Analyze first 10,000 records only
python -m src.cli scan-schemas --max-sample-size 10000
# Random sample for better representation
python -m src.cli scan-schemas \
--sampling-strategy random \
--max-sample-size 5000Use SchemaForge as a Python library:
from src.schema_reader import SchemaReader
from src.converter import Converter
from pathlib import Path
# Discover schemas
reader = SchemaReader(
data_dir=Path("data"),
max_sample_size=1000
)
schemas = reader.scan_directory()
reader.generate_report(schemas, output_path=Path("reports/schema.md"))
# Convert with schema
converter = Converter(
data_dir=Path("data"),
output_dir=Path("output")
)
converter.convert_all(format="parquet", schema_report_path=Path("reports/schema.json"))Extend type inference for custom formats:
from src.schema_reader import SchemaReader
class CustomSchemaReader(SchemaReader):
def _infer_type(self, value):
# Add custom type detection
if isinstance(value, str) and value.startswith("http"):
return "url"
return super()._infer_type(value)Run the full test suite:
# Run all tests
pytest tests/
# Run with verbose output
pytest tests/ -v
# Run specific test file
pytest tests/test_schema_reader.py
# Run with coverage
pytest tests/ --cov=src --cov-report=html┌─────────────────────────────────────────────────────┐
│ Phase 1: Schema Discovery │
│ │
│ JSON Files → Format Detection → Schema Inference │
│ ↓ ↓ ↓ │
│ Load Data → Extract Columns → Analyze Types │
│ ↓ │
│ Generate Reports │
│ (Markdown + JSON) │
└─────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────┐
│ Phase 2: Conversion │
│ │
│ Schema Report → Load Data → Apply Schema │
│ ↓ ↓ ↓ │
│ Type Coercion → Flatten → Convert Format │
│ ↓ │
│ Parquet/CSV Output │
└─────────────────────────────────────────────────────┘
┌──────────────┐
│ CLI Layer │ ← User commands (scan-schemas, convert)
└──────┬───────┘
│
├─────────────────┬─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│Schema Reader │ │ Converter │ │JSON Loader │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┴─────────────────┘
▼
┌──────────────┐
│ JSON Files │
└──────────────┘
| Limitation | Description | Workaround |
|---|---|---|
| Memory Usage | Large files loaded into memory | Use --max-sample-size for schema inference |
| Array of Objects | Stored as JSON strings in output | Design choice for flat file compatibility |
| Type Coercion | Best-effort conversion | Manual validation recommended |
| Timestamp Detection | Pattern-based recognition | May miss custom formats |
| Encoding | Assumes UTF-8 | Convert files to UTF-8 first |
We welcome contributions! Here are some ideas:
- Avro and ORC output formats
- Schema validation against inferred schemas
- Streaming processing for very large files
- Schema versioning and migration tools
- Database export capabilities (PostgreSQL, MySQL)
- GUI/Web interface
- Docker support
- Schema diff tool
- Add DuckDB conversion
- Add ML task feature
- Workload testing & benchmarking
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Built with love for:
- Data engineers struggling with inconsistent JSON
- Researchers drowning in data wrangling
- Developers tired of manual schema definitions
- Anyone who's ever said "I wish this JSON had a schema"
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
If SchemaForge saved you time, consider giving it a star! ⭐
Made with 🔨 by developers, for developers