feat: add France Market Scanner with INPI data loader#1
Open
Acid3croco wants to merge 5 commits intomasterfrom
Open
feat: add France Market Scanner with INPI data loader#1Acid3croco wants to merge 5 commits intomasterfrom
Acid3croco wants to merge 5 commits intomasterfrom
Conversation
Add complete data collection system for French company analysis: ## Data Sources - SIRENE (INSEE): 29M companies, 42M establishments - BODACC: Legal announcements with date windowing (7-day chunks) - INPI: Annual accounts via data.cquest.org mirror (2017-2023) ## Features - DuckDB database with optimized schema - CLI commands for download, load, sync operations - Support for both Complete (C) and Simplified (S) bilan types - XML parser for INPI liasse fiscale codes - Automatic date windowing to handle API limits ## Key Files - cli.py: Click-based CLI interface - src/extractors/inpi.py: INPI data loader with mirror support - src/extractors/bodacc.py: BODACC API client with windowing - src/extractors/sirene.py: SIRENE bulk data loader - src/core/database.py: DuckDB schema and manager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- Document SIRENE vs INPI data reliability (payroll > employee brackets) - Add profit/payroll ratio as primary metric (no circular assumptions) - Document holding company detection (profit > revenue = dividends) - Add analysis scripts for finding PME gems by sector - Best sectors: medical labs (86.90B), software publishing (58.29C) - Note: 80% of small PMEs file confidential accounts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- SIRENE unités légales: 25 columns, coverage stats - SIRENE établissements: 36 columns, address fields - INPI comptes: 29 columns, balance sheet + income statement - BODACC annonces: 24 columns, event types - Data quality summary highlighting key gaps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The payroll trick (employees = payroll / 70K) is mathematically circular: - profit_per_employee = profit × 70K / payroll = just profit/payroll scaled - Biased toward low-wage sectors (look like gems) - Misses high-wage tech gems (look mediocre) Bottom line: without real employee counts, we can only find high-margin or high-profit/payroll businesses, not "small teams" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Pipeline now saves both raw JSON and flattened Parquet - Schema-agnostic flattening based on JSON structure - Recursively flattens nested STRUCT columns - Extracts fields from JSON string columns (jugement, acte, depot) - Handles column name collisions with distinguishing prefixes - ZSTD compression reduces 665MB JSON to 36MB Parquet Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tools to identify French company "gems" - profitable companies for AI automation or acquisition targets.
Data Added (2.9GB Parquet files)
Key Columns
sirendenominationactivite_principaletranche_effectifschiffre_affairesresultat_netcharges_personnelconfidentialiteKey Limitations
tranche_effectifsuselessThe Payroll Trap
Using
payroll / 70Kto estimate employees is circular:Just
profit/payrollwith extra steps. Biased toward low-wage sectors.What We CAN Find
profit / revenueprofit / payrollBest Sectors (visible data)
Files
README.md- Full data dictionary & honest limitationsfind_*.py- Analysis scriptsdata/parquet/- 2.9GB (gitignored)🤖 Generated with Claude Code