Skip to content

Add logo URL checker script#26152

Open
rursache wants to merge 9 commits intoiptv-org:masterfrom
rursache:add-logo-checker
Open

Add logo URL checker script#26152
rursache wants to merge 9 commits intoiptv-org:masterfrom
rursache:add-logo-checker

Conversation

@rursache
Copy link
Copy Markdown

Summary

  • Adds scripts/check_logos.py — an async Python script that scans data/logos.csv and identifies broken logo URLs
  • Adds dead_logos*.json to .gitignore so output files are never accidentally committed

How it works

A logo URL is considered dead if:

  • The connection fails or times out
  • The HTTP response is not 2xx
  • The Content-Type is not image/*

429 responses are retried with exponential backoff (respects Retry-After header). HEAD requests fall back to GET automatically on 405.

Usage

# pip install aiohttp

# Full scan (~40k URLs)
python3 scripts/check_logos.py

# Re-check a previous result with lower concurrency to avoid rate limits
python3 scripts/check_logos.py --recheck dead_logos.json --concurrency 10 --delay 500

# Keep re-checking until the dead list stabilizes
python3 scripts/check_logos.py --recheck dead_logos.json --loop --concurrency 10 --delay 500

Output is a JSON array of dead logo entries (all original CSV fields preserved) with an added _reason field explaining why the URL failed.

Requires: Python 3.10+, aiohttp

@rursache rursache force-pushed the add-logo-checker branch 5 times, most recently from ff7972e to 4b54494 Compare March 18, 2026 21:01
@rursache
Copy link
Copy Markdown
Author

example of the resulting dead_logos.json file:

Reason breakdown:
        55  HTTP 404
        33  connection error
        26  HTTP 403
        15  bad content-type
        14  other
        14  timeout
         2  HTTP 409
         1  HTTP 521
         1  HTTP 402
         1  HTTP 400
         1  HTTP 502
         1  HTTP 500

dead_logos.json

Copy link
Copy Markdown
Contributor

@BellezaEmporium BellezaEmporium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against the idea, but since most of the scripts' codes are in JS (via Node), i'll see if @freearhey 's in for adding a Python script.

@freearhey
Copy link
Copy Markdown
Contributor

Of course, it would be better to rewrite the code in JavaScript, primarily to avoid complicating the maintenance of the repository as a whole.

scripts/check_logos.js scans data/logos.csv for broken logo URLs and
writes the dead entries to dead_logos.json.

Features:
- Async worker-pool (concurrency bounded, EMFILE retry)
- HEAD with automatic fallback to GET on 405
- 429 retries rescheduled via setTimeout so workers never stall
- Exponential backoff with Retry-After header support
- Magic-byte sniffing for application/octet-stream responses
- --recheck: re-verify a previous dead_logos.json in place
- --loop: keep re-checking until the list stabilizes
- --delay: per-request throttle to reduce rate limiting
- Live progress with rate and ETA, reason breakdown on completion

Requires: Node.js 18+ (no external dependencies)
Also adds dead_logos*.json to .gitignore.
@rursache
Copy link
Copy Markdown
Author

i migrated the script from python to node.js (no deps) as proposed. i used claude code as im not fluent in js

@freearhey
Copy link
Copy Markdown
Contributor

The script has flagged over 10,000 links as dead. Is that supposed to happen?

node scripts/check_logos.js
...
Finished in 7.6min — 10434/40707 still dead
  Reason breakdown:
     10117  HTTP 429 (gave up after retries)
       188  connection error
        55  HTTP 404
        41  HTTP 403
        15  timeout
        11  bad content-type
         2  HTTP 418
         1  HTTP 521
         1  HTTP 402
         1  HTTP 400
         1  HTTP 500
         1  HTTP 502

@rursache
Copy link
Copy Markdown
Author

@freearhey judging by the 10k+ HTTP 429 errors you get, wikipedia cdn rate limited you. reduce the speed/amount of hits using params to prevent this

@freearhey
Copy link
Copy Markdown
Contributor

@rursache so you don't get a 429 error with the default settings?

@rursache
Copy link
Copy Markdown
Author

rursache commented Mar 23, 2026

@freearhey i get some but it really depends on your IP, reputation, etc. there are settings to configure everything, including retry in chunks and resume-on-fail. a "safe" config would take hours and is probably unnecessary for most users. that's why we have params.
a complete run example is here: #26152 (comment)

When a server returns HTTP 429, the Retry-After delay is now applied
to all pending requests for that domain, not just the individual URL.
This prevents other concurrent workers from continuing to hammer a
rate-limited server.
Reduce DEFAULT_CONCURRENCY from 50 to 10 and set DEFAULT_DELAY_MS
to 200 to avoid triggering rate limits with default settings.
Use the same csvtojson library already used across the repo instead
of a hand-rolled CSV parser.
Use the repo's existing probe-image-size dependency to verify
ambiguous content types instead of manual magic-byte sniffing.
Copy link
Copy Markdown
Contributor

@freearhey freearhey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, it would be nice to have at least a basic test to make sure the script correctly parses the response from the server and doesn't crash at the end of the check. because I'm not ready to wait half an hour just to find that out:

node scripts/check_logos.js
  [336/40707] 0.8%  dead: 1  22.4/s  eta: 30.1min
  [500/40707] 1.2%  dead: 3  19.7/s  eta: 34.0min
  [598/40707] 1.5%  dead: 4  19.9/s  eta: 33.5min
  [927/40707] 2.3%  dead: 7  20.6/s  eta: 32.2min
  [1000/40707] 2.5%  dead: 8  20.9/s  eta: 31.6min
...

Spins up a local HTTP server to verify the script correctly handles:
- Valid image URLs (200 + image content-type)
- HTTP errors (404, 403, 500)
- Bad content-type responses
- HEAD 405 fallback to GET
- Persistent 429 rate limiting
- Empty URLs
- Mixed alive/dead batches
@rursache
Copy link
Copy Markdown
Author

updated to apply all the requested changes, a small test suite + i've made the defaults a bit more slow to prevent 429 error chances. i feel like this can be merged now as it contains all that's needed to just see which logos are no longer available and needs fixing

freearhey
freearhey previously approved these changes Mar 24, 2026
Copy link
Copy Markdown
Contributor

@freearhey freearhey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, after a 6-hour scan, I ended up with a list of 5552 dead links:

Finished in 345.1min — 5552/40707 still dead
  Reason breakdown:
      5068  HTTP 429 (gave up after retries)
       219  connection error
       190  HTTP 404
        39  HTTP 403
        15  timeout
        13  bad content-type
         2  HTTP 418
         1  HTTP 521
         1  HTTP 402
         1  HTTP 400
         1  HTTP 500
         1  HTTP 409
         1  HTTP 502

And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.

Although, who knows—maybe this script will come in handy for someone else.

@rursache
Copy link
Copy Markdown
Author

rursache commented Mar 24, 2026

ignoring the HTTP 429 errors, all others should have replacement logos provided. that's the whole point, finding dead links so we have a healthy database. the script could be ran on a CI or something in the background where it can be slow

i'm working already on getting fresh images for the dead links and will have a PR fixing most of them soon

@BellezaEmporium
Copy link
Copy Markdown
Contributor

BellezaEmporium commented Mar 25, 2026

TypeScript gets excited because the file is not typed. Doing the necessary changes.

@BellezaEmporium
Copy link
Copy Markdown
Contributor

Tested and working on my side.

Please try with "npx tsx scripts/commands/db/check_logos.ts" if necessary.

Copy link
Copy Markdown
Contributor

@freearhey freearhey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test failed:

npm test --- check_logos

> test
> npx vitest run check_logos

 ❯ tests/commands/check_logos.test.js:3:1
      1| import { describe, it, expect, afterAll, beforeAll } from 'vitest'
      2| import { createServer } from 'node:http'
      3| import { checkAll } from '../../scripts/check_logos.js'
       | ^
      4| 
      5| const REMOTE_IMAGE = 'https://i.imgur.com/7oNe8xj.png'

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/1]⎯


 Test Files  1 failed (1)
      Tests  no tests
   Duration  417ms (transform 51ms, setup 0ms, collect 0ms, tests 0ms, environment 0ms, prepare 11ms)

@BellezaEmporium
Copy link
Copy Markdown
Contributor

Test failed:

npm test --- check_logos

> test
> npx vitest run check_logos

 ❯ tests/commands/check_logos.test.js:3:1
      1| import { describe, it, expect, afterAll, beforeAll } from 'vitest'
      2| import { createServer } from 'node:http'
      3| import { checkAll } from '../../scripts/check_logos.js'
       | ^
      4| 
      5| const REMOTE_IMAGE = 'https://i.imgur.com/7oNe8xj.png'

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/1]⎯


 Test Files  1 failed (1)
      Tests  no tests
   Duration  417ms (transform 51ms, setup 0ms, collect 0ms, tests 0ms, environment 0ms, prepare 11ms)

That's a whoopsie from my side.

@StrangeDrVN
Copy link
Copy Markdown
Collaborator

StrangeDrVN commented Mar 26, 2026

And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.

this could be useful to remove dead urls from source other than wikimedia

  1. should the name of the script be different, in line with the ones in contributing guide?

To run scripts use the npm run <script-name> command.

  • act:check: allows to run the check workflow locally. Depends on nektos/act.
  • act:update: allows to run the update workflow locally. Depends on nektos/act.
  • act:deploy: allows to run the deploy workflow locally. Depends on nektos/act.
  • db:validate: checks the integrity of data.
  • db:export: saves all data in JSON format to the /.api folder.
  • db:update: triggers a data update using approved requests from issues.
  • lint: сhecks the scripts for syntax errors.
  • test: runs a test of all the scripts described above.


const DEFAULT_CONCURRENCY = 10
const DEFAULT_TIMEOUT = 15_000 // ms per request
const DEFAULT_DELAY_MS = 200
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a delay really necessary by default?

@StrangeDrVN
Copy link
Copy Markdown
Collaborator

And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.

i agree, it seems to be useful like once a year.

Found these with regard to wikimedia rate-limits, could these be of any help?
https://www.mediawiki.org/wiki/API:Etiquette
https://www.mediawiki.org/wiki/Manual:Rate_limits
https://www.mediawiki.org/wiki/API:Query#Generators

https://commons.wikimedia.org/w/api.php?action=query&titles=File:Sharq_TV_Logo_2020.jpg|File:Ora_News_(Albania).svg&prop=imageinfo&iiprop=url&format=json
it seems we can use the api to examine 50 images per request which returns a json with url and 404/missing among other information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants