Add logo URL checker script by rursache · Pull Request #26152 · iptv-org/database

rursache · 2026-03-18T20:14:14Z

Summary

Adds scripts/check_logos.py — an async Python script that scans data/logos.csv and identifies broken logo URLs
Adds dead_logos*.json to .gitignore so output files are never accidentally committed

How it works

A logo URL is considered dead if:

The connection fails or times out
The HTTP response is not 2xx
The Content-Type is not image/*

429 responses are retried with exponential backoff (respects Retry-After header). HEAD requests fall back to GET automatically on 405.

Usage

# pip install aiohttp

# Full scan (~40k URLs)
python3 scripts/check_logos.py

# Re-check a previous result with lower concurrency to avoid rate limits
python3 scripts/check_logos.py --recheck dead_logos.json --concurrency 10 --delay 500

# Keep re-checking until the dead list stabilizes
python3 scripts/check_logos.py --recheck dead_logos.json --loop --concurrency 10 --delay 500

Output is a JSON array of dead logo entries (all original CSV fields preserved) with an added _reason field explaining why the URL failed.

Requires: Python 3.10+, aiohttp

rursache · 2026-03-18T21:04:32Z

example of the resulting dead_logos.json file:

Reason breakdown:
        55  HTTP 404
        33  connection error
        26  HTTP 403
        15  bad content-type
        14  other
        14  timeout
         2  HTTP 409
         1  HTTP 521
         1  HTTP 402
         1  HTTP 400
         1  HTTP 502
         1  HTTP 500

dead_logos.json

BellezaEmporium

I'm not against the idea, but since most of the scripts' codes are in JS (via Node), i'll see if @freearhey 's in for adding a Python script.

freearhey · 2026-03-20T06:16:31Z

Of course, it would be better to rewrite the code in JavaScript, primarily to avoid complicating the maintenance of the repository as a whole.

scripts/check_logos.js scans data/logos.csv for broken logo URLs and writes the dead entries to dead_logos.json. Features: - Async worker-pool (concurrency bounded, EMFILE retry) - HEAD with automatic fallback to GET on 405 - 429 retries rescheduled via setTimeout so workers never stall - Exponential backoff with Retry-After header support - Magic-byte sniffing for application/octet-stream responses - --recheck: re-verify a previous dead_logos.json in place - --loop: keep re-checking until the list stabilizes - --delay: per-request throttle to reduce rate limiting - Live progress with rate and ETA, reason breakdown on completion Requires: Node.js 18+ (no external dependencies) Also adds dead_logos*.json to .gitignore.

rursache · 2026-03-23T12:17:32Z

i migrated the script from python to node.js (no deps) as proposed. i used claude code as im not fluent in js

freearhey · 2026-03-23T15:14:13Z

The script has flagged over 10,000 links as dead. Is that supposed to happen?

node scripts/check_logos.js
...
Finished in 7.6min — 10434/40707 still dead
  Reason breakdown:
     10117  HTTP 429 (gave up after retries)
       188  connection error
        55  HTTP 404
        41  HTTP 403
        15  timeout
        11  bad content-type
         2  HTTP 418
         1  HTTP 521
         1  HTTP 402
         1  HTTP 400
         1  HTTP 500
         1  HTTP 502

rursache · 2026-03-23T15:17:45Z

@freearhey judging by the 10k+ HTTP 429 errors you get, wikipedia cdn rate limited you. reduce the speed/amount of hits using params to prevent this

freearhey · 2026-03-23T15:35:07Z

@rursache so you don't get a 429 error with the default settings?

rursache · 2026-03-23T15:38:21Z

@freearhey i get some but it really depends on your IP, reputation, etc. there are settings to configure everything, including retry in chunks and resume-on-fail. a "safe" config would take hours and is probably unnecessary for most users. that's why we have params.
a complete run example is here: #26152 (comment)

scripts/commands/db/check_logos.ts

scripts/check_logos.js

When a server returns HTTP 429, the Retry-After delay is now applied to all pending requests for that domain, not just the individual URL. This prevents other concurrent workers from continuing to hammer a rate-limited server.

Reduce DEFAULT_CONCURRENCY from 50 to 10 and set DEFAULT_DELAY_MS to 200 to avoid triggering rate limits with default settings.

Use the same csvtojson library already used across the repo instead of a hand-rolled CSV parser.

scripts/check_logos.js

Use the repo's existing probe-image-size dependency to verify ambiguous content types instead of manual magic-byte sniffing.

freearhey

also, it would be nice to have at least a basic test to make sure the script correctly parses the response from the server and doesn't crash at the end of the check. because I'm not ready to wait half an hour just to find that out:

node scripts/check_logos.js
  [336/40707] 0.8%  dead: 1  22.4/s  eta: 30.1min
  [500/40707] 1.2%  dead: 3  19.7/s  eta: 34.0min
  [598/40707] 1.5%  dead: 4  19.9/s  eta: 33.5min
  [927/40707] 2.3%  dead: 7  20.6/s  eta: 32.2min
  [1000/40707] 2.5%  dead: 8  20.9/s  eta: 31.6min
...

Spins up a local HTTP server to verify the script correctly handles: - Valid image URLs (200 + image content-type) - HTTP errors (404, 403, 500) - Bad content-type responses - HEAD 405 fallback to GET - Persistent 429 rate limiting - Empty URLs - Mixed alive/dead batches

rursache · 2026-03-24T10:50:18Z

updated to apply all the requested changes, a small test suite + i've made the defaults a bit more slow to prevent 429 error chances. i feel like this can be merged now as it contains all that's needed to just see which logos are no longer available and needs fixing

freearhey

Well, after a 6-hour scan, I ended up with a list of 5552 dead links:

Finished in 345.1min — 5552/40707 still dead
  Reason breakdown:
      5068  HTTP 429 (gave up after retries)
       219  connection error
       190  HTTP 404
        39  HTTP 403
        15  timeout
        13  bad content-type
         2  HTTP 418
         1  HTTP 521
         1  HTTP 402
         1  HTTP 400
         1  HTTP 500
         1  HTTP 409
         1  HTTP 502

And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.

Although, who knows—maybe this script will come in handy for someone else.

rursache · 2026-03-24T17:00:26Z

ignoring the HTTP 429 errors, all others should have replacement logos provided. that's the whole point, finding dead links so we have a healthy database. the script could be ran on a CI or something in the background where it can be slow

i'm working already on getting fresh images for the dead links and will have a PR fixing most of them soon

BellezaEmporium · 2026-03-25T13:34:30Z

TypeScript gets excited because the file is not typed. Doing the necessary changes.

BellezaEmporium · 2026-03-25T13:52:24Z

Tested and working on my side.

Please try with "npx tsx scripts/commands/db/check_logos.ts" if necessary.

freearhey

Test failed:

npm test --- check_logos

> test
> npx vitest run check_logos

 ❯ tests/commands/check_logos.test.js:3:1
      1| import { describe, it, expect, afterAll, beforeAll } from 'vitest'
      2| import { createServer } from 'node:http'
      3| import { checkAll } from '../../scripts/check_logos.js'
       | ^
      4| 
      5| const REMOTE_IMAGE = 'https://i.imgur.com/7oNe8xj.png'

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/1]⎯


 Test Files  1 failed (1)
      Tests  no tests
   Duration  417ms (transform 51ms, setup 0ms, collect 0ms, tests 0ms, environment 0ms, prepare 11ms)

BellezaEmporium · 2026-03-25T18:10:26Z

Test failed:

npm test --- check_logos

> test
> npx vitest run check_logos

 ❯ tests/commands/check_logos.test.js:3:1
      1| import { describe, it, expect, afterAll, beforeAll } from 'vitest'
      2| import { createServer } from 'node:http'
      3| import { checkAll } from '../../scripts/check_logos.js'
       | ^
      4| 
      5| const REMOTE_IMAGE = 'https://i.imgur.com/7oNe8xj.png'

⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/1]⎯


 Test Files  1 failed (1)
      Tests  no tests
   Duration  417ms (transform 51ms, setup 0ms, collect 0ms, tests 0ms, environment 0ms, prepare 11ms)

That's a whoopsie from my side.

StrangeDrVN · 2026-03-26T07:29:11Z

And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.

this could be useful to remove dead urls from source other than wikimedia

should the name of the script be different, in line with the ones in contributing guide?

To run scripts use the npm run <script-name> command.

act:check: allows to run the check workflow locally. Depends on nektos/act.
act:update: allows to run the update workflow locally. Depends on nektos/act.
act:deploy: allows to run the deploy workflow locally. Depends on nektos/act.
db:validate: checks the integrity of data.
db:export: saves all data in JSON format to the /.api folder.
db:update: triggers a data update using approved requests from issues.
lint: сhecks the scripts for syntax errors.
test: runs a test of all the scripts described above.

freearhey · 2026-03-28T06:32:46Z

scripts/commands/db/check_logos.ts

+
+const DEFAULT_CONCURRENCY = 10
+const DEFAULT_TIMEOUT = 15_000 // ms per request
+const DEFAULT_DELAY_MS = 200


Is a delay really necessary by default?

StrangeDrVN · 2026-03-28T12:18:31Z

And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.

i agree, it seems to be useful like once a year.

Found these with regard to wikimedia rate-limits, could these be of any help?
https://www.mediawiki.org/wiki/API:Etiquette
https://www.mediawiki.org/wiki/Manual:Rate_limits
https://www.mediawiki.org/wiki/API:Query#Generators

https://commons.wikimedia.org/w/api.php?action=query&titles=File:Sharq_TV_Logo_2020.jpg|File:Ora_News_(Albania).svg&prop=imageinfo&iiprop=url&format=json
it seems we can use the api to examine 50 images per request which returns a json with url and 404/missing among other information

rursache force-pushed the add-logo-checker branch 5 times, most recently from ff7972e to 4b54494 Compare March 18, 2026 21:01

BellezaEmporium reviewed Mar 19, 2026

View reviewed changes

rursache force-pushed the add-logo-checker branch from 4b54494 to aaf4265 Compare March 23, 2026 12:17

freearhey reviewed Mar 24, 2026

View reviewed changes

scripts/commands/db/check_logos.ts Show resolved Hide resolved

freearhey reviewed Mar 24, 2026

View reviewed changes

scripts/check_logos.js Outdated Show resolved Hide resolved

rursache added 3 commits March 24, 2026 12:09

Add per-domain rate-limit backoff for 429 responses

290f3e7

When a server returns HTTP 429, the Retry-After delay is now applied to all pending requests for that domain, not just the individual URL. This prevents other concurrent workers from continuing to hammer a rate-limited server.

Lower default concurrency and add default delay

ad6e1c2

Reduce DEFAULT_CONCURRENCY from 50 to 10 and set DEFAULT_DELAY_MS to 200 to avoid triggering rate limits with default settings.

Replace custom CSV parser with repo's csvtojson dependency

93f874a

Use the same csvtojson library already used across the repo instead of a hand-rolled CSV parser.

freearhey reviewed Mar 24, 2026

View reviewed changes

scripts/check_logos.js Outdated Show resolved Hide resolved

Replace magic-byte image detection with probe-image-size

e44ebfa

Use the repo's existing probe-image-size dependency to verify ambiguous content types instead of manual magic-byte sniffing.

freearhey reviewed Mar 24, 2026

View reviewed changes

rursache force-pushed the add-logo-checker branch from 1f34000 to 52e4d2a Compare March 24, 2026 10:48

rursache requested review from BellezaEmporium and freearhey March 24, 2026 10:50

freearhey previously approved these changes Mar 24, 2026

View reviewed changes

typed + moved to scripts/commands/db

457cc83

BellezaEmporium dismissed freearhey’s stale review via 457cc83 March 25, 2026 13:50

BellezaEmporium previously approved these changes Mar 25, 2026

View reviewed changes

BellezaEmporium requested a review from freearhey March 25, 2026 13:53

freearhey requested changes Mar 25, 2026

View reviewed changes

Update import path for checkAll function

8338de8

BellezaEmporium dismissed their stale review via 8338de8 March 27, 2026 11:41

fix test and move it to tests/commands/db

d7dc11d

BellezaEmporium requested a review from freearhey March 27, 2026 16:59

freearhey approved these changes Mar 28, 2026

View reviewed changes

freearhey reviewed Mar 28, 2026

View reviewed changes

Uh oh!

Conversation

rursache commented Mar 18, 2026

Summary

How it works

Usage

Uh oh!

rursache commented Mar 18, 2026

Uh oh!

BellezaEmporium left a comment

Choose a reason for hiding this comment

Uh oh!

freearhey commented Mar 20, 2026

Uh oh!

rursache commented Mar 23, 2026

Uh oh!

freearhey commented Mar 23, 2026

Uh oh!

rursache commented Mar 23, 2026

Uh oh!

freearhey commented Mar 23, 2026

Uh oh!

rursache commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

freearhey left a comment

Choose a reason for hiding this comment

Uh oh!

rursache commented Mar 24, 2026

Uh oh!

freearhey left a comment

Choose a reason for hiding this comment

Uh oh!

rursache commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BellezaEmporium commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BellezaEmporium commented Mar 25, 2026

Uh oh!

freearhey left a comment

Choose a reason for hiding this comment

Uh oh!

BellezaEmporium commented Mar 25, 2026

Uh oh!

StrangeDrVN commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

freearhey Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

StrangeDrVN commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

rursache commented Mar 23, 2026 •

edited

Loading

rursache commented Mar 24, 2026 •

edited

Loading

BellezaEmporium commented Mar 25, 2026 •

edited

Loading

StrangeDrVN commented Mar 26, 2026 •

edited

Loading