Conversation
ff7972e to
4b54494
Compare
|
example of the resulting |
BellezaEmporium
left a comment
There was a problem hiding this comment.
I'm not against the idea, but since most of the scripts' codes are in JS (via Node), i'll see if @freearhey 's in for adding a Python script.
|
Of course, it would be better to rewrite the code in JavaScript, primarily to avoid complicating the maintenance of the repository as a whole. |
scripts/check_logos.js scans data/logos.csv for broken logo URLs and writes the dead entries to dead_logos.json. Features: - Async worker-pool (concurrency bounded, EMFILE retry) - HEAD with automatic fallback to GET on 405 - 429 retries rescheduled via setTimeout so workers never stall - Exponential backoff with Retry-After header support - Magic-byte sniffing for application/octet-stream responses - --recheck: re-verify a previous dead_logos.json in place - --loop: keep re-checking until the list stabilizes - --delay: per-request throttle to reduce rate limiting - Live progress with rate and ETA, reason breakdown on completion Requires: Node.js 18+ (no external dependencies) Also adds dead_logos*.json to .gitignore.
4b54494 to
aaf4265
Compare
|
i migrated the script from python to node.js (no deps) as proposed. i used claude code as im not fluent in js |
|
The script has flagged over 10,000 links as dead. Is that supposed to happen? node scripts/check_logos.js
...
Finished in 7.6min — 10434/40707 still dead
Reason breakdown:
10117 HTTP 429 (gave up after retries)
188 connection error
55 HTTP 404
41 HTTP 403
15 timeout
11 bad content-type
2 HTTP 418
1 HTTP 521
1 HTTP 402
1 HTTP 400
1 HTTP 500
1 HTTP 502 |
|
@freearhey judging by the 10k+ |
|
@rursache so you don't get a 429 error with the default settings? |
|
@freearhey i get some but it really depends on your IP, reputation, etc. there are settings to configure everything, including retry in chunks and resume-on-fail. a "safe" config would take hours and is probably unnecessary for most users. that's why we have params. |
When a server returns HTTP 429, the Retry-After delay is now applied to all pending requests for that domain, not just the individual URL. This prevents other concurrent workers from continuing to hammer a rate-limited server.
Reduce DEFAULT_CONCURRENCY from 50 to 10 and set DEFAULT_DELAY_MS to 200 to avoid triggering rate limits with default settings.
Use the same csvtojson library already used across the repo instead of a hand-rolled CSV parser.
Use the repo's existing probe-image-size dependency to verify ambiguous content types instead of manual magic-byte sniffing.
freearhey
left a comment
There was a problem hiding this comment.
also, it would be nice to have at least a basic test to make sure the script correctly parses the response from the server and doesn't crash at the end of the check. because I'm not ready to wait half an hour just to find that out:
node scripts/check_logos.js
[336/40707] 0.8% dead: 1 22.4/s eta: 30.1min
[500/40707] 1.2% dead: 3 19.7/s eta: 34.0min
[598/40707] 1.5% dead: 4 19.9/s eta: 33.5min
[927/40707] 2.3% dead: 7 20.6/s eta: 32.2min
[1000/40707] 2.5% dead: 8 20.9/s eta: 31.6min
...Spins up a local HTTP server to verify the script correctly handles: - Valid image URLs (200 + image content-type) - HTTP errors (404, 403, 500) - Bad content-type responses - HEAD 405 fallback to GET - Persistent 429 rate limiting - Empty URLs - Mixed alive/dead batches
1f34000 to
52e4d2a
Compare
|
updated to apply all the requested changes, a small test suite + i've made the defaults a bit more slow to prevent 429 error chances. i feel like this can be merged now as it contains all that's needed to just see which logos are no longer available and needs fixing |
freearhey
left a comment
There was a problem hiding this comment.
Well, after a 6-hour scan, I ended up with a list of 5552 dead links:
Finished in 345.1min — 5552/40707 still dead
Reason breakdown:
5068 HTTP 429 (gave up after retries)
219 connection error
190 HTTP 404
39 HTTP 403
15 timeout
13 bad content-type
2 HTTP 418
1 HTTP 521
1 HTTP 402
1 HTTP 400
1 HTTP 500
1 HTTP 409
1 HTTP 502And I have no idea what to do with this information now. But I definitely don’t plan on repeating this process in the future.
Although, who knows—maybe this script will come in handy for someone else.
|
ignoring the HTTP 429 errors, all others should have replacement logos provided. that's the whole point, finding dead links so we have a healthy database. the script could be ran on a CI or something in the background where it can be slow i'm working already on getting fresh images for the dead links and will have a PR fixing most of them soon |
|
TypeScript gets excited because the file is not typed. Doing the necessary changes. |
|
Tested and working on my side. Please try with "npx tsx scripts/commands/db/check_logos.ts" if necessary. |
freearhey
left a comment
There was a problem hiding this comment.
Test failed:
npm test --- check_logos
> test
> npx vitest run check_logos
❯ tests/commands/check_logos.test.js:3:1
1| import { describe, it, expect, afterAll, beforeAll } from 'vitest'
2| import { createServer } from 'node:http'
3| import { checkAll } from '../../scripts/check_logos.js'
| ^
4|
5| const REMOTE_IMAGE = 'https://i.imgur.com/7oNe8xj.png'
⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/1]⎯
Test Files 1 failed (1)
Tests no tests
Duration 417ms (transform 51ms, setup 0ms, collect 0ms, tests 0ms, environment 0ms, prepare 11ms)
That's a whoopsie from my side. |
this could be useful to remove dead urls from source other than
To run scripts use the
|
|
|
||
| const DEFAULT_CONCURRENCY = 10 | ||
| const DEFAULT_TIMEOUT = 15_000 // ms per request | ||
| const DEFAULT_DELAY_MS = 200 |
There was a problem hiding this comment.
Is a delay really necessary by default?
i agree, it seems to be useful like once a year. Found these with regard to wikimedia rate-limits, could these be of any help? https://commons.wikimedia.org/w/api.php?action=query&titles=File:Sharq_TV_Logo_2020.jpg|File:Ora_News_(Albania).svg&prop=imageinfo&iiprop=url&format=json |
Summary
scripts/check_logos.py— an async Python script that scansdata/logos.csvand identifies broken logo URLsdead_logos*.jsonto.gitignoreso output files are never accidentally committedHow it works
A logo URL is considered dead if:
Content-Typeis notimage/*429 responses are retried with exponential backoff (respects
Retry-Afterheader). HEAD requests fall back to GET automatically on 405.Usage
Output is a JSON array of dead logo entries (all original CSV fields preserved) with an added
_reasonfield explaining why the URL failed.Requires: Python 3.10+,
aiohttp