zimcheck: fix isOutofBounds bug related to URL queries#488
zimcheck: fix isOutofBounds bug related to URL queries#488IMayBeABitShy wants to merge 2 commits intoopenzim:mainfrom
Conversation
|
I think that fixing |
|
It also seems that there is another problem with |
|
I've changed this PR so that it instead attempts to properly parse the URL. |
|
@veloman-yunkan Could you elaborate on the changes needed for //Removes extra spaces from URLs. Usually done by the browser, so web authors sometimes tend to ignore it.
//Converts the %20 to space.Essential for comparing URLs.The function already takes care of the space related issues as well as properly ignoring the query string and fragment in the URL. We could limit it so that it only removes spaces from the path component of a link, however the comments indicate that the function is used for comparing paths. Normalizing the links by stripping stuff like "https://" may lead to wrong comparison results. How should a link like "https://example.org/some%20path/?k=v" be normalized? |
The doc comment appears to be outdated. |
|
BTW, I remember that there once has been a similar discussion of relative links going out of bounds (ascending with a series of |
@IMayBeABitShy After re-reading your bug report, I think that my initial interpretation of it was different from what you had in mind. To the best of my understanding (I haven't consulted any specification, but my intuition tells me so) the search (query) component of a URL isn't subject to relative URL resolution. Thus the URL in your example should resolve to |
I guess the possibility of links containing search and/or fragment components has been completely overlooked. We now have to define how to handle them. My feeling is that in the context of |
I strongly agree. The specification does not, for the best of my knowledge, guarantee any URL scheme on how ZIM files should be served. I think a ZIM file may be served on any path, including Off-topic (can be skipped): Cross-ZIM-file links are in theory quite useful though. If anyone cares for my opinion on this, I'd propose that this should be done using a proper definition in the spec. I think that ideally each ZIM should be able to define some metadata indicating what content it provides, which can then be used by the reader to dynamically open the ZIM from other ZIM files. If, for example, we had a sotoki ZIM that contains a wikipedia link, then ideally the wikipedia link should work regardless of which other ZIMs are present on the device. Thus, the sotoki ZIM should contain regular wikipedia links. The ZIM reader could then check if there's another ZIM that provides that path (e.g. a wikipedia ZIM) using the metadata and open that ZIM. Just an idea I had some time ago.
Ah, sorry about that, that's a minor mistake on my part. I rewrote the example above a couple of times trying to improve clarity, one of those edits must have been incomplete. You're right that the query part shouldn't be resolved, but that shouldn't have a bearing on this issue+fix. The key issue is still that
There's actual a short section about this in the zim wiki:
In other words, a HTML link to Thus, my interpretation of the correct behavior would be:
|
Greetings,
it's been a while since I've done anything with C++, so feel free to tear the suggested changes of this PR apart.
This is just a simple "I've found a bug and fixed it" PR otherwise.
Bug report (fix follows in the next section)
Observed Behavior:
I've recently found
zimcheckfailing with the error message indicating that an URL was out of bounds despite being valid.Let's assume the following ZIM structure:
The URL was something along the lines of
../../tools/tool.html?path=../dir/subdir/file.htmlwith the base directory being/dir/subdir/index.html. Now, this URL should resolve to/tools/tool.html?path=../dir/subdir/file.html. Entirely valid, yetzimcheckfails due to wrongly identifying the above link as out of bounds (meaining that it reaches beyond the ZIM root directory).Here's a minimal example of the failing code:
Expected behavior:
As you can probably guess from the previous subsection, the expected behavior is that the link shouldn't be identified as an out of bounds link. Consequently,
zimcheckshould not fail the zim.The fix
The problem lies in the
isOutofBoundsfunction, which doesn't properly parse the HTML links. Rather, it interprets the links as filesystem paths. Which is probably sort-of valid for some file types, but for HTML files the links must be interpreted as hyperlinks. TheisOutofBoundsfunction is rather crude in that it only compares to number of occourences of/and../against each other rather than actualyl understanding how a hyperlink URL may be structured.The bug in this PR can be fixed by limiting the range in which theThe bug in this PR can be fixed by properly extracting the path component of the URL and only checking isOutOfBounds on that part.../occurences are counted by the first occurence of?.Important note:: this fix is just as "crude" as the original function. It doesn't really resolve the underlying issue - that being the lack of understanding of how a URL works - and thus doesn't fix like a dozen other potential issues I can think of. For example, should a URL specify a fragment containing../, this same issue may occur.Other stuff
This PR contains a new function
extractPathFromLink, which is used to get the path component of a http URL, as well as tests for both this new function and the bug described in this issue.