Skip to content

zimcheck: fix isOutofBounds bug related to URL queries#488

Open
IMayBeABitShy wants to merge 2 commits intoopenzim:mainfrom
IMayBeABitShy:main
Open

zimcheck: fix isOutofBounds bug related to URL queries#488
IMayBeABitShy wants to merge 2 commits intoopenzim:mainfrom
IMayBeABitShy:main

Conversation

@IMayBeABitShy
Copy link
Copy Markdown

@IMayBeABitShy IMayBeABitShy commented Jan 11, 2026

Greetings,

it's been a while since I've done anything with C++, so feel free to tear the suggested changes of this PR apart.
This is just a simple "I've found a bug and fixed it" PR otherwise.

Bug report (fix follows in the next section)

Observed Behavior:

I've recently found zimcheck failing with the error message indicating that an URL was out of bounds despite being valid.
Let's assume the following ZIM structure:

ZIM_ROOT (/)
 - tools/
    - tool.html
 - dir/
    - subdir/
      - index.html
      - file.html

The URL was something along the lines of ../../tools/tool.html?path=../dir/subdir/file.html with the base directory being /dir/subdir/index.html. Now, this URL should resolve to /tools/tool.html?path=../dir/subdir/file.html. Entirely valid, yet zimcheck fails due to wrongly identifying the above link as out of bounds (meaining that it reaches beyond the ZIM root directory).

Here's a minimal example of the failing code:

int main() {
    std::string inp = "../../tool.html?path=../dir/subdir/file.html";
    std::string path = "dir/subdir/index.html";
    auto baseUrl = path;
    auto pos = baseUrl.find_last_of('/');
    baseUrl.resize( pos==baseUrl.npos ? 0 : pos );

    cout << "path: " << path << "\n";
    cout << "base: " << baseUrl << "\n";
    cout << "inp:  " << inp << "\n";

    if (isOutofBounds(inp, baseUrl)) {
        cout << "OOB: true\n";
    } else {
        cout << "OOB: false\n";
    }
    return 0;
}

Expected behavior:

As you can probably guess from the previous subsection, the expected behavior is that the link shouldn't be identified as an out of bounds link. Consequently, zimcheck should not fail the zim.

The fix

The problem lies in the isOutofBounds function, which doesn't properly parse the HTML links. Rather, it interprets the links as filesystem paths. Which is probably sort-of valid for some file types, but for HTML files the links must be interpreted as hyperlinks. The isOutofBounds function is rather crude in that it only compares to number of occourences of / and ../ against each other rather than actualyl understanding how a hyperlink URL may be structured.

The bug in this PR can be fixed by limiting the range in which the ../ occurences are counted by the first occurence of ?. The bug in this PR can be fixed by properly extracting the path component of the URL and only checking isOutOfBounds on that part.

Important note:: this fix is just as "crude" as the original function. It doesn't really resolve the underlying issue - that being the lack of understanding of how a URL works - and thus doesn't fix like a dozen other potential issues I can think of. For example, should a URL specify a fragment containing ../, this same issue may occur.

Other stuff

This PR contains a new function extractPathFromLink, which is used to get the path component of a http URL, as well as tests for both this new function and the bug described in this issue.

@kelson42 kelson42 added this to the 3.7.0 milestone Jan 12, 2026
@veloman-yunkan
Copy link
Copy Markdown
Collaborator

I think that fixing isOutOfBounds() in that way is not enough - normalize_link() will have to be fixed too (since it relies on the structure of relative URLs to conform to the requirements imposed by the current implementation of isOutOfBounds()).

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

It also seems that there is another problem with isOutOfBounds() - it can be fooled by the presence of ../ in the search component (that after the ? symbol) of the URL. isOutOfBounds() should only analyze the path component.

@IMayBeABitShy
Copy link
Copy Markdown
Author

I've changed this PR so that it instead attempts to properly parse the URL. isOutOfBounds will now call the new extractPathFromLink() function and only process the actual path of the URL. I have yet to take a look at normalie_link(), though.

@IMayBeABitShy
Copy link
Copy Markdown
Author

@veloman-yunkan Could you elaborate on the changes needed for normalize_link()? I am unsure what the intended behavior is. The comments in the header file is as follows:

//Removes extra spaces from URLs. Usually done by the browser, so web authors sometimes tend to ignore it.
//Converts the %20 to space.Essential for comparing URLs.

The function already takes care of the space related issues as well as properly ignoring the query string and fragment in the URL. We could limit it so that it only removes spaces from the path component of a link, however the comments indicate that the function is used for comparing paths. Normalizing the links by stripping stuff like "https://" may lead to wrong comparison results. How should a link like "https://example.org/some%20path/?k=v" be normalized?

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

@veloman-yunkan Could you elaborate on the changes needed for normalize_link()? I am unsure what the intended behavior is. The comments in the header file is as follows:

//Removes extra spaces from URLs. Usually done by the browser, so web authors sometimes tend to ignore it.
//Converts the %20 to space.Essential for comparing URLs.

The function already takes care of the space related issues as well as properly ignoring the query string and fragment in the URL. We could limit it so that it only removes spaces from the path component of a link, however the comments indicate that the function is used for comparing paths. Normalizing the links by stripping stuff like "https://" may lead to wrong comparison results. How should a link like "https://example.org/some%20path/?k=v" be normalized?

The doc comment appears to be outdated. normalize_link() converts links into a canonical form, a most important part of which is turning relative links into absolute links.

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

BTW, I remember that there once has been a similar discussion of relative links going out of bounds (ascending with a series of ../ above the top location) and it was pointed out that it is a perfectly valid URL according to the spec. I will try to find that comment exchange, however I still want to point out that maybe in the context of ZIM files we shouldn't allow such "overflowing" relative URLs (and, hence, handling them in the scrapers) for the following reason. Suppose that a ZIM file is served by kiwix-serve. Then such (internal) links will resolve differently depending on the value of the --urlRootLocation option of kiwix-serve. Though it can also be taken advantage of - by allowing cross-ZIM-file links (assuming that the ZIM files are served by the same instance of kiwix-serve).

@kelson42 @rgaudin @benoit74 What do you think?

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

The URL was something along the lines of ../../tools/tool.html?path=../dir/subdir/file.html with the base directory being /dir/subdir/index.html. Now, this URL should resolve to /tools/tool.html?path=../../file.html

@IMayBeABitShy After re-reading your bug report, I think that my initial interpretation of it was different from what you had in mind. To the best of my understanding (I haven't consulted any specification, but my intuition tells me so) the search (query) component of a URL isn't subject to relative URL resolution. Thus the URL in your example should resolve to /tools/tool.html?path=../dir/subdir/file.html.

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

@veloman-yunkan Could you elaborate on the changes needed for normalize_link()? I am unsure what the intended behavior is. The comments in the header file is as follows:

//Removes extra spaces from URLs. Usually done by the browser, so web authors sometimes tend to ignore it.
//Converts the %20 to space.Essential for comparing URLs.

The function already takes care of the space related issues as well as properly ignoring the query string and fragment in the URL. We could limit it so that it only removes spaces from the path component of a link, however the comments indicate that the function is used for comparing paths. Normalizing the links by stripping stuff like "https://" may lead to wrong comparison results. How should a link like "https://example.org/some%20path/?k=v" be normalized?

The doc comment appears to be outdated. normalize_link() converts links into a canonical form, a most important part of which is turning relative links into absolute links.

I guess the possibility of links containing search and/or fragment components has been completely overlooked. We now have to define how to handle them. My feeling is that in the context of zimcheck they should be simply discarded.

@IMayBeABitShy
Copy link
Copy Markdown
Author

@veloman-yunkan

[...] however I still want to point out that maybe in the context of ZIM files we shouldn't allow such "overflowing" relative URLs (and, hence, handling them in the scrapers) for the following reason. Suppose that a ZIM file is served by kiwix-serve. Then such (internal) links will resolve differently depending on the value of the --urlRootLocation option of kiwix-serve. Though it can also be taken advantage of - by allowing cross-ZIM-file links (assuming that the ZIM files are served by the same instance of kiwix-serve).

I strongly agree. The specification does not, for the best of my knowledge, guarantee any URL scheme on how ZIM files should be served. I think a ZIM file may be served on any path, including / and there's no guarantee that the first section of the path is the name of the ZIM file (similiar to how kiwix-serve operates). Also, how would cross-ZIM-file links reliable work with readers that are not / don't use webservers?

Off-topic (can be skipped): Cross-ZIM-file links are in theory quite useful though. If anyone cares for my opinion on this, I'd propose that this should be done using a proper definition in the spec. I think that ideally each ZIM should be able to define some metadata indicating what content it provides, which can then be used by the reader to dynamically open the ZIM from other ZIM files. If, for example, we had a sotoki ZIM that contains a wikipedia link, then ideally the wikipedia link should work regardless of which other ZIMs are present on the device. Thus, the sotoki ZIM should contain regular wikipedia links. The ZIM reader could then check if there's another ZIM that provides that path (e.g. a wikipedia ZIM) using the metadata and open that ZIM. Just an idea I had some time ago.

The URL was something along the lines of ../../tools/tool.html?path=../dir/subdir/file.html with the base directory being /dir/subdir/index.html. Now, this URL should resolve to /tools/tool.html?path=../../file.html

@IMayBeABitShy After re-reading your bug report, I think that my initial interpretation of it was different from what you had in mind. To the best of my understanding (I haven't consulted any specification, but my intuition tells me so) the search (query) component of a URL isn't subject to relative URL resolution. Thus the URL in your example should resolve to /tools/tool.html?path=../dir/subdir/file.html.

Ah, sorry about that, that's a minor mistake on my part. I rewrote the example above a couple of times trying to improve clarity, one of those edits must have been incomplete. You're right that the query part shouldn't be resolved, but that shouldn't have a bearing on this issue+fix. The key issue is still that zimcheck counts the occurences of ../ in the query and fragment parts of the URL when trying to determine wheter a link is out of bound or not.

I guess the possibility of links containing search and/or fragment components has been completely overlooked. We now have to define how to handle them. My feeling is that in the context of zimcheck they should be simply discarded.

There's actual a short section about this in the zim wiki:

Or if you want to store a ZIM entry at index.html?param=value, the HTML document pointing to it will have to use the index.html%3Fparam%3Dvalue href. [...] When serving web content (which is usually the case), some readers process the requests and already do the url decoding internally, whereas most readers will handle the paths directly. The same applies to querystring which might be absorbed by some webservers and not passed to the libzim. [...]

In other words, a HTML link to dir/file?param=value does not point to the file dir/file?param=value, as in this case ?param=value is still to be interpreted as a regular parameters. Actually linking to such a file would require the a link to dir/file%3Fparam%3Dvalue.

Thus, my interpretation of the correct behavior would be:

  • zim paths are allowed to have a form like index.html?param=value, which would be a different file than index.html
  • HTML links like href="index.html?param=value" should still point to index.html with the parameters "param=value", whereas href="index.html%3Fparam%3Dvalue" should point to index.html?param=value with no parameters.
    • consequently, zimcheck should dismiss any non-encoded parameters from the URL when checking if a URL exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants