Use file hashes in addition to timestamps, fixes #1459#2735
Use file hashes in addition to timestamps, fixes #1459#2735jhasse wants to merge 5 commits intoninja-build:masterfrom
Conversation
Save file hashes together with their timestamp in .ninja_hashes. Then on the next build read that file and if only the timestamp changed (i.e. the generated hash of the content is the same as in the file) "fake" its mtime internally. So when comparing the file against others it's treated as if it was in the same state as before (which it is, because the content didn't change). To make sure that the mtime isn't used anywhere introduce OlderThan and NewerThan in Node. Node::exists_ and mtime_ (which had magic numbers for its state) are replaced by an algebraic data type: std::variant<Unknown, Missing, Exists> status_; A new class HashCache handles .ninja_hashes and the calculation of hashes via FNV-1a. It currently uses C file IO instead of DiskInterface to read files, so there are no unit tests yet. To make finding bugs easier DiskInterface::Stat now uses exceptions for error handling. TODO: * Make this optional (`--hash`?) * Save .ninja_hashes in builddir * Add (integration?) tests and fix bugs * Correctly detect directories vs empty files * Use the same hash function (i.e. rapidhash) as elsewhere?
| }; | ||
| ExistenceStatus exists_ = ExistenceStatusUnknown; | ||
| struct Exists { TimeStamp mtime; }; | ||
| std::variant<Unknown, Missing, Exists> status_ = Unknown{}; |
There was a problem hiding this comment.
It's a great idea to use algebraic types but using std::variant<> and std::optional<> in this implementation makes each Node instance's size increase by about 20 bytes, which is a lot compared to using a char enum + a 64-bit timestamp value.
Consider something simpler, for example:
-
Make Timestamp a proper 64-bit algebraic class, with sentinel values to model Missing / Error / Value (don't use std::variant<> which will make that type 12 bytes). That's where the NewerThan() and OlderThan() methods belong, as they will make everything more readable.
-
Keep the
status_char enum and ensure it is never accessed directly, just like the internal timestamp_ value, and use Accessor methods to set or retrieve both in consistent ways.
This will keep the implentation price low while making all uses easier to understand.
There was a problem hiding this comment.
Hm ... why is it 20 bytes? Doesn't std::optional add 1 byte and with aliasing 8?
I could introduce MissingAndNotSetYet to get rid of the std::optional.
In any case: As this is WIP I didn't want to do any premature optimizations and focus on ergonomics first :)
There was a problem hiding this comment.
You're right, it's only 16 extra bytes, not 20, which is still a lot. What happens is the following:
-
std::optional<>adds one boolean, which ends up taking 8 extra bytes due to alignment. -
std::variant<>does the same, and you have an std::optional<> in one of your std::variant<> sub-types, for a total of 16 extra bytes, in additional to theuint64_ttimestamp, so a total size of 24 bytes.
Demonstrated here https://godbolt.org/z/5MWqqsed3
So in the end, before your change, the in-memory layout of the type, starting from the mtime_ field looked like:
offset size name
0 8 mtime_
8 1 exists_
9 1 dirty_
10 1 dyndep_pending_
11 1 generated_by_dep_loader_
12 4 id_
16 8 in_edge_
24
After you change, this becomes:
offset size name
0 24 status_ (includes timestamp)
24 1 dirty_
25 1 dyndep_pending_
26 1 generated_by_dep_loader_
27 1 <padding>
28 4 id_
32 8 in_edge_
40
So 16 extra bytes for no more information stored in the class. It might be simpler to make Timestamp a real struct with internal sentinel values like -1 and -2 to indicate unknown and missing, respectively.
| } | ||
| fclose(f); | ||
| if (hash == 14695981039346656037ULL) { | ||
| return std::nullopt; // File is empty or directory, don't cache this |
There was a problem hiding this comment.
An empty file is a perfectly valid input or output, there is no reason to return std::nullopt for them.
There was a problem hiding this comment.
Yes, this is a big TODO. We need to check if the path is a directory.
src/disk_interface.h
Outdated
| } | ||
|
|
||
| /// stat() a file, returning the mtime. Throws std::runtime_error if missing | ||
| /// or other error. |
There was a problem hiding this comment.
first, the comment is wrong since the function doesn't throw if the file is missing.
Second, please do not introduce random exception-throwing functions in the source code, as it makes reasoning about all possible exit paths impossible in a code base that doesn't strictly use exceptions and RAII values everywhere.
For example, you had to introduce various try { .. } catch { .. } statements in this PR, but the calls in line 117 of graph.cc is not protected, meaning that now Ninja will not report the error properly as it used to do, but instead crash badly, which is really bad user experience. There are many other calls like that, which change the runtime behavior of the functions that contain them.
For simplicity Stat() should continue to report an error. An improvement would be to use algebraic types like std::expected<R,E> to let functions return either a result value or an error condition, but for C++17 this would require writing a custom template.
In all cases, this has absolutely nothing to do with implementing a hash cache, maybe put this type of changes in a separate PR, or at a minimum into a separate commit.
There was a problem hiding this comment.
first, the comment is wrong since the function doesn't throw if the file is missing.
Right, it returns std::nullopt in that case. I will update the comment, thanks :)
Second, please do not introduce random exception-throwing functions in the source code, as it makes reasoning about all possible exit paths impossible in a code base that doesn't strictly use exceptions and RAII values everywhere.
As I said in the commit message I used exceptions here to make finding bugs easier and it helped me a lot: I wanted to see where errors are okay and where they are not.
It helped me understand how this function is used in various places and I actually found a bug on the master branch with this: 27e545a
The current error handling in Ninja is very error-prone, I guess there are lots of hidden bugs left. Return codes are often unchecked and not propagated. Unit tests (and normal code, too) often reuse std::string err and don't assert that it's empty. It's hard to do refactoring like this.
For example, you had to introduce various try { .. } catch { .. } statements in this PR, but the calls in line 117 of graph.cc is not protected, meaning that now Ninja will not report the error properly as it used to do, but instead crash badly, which is really bad user experience. There are many other calls like that, which change the runtime behavior of the functions that contain them.
That is easily fixable by catching in main. Having integration test for such cases would also be nice.
In all cases, this has absolutely nothing to do with implementing a hash cache, maybe put this type of changes in a separate PR, or at a minimum into a separate commit.
You're right. I wanted to do that at first, but I was too lazy in the end and wanted to get this PR started. Sorry! Will continue working on this and see where it goes :)
There was a problem hiding this comment.
This is not so easily fixable. For example, catching the exception in main during the build would mean the lock file would never be removed, and every new invocation of Ninja would fail. There are probably plenty of other subtle issues.
Generally speaking, it's very hard to ensure exception safety if RAII types aren't used consistently across the whole code base, which is clearly not the case in Ninja (e.g. the ownership of the Subprocess pointer values returned by various SubprocessSet functions is not documented and error prone, and there are other examples where stuff is allocated and their raw pointers are passed around).
I.e. this requires discipline that does not exist in the code, and you are just introducing random untested failure case by sprinkling exceptions here and there in this code.
In addition, this makes reasoning about what the code does or should do much more difficult. It's unclear how errors should be reported or processed now, since to be safe you should now consider both the error code returned by functions, and possible exceptions paths.
It would be much easier to first get rid of all the non-RAII values in Ninja first, then introduce exceptions. Alternatively, use the equivalent of std::expected<R,E> to return results or error conditions and treat them properly. Either method is good as long as it is used consistently in the code. Mixing them just makes things confusing and difficult to maintain over the long term.
In all cases, getting rid of the various raw pointer passing and memory leaks would be nice too, independent from these considerations.
There was a problem hiding this comment.
Well resources get leaked when doing
if (!err.empty()) {
Fatal(...);
}anywhere in the code, too, don't they? And with exceptions there's a least a way to fix that thanks to stack unwinding.
std::expected<R, E> is C++23 so not an option.
I agree with most of what you said though. Current state of this PR is WIP :)
| if (auto* exists = std::get_if<Exists>(&status_)) { | ||
| return exists->mtime; | ||
| } | ||
| throw std::runtime_error(std::holds_alternative<Unknown>(status_) |
There was a problem hiding this comment.
nit: consider using the helper methods like status_known() or exists() to reduce visual clutter, as std::holds_alternative<Unknown>(status_) is not the most readable expression.
| bool Node::OlderThan(const Node* other) const { | ||
| if (auto* missing = std::get_if<Missing>(&status_)) { | ||
| if (auto* other_missing = std::get_if<Missing>(&other->status_)) { | ||
| return missing->latest_mtime_of_deps < other_missing->latest_mtime_of_deps; |
There was a problem hiding this comment.
isn't latest_mtime_of_deps an std::optional<>, why not check for std::nullopt to avoid runtime errors?
There was a problem hiding this comment.
good question. I wonder how comparing two std::optionals works? I will add error checking to have no surprises.
There was a problem hiding this comment.
Generally speaking, if the two values are of different types, and not easily convertible, the result is undefined behavior :-/
|
Thanks for the comments @digit-google! I forgot to mention one FIXME of this PR: It currently enables "restat" on all edges. This makes sense when using file hashes because the case that the hash of an output does NOT change is way more common than the mtime of an output not changing. It obviously can't stay like this: restat should only be enabled when hashing is active, too. I will fix this with the introduction of the |
|
my two cents really quick: this should be possible as a user integration, perhaps even already possible and just needs someone to write a guide for it, and if not quite possible, I think it would be a healthier change to the project to just add that functionality to specify forcing a build based on hash in the build.ninja file for example it's already possible with Make to call a script to compare hashes of source files and then add a variable FORCE dependency based on that result |
|
@mcprat Yes, but I think there's huge value in moving the decision whether to use hashes from configure time to build time. Often I don't need it, but want to be able to turn it for a short period of time, e.g. when switch branches a lot (yes I know there are workarounds). Implementing this in build.ninja would also come with other disadvantages, e.g. every command has to be wrapped making it harder to inspect what is going on. Having this in Ninja itself is an often requested feature for a reason. |
|
I was thinking more of just adding an option per target, so that for whatever target that option is enabled, ninja will refer to the checksums saved for each source file after the target was last built, rebuild it, then save the new checksums so it would still be a change to ninja, but not a command line flag or automatic, so default behavior and invocation is the same |
As someone observing this from the peanut gallery and fantasizing about this work someday being ported to and beneficial to AOSP - which has an incredible number of targets all over the place - I would definitely prefer not needing to modify all those targets individually. |
Thoughts on the hash optionAccording to the discussion in issue #1459, the key insight is that only files checked into version control (pure inputs) meaningfully influence the hash. Intermediate files generated during the build and not marked as restat do not add significant value when hashed, because the dependency chain will not stop rebuilding even if their hashes remain unchanged. An option
|
Save file hashes together with their timestamp in .ninja_hashes. Then on the next build read that file and if only the timestamp changed (i.e. the generated hash of the content is the same as in the file) "fake" its mtime internally. So when comparing the file against others it's treated as if it was in the same state as before (which it is, because the content didn't change).
To make sure that the mtime isn't used anywhere introduce OlderThan and NewerThan in Node. Node::exists_ and mtime_ (which had magic numbers for its state) are replaced by an algebraic data type:
A new class HashCache handles .ninja_hashes and the calculation of hashes via FNV-1a. It currently uses C file IO instead of DiskInterface to read files, so there are no unit tests yet.
To make finding bugs easier DiskInterface::Stat now uses exceptions for error handling.
TODO:
--hash?)Looking for feedback :)
Fixes #1459.