Skip to content

Use file hashes in addition to timestamps, fixes #1459#2735

Draft
jhasse wants to merge 5 commits intoninja-build:masterfrom
jhasse:file-hashing
Draft

Use file hashes in addition to timestamps, fixes #1459#2735
jhasse wants to merge 5 commits intoninja-build:masterfrom
jhasse:file-hashing

Conversation

@jhasse
Copy link
Copy Markdown
Collaborator

@jhasse jhasse commented Feb 25, 2026

Save file hashes together with their timestamp in .ninja_hashes. Then on the next build read that file and if only the timestamp changed (i.e. the generated hash of the content is the same as in the file) "fake" its mtime internally. So when comparing the file against others it's treated as if it was in the same state as before (which it is, because the content didn't change).

To make sure that the mtime isn't used anywhere introduce OlderThan and NewerThan in Node. Node::exists_ and mtime_ (which had magic numbers for its state) are replaced by an algebraic data type:

  std::variant<Unknown, Missing, Exists> status_;

A new class HashCache handles .ninja_hashes and the calculation of hashes via FNV-1a. It currently uses C file IO instead of DiskInterface to read files, so there are no unit tests yet.

To make finding bugs easier DiskInterface::Stat now uses exceptions for error handling.

TODO:

  • Make this optional (--hash?)
  • Save .ninja_hashes in builddir
  • Add (integration?) tests and fix bugs
  • Correctly detect directories vs empty files
  • Use the same hash function (i.e. rapidhash) as elsewhere?

Looking for feedback :)

Fixes #1459.

Save file hashes together with their timestamp in .ninja_hashes. Then on
the next build read that file and if only the timestamp changed (i.e.
the generated hash of the content is the same as in the file) "fake" its
mtime internally. So when comparing the file against others it's treated
as if it was in the same state as before (which it is, because the
content didn't change).

To make sure that the mtime isn't used anywhere introduce OlderThan and
NewerThan in Node. Node::exists_ and mtime_ (which had magic numbers for
its state) are replaced by an algebraic data type:
  std::variant<Unknown, Missing, Exists> status_;

A new class HashCache handles .ninja_hashes and the calculation of
hashes via FNV-1a. It currently uses C file IO instead of DiskInterface
to read files, so there are no unit tests yet.

To make finding bugs easier DiskInterface::Stat now uses exceptions for
error handling.

TODO:
* Make this optional (`--hash`?)
* Save .ninja_hashes in builddir
* Add (integration?) tests and fix bugs
* Correctly detect directories vs empty files
* Use the same hash function (i.e. rapidhash) as elsewhere?
};
ExistenceStatus exists_ = ExistenceStatusUnknown;
struct Exists { TimeStamp mtime; };
std::variant<Unknown, Missing, Exists> status_ = Unknown{};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a great idea to use algebraic types but using std::variant<> and std::optional<> in this implementation makes each Node instance's size increase by about 20 bytes, which is a lot compared to using a char enum + a 64-bit timestamp value.

Consider something simpler, for example:

  • Make Timestamp a proper 64-bit algebraic class, with sentinel values to model Missing / Error / Value (don't use std::variant<> which will make that type 12 bytes). That's where the NewerThan() and OlderThan() methods belong, as they will make everything more readable.

  • Keep the status_ char enum and ensure it is never accessed directly, just like the internal timestamp_ value, and use Accessor methods to set or retrieve both in consistent ways.

This will keep the implentation price low while making all uses easier to understand.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm ... why is it 20 bytes? Doesn't std::optional add 1 byte and with aliasing 8?

I could introduce MissingAndNotSetYet to get rid of the std::optional.

In any case: As this is WIP I didn't want to do any premature optimizations and focus on ergonomics first :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's only 16 extra bytes, not 20, which is still a lot. What happens is the following:

  • std::optional<> adds one boolean, which ends up taking 8 extra bytes due to alignment.

  • std::variant<> does the same, and you have an std::optional<> in one of your std::variant<> sub-types, for a total of 16 extra bytes, in additional to the uint64_t timestamp, so a total size of 24 bytes.

Demonstrated here https://godbolt.org/z/5MWqqsed3

So in the end, before your change, the in-memory layout of the type, starting from the mtime_ field looked like:

offset   size   name

     0      8   mtime_
     8      1   exists_
     9      1   dirty_
    10      1   dyndep_pending_
    11      1   generated_by_dep_loader_
    12      4   id_
    16      8   in_edge_
    24

After you change, this becomes:

offset   size   name

     0     24   status_ (includes timestamp)
    24      1   dirty_
    25      1   dyndep_pending_
    26      1   generated_by_dep_loader_
    27      1   <padding>
    28      4   id_
    32      8   in_edge_
    40

So 16 extra bytes for no more information stored in the class. It might be simpler to make Timestamp a real struct with internal sentinel values like -1 and -2 to indicate unknown and missing, respectively.

}
fclose(f);
if (hash == 14695981039346656037ULL) {
return std::nullopt; // File is empty or directory, don't cache this
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty file is a perfectly valid input or output, there is no reason to return std::nullopt for them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a big TODO. We need to check if the path is a directory.

}

/// stat() a file, returning the mtime. Throws std::runtime_error if missing
/// or other error.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, the comment is wrong since the function doesn't throw if the file is missing.

Second, please do not introduce random exception-throwing functions in the source code, as it makes reasoning about all possible exit paths impossible in a code base that doesn't strictly use exceptions and RAII values everywhere.

For example, you had to introduce various try { .. } catch { .. } statements in this PR, but the calls in line 117 of graph.cc is not protected, meaning that now Ninja will not report the error properly as it used to do, but instead crash badly, which is really bad user experience. There are many other calls like that, which change the runtime behavior of the functions that contain them.

For simplicity Stat() should continue to report an error. An improvement would be to use algebraic types like std::expected<R,E> to let functions return either a result value or an error condition, but for C++17 this would require writing a custom template.

In all cases, this has absolutely nothing to do with implementing a hash cache, maybe put this type of changes in a separate PR, or at a minimum into a separate commit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, the comment is wrong since the function doesn't throw if the file is missing.

Right, it returns std::nullopt in that case. I will update the comment, thanks :)

Second, please do not introduce random exception-throwing functions in the source code, as it makes reasoning about all possible exit paths impossible in a code base that doesn't strictly use exceptions and RAII values everywhere.

As I said in the commit message I used exceptions here to make finding bugs easier and it helped me a lot: I wanted to see where errors are okay and where they are not.
It helped me understand how this function is used in various places and I actually found a bug on the master branch with this: 27e545a

The current error handling in Ninja is very error-prone, I guess there are lots of hidden bugs left. Return codes are often unchecked and not propagated. Unit tests (and normal code, too) often reuse std::string err and don't assert that it's empty. It's hard to do refactoring like this.

For example, you had to introduce various try { .. } catch { .. } statements in this PR, but the calls in line 117 of graph.cc is not protected, meaning that now Ninja will not report the error properly as it used to do, but instead crash badly, which is really bad user experience. There are many other calls like that, which change the runtime behavior of the functions that contain them.

That is easily fixable by catching in main. Having integration test for such cases would also be nice.

In all cases, this has absolutely nothing to do with implementing a hash cache, maybe put this type of changes in a separate PR, or at a minimum into a separate commit.

You're right. I wanted to do that at first, but I was too lazy in the end and wanted to get this PR started. Sorry! Will continue working on this and see where it goes :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not so easily fixable. For example, catching the exception in main during the build would mean the lock file would never be removed, and every new invocation of Ninja would fail. There are probably plenty of other subtle issues.

Generally speaking, it's very hard to ensure exception safety if RAII types aren't used consistently across the whole code base, which is clearly not the case in Ninja (e.g. the ownership of the Subprocess pointer values returned by various SubprocessSet functions is not documented and error prone, and there are other examples where stuff is allocated and their raw pointers are passed around).

I.e. this requires discipline that does not exist in the code, and you are just introducing random untested failure case by sprinkling exceptions here and there in this code.

In addition, this makes reasoning about what the code does or should do much more difficult. It's unclear how errors should be reported or processed now, since to be safe you should now consider both the error code returned by functions, and possible exceptions paths.

It would be much easier to first get rid of all the non-RAII values in Ninja first, then introduce exceptions. Alternatively, use the equivalent of std::expected<R,E> to return results or error conditions and treat them properly. Either method is good as long as it is used consistently in the code. Mixing them just makes things confusing and difficult to maintain over the long term.

In all cases, getting rid of the various raw pointer passing and memory leaks would be nice too, independent from these considerations.

Copy link
Copy Markdown
Collaborator Author

@jhasse jhasse Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well resources get leaked when doing

if (!err.empty()) {
    Fatal(...);
}

anywhere in the code, too, don't they? And with exceptions there's a least a way to fix that thanks to stack unwinding.

std::expected<R, E> is C++23 so not an option.

I agree with most of what you said though. Current state of this PR is WIP :)

if (auto* exists = std::get_if<Exists>(&status_)) {
return exists->mtime;
}
throw std::runtime_error(std::holds_alternative<Unknown>(status_)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider using the helper methods like status_known() or exists() to reduce visual clutter, as std::holds_alternative<Unknown>(status_) is not the most readable expression.

bool Node::OlderThan(const Node* other) const {
if (auto* missing = std::get_if<Missing>(&status_)) {
if (auto* other_missing = std::get_if<Missing>(&other->status_)) {
return missing->latest_mtime_of_deps < other_missing->latest_mtime_of_deps;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't latest_mtime_of_deps an std::optional<>, why not check for std::nullopt to avoid runtime errors?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. I wonder how comparing two std::optionals works? I will add error checking to have no surprises.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, if the two values are of different types, and not easily convertible, the result is undefined behavior :-/

@jhasse
Copy link
Copy Markdown
Collaborator Author

jhasse commented Feb 27, 2026

Thanks for the comments @digit-google!

I forgot to mention one FIXME of this PR: It currently enables "restat" on all edges. This makes sense when using file hashes because the case that the hash of an output does NOT change is way more common than the mtime of an output not changing.
For example when changing a comment in a source file the object timestamp changes but not its hash so linking can be skipped. Without this, hashing would be way less useful and it's also how other build systems like SCons work.

It obviously can't stay like this: restat should only be enabled when hashing is active, too. I will fix this with the introduction of the --hash flag.

@mcprat
Copy link
Copy Markdown
Contributor

mcprat commented Feb 27, 2026

my two cents really quick:

this should be possible as a user integration, perhaps even already possible and just needs someone to write a guide for it, and if not quite possible, I think it would be a healthier change to the project to just add that functionality to specify forcing a build based on hash in the build.ninja file

for example it's already possible with Make to call a script to compare hashes of source files and then add a variable FORCE dependency based on that result

@jhasse
Copy link
Copy Markdown
Collaborator Author

jhasse commented Mar 3, 2026

@mcprat Yes, but I think there's huge value in moving the decision whether to use hashes from configure time to build time. Often I don't need it, but want to be able to turn it for a short period of time, e.g. when switch branches a lot (yes I know there are workarounds).

Implementing this in build.ninja would also come with other disadvantages, e.g. every command has to be wrapped making it harder to inspect what is going on.

Having this in Ninja itself is an often requested feature for a reason.

@mcprat
Copy link
Copy Markdown
Contributor

mcprat commented Mar 4, 2026

I was thinking more of just adding an option per target, so that for whatever target that option is enabled, ninja will refer to the checksums saved for each source file after the target was last built, rebuild it, then save the new checksums

so it would still be a change to ninja, but not a command line flag or automatic, so default behavior and invocation is the same

@t-m-w
Copy link
Copy Markdown

t-m-w commented Mar 6, 2026

I was thinking more of just adding an option per target

As someone observing this from the peanut gallery and fantasizing about this work someday being ported to and beneficial to AOSP - which has an incredible number of targets all over the place - I would definitely prefer not needing to modify all those targets individually.

@moritzx22
Copy link
Copy Markdown
Contributor

moritzx22 commented Mar 17, 2026

Thoughts on the hash option

According to the discussion in issue #1459, the key insight is that only files checked into version control (pure inputs) meaningfully influence the hash. Intermediate files generated during the build and not marked as restat do not add significant value when hashed, because the dependency chain will not stop rebuilding even if their hashes remain unchanged.

An option --hash=[all|input] could be introduced to control which files contribute to the build‑graph hash.

--hash=input therefore hashes only true input files and reduces the hashing overhead, while --hash=all includes all files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Option to use file characteristics instead of timestamps

5 participants