* On discrete GPUs that expose PCI-E window memory (device local but still host
visible) this memory is extremely slow to read from on the CPU. It's
significantly faster to issue a command buffer to get the GPU to copy into CPU
memory and wait on that command buffer to finish, then read from the copy.
* We do this for detected coherent writes in queue submit, issuing the copy on
the queue being submitted to. We do *not* do this for memory unmaps or
explicit application flushes. This does mean those will remain slow, however
with no queue to use the synchronisation challenges become more significant
and most applications leave memory persistently mapped.
* We only care about tracking two things:
1. Resources that have been written very recently. These should not be
postponed as there's a high chance they'll be written mid-frame and so we'd
need their initial contents.
2. Resources that have their last non-complete-write reference was a while ago
However in the second case we can acceptably ignore any resources that haven't
been written recently either, since if the resource hasn't been written and
also hasn't been complete-written then it hasn't been used at all.
* So when updating the non-complete-write time we only do this if the resource
has had a write reference, and intermittently we remove any resources that
haven't had a write at all.
* Postponed resources will be exactly the same set, because we treat a resource
as postponable if we have no write time for it at all so it's fine to remove
old resources from the list. Fewer resources will be skipped, as we now treat
resources that have no known age as non-skippable. However in the majority of
these cases we expect either for the resource to not be used at all (thus the
postpone will never be forced to prepare and we won't serialise anything), or
else if it is used the chances are high it will be used read-only so the
postpone will still be enough.
* This means we don't have to iterate the whole bindrefs array every time we
want to propagate references in the background, but we can submit them in
batch.
* Almost all dirty-able resources (memory and images) become dirty almost
immediately, so spending time tracking dirty state is wasted. Instead we treat
these resources as dirty at creation and rely on the postponing logic to avoid
preparing initial states for newly created resources that are not used in the
frame.
* This may cause more 'last-minute' postponed prepares for newly created
resources, which would previously.
* Technically the code is incorrect, because the C++ spec is terrible and makes
completely normal things illegal. GCC decides that a couple of % more perf is
worth breaking lots of code, so instead we disable this class of
"optimisation".