* Having a single subresource (range) is a common case, so avoid allocating
storage in an array for that, only switch to the array when we have more than
one range to store.
* We don't expect contention on this, the only time it will contend is when
actively capturing a frame between updates and submits reading the descriptor
contents, so we penalise that case while making the background case faster -
since a spinlock is 'free' to take when there's no contention.
* If an application allocates from and resets descriptor pools at very high
frequency the overhead of freeing and reallocating those descriptor sets can
be high. Instead use the descriptor pool as a pool for children and look up
the freelist first for an existing descriptor set before trying to allocate a
new one.
* This is still accurate, what we're missing is "read data as int, then cast to
float" which is represented by setting 'floatCast' to true. A normalized cast
or interpret is accurately represented by saying the input is snorm/unorm
typed.
* When we roll over from one binding to another due to descriptor count being
larger than a single binding, we need to update the frame reftype since it
might go from storage to sampled or vice-versa and so change from read-only to
read-write.
* While active capturing we might do significant work to flush coherent mapped
memory regions and prepare initial contents for postponed resources that are
about to be write-referenced. We need to do that before submitting the actual
work to the queue or else the contents may be corrupted.
* We track memory bindings to see which regions of a memory object are only used
for tiled images, and discard any writes in case this was accidental detection
of changes by the GPU which we don't want to replay. In the case of aliasing
if there's linear and tiled resources then we still replay the writes.
* Note that we have to take a slower path involving a copy since we can't
serialise straight into memory in this case, so applications should avoid
mapping memory behind
* We work around a GNOME bug here by ignoring a selected filter if it's the
empty string. For all other unknown filters we try to determine the suffix on
the fly.
* When we changed to serialise render target descriptor contents at list record
time we also updated all descriptor writes to happen immediately so we'd get
the latest contents. However we didn't also update copies, so copies before
OMSetRenderTargets weren't properly reflected.
* There's nothing that needs the 'old' copy of descriptors so we can remove any
pending/deferring of updates and do it immediately, which also saves some
tracking.
* The function is illegal to call regardless of whether we get a non-NULL
function pointer. Core GLES doesn't support glBindFragDataLocation but
fortunately we don't need to call it ourselves unless the user has done some
dynamic binding - which assumes glBindFragDataLocation is available.
* Resources which aren't referenced in the frame don't need initial states
unless we have 'Ref All Resources' enabled. These initial states can be
stripped on replay as they aren't needed.
* We also renamed the WrittenRecords to more explicitly list that this is the
list of resources needing initial contents, whether because they were dirty
(and so had initial contents) or because they were written mid-frame and so
need to be reset.
* Instead of waiting for idle, we allocate a command buffer per swapchain image
to render the text overlay and use semaphores and fences to properly
synchronise with other GPU work ongoing.
* On discrete GPUs that expose PCI-E window memory (device local but still host
visible) this memory is extremely slow to read from on the CPU. It's
significantly faster to issue a command buffer to get the GPU to copy into CPU
memory and wait on that command buffer to finish, then read from the copy.
* We do this for detected coherent writes in queue submit, issuing the copy on
the queue being submitted to. We do *not* do this for memory unmaps or
explicit application flushes. This does mean those will remain slow, however
with no queue to use the synchronisation challenges become more significant
and most applications leave memory persistently mapped.