diff options
author | Tvrtko Ursulin <tvrtko.ursulin@intel.com> | 2016-02-26 16:58:32 +0000 |
---|---|---|
committer | Tvrtko Ursulin <tvrtko.ursulin@intel.com> | 2016-03-01 10:36:02 +0000 |
commit | c6a2ac712d7dee13c13e44c4c4184478853dcb37 (patch) | |
tree | bf3ebd11d24cd370214eb93b80bf05202443e08c /drivers/gpu/drm/i915/intel_ringbuffer.h | |
parent | 3ba86073edcbe2be53d9862d5a3098f0ebf8ae9a (diff) | |
download | linux-c6a2ac712d7dee13c13e44c4c4184478853dcb37.tar.gz |
drm/i915: Execlists small cleanups and micro-optimisations
Assorted changes in the areas of code cleanup, reduction of invariant conditional in the interrupt handler and lock contention and MMIO access optimisation. * Remove needless initialization. * Improve cache locality by reorganizing code and/or using branch hints to keep unexpected or error conditions out of line. * Favor busy submit path vs. empty queue. * Less branching in hot-paths. v2: * Avoid mmio reads when possible. (Chris Wilson) * Use natural integer size for csb indices. * Remove useless return value from execlists_update_context. * Extract 32-bit ppgtt PDPs update so it is out of line and shared with two callers. * Grab forcewake across all mmio operations to ease the load on uncore lock and use chepear mmio ops. v3: * Removed some more pointless u8 data types. * Removed unused return from execlists_context_queue. * Commit message updates. v4: * Unclumsify the unqueue if statement. (Chris Wilson) * Hide forcewake from the queuing function. (Chris Wilson) Version 3 now makes the irq handling code path ~20% smaller on 48-bit PPGTT hardware, and a little bit less elsewhere. Hot paths are mostly in-line now and hammering on the uncore spinlock is greatly reduced together with mmio traffic to an extent. Benchmarking with "gem_latency -n 100" (keep submitting batches with 100 nop instruction) shows approximately 4% higher throughput, 2% less CPU time and 22% smaller latencies. This was on a big-core while small-cores could benefit even more. Most likely reason for the improvements are the MMIO optimization and uncore lock traffic reduction. One odd result is with "gem_latency -n 0" (dispatching empty batches) which shows 5% more throughput, 8% less CPU time, 25% better producer and consumer latencies, but 15% higher dispatch latency which is yet unexplained. Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1456505912-22286-1-git-send-email-tvrtko.ursulin@linux.intel.com
Diffstat (limited to 'drivers/gpu/drm/i915/intel_ringbuffer.h')
-rw-r--r-- | drivers/gpu/drm/i915/intel_ringbuffer.h | 3 |
1 files changed, 2 insertions, 1 deletions
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 566b0ae10ce0..dd910d30a380 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -271,7 +271,8 @@ struct intel_engine_cs { spinlock_t execlist_lock; struct list_head execlist_queue; struct list_head execlist_retired_req_list; - u8 next_context_status_buffer; + unsigned int next_context_status_buffer; + unsigned int idle_lite_restore_wa; bool disable_lite_restore_wa; u32 ctx_desc_template; u32 irq_keep_mask; /* bitmask for interrupts that should not be masked */ |