gpu: nvgpu: fix deadlock on railgate_lock during race condition
We have below race condition during __gk20a_do_idle()
and force_reset case :
- before execution of __gk20a_do_idle(), a process drops the last
usage count of GPU, which triggers GPU railgate process
- but before GPU is really railgated (there is 500 mS delay),
some process calls __gk20a_do_idle()
- in __gk20a_do_idle(), we first take railgate_lock
- then we check if GPU is already railgated or not
- since it is not railgated yet (due to 500 mS delay), this
returns false
- then we call pm_runtime_get_noresume() which just increases the
usage counter
- in this particular case, this call just increases usage count to
1 from 0, but whereas GPU is already on its way to railgate
- while we check if GPU usage count drops to one, GPU gets railgated
- now if we have force_reset=true case, we will end up calling
pm_runtime_get_sync() which will take railgate_lock lock _again_
and try to unrailgate GPU
- this causes a deadlock on railgate_lock
To fix this, use below sequence :
- take railgate_lock
- check if GPU is already railgated
- release railgate_lock
- call pm_runtime_get_sync() which will keep GPU active even if
railgating is already triggered
- take railgate_lock again to prevent unrailgate in futher process
Also, add more descriptive comments to explain the flow
Bug
1624537
Change-Id: I0febc65d7bfac03ee738be200cf321322ffbe5a6
Signed-off-by: Deepak Nibade <dnibade@nvidia.com>
Reviewed-on: http://git-master/r/719625
(cherry picked from commit
480284eda16e2b50ee6368bad3d15574e098b231)
Reviewed-on: http://git-master/r/719620
Reviewed-by: Sachin Nikam <snikam@nvidia.com>