Thomas Gleixner [Tue, 31 Jan 2012 12:01:27 +0000 (13:01 +0100)]
genirq: Allow disabling of softirq processing in irq thread context
The processing of softirqs in irq thread context is a performance gain
for the non-rt workloads of a system, but it's counterproductive for
interrupts which are explicitely related to the realtime
workload. Allow such interrupts to prevent softirq processing in their
thread context.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
Ingo Molnar [Wed, 30 Nov 2011 01:18:22 +0000 (20:18 -0500)]
tasklet: Prevent tasklets from going into infinite spin in RT
When CONFIG_PREEMPT_RT_FULL is enabled, tasklets run as threads,
and spinlocks turn are mutexes. But this can cause issues with
tasks disabling tasklets. A tasklet runs under ksoftirqd, and
if a tasklets are disabled with tasklet_disable(), the tasklet
count is increased. When a tasklet runs, it checks this counter
and if it is set, it adds itself back on the softirq queue and
returns.
The problem arises in RT because ksoftirq will see that a softirq
is ready to run (the tasklet softirq just re-armed itself), and will
not sleep, but instead run the softirqs again. The tasklet softirq
will still see that the count is non-zero and will not execute
the tasklet and requeue itself on the softirq again, which will
cause ksoftirqd to run it again and again and again.
It gets worse because ksoftirqd runs as a real-time thread.
If it preempted the task that disabled tasklets, and that task
has migration disabled, or can't run for other reasons, the tasklet
softirq will never run because the count will never be zero, and
ksoftirqd will go into an infinite loop. As an RT task, it this
becomes a big problem.
This is a hack solution to have tasklet_disable stop tasklets, and
when a tasklet runs, instead of requeueing the tasklet softirqd
it delays it. When tasklet_enable() is called, and tasklets are
waiting, then the tasklet_enable() will kick the tasklets to continue.
This prevents the lock up from ksoftirq going into an infinite loop.
[ rostedt@goodmis.org: ported to 3.0-rt ]
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable-rt@vger.kernel.org Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
sched: dont calculate hweight in update_migrate_disable()
Proposal for a minor optimization in update_migrate_disable - its only a few
instructions saved but those are in the hot path of locks so it might be worth
it
When being scheduled out while migrate_disable > 0 and migrate_disabled_updated
is not yet set we end up here (kernel/sched/core.c):
if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, mask);
p->nr_cpus_allowed = cpumask_weight(mask);
as we only can get here if migrate_disable > 0 there is no need to calculate
the cpumask_weight(mask) as tsk_cpus_allowed in that case will return
cpumask_of(task_cpu(p)) which only can have a hamming weight of 1 anyway.
So we can simply do:
p->nr_cpus_allowed = 1;
without changing the behavior.
Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Peter Zijlstra [Tue, 27 Sep 2011 12:40:25 +0000 (08:40 -0400)]
sched: Have migrate_disable ignore bounded threads
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.567944215@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Tue, 27 Sep 2011 12:40:24 +0000 (08:40 -0400)]
sched: Do not compare cpu masks in scheduler
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124423.128129033@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
allow preemption in recursive migrate_disable call
Minor cleanup in migrate_disable/migrate_enable. The recursive case
does not need to disable preemption as it is "pinned" to the current
cpu any way so it is safe to preempt it.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Steven Rostedt [Tue, 27 Sep 2011 12:40:23 +0000 (08:40 -0400)]
sched: Postpone actual migration disalbe to schedule
The migrate_disable() can cause a bit of a overhead to the RT kernel,
as changing the affinity is expensive to do at every lock encountered.
As a running task can not migrate, the actual disabling of migration
does not need to occur until the task is about to schedule out.
In most cases, a task that disables migration will enable it before
it schedules making this change improve performance tremendously.
[ Frank Rowand: UP compile fix ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <williams@redhat.com> Link: http://lkml.kernel.org/r/20110927124422.779693167@goodmis.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Thu, 11 Aug 2011 13:14:58 +0000 (15:14 +0200)]
sched: Generic migrate_disable
Make migrate_disable() be a preempt_disable() for !rt kernels. This
allows generic code to use it but still enforces that these code
sections stay relatively small.
A preemptible migrate_disable() accessible for general use would allow
people growing arbitrary per-cpu crap instead of clean these things
up.
Steven Rostedt [Wed, 16 Nov 2011 18:19:35 +0000 (13:19 -0500)]
tracing: Show padding as unsigned short
RT added two bytes to trace migrate disable counting to the trace events
and used two bytes of the padding to make the change. The structures and
all were updated correctly, but the display in the event formats was
not:
The field for common_padding has the correct size and offset, but the
use of "int" might confuse some parsers (and people that are reading
it). This needs to be changed to "unsigned short".
hotplug: Reread hotplug_pcp on pin_current_cpu() retry
When retry happens, it's likely that the task has been migrated to
another cpu (except unplug failed), but it still derefernces the
original hotplug_pcp per cpu data.
Update the pointer to hotplug_pcp in the retry path, so it points to
the current cpu.
Thomas Gleixner [Wed, 15 Jun 2011 10:36:06 +0000 (12:36 +0200)]
hotplug: Lightweight get online cpus
get_online_cpus() is a heavy weight function which involves a global
mutex. migrate_disable() wants a simpler construct which prevents only
a CPU from going doing while a task is in a migrate disabled section.
Implement a per cpu lockless mechanism, which serializes only in the
real unplug case on a global mutex. That serialization affects only
tasks on the cpu which should be brought down.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Steven Rostedt [Mon, 18 Mar 2013 19:12:49 +0000 (15:12 -0400)]
sched/workqueue: Only wake up idle workers if not blocked on sleeping spin lock
In -rt, most spin_locks() turn into mutexes. One of these spin_lock
conversions is performed on the workqueue gcwq->lock. When the idle
worker is worken, the first thing it will do is grab that same lock and
it too will block, possibly jumping into the same code, but because
nr_running would already be decremented it prevents an infinite loop.
But this is still a waste of CPU cycles, and it doesn't follow the method
of mainline, as new workers should only be woken when a worker thread is
truly going to sleep, and not just blocked on a spin_lock().
Check the saved_state too before waking up new workers.
Cc: stable-rt@vger.kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Thomas Gleixner [Tue, 13 Dec 2011 20:42:19 +0000 (21:42 +0100)]
sched: ttwu: Return success when only changing the saved_state value
When a task blocks on a rt lock, it saves the current state in
p->saved_state, so a lock related wake up will not destroy the
original state.
When a real wakeup happens, while the task is running due to a lock
wakeup already, we update p->saved_state to TASK_RUNNING, but we do
not return success, which might cause another wakeup in the waitqueue
code and the task remains in the waitqueue list. Return success in
that case as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
Thomas Gleixner [Mon, 18 Jul 2011 15:03:52 +0000 (17:03 +0200)]
sched: Disable CONFIG_RT_GROUP_SCHED on RT
Carsten reported problems when running:
taskset 01 chrt -f 1 sleep 1
from within rc.local on a F15 machine. The task stays running and
never gets on the run queue because some of the run queues have
rt_throttled=1 which does not go away. Works nice from a ssh login
shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The clock_was_set_delayed is called in hard IRQ handler (timer interrupt), which
calls schedule_work.
Under PREEMPT_RT_FULL, schedule_work calls spinlocks which could sleep, so it's
not safe to call schedule_work in interrupt context.
Reference upstream commit b68d61c705ef02384c0538b8d9374545097899ca
(rt,ntp: Move call to schedule_delayed_work() to helper thread)
from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git, which
makes a similar change.
add a helper thread which does the call to schedule_work and wake up that
thread instead of calling schedule_work directly.
Cc: stable-rt@vger.kernel.org Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Thomas Gleixner [Wed, 25 Jan 2012 10:08:40 +0000 (11:08 +0100)]
timer-fd: Prevent live lock
If hrtimer_try_to_cancel() requires a retry, then depending on the
priority setting te retry loop might prevent timer callback completion
on RT. Prevent that by waiting for completion on RT, no change for a
non RT kernel.
Reported-by: Sankara Muthukrishnan <sankara.m@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
Which indicated the cpu_load computation was hosed, the 177522 value
indicates that there is one RT task runnable. Initially I thought the
old problem of calculating the cpu_load from a softirq had re-surfaced,
however looking at the code shows its being done from scheduler_tick().
[ we really should fix this RT/cfs interaction some day... ]
We see that we always raise the timer softirq before doing the load
calculation. Avoid this by re-ordering the scheduler_tick() call in
update_process_times() to occur before we deal with timers.
This lowers the load back to sanity and restores regular load-balancing
behaviour.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When softirqs can be preempted we need to make sure that cancelling
the timer from the active thread can not deadlock vs. a running timer
callback. Add a waitqueue to resolve that.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
remove timer calls (!!!) from deep within the tracing infrastructure.
This was totally bogus code that can cause lockups and worse. Poll
the buffer every 2 jiffies for now.
Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Frank Rowand [Sun, 2 Oct 2011 01:58:13 +0000 (18:58 -0700)]
ARM: Initialize ptl->lock for vector page
Without this patch, ARM can not use SPLIT_PTLOCK_CPUS if
PREEMPT_RT_FULL=y because vectors_user_mapping() creates a
VM_ALWAYSDUMP mapping of the vector page (address 0xffff0000), but no
ptl->lock has been allocated for the page. An attempt to coredump
that page will result in a kernel NULL pointer dereference when
follow_page() attempts to lock the page.
The underlying problem is exposed by mm-shrink-the-page-frame-to-rt-size.patch.
Signed-off-by: Frank Rowand <frank.rowand@am.sony.com> Cc: Frank <Frank_Rowand@sonyusa.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/4E87C535.2030907@am.sony.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
thus there should be no need to call migrate_disable/enable recursively in
spin_try/lock/unlock. This patch addes a spin_trylock_local and replaces
the migration disabling calls by the local calls.
This patch is incomplete as it does not yet cover the _irq/_irqsave variants
by local locks. This patch requires the API cleanup in kernel/softirq.c or
it would break softirq_lock/unlock with respect to migration.
Signed-off-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
each per-queue lock is taken with spin_lock_irqsave() except in the case
where all of them are taken for some kind of serialisation. As an
optimisation local_irq_save() is used so that lock_tx_qs() and
lock_rx_qs() can use just the spin_lock() variant instead.
On RT local_irq_save() behaves differently so we use the nort()
variant.
Lockdep screems easily by "ethtool -K eth0 rx off tx off"
What remains is missing lockdep annotation that makes lockdep think
lock_tx_qs() may cause a dead lock.
Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Steven Rostedt [Fri, 3 Jul 2009 13:30:00 +0000 (08:30 -0500)]
drivers/net: vortex fix locking issues
Argh, cut and paste wasn't enough...
Use this patch instead. It needs an irq disable. But, believe it or not,
on SMP this is actually better. If the irq is shared (as it is in Mark's
case), we don't stop the irq of other devices from being handled on
another CPU (unfortunately for Mark, he pinned all interrupts to one CPU).
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
drivers/net/ethernet/3com/3c59x.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
Thomas Gleixner [Sat, 20 Jun 2009 09:36:54 +0000 (11:36 +0200)]
drivers/net: fix livelock issues
Preempt-RT runs into a live lock issue with the NETDEV_TX_LOCKED micro
optimization. The reason is that the softirq thread is rescheduling
itself on that return value. Depending on priorities it starts to
monoplize the CPU and livelock on UP systems.
Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
genirq: do not invoke the affinity callback via a workqueue
Joe Korty reported, that __irq_set_affinity_locked() schedules a
workqueue while holding a rawlock which results in a might_sleep()
warning.
This patch moves the invokation into a process context so that we only
wakeup() a process while holding the lock.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Paul Gortmaker [Fri, 21 Jun 2013 19:07:25 +0000 (15:07 -0400)]
list_bl.h: make list head locking RT safe
As per changes in include/linux/jbd_common.h for avoiding the
bit_spin_locks on RT ("fs: jbd/jbd2: Make state lock and journal
head lock rt safe") we do the same thing here.
We use the non atomic __set_bit and __clear_bit inside the scope of
the lock to preserve the ability of the existing LIST_DEBUG code to
use the zero'th bit in the sanity checks.
As a bit spinlock, we had no lockdep visibility into the usage
of the list head locking. Now, if we were to implement it as a
standard non-raw spinlock, we would see:
Since we are only taking the lock during short lived list operations,
lets assume for now that it being raw won't be a significant latency
concern.
Cc: stable-rt@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
mm/workingset: do not protect workingset_shadow_nodes with irq off
workingset_shadow_nodes is protected by local_irq_disable(). Some users
use spin_lock_irq().
Replace the irq/on with a local_lock(). Rename workingset_shadow_nodes
so I catch users of it which will be introduced later.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
infiniband: Mellanox IB driver patch use _nort() primitives
Fixes in_atomic stack-dump, when Mellanox module is loaded into the RT
Kernel.
Michael S. Tsirkin <mst@dev.mellanox.co.il> sayeth:
"Basically, if you just make spin_lock_irqsave (and spin_lock_irq) not disable
interrupts for non-raw spinlocks, I think all of infiniband will be fine without
changes."
Signed-off-by: Sven-Thorsten Dietrich <sven@thebigcorporation.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 21 Jul 2009 20:34:14 +0000 (22:34 +0200)]
rt: local_irq_* variants depending on RT/!RT
Add local_irq_*_(no)rt variant which are mainly used to break
interrupt disabled sections on PREEMPT_RT or to explicitely disable
interrupts on PREEMPT_RT.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>