I spent last weekend tracking down some of the remaining SMP deadlocks in Linux
2.0.30. The latest version of my deadlock patch now corrects all the problems
I've found to date. Since it's grown a bit larger, the patch is now available
from my Linux web page at URL http://www.dandelion.com/Linux/. Specifically,
the following improvements have been implemented:
- The earlier versions of the patch were only effective on systems where the
boot CPU was CPU #0. This version correctly handles a non-zero boot CPU.
The Tyan Titan Pro and Tomcat IIID seems to always have CPU #0 as the boot
CPU, whereas many of the AIR, SuperMicro, and ASUS boards have CPU #1 as
the boot CPU.
- An additional form of deadlock is where kernel code running on a non-boot
CPU waits for the jiffies variable to be incremented. This deadlock is
now avoided by having the spin loops in ENTER_KERNEL increment jiffies
approximately every 10 milliseconds. This approach avoids having to track
down every place in the kernel where such waiting loops occur.
- Finally, if approximately 60 seconds elapse while waiting for the kernel
lock, a message will be printed if possible to indicate that a deadlock
has been detected. This will help differentiate between SMP lockups and
hardware lockups.
I suspect (1) is the reason that earlier versions of this patch seemed to be
effective for some people and yet completely ineffective for others.
If people still encounter lockups with this patch, I also fixed the big in
Ingo's deadlock detection patch, so we should be able to use that for further
debugging.
12-Apr-97
Enclosed below is my patch to linux 2.0.30 to avoid the interrupt/paging
deadlocks that have been reported in Linux 2.0.x/SMP. Look in the patched
"linux/kernel/sched.c" for a large comment with a full explanation of the
deadlock conditions I've addressed and how they are resolved. Please report on
whether this resolves any lockups you've seen or if you have any problems with
it installed. With luck, this will be reliable enough for 2.0.31 and will
remove the black mark from Linux/SMP.
Thanks to Bill Reynolds for his "kill_kernel" package which made testing this
easier, and to Linus for the 2.1.28/29 SMP fix which provided the locations for
the necessary allow_interrupts calls.