How to troubleshoot/bypass locking problem which appears to be GIL related
An obscure lock-up between two threads seems to be related to the Global Interpreter Lock or some other "behind-the-scenes-lock", and I am at a loss how to continue troubleshooting. Any tips how to eliminate the lock-up would be appreciated.
The issue reproduces (erratically and somewhat random) in a larger set of code. The code is strictly python. Python version is 2.6.5 (on Linux). Hours of troubleshooting has reduced the problem when the lock-up occurs to the following:
- The program has only two running threads
- The threads concurrently call a method protected by a single threading.RLock
- Thread 1 has acquired the lock [plus some other locks] via acquire()
- Thread 2 has called acquire() and is confirmed to be waiting for the lock
- Thread 1 is able to do print to the console with print(), however it is getting locked up by a simple non-blocking library call
The offensive call in #5 is the function unicode.encode, which should be non-blocking. The following code in Thread 1 at the position where the thread locks will (as expected) print 'A' and 'B':
print('A')
print('B')
However, the following will just print 'A' and block the thread:
print('A')
u'hello'.encode('utf8') # This dummy (non-blocking) call locks up Thread 1
print('B')
This makes no sense to me at all. There is no logical dead-lock condition existing between the two threads. Thread 1 is being blocked by a non-blocking library call that is not in any way interfering with Thread 2, which is just silently waiting to acquire an RLock. The only cause I can think of for Thread 1 being blocked is that it is waiting for the GIL.
Any thoughts how to troubleshoot this further, or any mechanisms to somehow control or manipulate GIL operation as a workaround?
Edit: Some additional information in response to samplebias (and thank you for the reply). I had problems getting a trace as the issue seems to be very sensitive to anything that may disturb the timing between the two threads. However running strace with only the -f option, after a few iterations I was able to get a trace.
Thread 1 contains these three debug statements which should print two lines 'CHECK_IN' and 'CHECK_TEST' to the console:
print('CHECK IN')#DEBUG
u'hello'.encode('utf8')
print('CHECK TEST')#DEBUG
Here is the last page of the strace:
8605 mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb753d000
8605 mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xb6d3c000
8605 mprotect(0xb6d3c000, 4096, PROT_NONE) = 0
8605 clone(child_stack=0xb753c494, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb753cbd8, {entry_number:6, base_addr:0xb753cb70, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb753cbd8) = 8606
8606 set_robust_list(0xb753cbe0, 0xc <unfinished ...>
8605 futex(0xa239138, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... set_robust_list resumed> ) = 0
8606 futex(0xa239138, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326496}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 gettimeofday( <unfinished ...>
8606 <... futex resumed> ) = 0
8605 <... gettimeofday resumed> {1301528807, 326821}, NULL) = 0
8606 futex(0xa272398, F开发者_C百科UTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8605 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8606 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
8605 <... futex resumed> ) = 0
8606 gettimeofday( <unfinished ...>
8605 futex(0xa272398, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... gettimeofday resumed> {1301528807, 326908}, NULL) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1) = 1
8605 <... futex resumed> ) = 0
8606 futex(0xa272398, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
8605 futex(0xa1b0d70, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
8606 <... futex resumed> ) = 0
8606 stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2225, ...}) = 0
8606 fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
8606 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6d3b000
8606 write(1, "CHECK IN\n", 9) = 9
8606 futex(0xa115270, FUTEX_WAIT_PRIVATE, 0, NULL
And the output of the three lines of code is just the following, before the program locks up:
CHECK IN
So the strace shows how Thread 1 (#8606) writes the 'CHECK_IN' string, and when reaching the unicode.encode call enters a waiting state which never returns.
By the way I am making a couple future imports in all modules to keep with some newer python conventions ...
from __future__ import print_function, unicode_literals
... but I cannot see they should make any difference - especially as the u'hello' string is called out explicitly as a unicode string.
I can find nothing in the Python source which would cause unicode.encode()
to block, and the dummy program I wrote to try to reproduce this runs as expected. You mentioned that Thread 1 has acquired more than 1 lock - have you eliminated those locks as the source of the lock-up?
Does the test case below show the same lock-up in your environment?
import time
import threading
def worker(tid):
_lock.acquire()
if not tid:
# wait for rest of threads to enter acquire
time.sleep(0.5)
print('%d: A' % tid)
u'hello'.encode('utf-8')
print('%d: B' % tid)
_lock.release()
def start(tid):
th = threading.Thread(target=worker, args=(tid,))
th.start()
return th
_num = 2
_lock = threading.RLock()
workers = [start(n) for n in range(_num)]
while all(w.isAlive() for w in workers):
time.sleep(1)
Output:
0: A
0: B
1: A
1: B
You can also run strace
on your program to work out where the process is blocked. For example with the above script:
% strace -fTr -o trace.out python lockup.py
The -o trace.out
flag tells strace to write the output to a file. You can omit that and strace will print to stderr.
The contents of trace.out
should show you all of the system calls made by the program, with each line prefixed with thread id and the relative time between system calls. The end of the line will have the time spent inside that system call. I've annotated the last few syscalls with the corresponding Python code:
# thread 0 time.sleep(0.5) completes
24778 0.500124 <... select resumed> ) = 0 (Timeout) <0.500599>
# preparing to print()
24778 0.000071 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0 <0.000017>
24778 0.000058 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fe90a6000 <0.000018>
# print("0: A\n")..
24778 0.000079 write(1, "0: A\n", 5) = 5 <0.000023>
24778 0.000106 write(1, "0: B\n", 5) = 5 <0.000056>
# thread 0 _lock.release()
24778 0.000114 futex(0xe0f3c0, FUTEX_WAKE_PRIVATE, 1) = 1 <0.000024>
24778 0.000108 madvise(0x7f8fe7266000, 8368128, MADV_DONTNEED) = 0 <0.000030>
# thread 0 exit
24778 0.000072 _exit(0) = ?
# thread 1 _lock.acquire()
24779 0.000050 <... futex resumed> ) = 0 <0.500774>
# thread 1 print("1: A\n") and so on..
24779 0.000052 write(1, "1: A\n", 5) = 5 <0.000026>
24779 0.000086 write(1, "1: B\n", 5) = 5 <0.000026>
24779 0.000099 madvise(0x7f8fe6a65000, 8368128, MADV_DONTNEED) = 0 <0.000024>
24779 0.000064 _exit(0) = ?
24777 0.499956 <... select resumed> ) = 0 (Timeout) <1.001138>
24777 0.000132 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8fe8c7c8f0}, {0x4d9a90, [], SA_RESTORER, 0x7f8fe8c7c8f0}, 8) = 0 <0.000025>
# main thread process exit
24777 0.002349 exit_group(0) = ?
精彩评论