I am using zeromq 4.3.1. In our design micro-services sends periodic heart-beat
to its peers(ROUTER->DEALER model). ZMQ socket options are set with
ZMQ_IMMEDIATE and ZMQ_SENDTIMEO. This makes the send operation non-blocking
when the peer is not up. We are seeing cases where zmq I/O thread
crashes(abort) with "BAD file descriptor". It only happens for a peer which is
not reachable. It aborts due to "EBADF" epoll_ctl() for EPOLL_CTL_DEL/
EPOLL_CTL_ADD. Our application only uses zmq socket, it doesn't use ZMQ_FD. I
am not sure how there could be any race condition. It seems the socket file
descriptor gets closed after epoll_wait () event. The problem is rare but does
happen. I don't have any recipe to reproduce the problem. There is no issue
with peers that are reachable.
Any pointer will be helpful. Thanks
Hadi-
gdb) bt#0 0x00003fff7cdee530 in __libc_signal_restore_set (set=0x3fff712e8040)
at ../sysdeps/unix/sysv/linux/internal-signals.h:84#1 __GI_raise
(sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:48#2
0x00003fff7cdd4648 in __GI_abort () at abort.c:79#3 0x00003fff7c971818 in
zmq::zmq_abort (errmsg_=<optimized out>) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/err.cpp:88#4
0x00003fff7c970d88 in zmq::epoll_t::add_fd (this=0x104797d0, fd_=<optimized
out>, events_=<optimized out>) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/epoll.cpp:100#5
0x00003fff7c972438 in zmq::io_object_t::add_fd (this=<optimized out>,
fd_=<optimized out>) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/io_object.cpp:65#6
0x00003fff7c9b1e98 in zmq::tcp_connecter_t::start_connecting
(this=0x3fff385bfe70) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/tcp_connecter.cpp:203#7
zmq::tcp_connecter_t::start_connecting (this=0x3fff385bfe70) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/tcp_connecter.cpp:190#8
0x00003fff7c9b1fe4 in zmq::tcp_connecter_t::timer_event (this=<optimized out>,
id_=<optimized out>) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/tcp_connecter.cpp:186#9
0x00003fff7c98dad0 in zmq::poller_base_t::execute_timers (this=0x104797d0) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/poller_base.cpp:103#10
0x00003fff7c9709c4 in zmq::epoll_t::loop (this=0x104797d0) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/epoll.cpp:173#11
0x00003fff7c98d3ac in zmq::worker_poller_base_t::worker_routine
(arg_=<optimized out>) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/poller_base.cpp:139#12
0x00003fff7c9b3658 in thread_routine (arg_=0x10479828) at
/usr/src/debug/zeromq/4.3.1-r0/zeromq-4.3.1/src/thread.cpp:182#13
0x00003fff7cfabb14 in start_thread (arg=0x0) at pthread_create.c:486#14
0x00003fff7cec72e8 in .__clone () at
../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82
#0 0x00003fffb323e530 in __libc_signal_restore_set (set=0x3fff6afe7060) at
../sysdeps/unix/sysv/linux/internal-signals.h:84#1 __GI_raise (sig=<optimized
out>) at ../sysdeps/unix/sysv/linux/raise.c:48#2 0x00003fffb3224648 in
__GI_abort () at abort.c:79#3 0x00003fffb2dc1818 in .zmq::zmq_abort(char
const*) () from /usr/lib64/libzmq.so.5#4 0x00003fffb2dc155c in
.zmq::epoll_t::rm_fd(void*) () from /usr/lib64/libzmq.so.5#5
0x00003fffb2dc2474 in .zmq::io_object_t::rm_fd(void*) () from
/usr/lib64/libzmq.so.5#6 0x00003fffb2e011e0 in
.zmq::tcp_connecter_t::rm_handle() () from /usr/lib64/libzmq.so.5#7
0x00003fffb2e01c3c in .zmq::tcp_connecter_t::out_event() () from
/usr/lib64/libzmq.so.5#8 0x00003fffb2e00cbc in
.zmq::tcp_connecter_t::in_event() () from /usr/lib64/libzmq.so.5#9
0x00003fffb2dc0a9c in ?? () from /usr/lib64/libzmq.so.5#10 0x00003fffb2ddd3ac
in .zmq::worker_poller_base_t::worker_routine(void*) () from
/usr/lib64/libzmq.so.5#11 0x00003fffb2e03658 in ?? () from
/usr/lib64/libzmq.so.5#12 0x00003fffb33fbb14 in start_thread (arg=0x0) at
pthread_create.c:486#13 0x00003fffb33172e8 in .__clone () at
../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:82*
_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev