Marcos Juarez
2016-11-03 17:02:16 UTC
We're running into a recurrent deadlock issue in both our production and
staging clusters, both using the latest 0.10.1 release. The symptom we
noticed was that, in servers in which kafka producer connections are short
lived, every other day or so, we'd see file descriptors being exhausted,
until the broker is restarted, or the broker runs out of file descriptors,
and it goes down. None of the clients are on 0.10.1 kafka jars, they're
all using previous versions.
When diagnosing the issue, we found that when the system is in that state,
using up file descriptors at a really fast rate, the JVM is actually in a
deadlock. Did a thread dump from both jstack and visualvm, and attached
those to this email.
This is the interesting bit from the jstack thread dump:
Found one Java-level deadlock:
=============================
"executor-Heartbeat":
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
"group-metadata-manager-0":
waiting to lock monitor 0x00000000011ddaa8 (object 0x000000063f1b0cc0, a
java.util.LinkedList),
which is held by "kafka-request-handler-3"
"kafka-request-handler-3":
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the one called
"executor-Heartbeat" above) is new for this release, under KAFKA-3888
ticket - https://issues.apache.org/jira/browse/KAFKA-3888
We haven't noticed this problem with earlier Kafka broker versions, so I'm
guessing maybe this new background heartbeat thread is what introduced the
deadlock problem.
That same broker is still in the deadlock scenario, we haven't restarted
it, so let me know if you'd like more info/log/stats from the system before
we restart it.
Thanks,
Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The symptom we
noticed was that, in servers in which kafka producer connections are short
lived, every other day or so, we'd see file descriptors being exhausted,
until the broker is restarted, or the broker runs out of file descriptors,
and it goes down. None of the clients are on 0.10.1 kafka jars, they're
all using previous versions.
When diagnosing the issue, we found that when the system is in that state,
using up file descriptors at a really fast rate, the JVM is actually in a
deadlock. Did a thread dump from both jstack and visualvm, and attached
those to this email.
This is the interesting bit from the jstack thread dump:
Found one Java-level deadlock:
=============================
"executor-Heartbeat":
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
"group-metadata-manager-0":
waiting to lock monitor 0x00000000011ddaa8 (object 0x000000063f1b0cc0, a
java.util.LinkedList),
which is held by "kafka-request-handler-3"
"kafka-request-handler-3":
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the one called
"executor-Heartbeat" above) is new for this release, under KAFKA-3888
ticket - https://issues.apache.org/jira/browse/KAFKA-3888
We haven't noticed this problem with earlier Kafka broker versions, so I'm
guessing maybe this new background heartbeat thread is what introduced the
deadlock problem.
That same broker is still in the deadlock scenario, we haven't restarted
it, so let me know if you'd like more info/log/stats from the system before
we restart it.
Thanks,
Marcos Juarez