Discussion:
Deadlock using latest 0.10.1 Kafka release
Marcos Juarez
2016-11-03 17:02:16 UTC
Permalink
We're running into a recurrent deadlock issue in both our production and
staging clusters, both using the latest 0.10.1 release. The symptom we
noticed was that, in servers in which kafka producer connections are short
lived, every other day or so, we'd see file descriptors being exhausted,
until the broker is restarted, or the broker runs out of file descriptors,
and it goes down. None of the clients are on 0.10.1 kafka jars, they're
all using previous versions.

When diagnosing the issue, we found that when the system is in that state,
using up file descriptors at a really fast rate, the JVM is actually in a
deadlock. Did a thread dump from both jstack and visualvm, and attached
those to this email.

This is the interesting bit from the jstack thread dump:


Found one Java-level deadlock:
=============================
"executor-Heartbeat":
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"

"group-metadata-manager-0":
waiting to lock monitor 0x00000000011ddaa8 (object 0x000000063f1b0cc0, a
java.util.LinkedList),
which is held by "kafka-request-handler-3"

"kafka-request-handler-3":
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"


I also noticed the background heartbeat thread (I'm guessing the one called
"executor-Heartbeat" above) is new for this release, under KAFKA-3888
ticket - https://issues.apache.org/jira/browse/KAFKA-3888

We haven't noticed this problem with earlier Kafka broker versions, so I'm
guessing maybe this new background heartbeat thread is what introduced the
deadlock problem.

That same broker is still in the deadlock scenario, we haven't restarted
it, so let me know if you'd like more info/log/stats from the system before
we restart it.

Thanks,

Marcos Juarez
Hans Jespersen
2016-11-03 18:42:52 UTC
Permalink
The 0.10.1 broker will use more file descriptor than previous releases
because of the new timestamp indexes. You should expect and plan for ~33%
more file descriptors to be open.

-hans

/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
* ***@confluent.io (650)924-2670
*/
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our production and
staging clusters, both using the latest 0.10.1 release. The symptom we
noticed was that, in servers in which kafka producer connections are short
lived, every other day or so, we'd see file descriptors being exhausted,
until the broker is restarted, or the broker runs out of file descriptors,
and it goes down. None of the clients are on 0.10.1 kafka jars, they're
all using previous versions.
When diagnosing the issue, we found that when the system is in that state,
using up file descriptors at a really fast rate, the JVM is actually in a
deadlock. Did a thread dump from both jstack and visualvm, and attached
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object 0x000000063f1b0cc0, a
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the one
called "executor-Heartbeat" above) is new for this release, under
KAFKA-3888 ticket - https://issues.apache.org/jira/browse/KAFKA-3888
We haven't noticed this problem with earlier Kafka broker versions, so I'm
guessing maybe this new background heartbeat thread is what introduced the
deadlock problem.
That same broker is still in the deadlock scenario, we haven't restarted
it, so let me know if you'd like more info/log/stats from the system before
we restart it.
Thanks,
Marcos Juarez
Lawrence Weikum
2016-11-03 19:53:57 UTC
Permalink
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a deadlock, not functionality or new features.

Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
m 720.203.1578 | ***@pandora.com

On 11/3/16, 12:42 PM, "Hans Jespersen" <***@confluent.io> wrote:

The 0.10.1 broker will use more file descriptor than previous releases
because of the new timestamp indexes. You should expect and plan for ~33%
more file descriptors to be open.

-hans

/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
* ***@confluent.io (650)924-2670
*/
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our production and
staging clusters, both using the latest 0.10.1 release. The symptom we
noticed was that, in servers in which kafka producer connections are short
lived, every other day or so, we'd see file descriptors being exhausted,
until the broker is restarted, or the broker runs out of file descriptors,
and it goes down. None of the clients are on 0.10.1 kafka jars, they're
all using previous versions.
When diagnosing the issue, we found that when the system is in that state,
using up file descriptors at a really fast rate, the JVM is actually in a
deadlock. Did a thread dump from both jstack and visualvm, and attached
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object 0x000000063f1b0cc0, a
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object 0x000000062732a398, a
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the one
called "executor-Heartbeat" above) is new for this release, under
KAFKA-3888 ticket - https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYbvNriVw0wrEegk&e=
We haven't noticed this problem with earlier Kafka broker versions, so I'm
guessing maybe this new background heartbeat thread is what introduced the
deadlock problem.
That same broker is still in the deadlock scenario, we haven't restarted
it, so let me know if you'd like more info/log/stats from the system before
we restart it.
Thanks,
Marcos Juarez
2016-11-03 20:32:45 UTC
Permalink
Just to expand on Lawrence's answer: The increase in file descriptor usage
goes from 2-3K under normal conditions, to 64K+ under deadlock, which it
hits within a couple of hours, at which point the broker goes down, because
that's our OS-defined limit.

If it was only a 33% increase from the new timestamp indexes, we should be
going to max 4K-5K file descriptors in use, not 64K+.

Marcos
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a deadlock,
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous releases
because of the new timestamp indexes. You should expect and plan for ~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our production
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The symptom
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer connections are
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors being
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of file
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka jars,
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system is in that
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is actually
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and visualvm, and
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the one
called "executor-Heartbeat" above) is new for this release, under
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker versions,
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we haven't
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from the system
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Jason Gustafson
2016-11-03 23:19:57 UTC
Permalink
Hey Marcos,

Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches? At
a glance, it looks like the same problem. We tried pretty hard to get the
fix into the release, but it didn't quite make it. A few questions:

1. Did you not see this in prior releases? As far as we can tell, it is
possible going back to 0.9.0.0, but there could be a subtle difference in
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch that's
available if it's occurring frequently and you can't downgrade. I think the
existing patch is incomplete, but it should still reduce the likelihood.
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?

Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file descriptor usage
goes from 2-3K under normal conditions, to 64K+ under deadlock, which it
hits within a couple of hours, at which point the broker goes down, because
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we should be
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a deadlock,
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and plan for ~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer connections
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors being
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of file
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka jars,
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system is in that
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and visualvm, and
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this release, under
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker versions,
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we haven't
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from the
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-04 16:48:36 UTC
Permalink
Jason,

Thanks for that link. It does appear to be a very similar issue, if not
identical. In our case, the deadlock is reported as across 3 threads, one
of them being a group_metadata_manager thread. Otherwise, it looks the same.

On your questions:

- We did not see this in prior releases, but we are ramping up usage of our
kafka clusters lately, so maybe we didn't have the needed volume before to
trigger it.

- Across our multiple staging and production clusters, we're seeing the
problem roughly once or twice a day.

- Our clusters are small at the moment. The two that are experiencing the
issue are 5 and 8 brokers, respectively. The number of consumers is small,
I'd say less than 20 at the moment. The amount of data being produced is
small also, in the tens of megabytes per second range, but the number of
connects/disconnects is really high, because they're usually short-lived
processes. Our guess at the moment is that this is triggering the bug. We
have a separate cluster where we don't have short-lived producers, and that
one has been rock solid.


We'll look into applying the patch, which could help reduce the problem.
The ticket says it's being targeted for the 0.10.2 release. Any rough
estimate of a timeline for that to come out?

Thanks!

Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches? At
a glance, it looks like the same problem. We tried pretty hard to get the
1. Did you not see this in prior releases? As far as we can tell, it is
possible going back to 0.9.0.0, but there could be a subtle difference in
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch that's
available if it's occurring frequently and you can't downgrade. I think the
existing patch is incomplete, but it should still reduce the likelihood.
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file descriptor
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock, which it
hits within a couple of hours, at which point the broker goes down,
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we should
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and plan
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer connections
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors being
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of file
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka jars,
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system is in
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and visualvm, and
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing the
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this release, under
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we haven't
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from the
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Jason Gustafson
2016-11-04 19:02:53 UTC
Permalink
Hi Marcos,

I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up the
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if you
need something in a hurry.

Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue, if not
identical. In our case, the deadlock is reported as across 3 threads, one
of them being a group_metadata_manager thread. Otherwise, it looks the same.
- We did not see this in prior releases, but we are ramping up usage of our
kafka clusters lately, so maybe we didn't have the needed volume before to
trigger it.
- Across our multiple staging and production clusters, we're seeing the
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are experiencing the
issue are 5 and 8 brokers, respectively. The number of consumers is small,
I'd say less than 20 at the moment. The amount of data being produced is
small also, in the tens of megabytes per second range, but the number of
connects/disconnects is really high, because they're usually short-lived
processes. Our guess at the moment is that this is triggering the bug. We
have a separate cluster where we don't have short-lived producers, and that
one has been rock solid.
We'll look into applying the patch, which could help reduce the problem.
The ticket says it's being targeted for the 0.10.2 release. Any rough
estimate of a timeline for that to come out?
Thanks!
Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches?
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard to get the
1. Did you not see this in prior releases? As far as we can tell, it is
possible going back to 0.9.0.0, but there could be a subtle difference in
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch that's
available if it's occurring frequently and you can't downgrade. I think
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the likelihood.
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file descriptor
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock, which
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes down,
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we should
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and plan
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors being
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of file
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system is in
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and visualvm, and
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this release,
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we haven't
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from the
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-04 20:05:31 UTC
Permalink
That's great, thanks Jason.

We'll try and apply the patch in the meantime, and wait for the official
release for 0.10.1.1.

Please let us know if you need more details about the deadlocks on our side.

Thanks again!

Marcos
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up the
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if you
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue, if not
identical. In our case, the deadlock is reported as across 3 threads,
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it looks the same.
- We did not see this in prior releases, but we are ramping up usage of
our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume before
to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're seeing the
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are experiencing
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of consumers is
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being produced is
small also, in the tens of megabytes per second range, but the number of
connects/disconnects is really high, because they're usually short-lived
processes. Our guess at the moment is that this is triggering the bug.
We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived producers, and
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the problem.
The ticket says it's being targeted for the 0.10.2 release. Any rough
estimate of a timeline for that to come out?
Thanks!
Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard to get
the
Post by Marcos Juarez
Post by Jason Gustafson
1. Did you not see this in prior releases? As far as we can tell, it is
possible going back to 0.9.0.0, but there could be a subtle difference
in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't downgrade. I think
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file descriptor
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock, which
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes down,
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors being
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of file
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system is
in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and visualvm,
and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this release,
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from the
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-07 17:47:31 UTC
Permalink
We ran into this issue several more times over the weekend. Basically, FDs
are exhausted so fast now, we can't even get to the server in time, the JVM
goes down in less than 5 minutes.

I can send the whole thread dumps if needed, but for brevity's sake, I just
copied over the relevant deadlock segment, and concatenated them all
together in the attached text file.

Do you think this is something I should add to KAFKA-3994 ticket? Or is
the information in that ticket enough for now?

Thanks,

Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the official
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on our side.
Thanks again!
Marcos
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up the
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if you
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue, if not
identical. In our case, the deadlock is reported as across 3 threads,
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it looks the same.
- We did not see this in prior releases, but we are ramping up usage of
our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume before
to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're seeing the
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are experiencing
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of consumers is
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being produced
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the number of
connects/disconnects is really high, because they're usually short-lived
processes. Our guess at the moment is that this is triggering the
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived producers, and
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the problem.
The ticket says it's being targeted for the 0.10.2 release. Any rough
estimate of a timeline for that to come out?
Thanks!
Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard to get
the
Post by Marcos Juarez
Post by Jason Gustafson
1. Did you not see this in prior releases? As far as we can tell, it
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't downgrade. I
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock,
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes down,
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release. The
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system is
in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and visualvm,
and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm guessing
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this release,
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Becket Qin
2016-11-07 18:25:01 UTC
Permalink
Hi Marcos,

Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test it
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up the
patch as a short term solution. Meanwhile we will prioritize the ticket.

Thanks,

Jiangjie (Becket) Qin
Post by Marcos Juarez
We ran into this issue several more times over the weekend. Basically,
FDs are exhausted so fast now, we can't even get to the server in time, the
JVM goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's sake, I
just copied over the relevant deadlock segment, and concatenated them all
together in the attached text file.
Do you think this is something I should add to KAFKA-3994 ticket? Or is
the information in that ticket enough for now?
Thanks,
Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the official
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on our side.
Thanks again!
Marcos
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up the
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if you
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue, if
not
Post by Marcos Juarez
identical. In our case, the deadlock is reported as across 3 threads,
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it looks the same.
- We did not see this in prior releases, but we are ramping up usage
of our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume
before to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're seeing the
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are experiencing
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of consumers is
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being produced
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the number
of
Post by Marcos Juarez
connects/disconnects is really high, because they're usually
short-lived
Post by Marcos Juarez
processes. Our guess at the moment is that this is triggering the
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived producers, and
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the
problem.
Post by Marcos Juarez
The ticket says it's being targeted for the 0.10.2 release. Any rough
estimate of a timeline for that to come out?
Thanks!
Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard to
get the
Post by Marcos Juarez
Post by Jason Gustafson
1. Did you not see this in prior releases? As far as we can tell, it
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't downgrade. I
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock,
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes down,
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than previous
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release.
The
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1 kafka
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system
is in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM is
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and
visualvm, and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm
guessing
Post by Marcos Juarez
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this release,
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-JRepxyw&r=
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is what
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-08 00:47:25 UTC
Permalink
Thanks Becket.

I was working on that today. I have a working jar, created from the
0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've
left it running in one test broker today, will try tomorrow to trigger the
issue, and try it with both the patched and un-patched versions.

I'll let you know what we find.

Thanks,

Marcos
Post by Jason Gustafson
Hi Marcos,
Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test it
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up the
patch as a short term solution. Meanwhile we will prioritize the ticket.
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
We ran into this issue several more times over the weekend. Basically,
FDs are exhausted so fast now, we can't even get to the server in time,
the
Post by Marcos Juarez
JVM goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's sake, I
just copied over the relevant deadlock segment, and concatenated them all
together in the attached text file.
Do you think this is something I should add to KAFKA-3994 ticket? Or is
the information in that ticket enough for now?
Thanks,
Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the official
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on our side.
Thanks again!
Marcos
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if
you
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue, if
not
Post by Marcos Juarez
identical. In our case, the deadlock is reported as across 3
threads,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it looks
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
same.
- We did not see this in prior releases, but we are ramping up usage
of our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume
before to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're seeing
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are
experiencing
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of consumers is
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being
produced
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the number
of
Post by Marcos Juarez
connects/disconnects is really high, because they're usually
short-lived
Post by Marcos Juarez
processes. Our guess at the moment is that this is triggering the
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived producers,
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the
problem.
Post by Marcos Juarez
The ticket says it's being targeted for the 0.10.2 release. Any
rough
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
estimate of a timeline for that to come out?
Thanks!
Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard to
get the
Post by Marcos Juarez
Post by Jason Gustafson
1. Did you not see this in prior releases? As far as we can tell,
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't downgrade. I
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock,
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes
down,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than
previous
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect and
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent
Inc.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both our
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release.
The
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1
kafka
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system
is in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the JVM
is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and
visualvm, and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm
guessing
Post by Marcos Juarez
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this
release,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo0
7QsHw-JRepxyw&r=
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=
nEcLEnYWPyaDuPDI5vSSKPWoljoXYb
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka broker
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is
what
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats from
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-11 17:51:07 UTC
Permalink
Becket/Jason,

We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch,
but we're seeing the same exact issue. It doesnt' seem like the patch
fixes the problem we're seeing.

At this point, we're considering downgrading our prod clusters back to
0.10.0.1. Is there any concern/issues we should be aware of when
downgrading the cluster like that?

Thanks,

Marcos Juarez
Post by Marcos Juarez
Thanks Becket.
I was working on that today. I have a working jar, created from the
0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've
left it running in one test broker today, will try tomorrow to trigger the
issue, and try it with both the patched and un-patched versions.
I'll let you know what we find.
Thanks,
Marcos
Post by Jason Gustafson
Hi Marcos,
Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test it
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up the
patch as a short term solution. Meanwhile we will prioritize the ticket.
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
We ran into this issue several more times over the weekend. Basically,
FDs are exhausted so fast now, we can't even get to the server in time,
the
Post by Marcos Juarez
JVM goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's sake, I
just copied over the relevant deadlock segment, and concatenated them
all
Post by Marcos Juarez
together in the attached text file.
Do you think this is something I should add to KAFKA-3994 ticket? Or is
the information in that ticket enough for now?
Thanks,
Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the
official
Post by Marcos Juarez
Post by Marcos Juarez
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on our side.
Thanks again!
Marcos
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
Since
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we're now seeing users hit this in practice, I'll definitely bump up
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there if
you
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue, if
not
Post by Marcos Juarez
identical. In our case, the deadlock is reported as across 3
threads,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it looks
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
same.
- We did not see this in prior releases, but we are ramping up usage
of our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume
before to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're seeing
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are
experiencing
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of consumers is
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being
produced
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the
number
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of
Post by Marcos Juarez
connects/disconnects is really high, because they're usually
short-lived
Post by Marcos Juarez
processes. Our guess at the moment is that this is triggering the
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived producers,
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the
problem.
Post by Marcos Juarez
The ticket says it's being targeted for the 0.10.2 release. Any
rough
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
estimate of a timeline for that to come out?
Thanks!
Marcos
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard to
get the
Post by Marcos Juarez
Post by Jason Gustafson
fix into the release, but it didn't quite make it. A few
1. Did you not see this in prior releases? As far as we can tell,
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the patch
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't downgrade. I
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how many
consumers do you have in your cluster?
Thanks!
Jason
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under deadlock,
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes
down,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes, we
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than
previous
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent
Inc.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both
our
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1 release.
The
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file descriptors
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out of
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1
kafka
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the system
is in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the
JVM is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and
visualvm, and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm
guessing
Post by Marcos Juarez
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this
release,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-
JRepxyw&r=
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0&s=nEcLEnYWPyaDuPDI5vSSKPWoljo
XYb
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka
broker
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is
what
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats
from
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Becket Qin
2016-11-11 18:27:16 UTC
Permalink
Hi Marcos,

Thanks for the update. It looks the deadlock you saw was another one. Do
you mind sending us a full stack trace after this happens?

Regarding the downgrade, the steps would be the following:
1. change the inter.broker.protocol to 0.10.0
2. rolling bounce the cluster
3. deploy the 0.10.0.1 code

There might be a bunch of .timeindex file left over but that should be fine.

Thanks,

Jiangjie (Becket) Qin
Post by Marcos Juarez
Becket/Jason,
We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch,
but we're seeing the same exact issue. It doesnt' seem like the patch
fixes the problem we're seeing.
At this point, we're considering downgrading our prod clusters back to
0.10.0.1. Is there any concern/issues we should be aware of when
downgrading the cluster like that?
Thanks,
Marcos Juarez
Post by Marcos Juarez
Thanks Becket.
I was working on that today. I have a working jar, created from the
0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've
left it running in one test broker today, will try tomorrow to trigger
the
Post by Marcos Juarez
issue, and try it with both the patched and un-patched versions.
I'll let you know what we find.
Thanks,
Marcos
Post by Jason Gustafson
Hi Marcos,
Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test it
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up the
patch as a short term solution. Meanwhile we will prioritize the ticket.
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
We ran into this issue several more times over the weekend.
Basically,
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
FDs are exhausted so fast now, we can't even get to the server in
time,
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
JVM goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's sake, I
just copied over the relevant deadlock segment, and concatenated them
all
Post by Marcos Juarez
together in the attached text file.
Do you think this is something I should add to KAFKA-3994 ticket? Or
is
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
the information in that ticket enough for now?
Thanks,
Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the
official
Post by Marcos Juarez
Post by Marcos Juarez
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on
our
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
side.
Thanks again!
Marcos
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
Since
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we're now seeing users hit this in practice, I'll definitely bump up
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
priority on a fix. I can't say for sure when the release will be,
but
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we'll
merge the fix into the 0.10.1 branch and you can build from there if
you
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar issue,
if
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
not
Post by Marcos Juarez
identical. In our case, the deadlock is reported as across 3
threads,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it looks
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
same.
- We did not see this in prior releases, but we are ramping up
usage
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume
before to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're
seeing
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are
experiencing
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of consumers
is
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being
produced
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the
number
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of
Post by Marcos Juarez
connects/disconnects is really high, because they're usually
short-lived
Post by Marcos Juarez
processes. Our guess at the moment is that this is triggering the
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived producers,
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the
problem.
Post by Marcos Juarez
The ticket says it's being targeted for the 0.10.2 release. Any
rough
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
estimate of a timeline for that to come out?
Thanks!
Marcos
On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard
to
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
get the
Post by Marcos Juarez
Post by Jason Gustafson
fix into the release, but it didn't quite make it. A few
1. Did you not see this in prior releases? As far as we can
tell,
Post by Marcos Juarez
Post by Jason Gustafson
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the
patch
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't downgrade.
I
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how
many
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
consumers do you have in your cluster?
Thanks!
Jason
On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under
deadlock,
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker goes
down,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp indexes,
we
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to
0.10.0.1.
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
We’re now running on 0.10.1.0, and the FD increase is due
to a
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than
previous
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should expect
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent
Inc.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in both
our
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1
release.
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
The
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file
descriptors
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs out
of
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1
kafka
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the
system
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the
JVM is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and
visualvm, and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
This is the interesting bit from the jstack thread
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm
guessing
Post by Marcos Juarez
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this
release,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.apache.org_jira_browse_KAFKA-
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-
JRepxyw&r=
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=zJ2wVkapVi8N-
jmDGRxM8a16nchqtjTfs20lhBw5xB0
&s=nEcLEnYWPyaDuPDI5vSSKPWoljo
Post by Marcos Juarez
Post by Jason Gustafson
XYb
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka
broker
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread is
what
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario, we
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats
from
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-11 18:56:05 UTC
Permalink
Thanks Becket,

We should get a full thread dump the next time, so I'll send it as soon
that happens.

Marcos
Post by Jason Gustafson
Hi Marcos,
Thanks for the update. It looks the deadlock you saw was another one. Do
you mind sending us a full stack trace after this happens?
1. change the inter.broker.protocol to 0.10.0
2. rolling bounce the cluster
3. deploy the 0.10.0.1 code
There might be a bunch of .timeindex file left over but that should be fine.
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
Becket/Jason,
We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994
patch,
Post by Marcos Juarez
but we're seeing the same exact issue. It doesnt' seem like the patch
fixes the problem we're seeing.
At this point, we're considering downgrading our prod clusters back to
0.10.0.1. Is there any concern/issues we should be aware of when
downgrading the cluster like that?
Thanks,
Marcos Juarez
Post by Marcos Juarez
Thanks Becket.
I was working on that today. I have a working jar, created from the
0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it.
I've
Post by Marcos Juarez
Post by Marcos Juarez
left it running in one test broker today, will try tomorrow to trigger
the
Post by Marcos Juarez
issue, and try it with both the patched and un-patched versions.
I'll let you know what we find.
Thanks,
Marcos
Post by Jason Gustafson
Hi Marcos,
Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
patch as a short term solution. Meanwhile we will prioritize the
ticket.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
We ran into this issue several more times over the weekend.
Basically,
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
FDs are exhausted so fast now, we can't even get to the server in
time,
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
JVM goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's
sake, I
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
just copied over the relevant deadlock segment, and concatenated
them
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
all
Post by Marcos Juarez
together in the attached text file.
Do you think this is something I should add to KAFKA-3994 ticket?
Or
Post by Marcos Juarez
is
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
the information in that ticket enough for now?
Thanks,
Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the
official
Post by Marcos Juarez
Post by Marcos Juarez
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on
our
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
side.
Thanks again!
Marcos
On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
Since
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we're now seeing users hit this in practice, I'll definitely bump
up
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
priority on a fix. I can't say for sure when the release will be,
but
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we'll
merge the fix into the 0.10.1 branch and you can build from there
if
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
you
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar
issue,
Post by Marcos Juarez
if
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
not
Post by Marcos Juarez
identical. In our case, the deadlock is reported as across 3
threads,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it
looks
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
same.
- We did not see this in prior releases, but we are ramping up
usage
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed volume
before to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're
seeing
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are
experiencing
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of
consumers
Post by Marcos Juarez
is
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being
produced
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the
number
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of
Post by Marcos Juarez
connects/disconnects is really high, because they're usually
short-lived
Post by Marcos Juarez
processes. Our guess at the moment is that this is triggering
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived
producers,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the
problem.
Post by Marcos Juarez
The ticket says it's being targeted for the 0.10.2 release. Any
rough
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
estimate of a timeline for that to come out?
Thanks!
Marcos
On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty hard
to
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
get the
Post by Marcos Juarez
Post by Jason Gustafson
fix into the release, but it didn't quite make it. A few
1. Did you not see this in prior releases? As far as we can
tell,
Post by Marcos Juarez
Post by Jason Gustafson
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the
patch
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't
downgrade.
Post by Marcos Juarez
I
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how
many
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
consumers do you have in your cluster?
Thanks!
Jason
On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under
deadlock,
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker
goes
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
down,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp
indexes,
Post by Marcos Juarez
we
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to
0.10.0.1.
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
We’re now running on 0.10.1.0, and the FD increase is due
to a
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
The 0.10.1 broker will use more file descriptor than
previous
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should
expect
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer,
Confluent
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Inc.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in
both
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
our
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1
release.
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
The
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka producer
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file
descriptors
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs
out
Post by Marcos Juarez
of
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on 0.10.1
kafka
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the
system
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate, the
JVM is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and
visualvm, and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
This is the interesting bit from the jstack thread
=============================
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8 (object
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138 (object
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm
guessing
Post by Marcos Juarez
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this
release,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.
apache.org_jira_browse_KAFKA-
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-
JRepxyw&r=
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
VSog3hHkqzZLadc6n_6BPH1OAPc78b24WpAbuhVZI0E&m=
zJ2wVkapVi8N-
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
jmDGRxM8a16nchqtjTfs20lhBw5xB0
&s=nEcLEnYWPyaDuPDI5vSSKPWoljo
Post by Marcos Juarez
Post by Jason Gustafson
XYb
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka
broker
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat thread
is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
what
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock scenario,
we
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more info/log/stats
from
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Marcos Juarez
2016-11-18 22:59:55 UTC
Permalink
Becket/Jason,

So, it turns out the server where saw the recurring FD issue was not
patched correctly, which is why we saw the deadlock again. We caught that,
and after testing over the last few days, feel pretty confident, I'd say
99% sure, that the patch in KAFKA-3994 does fix the problem for good. All
the evidence right now indicates that any broker that had been patched
correctly, never experienced the deadlock again. If we put stock Kafka
0.10.1.0 in those brokers again, we get deadlocks within a day or two.

Any idea of roughly when this patch will make it into a Kafka release?

Thanks for your help!

Marcos
Post by Marcos Juarez
Thanks Becket,
We should get a full thread dump the next time, so I'll send it as soon
that happens.
Marcos
Post by Jason Gustafson
Hi Marcos,
Thanks for the update. It looks the deadlock you saw was another one. Do
you mind sending us a full stack trace after this happens?
1. change the inter.broker.protocol to 0.10.0
2. rolling bounce the cluster
3. deploy the 0.10.0.1 code
There might be a bunch of .timeindex file left over but that should be fine.
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
Becket/Jason,
We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994
patch,
Post by Marcos Juarez
but we're seeing the same exact issue. It doesnt' seem like the patch
fixes the problem we're seeing.
At this point, we're considering downgrading our prod clusters back to
0.10.0.1. Is there any concern/issues we should be aware of when
downgrading the cluster like that?
Thanks,
Marcos Juarez
Post by Marcos Juarez
Thanks Becket.
I was working on that today. I have a working jar, created from the
0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it.
I've
Post by Marcos Juarez
Post by Marcos Juarez
left it running in one test broker today, will try tomorrow to trigger
the
Post by Marcos Juarez
issue, and try it with both the patched and un-patched versions.
I'll let you know what we find.
Thanks,
Marcos
Post by Jason Gustafson
Hi Marcos,
Is it possible for you to apply the patch of KAFKA-3994 and see if
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
issue is still there. The current patch of KAFKA-3994 should work,
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
only
reason we haven't checked that in was because when we ran stress
test it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
shows noticeable performance impact when producers are producing with
acks=all. So if you are blocking on this issue maybe you can pick up
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
patch as a short term solution. Meanwhile we will prioritize the
ticket.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Thanks,
Jiangjie (Becket) Qin
Post by Marcos Juarez
We ran into this issue several more times over the weekend.
Basically,
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
FDs are exhausted so fast now, we can't even get to the server in
time,
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
JVM goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's
sake, I
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
just copied over the relevant deadlock segment, and concatenated
them
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
all
Post by Marcos Juarez
together in the attached text file.
Do you think this is something I should add to KAFKA-3994 ticket?
Or
Post by Marcos Juarez
is
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
the information in that ticket enough for now?
Thanks,
Marcos
Post by Marcos Juarez
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the
official
Post by Marcos Juarez
Post by Marcos Juarez
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on
our
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
side.
Thanks again!
Marcos
On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson <
Post by Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA).
Since
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we're now seeing users hit this in practice, I'll definitely
bump up
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
priority on a fix. I can't say for sure when the release will be,
but
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
we'll
merge the fix into the 0.10.1 branch and you can build from
there if
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
you
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
need something in a hurry.
Thanks,
Jason
Post by Marcos Juarez
Jason,
Thanks for that link. It does appear to be a very similar
issue,
Post by Marcos Juarez
if
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
not
Post by Marcos Juarez
identical. In our case, the deadlock is reported as across 3
threads,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
one
Post by Marcos Juarez
of them being a group_metadata_manager thread. Otherwise, it
looks
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
same.
- We did not see this in prior releases, but we are ramping up
usage
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of our
Post by Marcos Juarez
kafka clusters lately, so maybe we didn't have the needed
volume
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
before to
Post by Marcos Juarez
trigger it.
- Across our multiple staging and production clusters, we're
seeing
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
problem roughly once or twice a day.
- Our clusters are small at the moment. The two that are
experiencing
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
issue are 5 and 8 brokers, respectively. The number of
consumers
Post by Marcos Juarez
is
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
small,
Post by Marcos Juarez
I'd say less than 20 at the moment. The amount of data being
produced
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
small also, in the tens of megabytes per second range, but the
number
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
of
Post by Marcos Juarez
connects/disconnects is really high, because they're usually
short-lived
Post by Marcos Juarez
processes. Our guess at the moment is that this is triggering
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
bug. We
Post by Marcos Juarez
have a separate cluster where we don't have short-lived
producers,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
one has been rock solid.
We'll look into applying the patch, which could help reduce the
problem.
Post by Marcos Juarez
The ticket says it's being targeted for the 0.10.2 release.
Any
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
rough
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
estimate of a timeline for that to come out?
Thanks!
Marcos
On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson <
Post by Jason Gustafson
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
matches?
Post by Marcos Juarez
At
Post by Jason Gustafson
a glance, it looks like the same problem. We tried pretty
hard
Post by Marcos Juarez
to
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
get the
Post by Marcos Juarez
Post by Jason Gustafson
fix into the release, but it didn't quite make it. A few
1. Did you not see this in prior releases? As far as we can
tell,
Post by Marcos Juarez
Post by Jason Gustafson
it
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is
Post by Marcos Juarez
Post by Jason Gustafson
possible going back to 0.9.0.0, but there could be a subtle
difference in
Post by Marcos Juarez
Post by Jason Gustafson
0.10.1.0 which increases its likelihood.
2. How often are you hitting this problem? You might try the
patch
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
that's
Post by Marcos Juarez
Post by Jason Gustafson
available if it's occurring frequently and you can't
downgrade.
Post by Marcos Juarez
I
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
think
Post by Marcos Juarez
the
Post by Jason Gustafson
existing patch is incomplete, but it should still reduce the
likelihood.
Post by Marcos Juarez
Post by Jason Gustafson
3. Out of curiosity, what is the size of your cluster and how
many
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
consumers do you have in your cluster?
Thanks!
Jason
On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez <
Post by Marcos Juarez
Just to expand on Lawrence's answer: The increase in file
descriptor
Post by Marcos Juarez
Post by Jason Gustafson
usage
Post by Marcos Juarez
goes from 2-3K under normal conditions, to 64K+ under
deadlock,
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
which
Post by Marcos Juarez
it
Post by Jason Gustafson
Post by Marcos Juarez
hits within a couple of hours, at which point the broker
goes
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
down,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
because
Post by Marcos Juarez
that's our OS-defined limit.
If it was only a 33% increase from the new timestamp
indexes,
Post by Marcos Juarez
we
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
should
Post by Marcos Juarez
Post by Jason Gustafson
be
Post by Marcos Juarez
going to max 4K-5K file descriptors in use, not 64K+.
Marcos
On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum <
Post by Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to
0.10.0.1.
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
We’re now running on 0.10.1.0, and the FD increase is due
to a
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
deadlock,
Post by Marcos Juarez
Post by Lawrence Weikum
not functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
On 11/3/16, 12:42 PM, "Hans Jespersen" <
The 0.10.1 broker will use more file descriptor than
previous
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
releases
Post by Lawrence Weikum
because of the new timestamp indexes. You should
expect
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
and
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
plan
Post by Marcos Juarez
Post by Jason Gustafson
for
Post by Marcos Juarez
Post by Lawrence Weikum
~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer,
Confluent
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Inc.
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
*/
On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez <
Post by Marcos Juarez
We're running into a recurrent deadlock issue in
both
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
our
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
production
Post by Lawrence Weikum
and
Post by Marcos Juarez
staging clusters, both using the latest 0.10.1
release.
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
The
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
symptom
Post by Lawrence Weikum
we
Post by Marcos Juarez
noticed was that, in servers in which kafka
producer
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
connections
Post by Jason Gustafson
Post by Marcos Juarez
are
Post by Lawrence Weikum
short
Post by Marcos Juarez
lived, every other day or so, we'd see file
descriptors
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
being
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
exhausted,
Post by Marcos Juarez
until the broker is restarted, or the broker runs
out
Post by Marcos Juarez
of
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
file
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
descriptors,
Post by Marcos Juarez
and it goes down. None of the clients are on
0.10.1
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
kafka
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
jars,
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
they're
Post by Marcos Juarez
all using previous versions.
When diagnosing the issue, we found that when the
system
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
is in
Post by Marcos Juarez
Post by Jason Gustafson
that
Post by Marcos Juarez
Post by Lawrence Weikum
state,
Post by Marcos Juarez
using up file descriptors at a really fast rate,
the
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
JVM is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
actually
Post by Lawrence Weikum
in a
Post by Marcos Juarez
deadlock. Did a thread dump from both jstack and
visualvm, and
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
attached
Post by Marcos Juarez
those to this email.
This is the interesting bit from the jstack thread
=============================
waiting to lock monitor 0x00000000016c8138
(object
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
waiting to lock monitor 0x00000000011ddaa8
(object
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
0x000000063f1b0cc0, a
Post by Marcos Juarez
java.util.LinkedList),
which is held by "kafka-request-handler-3"
waiting to lock monitor 0x00000000016c8138
(object
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
0x000000062732a398, a
Post by Marcos Juarez
kafka.coordinator.GroupMetadata),
which is held by "group-metadata-manager-0"
I also noticed the background heartbeat thread (I'm
guessing
Post by Marcos Juarez
the
Post by Jason Gustafson
Post by Marcos Juarez
one
Post by Lawrence Weikum
Post by Marcos Juarez
called "executor-Heartbeat" above) is new for this
release,
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
under
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
Post by Marcos Juarez
KAFKA-3888 ticket - https://urldefense.proofpoint.
com/v2/url?u=https-3A__issues.
apache.org_jira_browse_KAFKA-
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw-
JRepxyw&r=
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
VSog3hHkqzZLadc6n_6BPH1OAPc78b
24WpAbuhVZI0E&m=zJ2wVkapVi8N-
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
jmDGRxM8a16nchqtjTfs20lhBw5xB0
&s=nEcLEnYWPyaDuPDI5vSSKPWoljo
Post by Marcos Juarez
Post by Jason Gustafson
XYb
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
vNriVw0wrEegk&e=
Post by Marcos Juarez
We haven't noticed this problem with earlier Kafka
broker
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
versions,
Post by Marcos Juarez
Post by Lawrence Weikum
so I'm
Post by Marcos Juarez
guessing maybe this new background heartbeat
thread is
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
what
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
introduced the
Post by Marcos Juarez
deadlock problem.
That same broker is still in the deadlock
scenario, we
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
haven't
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
Post by Lawrence Weikum
restarted
Post by Marcos Juarez
it, so let me know if you'd like more
info/log/stats
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
from
Post by Marcos Juarez
Post by Marcos Juarez
Post by Jason Gustafson
the
Post by Marcos Juarez
Post by Jason Gustafson
Post by Marcos Juarez
system
Post by Lawrence Weikum
before
Post by Marcos Juarez
we restart it.
Thanks,
Marcos Juarez
Loading...