Discussion:
Kafka Streams Session store performance degradation from 0.10.2.1 to 0.11.0.3
Jonathan Gordon
2018-11-06 17:49:19 UTC
Permalink
I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
down. A thread profile indicates that the most expensive task is during our
aggregation, fetching from the cache.

Thread profile with caching:
https://imgur.com/l5VEsC2

If I disable the cache both performance and consumption are good but we are
producing every single aggregation modification, which is not what we want.

Thread profile without caching:
https://imgur.com/a/JK3nkou

I read this thread, which seems relevant e

https://lists.apache.org/thread.html/2b44e74eaec7172b107bcff96861cf8b4837f55a44714f69d033cc2e@%3Cusers.kafka.apache.org%3E

Notably: "Note, that caching was _not_ introduced to reduce the writes to
RocksDB, but to reduce the write the the changelog topic and to reduce the
number of records send downstream."

So how can we reduce the number of records sent downstream while
maintaining the same performance characteristics that we have with caching
turned off? Or put another way, how can I upgrade my app without taking a
hit in performance or behavior?

Thanks!
Matthias J. Sax
2018-11-06 19:22:16 UTC
Permalink
Not sure atm why you see a performance degradation. Would need to dig
into the details.

However, did you consider to upgrade to 2.0 instead or 0.11?

Also note that we added a new operator `suppress()` in upcoming 2.1
release, that allows you to do rate control without caching:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables

Hope this helps.


-Matthias
Post by Jonathan Gordon
I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
down. A thread profile indicates that the most expensive task is during our
aggregation, fetching from the cache.
https://imgur.com/l5VEsC2
If I disable the cache both performance and consumption are good but we are
producing every single aggregation modification, which is not what we want.
https://imgur.com/a/JK3nkou
I read this thread, which seems relevant e
Notably: "Note, that caching was _not_ introduced to reduce the writes to
RocksDB, but to reduce the write the the changelog topic and to reduce the
number of records send downstream."
So how can we reduce the number of records sent downstream while
maintaining the same performance characteristics that we have with caching
turned off? Or put another way, how can I upgrade my app without taking a
hit in performance or behavior?
Thanks!
j***@newrelic.com
2018-11-07 23:47:27 UTC
Permalink
Hi Matthias,

I upgraded to 2.0.0 and we're experiencing the same problem. I've posted a new screengrab of a thread profile:

https://imgur.com/a/2wncPHw

From our perspective, it appears something happened after 0.10.2.1 that made the LRU Cache much slower for our use case. What would you recommend for our next steps?

Jonathan
Post by Matthias J. Sax
Not sure atm why you see a performance degradation. Would need to dig
into the details.
However, did you consider to upgrade to 2.0 instead or 0.11?
Also note that we added a new operator `suppress()` in upcoming 2.1
https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
Hope this helps.
-Matthias
Post by Jonathan Gordon
I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
down. A thread profile indicates that the most expensive task is during our
aggregation, fetching from the cache.
https://imgur.com/l5VEsC2
If I disable the cache both performance and consumption are good but we are
producing every single aggregation modification, which is not what we want.
https://imgur.com/a/JK3nkou
I read this thread, which seems relevant e
Notably: "Note, that caching was _not_ introduced to reduce the writes to
RocksDB, but to reduce the write the the changelog topic and to reduce the
number of records send downstream."
So how can we reduce the number of records sent downstream while
maintaining the same performance characteristics that we have with caching
turned off? Or put another way, how can I upgrade my app without taking a
hit in performance or behavior?
Thanks!
Matthias J. Sax
2018-11-08 00:13:39 UTC
Permalink
Thanks for verifying.
Post by j***@newrelic.com
Post by j***@newrelic.com
From our perspective, it appears something happened after 0.10.2.1 that made the LRU Cache much slower for our use case.
That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
Jiras but found nothing I could point out. There are couple of
SessionStore related tickets, but none of them should have an effect
like this.

To narrow it down, it would be helpful to test with other versions, too.
Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.

Can you also profile v0.10.2.1 so we can compare?
Post by j***@newrelic.com
What would you recommend for our next steps?
Not sure. If you could help us to track down the issue, that would be
most helpful so get a fix (and you could run from a SNAPSHOT version to
get the fix -- not sure if this would be an option for you).


-Matthias
Post by j***@newrelic.com
Hi Matthias,
https://imgur.com/a/2wncPHw
From our perspective, it appears something happened after 0.10.2.1 that made the LRU Cache much slower for our use case. What would you recommend for our next steps?
Jonathan
Post by j***@newrelic.com
Not sure atm why you see a performance degradation. Would need to dig
into the details.
However, did you consider to upgrade to 2.0 instead or 0.11?
Also note that we added a new operator `suppress()` in upcoming 2.1
https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
Hope this helps.
-Matthias
Post by Jonathan Gordon
I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
0.11.0.3 but when I do I notice that CPU goes way up and consumption goes
down. A thread profile indicates that the most expensive task is during our
aggregation, fetching from the cache.
https://imgur.com/l5VEsC2
If I disable the cache both performance and consumption are good but we are
producing every single aggregation modification, which is not what we want.
https://imgur.com/a/JK3nkou
I read this thread, which seems relevant e
Notably: "Note, that caching was _not_ introduced to reduce the writes to
RocksDB, but to reduce the write the the changelog topic and to reduce the
number of records send downstream."
So how can we reduce the number of records sent downstream while
maintaining the same performance characteristics that we have with caching
turned off? Or put another way, how can I upgrade my app without taking a
hit in performance or behavior?
Thanks!
Guozhang Wang
2018-11-17 00:26:56 UTC
Permalink
Hi Jonathan,

Could you create a JIRA with all the current available information uploaded
on the ticket for me to further investigate the issue? This way we will not
lose track of it (email list is not the best venue for potential bug
investigation :).

At the mean time, I will try to compare the source code of 0.10.2 and 2.0
and see if I can eyeball any obvious issues.

Guozhang
Post by Matthias J. Sax
Thanks for verifying.
Post by j***@newrelic.com
Post by j***@newrelic.com
From our perspective, it appears something happened after 0.10.2.1 that
made the LRU Cache much slower for our use case.
That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
Jiras but found nothing I could point out. There are couple of
SessionStore related tickets, but none of them should have an effect
like this.
To narrow it down, it would be helpful to test with other versions, too.
Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.
Can you also profile v0.10.2.1 so we can compare?
Post by j***@newrelic.com
What would you recommend for our next steps?
Not sure. If you could help us to track down the issue, that would be
most helpful so get a fix (and you could run from a SNAPSHOT version to
get the fix -- not sure if this would be an option for you).
-Matthias
Post by j***@newrelic.com
Hi Matthias,
I upgraded to 2.0.0 and we're experiencing the same problem. I've posted
https://imgur.com/a/2wncPHw
From our perspective, it appears something happened after 0.10.2.1 that
made the LRU Cache much slower for our use case. What would you recommend
for our next steps?
Post by j***@newrelic.com
Jonathan
Post by j***@newrelic.com
Not sure atm why you see a performance degradation. Would need to dig
into the details.
However, did you consider to upgrade to 2.0 instead or 0.11?
Also note that we added a new operator `suppress()` in upcoming 2.1
https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables
Post by j***@newrelic.com
Post by j***@newrelic.com
Hope this helps.
-Matthias
Post by Jonathan Gordon
I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to
0.11.0.3 but when I do I notice that CPU goes way up and consumption
goes
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
down. A thread profile indicates that the most expensive task is
during our
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
aggregation, fetching from the cache.
https://imgur.com/l5VEsC2
If I disable the cache both performance and consumption are good but
we are
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
producing every single aggregation modification, which is not what we
want.
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
https://imgur.com/a/JK3nkou
I read this thread, which seems relevant e
Notably: "Note, that caching was _not_ introduced to reduce the writes
to
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
RocksDB, but to reduce the write the the changelog topic and to reduce
the
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
number of records send downstream."
So how can we reduce the number of records sent downstream while
maintaining the same performance characteristics that we have with
caching
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
turned off? Or put another way, how can I upgrade my app without
taking a
Post by j***@newrelic.com
Post by j***@newrelic.com
Post by Jonathan Gordon
hit in performance or behavior?
Thanks!
--
-- Guozhang
j***@newrelic.com
2018-11-17 19:43:56 UTC
Permalink
Post by Guozhang Wang
Could you create a JIRA with all the current available information uploaded
on the ticket for me to further investigate the issue? This way we will not
lose track of it (email list is not the best venue for potential bug
investigation :).
Here you go. I've added some logs which show the issue pretty clearly:

https://issues.apache.org/jira/browse/KAFKA-7652
Post by Guozhang Wang
At the mean time, I will try to compare the source code of 0.10.2 and 2.0
and see if I can eyeball any obvious issues.
Great. Please let me know if there's any way I can assist.

j***@newrelic.com
2018-11-16 00:33:24 UTC
Permalink
Post by Matthias J. Sax
That is what I try to figure out. I went over the 0.10.2.2 to 0.11.0.3
Jiras but found nothing I could point out. There are couple of
SessionStore related tickets, but none of them should have an effect
like this.
To narrow it down, it would be helpful to test with other versions, too.
Maybe 0.10.2.2 and 0.11.0.0 to see when the issue was introduced.
Done. So far here's what my tests have shown:
0.10.2.1 (the current version we're running) and 0.10.2.2, the local cache works properly and we see thread profiles similar to what I posted earlier, where the majority of time is spent in RockDB and there's no lag.

Testing with 0.11.0.0, 0.11.0.3, 1.1.1, 2.0.0 and 2.0.1 all show us spending the majority of time in the local cache and we lag considerably:

https://imgur.com/l5VEsC2
Post by Matthias J. Sax
Can you also profile v0.10.2.1 so we can compare?
Here's a recent profile for 0.10.2.1:

https://imgur.com/a/Sto636s
Post by Matthias J. Sax
Post by j***@newrelic.com
What would you recommend for our next steps?
Not sure. If you could help us to track down the issue, that would be
most helpful so get a fix (and you could run from a SNAPSHOT version to
get the fix -- not sure if this would be an option for you).
Another developer took a look a the code and he had some thoughts:

"It appears we're scanning an order of magnitude more keys for every call to `findSessions`. You can see this manifest in the flush logs where version 0.11.0.3 and later will have a billion hits on the cache in 10 minutes, even though the number of events consumed is only 1M. It seems like when they made some fixes to make sure all possible windows for a session merge are found that resulted in having to scan every entry in the cache."

Is there a way for us to refine the cache search so we're not searching the entire key space?
Jonathan Gordon
2018-11-07 23:28:21 UTC
Permalink
Hi Matthias,

I upgraded to 2.0.0 and we're experiencing the same problem. I've posted a
new screengrab of a threadprofile:

https://imgur.com/a/2wncPHw

From our perspective, it appears something happened after 0.10.2.1 that
made the LRU Cache much slower for our use case. What would you recommend
for our next steps?

Jonathan
Not sure atm why you see a performance degradation. Would need to dig>
into the details.>
However, did you consider to upgrade to 2.0 instead or 0.11?>
Also note that we added a new operator `suppress()` in upcoming 2.1>
release, that allows you to do rate control without caching:>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-328%3A+Ability+to+suppress+updates+for+KTables>
Hope this helps.>
-Matthias>
On 11/6/18 9:49 AM, Jonathan Gordon wrote:>
I have a Kafka Streams app that I'm trying to upgrade from 0.10.2.1 to>
0.11.0.3 but when I do I notice that CPU goes way up and consumption
goes>
down. A thread profile indicates that the most expensive task is during
our>
aggregation, fetching from the cache.>
Thread profile with caching:>
https://imgur.com/l5VEsC2>
If I disable the cache both performance and consumption are good but we
are>
producing every single aggregation modification, which is not what we
want.>
Thread profile without caching:>
https://imgur.com/a/JK3nkou>
I read this thread, which seems relevant e>
Notably: "Note, that caching was _not_ introduced to reduce the writes
to>
RocksDB, but to reduce the write the the changelog topic and to reduce
the>
number of records send downstream.">
So how can we reduce the number of records sent downstream while>
maintaining the same performance characteristics that we have with
caching>
turned off? Or put another way, how can I upgrade my app without taking
a>
hit in performance or behavior?>
Thanks!>
Loading...