Discussion:
log truncation did not happen on old leader?
Zaiming Shi
2018-11-14 16:52:54 UTC
Permalink
Hi there!

We are running kafka 0.11.0 with 0.10.0 message format configured for a
topic
The topic has 1 partition + 3 replicas, unclean.leader.election.enable is
set to false.

We have reasons to believe that an old partition leader did not truncate
its dirty log tail
before syncing with new leader.

Each message we produce, has a unique ID together with a sequence number
(seqno)
generated by the producer, when the producer restarts, seqno starts over
from 0.

Something like this happened:
The producer crashed with a 'connection closed' exception (broker restart)
when trying to produce message having (id=x, seqno=5) to node-3.

After a new leader (node-1) is discovered, the producer produced (id=x,
seqno=0),
then a lot following messages like (id=y, seqno=1) ...

A lot consumers fetched (id=x, seqno=0), (id=y, seqno=1) ... as expected.
However, a while later, the leader moved back to node-3,
A slower consumer fetched (id=x, seqno=5), (id=y, seqno=1) instead.

The consumers persists kafka-offset, id, seqno to a database,
We can see that ALL consumers stored consecutive kafka-offsets,
and they also saw all message (unique) IDs.
only that seqno=0 fetched from node-1 but seqno=5 from node-3.

Would like to get some insights on this, is it a kafka bug?
misconfiguration ? etc.

Some warnings logs from kafka:
WARN [Channel manager on controller 1]: Not sending request (type=
StopReplicaRequest, controllerId=1, controllerEpoch=267, deletePartitions=
false, partitions=my-topic-0, ..... to broker 3, since it is offline. (kafka
.controller.ControllerChannelManager)
WARN [Controller 1]: Cannot remove replica 3 from ISR of partition [my-topic
,0] since it is not in the ISR. Leader = 1 ; ISR = List(1, 2) (kafka.
controller.KafkaController)

Regards
-Zaiming

Loading...