Discussion:
stuck re-balance
Tom Raney
2017-01-27 18:16:05 UTC
Permalink
After adding a new Kafka node, I ran the kafka-reassign-partitions.sh tool
to redistribute topics onto the new machine and it seemed like some of the
migrations were stuck processing for over 24 hours, so I cancelled the
reassignment by deleting the zk node (/admin/reassign_partitions) and used
the kafka-preferred-replica-election.sh to try and resolve it. It didn't
work.

Now, I have partitions in a weird state. For example, I have one partition
that has broker 1003 as a replica but it shouldn't be there. The partition
directory on 1003 is still growing but is way behind the leader and the
other ISR on 1001.

Topic: foo Partition: 2 Leader: 1004 Replicas: 1003,1004,1001 Isr: 1004,1001

When I force a leader election, for that partition, it fails because 1003
is not in sync.

kafka.common.StateChangeFailedException: encountered error while electing
leader for partition [foo,2] due to: Preferred replica 1003 for partition
[foo,2] is either not alive or not in the isr. Current leader and ISR:
[{"leader":1004,"leader_epoch":11,"isr":[1004,1001]}].

When I try to reassign with the config...

{"version":1,"partitions":[{"topic":"foo","partition":2,"replicas":[1004,1001]}]}

I see that it doesn't resolve.

Status of partition reassignment:
Reassignment of partition [foo,2] is still in progress

And, I would think it would since 1001 is already an ISR and the leader is
already 1004.

How do I resolve this?
Todd Palino
2017-01-27 18:20:56 UTC
Permalink
Did you move the controller (by deleting the /controller znode) after
removing the reassign_partitions znode? If not, the controller is probably
still trying to do that move, and is not going to accept a new move request.
Post by Tom Raney
After adding a new Kafka node, I ran the kafka-reassign-partitions.sh tool
to redistribute topics onto the new machine and it seemed like some of the
migrations were stuck processing for over 24 hours, so I cancelled the
reassignment by deleting the zk node (/admin/reassign_partitions) and used
the kafka-preferred-replica-election.sh to try and resolve it. It didn't
work.
Now, I have partitions in a weird state. For example, I have one partition
that has broker 1003 as a replica but it shouldn't be there. The partition
directory on 1003 is still growing but is way behind the leader and the
other ISR on 1001.
Topic: foo Partition: 2 Leader: 1004 Replicas: 1003,1004,1001 Isr: 1004,1001
When I force a leader election, for that partition, it fails because 1003
is not in sync.
kafka.common.StateChangeFailedException: encountered error while electing
leader for partition [foo,2] due to: Preferred replica 1003 for partition
[{"leader":1004,"leader_epoch":11,"isr":[1004,1001]}].
When I try to reassign with the config...
{"version":1,"partitions":[{"topic":"foo","partition":2,"
replicas":[1004,1001]}]}
I see that it doesn't resolve.
Reassignment of partition [foo,2] is still in progress
And, I would think it would since 1001 is already an ISR and the leader is
already 1004.
How do I resolve this?
--
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming



linkedin.com/in/toddpalino
Tom Raney
2017-01-27 18:57:02 UTC
Permalink
Thanks, Todd! Deleting the /controller znode worked.
Post by Todd Palino
Did you move the controller (by deleting the /controller znode) after
removing the reassign_partitions znode? If not, the controller is probably
still trying to do that move, and is not going to accept a new move request.
Post by Tom Raney
After adding a new Kafka node, I ran the kafka-reassign-partitions.sh
tool
Post by Tom Raney
to redistribute topics onto the new machine and it seemed like some of
the
Post by Tom Raney
migrations were stuck processing for over 24 hours, so I cancelled the
reassignment by deleting the zk node (/admin/reassign_partitions) and
used
Post by Tom Raney
the kafka-preferred-replica-election.sh to try and resolve it. It
didn't
Post by Tom Raney
work.
Now, I have partitions in a weird state. For example, I have one
partition
Post by Tom Raney
that has broker 1003 as a replica but it shouldn't be there. The
partition
Post by Tom Raney
directory on 1003 is still growing but is way behind the leader and the
other ISR on 1001.
Topic: foo Partition: 2 Leader: 1004 Replicas: 1003,1004,1001 Isr: 1004,1001
When I force a leader election, for that partition, it fails because 1003
is not in sync.
kafka.common.StateChangeFailedException: encountered error while
electing
Post by Tom Raney
leader for partition [foo,2] due to: Preferred replica 1003 for partition
[{"leader":1004,"leader_epoch":11,"isr":[1004,1001]}].
When I try to reassign with the config...
{"version":1,"partitions":[{"topic":"foo","partition":2,"
replicas":[1004,1001]}]}
I see that it doesn't resolve.
Reassignment of partition [foo,2] is still in progress
And, I would think it would since 1001 is already an ISR and the leader
is
Post by Tom Raney
already 1004.
How do I resolve this?
--
*Todd Palino*
Staff Site Reliability Engineer
Data Infrastructure Streaming
linkedin.com/in/toddpalino
Loading...