Discussion:
Kafka in multi-dc deployment
igor polyakov
2018-11-19 14:45:47 UTC
Permalink
What is the best way for Kafka multi-dc deployments to pick up processing at another location if a primary location fails? What features of Kafka are the most relevant for continuous processing?

Sincerely,
Igor
Check these out !
[Ad Placement 1]<http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=b21f8e89-4a11-4b8a-9276-66a068c4f231> [Ad Placement 2] <http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=6588f140-01db-43c2-9c34-36edd8fcfec3> [Ad Placement 3] <http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=c8a1ae81-ffe9-4a63-96d8-2ef1a9588e16>
Ryanne Dolan
2018-11-19 18:10:08 UTC
Permalink
Igor, currently the best available disaster recovery strategy looks
something like this:

- Use MirrorMaker to replicate data from a source cluster to a target
cluster. MM should be co-located with the target cluster to minimize
producer lag. If you have multiple active data centers, each would need
it's own set of MM clusters to copy from the other DCs.
- Use a topic naming convention and MM whitelists to prevent cycles;
otherwise, messages would be replicated back-and-forth indefinitely between
clusters. Alternatively you can use passive "aggregation clusters" in each
DC to prevent cycles.
- To failover producers, you can usually just point them at another
cluster, unless your use-case would not allow this. Depending on your
multi-cluster topic naming convention, you may need to change the topic
that the producer sends to.
- To failover consumers, you will need to point them at another cluster
_and_ figure out what offsets they should resume consuming from, since
offsets are not consistent between mirrored clusters. To do this, use
timestamp-based offset translation as described below. Depending on your
multi-cluster topic naming convention, you may need to change the topic the
consumer reads from.
- To failback consumers, you'll need to 1) ensure that the original
cluster's MM clusters are caught up s.t. the original clusters have the
latest records. Then use timestamp-based offset translation as described
below.
- To failback producers, just point them at their primary clusters again.
Don't failback producers until after you failback consumers, or you'll have
out-of-order issues.

To translate consumers offsets based on timestamps, you need three pieces
of information: 1) the max replication lag, i.e. how far behind MM could
have been from real-time; 2) the max consumer lag, i.e. how far behind your
consumers could have been from real-time; and 3) the time at which the
failure occurred. Subtract the consumer lag and replication lag from the
failure timestamp to get a recovery timestamp. This is the point in time
you'll need to rewind your consumers to prevent them from skipping over
records during failover. Then:

1) Bring up the consumer in the new cluster with auto.offset.reset=earliest.

2) Use kafka-consumer-groups.sh tool to "reset" each migrated consumer to
the recovery timestamp. Something like:

$ ./bin/kafka-consumer-groups.sh --reset-offsets --group foo --topic bar
--to-datetime 123456 --boostrap-server xyz:9092

Do this for every consumer x every topic. Your consumers will consume any
records that were replicated to the target cluster prior to the failure,
and will process any new records produced to the target cluster during
failover. The consumers will see records in roughly the right order -- new
records won't be processed before old records.

N.B. the following issues:

- Your consumers will see duplicate records, since they will have been
reset to a previous point in time.
- The order of records in the source and target clusters will not be
consistent
- If you failback producers before consumers, your consumers will see new
records intermixed with old records.
- The # of partitions and partitioning schemes may not be consistent
between source and target clusters across all topics.
- There is a lot of manual intervention to get this to work. The process
does not scale well when many consumers and topics are involved.

I have published KIP-382 "MirrorMaker 2.0" which directly addresses this
mess. With MM2, it's possible to automate failover and failback based on
replication checkpoints (not timestamps) while maintaining consistent
record order and partitioning, among other features. Would love your
support for the KIP!

Thanks.
Ryanne

--
www.ryannedolan.info
Post by igor polyakov
What is the best way for Kafka multi-dc deployments to pick up processing
at another location if a primary location fails? What features of Kafka are
the most relevant for continuous processing?
Sincerely,
Igor
Check these out !
[Ad Placement 1]<
http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=b21f8e89-4a11-4b8a-9276-66a068c4f231>
[Ad Placement 2] <
http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=6588f140-01db-43c2-9c34-36edd8fcfec3>
[Ad Placement 3] <
http://mailwords.us/mailwords/consumer/PickYourAddServlet?userID=user1&imageID=c8a1ae81-ffe9-4a63-96d8-2ef1a9588e16
Loading...