You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.
PyKafka version: 2.8.0 (using SimpleConsumer with rdkafka support) Kafka version: 1.0.1 (not reproducible on 0.8.2.2)
There's a rare case on our production environment when one of the broker went down, in such cases our application restarts consumer, and at that point we are loosing committed offsets and start consuming from very beginning.
The only error we see in logs is:
After deep investigation we came to conclusion that this is related to implementation of OffsetFetchResponseV2 message. In V2 they added also response-level error code as part of message, in difference to V1 where error codes were only reported for each partition separately.
see here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-88%3A+OffsetFetch+Protocol+Update
basically pykafka decodes this error code field here:
log.error("Error fetching offsets for topic '%s' (errors: %s)",
so fetch_offsets() gets empty list of partitions, does nothing (even not retrying to fetch again) and exists without setting offsets.
we've made some changes to build_parts_by_error() function to workaround the problem.
in case where response-level error code is set, it is artificially being copied into partition-specific error codes, to simulate V1 behavior, so further code will be able to handle it without modifications.
not sure if this change is correct, since build_parts_by_error() is used for other cases as well.
defbuild_parts_by_error(response, partitions_by_id):
"""Separate the partitions from a response by their error code :param response: a Response object containing partition responses :type response: :class:`pykafka.protocol.Response` :param partitions_by_id: a dict mapping partition ids to OwnedPartition instances :type partitions_by_id: dict {int: :class:`pykafka.simpleconsumer.OwnedPartition`} """# group partition responses by error codeparts_by_error=defaultdict(list)
ifgetattr(response, 'err', 0) !=0:
# for OffsetFetchResponseV2 error processing - duplicate generic error into all partitionsifpartitions_by_idisnotNone:
forpartition_id, owned_partitioniniteritems(partitions_by_id):
parts_by_error[response.err].append((owned_partition, None))
fortopic_nameinresponse.topics.keys():
forpartition_id, presiniteritems(response.topics[topic_name]):
ifpartitions_by_idisnotNoneandpartition_idinpartitions_by_id:
owned_partition=partitions_by_id[partition_id]
parts_by_error[pres.err].append((owned_partition, pres))
returnparts_by_error
Cannot provide a runnable code at the moment.
Was able to reproduce the similar error by forcibly sending request to "wrong" group coordinator.
So OffsetFetchResponseV2 in this case was:
PyKafka version: 2.8.0 (using SimpleConsumer with rdkafka support)
Kafka version: 1.0.1 (not reproducible on 0.8.2.2)
There's a rare case on our production environment when one of the broker went down, in such cases our application restarts consumer, and at that point we are loosing committed offsets and start consuming from very beginning.
The only error we see in logs is:
After deep investigation we came to conclusion that this is related to implementation of OffsetFetchResponseV2 message. In V2 they added also response-level error code as part of message, in difference to V1 where error codes were only reported for each partition separately.
see here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-88%3A+OffsetFetch+Protocol+Update
basically pykafka decodes this error code field here:
pykafka/pykafka/protocol/offset_commit.py
Line 367 in e7665bf
but further it is not used anywhere around here:
pykafka/pykafka/simpleconsumer.py
Line 660 in ebbc5c7
so
fetch_offsets()
gets empty list of partitions, does nothing (even not retrying to fetch again) and exists without setting offsets.we've made some changes to
build_parts_by_error()
function to workaround the problem.in case where response-level error code is set, it is artificially being copied into partition-specific error codes, to simulate V1 behavior, so further code will be able to handle it without modifications.
not sure if this change is correct, since
build_parts_by_error()
is used for other cases as well.Cannot provide a runnable code at the moment.
Was able to reproduce the similar error by forcibly sending request to "wrong" group coordinator.
So OffsetFetchResponseV2 in this case was:
The text was updated successfully, but these errors were encountered: