Skip to content

Segfault whenever an on_conf_destroy callback is registered #4142

@Quuxplusone

Description

@Quuxplusone
$ cd examples
$ cat >consumer.c <<EOF
#include <assert.h>
#include <stdio.h>
#include "rdkafka.h"

rd_kafka_resp_err_t callback(void *opaque) {
    printf("callback called with %p\n", opaque);
    return RD_KAFKA_RESP_ERR_NO_ERROR;
}

int main(int argc, char **argv) {
        rd_kafka_conf_res_t err; /* librdkafka API error code */
        char errstr[512];        /* librdkafka API error reporting buffer */

        rd_kafka_conf_t *conf = rd_kafka_conf_new();

        err = rd_kafka_conf_set(conf, "security.protocol", "ssl", errstr, sizeof(errstr));
        assert(err == RD_KAFKA_CONF_OK);

        err = rd_kafka_conf_set(conf, "ssl.ca.location", "/dev/null", errstr, sizeof(errstr));
        assert(err == RD_KAFKA_CONF_OK);

        rd_kafka_resp_err_t rk_err = rd_kafka_conf_interceptor_add_on_conf_destroy(conf, "testing", callback, NULL);
        assert(rk_err == RD_KAFKA_RESP_ERR_NO_ERROR);

        rd_kafka_t *rk = rd_kafka_new(RD_KAFKA_CONSUMER, conf, errstr, sizeof(errstr));
        assert(rk == NULL);
        fprintf(stderr, "%% Failed to create new consumer: %s\n", errstr);

        // Since rk==NULL, the conf is still owned by us, and must be cleaned up to avoid leaks.
        rd_kafka_conf_destroy(conf);

        return 0;
}
EOF
$ make consumer
gcc -g -O2 -fPIC -Wall -Wsign-compare [...] -ldl -lpthread
$ ./consumer
% Failed to create new consumer: ssl.ca.location failed: No further error information available
Segmentation fault: 11

[...]
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
  * frame #0: 0x0000000100094950 consumer`rd_kafka_interceptors_on_conf_destroy + 48
    frame #1: 0x000000010002ac9f consumer`rd_kafka_conf_destroy + 15
    frame #2: 0x00000001000030f9 consumer`main(argc=<unavailable>, argv=<unavailable>) at consumer.c:30:9 [opt]
    frame #3: 0x00000001004b152e dyld`start + 462

The problem seems to be that rd_kafka_interceptors_on_conf_destroy is called twice: first (incorrectly) from the failing rd_kafka_new,

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
  * frame #0: 0x0000000100094920 consumer`rd_kafka_interceptors_on_conf_destroy
    frame #1: 0x000000010002aa3d consumer`rd_kafka_anyconf_destroy + 29
    frame #2: 0x000000010000442b consumer`rd_kafka_destroy_final + 827
    frame #3: 0x00000001000057a6 consumer`rd_kafka_new + 3062
    frame #4: 0x00000001000030cd consumer`main(argc=<unavailable>, argv=<unavailable>) at consumer.c:25:26 [opt]
    frame #5: 0x00000001004b152e dyld`start + 462

and then a second time (correctly) from rd_kafka_conf_destroy.

It appears that rd_kafka_new sometimes returns NULL without destroying the conf, and sometimes returns NULL after destroying the conf (i.e. any codepath that passes through goto fail). So the caller cannot know whether to clean up or not.

This is similar to the "should never happen" bug in #4100, but worse.

Checklist

  • librdkafka version (release number or git tag): 0e4b551
  • Apache Kafka version: doesn't matter, Kafka needn't be running
  • librdkafka client configuration: as shown above
  • Operating system: Mac OSX, but probably reproduces everywhere
  • Critical issue

Metadata

Metadata

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions