-
Notifications
You must be signed in to change notification settings - Fork 232
Python + Kafka community unification ideas #559
Comments
+1, but I’m not sure how this could happen naturally. Each of the In any case, something using librdkafka is likely the smartest way to go. On Sun, Jun 5, 2016 at 9:05 PM, Andrew Montalenti [email protected]
|
@ottomata I don't think I'm looking for a "grand merge". I think, instead, I'm trying to think about how to make sure there isn't wasted effort from this point forward. For example, pykafka and kafka-python provided varying degrees of production support for Kafka during its "stumbling infant" period in open source, and that's fine -- natural exploration/competition is what open source does best. But now we have a stable librdkafka, and relatively mature pure Python projects. We have an opportunity to do things like: a) evaluate the relative pro's and con's of each API; b) benchmark their real-world performance, and c) think about long-term maintenance. There are also practical concerns of the limited maintainer hours for open source contributors and the migration path for existing users of each project. I think what the community wants in the long-term is something like what we have with Redis, Cassandra and Elasticsearch -- there is a "clear" choice for the Python module/API to use (redis-py, cassandra-driver, and elasticsearch-py respectively), even though there are historical projects that have supported older versions of each API, but which now point people to the right place. (e.g. pycassa now points people to cassandra-driver). |
Aye, having used all 3 clients, here some passing opinions of them: pykafkaFor a time pykafka was simply better than kafka-python. It was the first to have a balanced consumer. pykafka does not have good dynamic topic production support (#354), which makes it hard to use for some use cases. Now that the Kafka API supports managing and balancing consumer groups itself, pykafka's interface feels a little fragmented. The community around pykafka is excellent. We have had several production issues, all of which have been pretty quickly responded to and fixed. kafka-pythonI actually like kafka-python's interface the best of all. It feels the most clean. However, even the newer version seems to perform fairly poorly for synchronous production (yes I know, never the answer bla bla), so I don't think we will use it in the future. Plus, I like the idea of outsourcing the work of the client to librdkafka, and kafka-python does not have this. confluent-kafka-pythonSo far so good. It is very new, but I like using it and it seems to perform really well. It'd be nice if the consumer interface returned an iterable. I think Wikimedia will end up using confluent-kafka-python in the near future. pykafka's lack of dynamic topic support is a no go for us, and kafka-python's performance isn't good enough. Using librdkafka seems like the right thing to do, as it is so widely used in many different languages and environments that you can generally expect it to be very solid. Wikimedia has an interest in a robust nodejs Kafka client as well, and I expect we will leverage librdkafka for that too. |
@ottomata Thanks for this write-up. @jofusa recently wrote-up and shared this handy benchmark comparing pykafka, kafka-python, and confluent-kafka-python. The full thing is worth a read, but these timings speak for themselves. I've cleaned up the tables (rounded the numbers) and sorted each by msgs/s. Consumer timings:
Producer timings:
It seems like his timings have confirmed what our benchmarks also proved: that librdkafka provides huge speedups across-the-board. So based on the data, that would confirm @ottomata's thought that whatever the Python community unifies around, it should be something librdkafka-based, as these benchmarks are pretty hard to ignore. |
Pulling @dpkp if he doesn't mind. He shared some good comments about Here's what strikes me about where we are between these 3 projects:
Meanwhile, it seems clear that:
Obviously, this is sort of just providing the "lay of the land"... where we go from here, I'm not sure. @emmett9001, @kbourgoin and I are having a little get-together in NYC next week and maybe we can chat about some plans and ideas to share with the community here after we talk it through a bit. But I'd love to hear other feedback. |
also pulling @yungchin here, one of the other "volunteers" :) |
I have fun working on kafka-python. I hope others enjoy it. If there's a better driver out there, huzzah! My one piece of advice is that all of you should get more involved on the kafka-dev mailing list, particularly with respect to API KIPs and wire protocol issues. The core team is very focused on the java ecosystem, and that can lead to api designs that force client drivers into less than ideal positions. If nothing else, getting more non-java perspectives into the mix would be a great improvement. I know Magnus is working very hard on the librdkafka side and it does appear that Confluent is paying a few more people to work on wrapper libraries. That will be a great benefit for users that want higher performance than they can get from a single python process w/ GIL restrictions. But I do think one of the great benefits of a community is a diversity of approaches and view points. And python is in a very unique position because we can generally develop faster than other languages and I believe we should continue to leverage that to the benefit of the entire community. So with that said, yay for options! Have fun writing software. We only live once. |
The splintering is something that I've found to be an irritation; @amontalenti thank you for opening up a discussion. Echoing @dpkp, I'm not sure this is necessarily the proper location for the discussion, but at least there's a discussion somewhere. Kafka's rapid release cycle is at least partially a reason for the splintering. Keeping up with the feature creep and maintaining production level code is tough. Everything has changed within the past 8 months. I think there should probably be a frank discussion about pykafka specifically in this thread or with a lack of consensus, to push everyone to helping develop Confluent's api. Long term, it feels like their API will become the standard. I feel that Confluent's entry, while performant, is not really usable for rapid iteration. It appears to be built by a team that knows c really well and has jumped into Python's C-API but hasn't really engaged the python community at any length. Without knowing the librdkafka api, this library in its current state is unusable. In my ideal world, there would be a strong python interface around Confluent-Kafka's current librdkafka skeleton that would also be compatible with pypy. I think the best option here would be a better C-interface with a ctypes and/or cffi version. This makes Confluent accessible to not only python but also: javascript, lua, ruby, lisp, haskell, etc. Any language that has a libffi interface. I think for us, Confluent-Kafka is a wait and see if there's updates, adoption and traction. I prefer kafka-python's interface most of all, but at the same time, I think kafka-python's performance is lacking and the updates are more sparse. In addition, while I appreciate the work done by the author, it's just one person. When we started putting together our code, we began initially with kafka-python and quickly moved to pykafka. We are currently using pykafka because it has the most features and still appears to be performant. It is built with python in mind and has been tested on a variety of environments. But I really dislike the threading and the current API. I also feel that there should be one obvious way to create a Consumer as opposed to three variances. I currently am supporting all three libraries in my backend library with a minor change in a configuration file to update which library should be used. However, this is obviously less than ideal. Moving forward, though, our support will be primarily with pykafka because it has the most python support with the widest feature-set. |
@brianbruggeman what performance do you expect from a python driver? I can push 100Mb/sec w/ a single core running a kafka-python producer. I am skeptical that anyone really needs more than that. I generally expect kafka brokers to be network bound first. Any normal deployment would have more producers than brokers, and if a single producer can max a network card, I think you've already hit overkill. |
@dpkp One of our data streams is currently sitting at about 30 MB/sec with spikes over 60 MB/sec. This is expected to grow for second half of 2016. |
We are currently using pykafka. Our use case is more focused on producing to and consuming from topics dynamically. To do that, we have a simple wrapper that create producer/consumer for each topic within one process. However, we run into some serious issues:
The java interface looks much nicer. I did a quick check on confluent-kafka interface, it looks like what we need. One concern is that how librdkafka can keep up with kafka development and new features. |
I've recently been intrigued by the efforts of libraries like hyper-h2, h11 and wsproto to implement canonical, pure python protocols and state machines without any networking. This seems like the most flexible approach, allowing a wide range of networking and opinionated frameworks to be layered on top. I am just starting to dive into these libraries but curious if anyone has an opinion on which would be easier to rip out all of the networking code :-) |
I've mostly done this at the low level of kafka-python. Would be happy to
work with you on further destructuring. I agree it is a powerful approach,
though managing the cluster connection routing makes it a bit trickier than
simple point-to-point protocols like http.
…-Dana
On Apr 17, 2017 5:31 PM, "Brian Merrell" <[email protected]> wrote:
I've recently been intrigued by the efforts of libraries like hyper-h2, h11
and wsproto to implement canonical, pure python protocols and state
machines without any networking. This seems like the most flexible
approach, allowing a wide range of networking and opinionated frameworks to
be layered on top. I am just starting to dive into these libraries but
curious if anyone has an opinion on which would be easier to rip out all of
the networking code :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#559 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAzetAGe5dM0S3dG2YZTqlxqSFwBLZ8kks5rxARigaJpZM4Iufn4>
.
|
Here's @mrocklin's blogpost with an overview of the current state of the ecosystem http://matthewrocklin.com/blog/work/2017/10/10/kafka-python |
@emmett9001 Thank you for that link; the thoughts there echo my impressions as well. Ultimately, I think confluent-kafka is the long-term future for python. Kafka is intended to be fast and we need to take advantage of python's CAPI to get to the needed performance level and the library essentially exposes librdkafka through cPython's c-api. But the package itself is very python developer unfriendly and the API piece, while functional, isn't designed with a Python developer in mind. It takes more digging than should be necessary to put the package in use. In contrast, Pykafka is definitely the most friendly python package. It's relatively performant, but threading makes it clunky and for any serious usage of kafka streaming, you'll need those threads to be performant. FWIW (circling back from a year+ ago) - we went with confluent-kafka. |
my humble perspective is that fewer people really care about performance than say they do. It makes for good blog posts, but really what matters is stability, bugfixes, and keeping up with server features / development. I think kafka-python manages to tackle these quite well. Also, fwiw, I can consume almost 100Mb/sec on a single core running kafka-python on pypy, which is certainly in line with the raw performance you see from librdkafka. |
Closing for inactivity. |
In the last couple of years there, we can’t help but notice that there has been quite a lot of fragmentation in the Python community around Kafka. As far as we can tell, there are three major projects as of this writing:
librdkafka
extension for speedups; evolved from Kafka 0.7 driver; has tested Kafka 0.8.2, 0.9.x support)librdkafka
; Kafka 0.9+ focused)Each project has a different history, level of current support for Kafka, and set of features — and, of course, different APIs. This is obviously not an ideal state for the Python user community around Kafka, both those that currently have Kafka clusters in production and for those looking to adopt Kafka for new projects.
I wonder if anyone has any ideas for what should happen -- if anything -- to unify the Python community around Kafka.
The text was updated successfully, but these errors were encountered: