Increasing the number or replication jobs #3185

tudordumitriu · 2020-10-01T14:54:32Z

tudordumitriu
Oct 1, 2020

Hi there
We are using (rather intensively) the _replicate endpoint and even though the settings replicator max_jobs is 500, the _scheduler/jobs total_rows never goes over 100.

To give a bit of context:
For now, we are using _replicate with mostly one single doc_id to replicate documents between "friend" databases, the business logic being implemented in our API, which calls the _replicate endpoint. This generates quite some load on the couchdb _replicate endpoint in our load tests and we are looking for a way to optimize this.

Of course one alternative would be to buffer those calls but by then, we would like to optimize as much as possible the couchdb replication.

Thanks

nickva · 2020-10-01T15:12:13Z

nickva
Oct 1, 2020
Collaborator

Jobs POSTed to the _replicate endpoint are considered transient, and if they complete they disappear from the system. The same thing happens if they have an error or if the node where they run crashes. It's the trade-off of them running completely in memory.

Could it be that your jobs crash or complete faster than you can create them, so their number stays below some threshold (100)?

See if you could use _replicator docs as an experiment and see if it changes the behavior, at least you'd be able to see their termination state. The trade-off there is having to do some cleanup and manage _replicator docs yourself.

1 reply

tudordumitriu Oct 2, 2020
Author

Hi @nickva
I know the jobs are transient and they do disappear, my bad, so I keep on refreshing the _jobs endpoint and values are varying from 0 to 100, but never above 100.
Most of the time is 0 (when there are no loads on the system), but during load tests when there are few thousands replications pending the max_rows is never above 100.
Also the error rate is small, when inserting 50k docs, that trigger replication let's say there are 1-2 errors, and from a db consistency point of view everything is ok.

The _replicate endpoint usage, opposed to the _replicator db docs, was intentional, because we wanted it to be transient, and these replications' history is not needed. Usually the http response code is enough, so no need to overload the system because I estimate we can easily reach billions of replications, and no need to manage that as well as docs in a db.

Again, maybe is the fact that we are replicating via doc_ids with only one single doc, but the limit of 100 looks to much to me as limitation in settings.

Thanks

nickva · 2020-10-04T17:33:33Z

nickva
Oct 4, 2020
Collaborator

Hi @tudordumitriu,

Thanks for explaining and for being patient. After looking at the source, I think I know what might be causing it. I forgot we had a default limit of returned jobs set to 100 for _scheduler/jobs output. It's applied here:

https://github.com/apache/couchdb/blob/3.x/src/couch_replicator/src/couch_replicator_httpd.erl#L42

It basically acts as if ?limit=100 was passed in as the request parameters. To verify that's the case you can set ?limit=500 and see if you see more (up to 500) jobs returned. The idea originally I think was to let users page through the results using offset and limit.

0 replies

tudordumitriu · 2020-10-05T05:52:59Z

tudordumitriu
Oct 5, 2020
Author

Hi @nickva
No worries at all and truly appreciate you getting back.
Unfortunately even with the query string limit to 500 the total rows doesn't go over 100, maybe there is something else?
_scheduler/jobs?limit=500 => total_rows: from 0 to 100

Thanks

0 replies

nickva · 2020-10-05T17:32:20Z

nickva
Oct 5, 2020
Collaborator

Thanks for double-checking @tudordumitriu.

Check if there are any errors in the logs indicating that jobs are crashing soon after starting perhaps.

Another thing to keep in mind is that POST-ing a replication job that has the same replication parameters as an existing job will be a no-op basically. So check if that could be the case. That means for example that creating a replication job from a -> b then and creating the same job again will still leave only one running a -> b job in the system.

Wonder if you'd have a script to reproduce the issue, a python script or similar to run against the dev instance of CouchDb

0 replies

tudordumitriu · 2020-10-06T03:11:08Z

tudordumitriu
Oct 6, 2020
Author

Thanks @nickva
It all makes sense, and for now I didn't see anything strange in logs.
Related to replicating multiple times, I doubt that's the case because we are using a load test (k6 javascript) that inserts new documents that basically get replicated from source db to a central db and from there to a destination db.

Will try to work on such a script, because we are using speigel to replicate have to think a bit on how to create this script that replicates the environment so will take a while (at least couple of days) but will definitely get back.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increasing the number or replication jobs #3185

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Increasing the number or replication jobs #3185

tudordumitriu Oct 1, 2020

Replies: 5 comments · 1 reply

nickva Oct 1, 2020 Collaborator

tudordumitriu Oct 2, 2020 Author

nickva Oct 4, 2020 Collaborator

tudordumitriu Oct 5, 2020 Author

nickva Oct 5, 2020 Collaborator

tudordumitriu Oct 6, 2020 Author

tudordumitriu
Oct 1, 2020

Replies: 5 comments 1 reply

nickva
Oct 1, 2020
Collaborator

tudordumitriu Oct 2, 2020
Author

nickva
Oct 4, 2020
Collaborator

tudordumitriu
Oct 5, 2020
Author

nickva
Oct 5, 2020
Collaborator

tudordumitriu
Oct 6, 2020
Author