Optimization of data distribution in cluster

In VictoriaTraces's cluster mode, the data distribution scheme is similar to VictoriaLogs: writes will be storage in one of vtstorage nodes(by randomly), while queries should query all vtstorage nodes. If one vtstorage node fails, vtselect can not serve reads, cuz it may return incorrect results if it do. And, vtinsert can still serve writes because there are still surviving vtstorage nodes. In other words, VictoriaTraces's cluster mode lacks HA read, if you really need it, you should use vmagent for cross-cluster HA, see https://docs.victoriametrics.com/victoriatraces/cluster/#high-availability

However, VT and VL differ in their data models. Logs can not be easily deduplicated (We can not easily assume that two logs with the same content should be deduplicated.), so replicating logs across vlstorage nodes for HA may be impractical. However, VT can deduplicate data by SpanID and TimeStamp(or something else), which means that,  it can replicate data across multiple vtstorage nodes and  deduplicate during queries, achieving cluster-level HA, just like VictoriaMetrics dose.

I am not sure whether current implementation of cluster is the final solution. If not, personally, I think it may be a good idea to implement HA in VT cluster just like VM.

The distribution optimization should include three tasks:
- [ ] Consistent-hash to distribute traces
- [ ] vtselect supports deduplication
- [ ] vtinsert replicates traces into N vtstorage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization of data distribution in cluster #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimization of data distribution in cluster #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions