Skip to content

Optimization of data distribution in cluster #50

@JayiceZ

Description

@JayiceZ

In VictoriaTraces's cluster mode, the data distribution scheme is similar to VictoriaLogs: writes will be storage in one of vtstorage nodes(by randomly), while queries should query all vtstorage nodes. If one vtstorage node fails, vtselect can not serve reads, cuz it may return incorrect results if it do. And, vtinsert can still serve writes because there are still surviving vtstorage nodes. In other words, VictoriaTraces's cluster mode lacks HA read, if you really need it, you should use vmagent for cross-cluster HA, see https://docs.victoriametrics.com/victoriatraces/cluster/#high-availability

However, VT and VL differ in their data models. Logs can not be easily deduplicated (We can not easily assume that two logs with the same content should be deduplicated.), so replicating logs across vlstorage nodes for HA may be impractical. However, VT can deduplicate data by SpanID and TimeStamp(or something else), which means that, it can replicate data across multiple vtstorage nodes and deduplicate during queries, achieving cluster-level HA, just like VictoriaMetrics dose.

I am not sure whether current implementation of cluster is the final solution. If not, personally, I think it may be a good idea to implement HA in VT cluster just like VM.

The distribution optimization should include three tasks:

  • Consistent-hash to distribute traces
  • vtselect supports deduplication
  • vtinsert replicates traces into N vtstorage

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions