Skip to content
This repository has been archived by the owner on May 10, 2022. It is now read-only.

tools: support zstd's dict compression/decompression #22

Open
wants to merge 1 commit into
base: thrift-0.11.0-inlined
Choose a base branch
from

Conversation

neverchanje
Copy link

@neverchanje neverchanje commented Nov 20, 2018

compression-dict-256kb
compression-dict-32kb
as we can see from the above two pics, dict-based compression can significantly improve compression ratio (+20%), besides, with a larger dictionary buffer the compression ratio increases (by 10%).

@qinzuoyan
Copy link
Member

这个让用户怎么用?感觉不太易用啊?譬如:

  1. 用户如果用spark写数据,多个task并发写数据,那么由谁来train?Dict怎么建立?
  2. 用户现在从一个表里读数据,怎么知道是不是要用Dict?用哪个Dict?

@neverchanje
Copy link
Author

neverchanje commented Nov 21, 2018

在 zstd 的设计里,train 就是在测试的时候干的,dict 训练好之后保存在 pegasus 里,用户启动 spark 的时候把 dict 拿到,然后压缩解压就用这个 dict。

dict 可以每个 spark 任务一个,也可以做成 thread-safe 的单例。现在没帮用户做成单例,用户就只能每个 task 拿一个 dict,这点可以改。

@qinzuoyan
Copy link
Member

如果每个spark一个,就会有很多个Dict。另一个业务如果要读数据,应当选择哪个Dict?如果读的数据来自不同spark写入的,应当用哪个Dict?整个使用场景都应当想清楚,让用户用起来简单、无歧义。

@neverchanje neverchanje force-pushed the thrift-0.11.0-inlined branch from 3af031b to 9c467ac Compare January 17, 2019 03:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants