Skip to content
Discussion options

You must be logged in to vote

From the Arrow format spec:

Note that a dictionary is permitted to contain duplicate values or nulls:

data VarBinary (dictionary-encoded)
   index_type: Int32
   values: [0, 1, 3, 1, 4, 2]

dictionary
   type: VarBinary
   values: ['foo', 'bar', 'baz', 'foo', null]

The null count of such arrays is dictated only by the validity bitmap of its indices, irrespective of any null values in the dictionary.
Arrow Columnar Format – Dictionary-encoded Layout

(Thanks Docs chat bot!)

So forcing the de-duplication seems to go against the spec and there for is a bug?

Replies: 4 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by kdkavanagh
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
4 participants