Design of Str #53

bkamins · 2018-03-09T21:49:58Z

I understand why Str needs the internal fields like cache etc. (100% support 👍) but I do not understand why their types have to be in a signature and not be fixed - what is the value of this flexibility?

In short why the signature struct Str{T} <: AbstractString is not enough for this type?
`

The text was updated successfully, but these errors were encountered:

ScottPJones · 2018-03-10T17:09:06Z

That is so that each of those optional fields can be dispatched on separately, and the types can be different. For the hash value, it might be a UInt32, or a UInt64, the SubStr might be a UnitRange{UInt16}, which could be useful to have a small table (up to 64KB) of "interned" strings, where each one can be packed into an array using only 16 bytes per string (8 bytes pointer, 4 bytes offset/len, 4 bytes CRC32c).
A larger table could be (8 bytes pointer, 8 bytes offset/len, 8 bytes hash) (up to 4GB table).
When those fields are specified (as I always do currently) as Nothing, they take up no space at all.

bkamins · 2018-03-10T19:09:03Z

OK (but this is the stuff that it makes the package harder to digest 😄).
I usually avoid such design unless it is absolutely necessary (but I guess you know the rowadmap and know it is - although it is not implemented yet).

bkamins · 2018-03-10T19:11:00Z

Just a short question - I understand that you want Cache to be Bool or Nothing? (the latter case to save space when we do not want to cache hash?)

bkamins · 2018-03-10T21:24:36Z

Apart from the question above I have just realized that UniStr is a union of 4 types only because of:

    @eval const $sym = Str{$(symstr(nam, "CSE")), Nothing,  Nothing, Nothing}

However, when you start allowing SubStr, Cache, Hash to change we will get UniStr to be an union of 32 types at least. I am not sure if it will be still a small union case for performance. Or you mean UniStr to be still non-cached and not allowing substrings?

ScottPJones · 2018-03-14T11:31:12Z

I've been thinking, since your questions, about another way of handling a "UniStr" type, that might be faster.
Instead of having them as separate types, use a single concrete type, that stores the bit to distinguish which type of string it is (and possibly a validation flag)

String, which I'm using for storage, because it takes a lot less space than a Vector{UInt8} for small strings,
always allocates an extra byte at the end, for a nul byte (it doesn't make much sense to me, because substrings won't have the trailing nul byte when you pass them to C/C++ anyway, so you'll have to copy those), however, since the space is already taken, might as well use it for something!

0 would indicate ASCII, 1 _Latin, 2 _UCS2, 3 _UTF32 (and those would no longer even be visible)
Other types, such as UTF8, UTF16 could use other values, along with Text1, Text2, Text4.
It would even be possible to sneakily improve performance if something is read in as Text1, and later validated as UTF8, Text2 -> UCS2 or UTF16, Text4 -> UTF32, by resetting that byte.

Another idea I had, since I noticed that the beginning of the strings are not nicely aligned, would be to store the hash value for the string in the next 8 bytes (on 64-bit systems), using one of the "type" bits to say if the hash had been calculated.
That both makes the start of the actual string data aligned on a 16-byte boundary (better for SSE2 instructions), but it does mean the minimum size would be 32 bytes for a "UniStr", while String has a minimum size of 16 (storing a UTF8 string of up to 7 bytes).

I also have been thinking about another trick, which would be to store short strings with the data field pointing to a known string, that you can check with === (the real === !), and if it is, then the Str type contains the string directly, so you could store up to 23-byte strings directly in registers. (if you have the 256 AVX/2 registers). Something to think about, when everything is working in Strs.jl as it is now.

bkamins closed this as completed Mar 10, 2018

bkamins reopened this Mar 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design of Str #53

Design of Str #53

bkamins commented Mar 9, 2018

ScottPJones commented Mar 10, 2018

bkamins commented Mar 10, 2018

bkamins commented Mar 10, 2018

bkamins commented Mar 10, 2018

ScottPJones commented Mar 14, 2018 •

edited

Loading

Design of Str #53

Design of Str #53

Comments

bkamins commented Mar 9, 2018

ScottPJones commented Mar 10, 2018

bkamins commented Mar 10, 2018

bkamins commented Mar 10, 2018

bkamins commented Mar 10, 2018

ScottPJones commented Mar 14, 2018 • edited Loading

ScottPJones commented Mar 14, 2018 •

edited

Loading