Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design of Str #53

Open
bkamins opened this issue Mar 9, 2018 · 5 comments
Open

Design of Str #53

bkamins opened this issue Mar 9, 2018 · 5 comments

Comments

@bkamins
Copy link

bkamins commented Mar 9, 2018

I understand why Str needs the internal fields like cache etc. (100% support 👍) but I do not understand why their types have to be in a signature and not be fixed - what is the value of this flexibility?

In short why the signature struct Str{T} <: AbstractString is not enough for this type?
`

@ScottPJones
Copy link
Member

That is so that each of those optional fields can be dispatched on separately, and the types can be different. For the hash value, it might be a UInt32, or a UInt64, the SubStr might be a UnitRange{UInt16}, which could be useful to have a small table (up to 64KB) of "interned" strings, where each one can be packed into an array using only 16 bytes per string (8 bytes pointer, 4 bytes offset/len, 4 bytes CRC32c).
A larger table could be (8 bytes pointer, 8 bytes offset/len, 8 bytes hash) (up to 4GB table).
When those fields are specified (as I always do currently) as Nothing, they take up no space at all.

@bkamins
Copy link
Author

bkamins commented Mar 10, 2018

OK (but this is the stuff that it makes the package harder to digest 😄).
I usually avoid such design unless it is absolutely necessary (but I guess you know the rowadmap and know it is - although it is not implemented yet).

@bkamins bkamins closed this as completed Mar 10, 2018
@bkamins
Copy link
Author

bkamins commented Mar 10, 2018

Just a short question - I understand that you want Cache to be Bool or Nothing? (the latter case to save space when we do not want to cache hash?)

@bkamins bkamins reopened this Mar 10, 2018
@bkamins
Copy link
Author

bkamins commented Mar 10, 2018

Apart from the question above I have just realized that UniStr is a union of 4 types only because of:

    @eval const $sym = Str{$(symstr(nam, "CSE")), Nothing,  Nothing, Nothing}

However, when you start allowing SubStr, Cache, Hash to change we will get UniStr to be an union of 32 types at least. I am not sure if it will be still a small union case for performance. Or you mean UniStr to be still non-cached and not allowing substrings?

@ScottPJones
Copy link
Member

ScottPJones commented Mar 14, 2018

I've been thinking, since your questions, about another way of handling a "UniStr" type, that might be faster.
Instead of having them as separate types, use a single concrete type, that stores the bit to distinguish which type of string it is (and possibly a validation flag)

String, which I'm using for storage, because it takes a lot less space than a Vector{UInt8} for small strings,
always allocates an extra byte at the end, for a nul byte (it doesn't make much sense to me, because substrings won't have the trailing nul byte when you pass them to C/C++ anyway, so you'll have to copy those), however, since the space is already taken, might as well use it for something!

0 would indicate ASCII, 1 _Latin, 2 _UCS2, 3 _UTF32 (and those would no longer even be visible)
Other types, such as UTF8, UTF16 could use other values, along with Text1, Text2, Text4.
It would even be possible to sneakily improve performance if something is read in as Text1, and later validated as UTF8, Text2 -> UCS2 or UTF16, Text4 -> UTF32, by resetting that byte.

Another idea I had, since I noticed that the beginning of the strings are not nicely aligned, would be to store the hash value for the string in the next 8 bytes (on 64-bit systems), using one of the "type" bits to say if the hash had been calculated.
That both makes the start of the actual string data aligned on a 16-byte boundary (better for SSE2 instructions), but it does mean the minimum size would be 32 bytes for a "UniStr", while String has a minimum size of 16 (storing a UTF8 string of up to 7 bytes).

I also have been thinking about another trick, which would be to store short strings with the data field pointing to a known string, that you can check with === (the real === !), and if it is, then the Str type contains the string directly, so you could store up to 23-byte strings directly in registers. (if you have the 256 AVX/2 registers). Something to think about, when everything is working in Strs.jl as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants