-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable native byte array fields on structs #766
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good in general, and I wanted to add bytes at some point as well, but there are some things we should figure out on RDF encoding first.
Can you please also add the new time to tests like this one? This will ensure that it can be saved to all databases properly.
Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @fungl164)
quad/value.go, line 437 at r1 (raw file):
// ByteArr is representation of []byte as a value type ByteArr string
Just Bytes
quad/value.go, line 453 at r1 (raw file):
return TypedString{ // TODO(dennwc): this is used to compute hash Value: String(b),
To be compatible with RDF we will probably need to encode it differently.
This may act as a reference, although I'm not sure it's the right one:
https://www.w3.org/TR/Content-in-RDF10/#bytesProperty
So maybe the "canonical RDF encoding" is the base64 with the type mentioned in that section.
schema/writer.go, line 95 at r1 (raw file):
} case saveRule: if f.Type.Kind() == reflect.Slice && f.Type != reflect.TypeOf([]byte(nil)) {
It's better to save the result of reflect.TypeOf([]byte(nil))
to a global and use it here
voc/schema/schema.go, line 29 at r1 (raw file):
Text = Prefix + `Text` // Data type: ByteArr. ByteArr = Prefix + "ByteArr"
This file corresponds to the types defined by https://schema.org, and there is no such type as "ByteArr".
quad/value.go, line 453 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code? |
schema/writer.go, line 95 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Done. |
voc/schema/schema.go, line 29 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out? |
For the tests, I might need your help with the protobuf marshalling/unmarshalling. With Bolt, I see the marshalling of quad.Bytes to ValueRaw going through, but the return ToNative() returns an empty array so the test fails. Here's what I have so far, but you may be faster at figuring out where/what's failing... : ) func MakeValue(qv quad.Value) *Value {
...
case quad.Bytes:
return &Value{&Value_Raw{v.Native().([]byte)}}
...
}
// ToNative converts protobuf Value to quad.Value.
func (m *Value) ToNative() (qv quad.Value) {
...
case *Value_Raw:
return quad.Bytes(v.Raw)
...
} |
Inching along...Defined The Thnxs! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rebase on top of the latest master? The PR now includes changes from other recently merged PRs.
Reviewed 6 of 12 files at r4, 13 of 13 files at r5.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @fungl164)
graph/graphtest/graphtest.go, line 710 at r5 (raw file):
quad.Bool(true), quad.Time(time.Now()), quad.Bytes([]byte{'b', 'y', 't', 'e', 's'}),
Makes sense to add some "raw" bytes like 0x00 and similar.
graph/kv/indexing.go, line 833 at r5 (raw file):
} if bytes, ok := v.(quad.Bytes); ok { v = bytes.TypedString()
This will definitely force it to be stored as "xxxx"^^<bytes>
string.
I think it might be better to pass it as quad.Bytes
further and convert it to something else when writing the value to the store.
internal/mapset/comparator.go, line 52 at r5 (raw file):
return TimeComparator(a, b) } if a == b {
This case can be used for every other type except Time
and []byte
.
internal/mapset/map.go, line 36 at r5 (raw file):
func NewMapWithComparator(cmp func(a, b interface{}) int) Map { m := &btreeMap{ inner: btree.NewWith(10, cmp),
Do we really need a B-Tree for this simple use case? Why Go maps cannot be used?
internal/mapset/set.go, line 7 at r5 (raw file):
) type Set interface {
Same here, do we really need a generic Set
interface? Why not define a specific implementation?
quad/value.go, line 453 at r1 (raw file):
Previously, fungl164 wrote…
I thought about that. I implemented base64 encoding/decoding. there is teh https://www.w3.org/TR/xmlschema-2/#base64Binary definition we can use. How should I define it in the code?
There should be some canonical URI/IRI for this type. I'm not sure https://www.w3.org/TR/xmlschema-2/#
is the right namespace, though. It may be somewhere in XML namespaces - RDF inherits some of them.
quad/value.go, line 66 at r5 (raw file):
if v != nil { // if vv, ok := v.(Bytes); ok { // h.Write([]byte(vv))
Hmm, right, we may want to write it as raw bytes. It should be fine since the String()
method returns RDF string, so all string values will be quoted.
But it will lead to hash collisions for quad.String("a")
and quad.Bytes("\"a\"")
. This may or may not be desirable. If we allow collisions, the first value that is written will "bind" it to a specific type. So if String("a")
is written first, and the Bytes("\"a\"")
is written next, you will always retrieve a String("a")
for both cases.
To avoid this we should probably change the hash to include the type ID. But this will be a breaking change.
Let me think about it a bit more. There may be a way to store it efficiently while preserving backward compatibility. Or as an option, we can include it in v0.8 that won't be bound by binary compatibility promise.
quad/pquads/quads.go, line 23 at r5 (raw file):
case quad.String: return &Value{&Value_Str{string(v)}} case quad.Bytes:
It may be better to remove this case and add a generic one before default
. This case may check for the TypedString()
method on the value (define an interface somewhere) and convert it.
So if it recognizes a specific type - it converts directly. If it doesn't, but the values can be converted to TypeString
- it will be encoded this way as a fallback.
quad/pquads/quads.go, line 97 at r5 (raw file):
return quad.BNode(v.Bnode) case *Value_TypedStr: if v.TypedStr.Type == schema.Bytes {
Instead of asserting for a specific type, there should be references to TypedString.ParseValue()
call - it will check for the known value types and will parse the string with a specified function. See quad.RegisterStringConversion
.
voc/schema/schema.go, line 29 at r1 (raw file):
Previously, fungl164 wrote…
I just made it up so I could test and run the code. I don't think there is a definition for just plain byte array data. Will it break things if we just take it out?
Yeah, it should be removed from here, but can be added as an unexported constant to quad
package.
9c5a6c6
to
e5da3c5
Compare
internal/mapset/map.go, line 36 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background... |
internal/mapset/set.go, line 7 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Not used. will take it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r6.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @dennwc and @fungl164)
internal/mapset/map.go, line 36 at r5 (raw file):
Previously, fungl164 wrote…
Go maps cough up when using quad.Bytes as a map key since it is defined by a slice of bytes. Btree is just a choice. It can be replaced with any proper map-like struct. Perhaps we can do some background trickery to convert strings <-> []bytes in the background...
They have a special optimization for bytes if you use it like this:
m := make(map[string]interface{})
// cast to string should be inline for optimization to work
m[string(b)] = x
// this won't be optimized
k := string(b)
m[s] = x
quad/value.go, line 450 at r6 (raw file):
} func (b Bytes) Native() interface{} { // v, err := base64.StdEncoding.DecodeString(string(b))
It should store raw bytes in the value, not base64. It should convert to base64 only in TypedValue
.
quad/value.go, line 450 at r6 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Done. |
quad/pquads/quads.go, line 23 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Fixed to store as raw bytes. |
quad/pquads/quads.go, line 97 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Fixed. Removed. |
voc/schema/schema.go, line 29 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
If we removed this, what does the TypedString type become? |
internal/mapset/comparator.go, line 52 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Removed. |
internal/mapset/map.go, line 36 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Removed. |
internal/mapset/set.go, line 7 at r5 (raw file): Previously, fungl164 wrote…
Removed. |
quad/value.go, line 453 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 11 files at r7.
Reviewable status: 7 of 8 files reviewed, 5 unresolved discussions (waiting on @dennwc and @fungl164)
graph/graphtest/graphtest.go, line 710 at r7 (raw file):
quad.Bool(true), quad.Time(time.Now()), quad.IRI("C"), quad.Raw("<bytes>"),
quad.IRI("bytes")
to make the hashes collide
graph/kv/indexing.go, line 145 at r7 (raw file):
continue } } else if byt, ok := d.Val.(quad.Bytes); ok {
Please remove this and the same code below. Bytes should not appear as often as IRIs.
quad/value.go, line 453 at r1 (raw file):
Previously, fungl164 wrote…
I'm thinking converting raw bytes to b64 and wrapping the result as CDATA would actually be RDF-compliant. This may actually be a workable solution since CDATA makes no guarantees and leaves it up to the reader to interpret the contents inside the tag...Just need to ensure the docs reflect what the interpretation should be...or we could just skip the CDATA encapsulation and just rely on the docs...what do you think?
CDATA only works for XML form of RDF, which is rarely used. So we will need a namespace anyway for it to work with NQuads/Turtle/JSON-LD formats.
voc/schema/schema.go, line 29 at r1 (raw file):
Previously, fungl164 wrote…
If we removed this, what does the TypedString type become?
We just need to find the right type, and it can be removed from here (since it's not related to schema.org), and added as a constant directly in quad
package.
quad/value.go, line 453 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Will need you to take the lead on this. |
voc/schema/schema.go, line 29 at r1 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Ok. I'll remove it when we find the right type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r8.
Reviewable status: 7 of 8 files reviewed, 4 unresolved discussions (waiting on @dennwc and @fungl164)
graph/graphtest/graphtest.go, line 710 at r7 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
quad.IRI("bytes")
to make the hashes collide
Sorry, I meant quad.IRI("C")
-> quad.IRI("bytes")
. And quad.Raw("<bytes>")
-> quad.Bytes("<bytes>")
. This should create the collision that I was talking about.
quad/value.go, line 453 at r1 (raw file):
Previously, fungl164 wrote…
Will need you to take the lead on this.
OK, sure, will push to this branch when I have a bit of the time.
graph/graphtest/graphtest.go, line 710 at r7 (raw file): Previously, dennwc (Denys Smirnov) wrote…
I think I see what you meant. I faked-forced a different hash for Bytes vs IRI by inserting an extra byte. Dunno if its the best way or if there will be other clashes, but hopefully unlikely if the hash is strong enough... : ) |
graph/graphtest/graphtest.go, line 710 at r7 (raw file): Previously, fungl164 wrote…
If the extra byte is not enough of a differentiator, adding a stronger suffix and/or prefix should make it work... : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 5 of 8 files reviewed, 3 unresolved discussions (waiting on @dennwc and @fungl164)
quad/value.go, line 66 at r5 (raw file):
Previously, dennwc (Denys Smirnov) wrote…
Hmm, right, we may want to write it as raw bytes. It should be fine since the
String()
method returns RDF string, so all string values will be quoted.But it will lead to hash collisions for
quad.String("a")
andquad.Bytes("\"a\"")
. This may or may not be desirable. If we allow collisions, the first value that is written will "bind" it to a specific type. So ifString("a")
is written first, and theBytes("\"a\"")
is written next, you will always retrieve aString("a")
for both cases.To avoid this we should probably change the hash to include the type ID. But this will be a breaking change.
Let me think about it a bit more. There may be a way to store it efficiently while preserving backward compatibility. Or as an option, we can include it in v0.8 that won't be bound by binary compatibility promise.
Let's make it a \x00
prefix instead of a \x01
suffix. If later we decide to change the hash for other value, we will use this extra byte as a value type ID.
quad/value.go, line 66 at r5 (raw file): Previously, dennwc (Denys Smirnov) wrote…
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, great! We only need a correct RDF type for TypedString
now. I'll take care of it.
Reviewed 1 of 2 files at r9.
Reviewable status: 6 of 8 files reviewed, 2 unresolved discussions (waiting on @dennwc)
What is the status of this PR? |
Need to rebase and check it again. |
Auto-converts []byte arrays to strings behind the scenes so the bytes are not stored as individual integers on the graph.
Examples:
This change is