Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struct.error: bad char in struct format #27

Open
farscape2012 opened this issue Aug 10, 2016 · 8 comments
Open

struct.error: bad char in struct format #27

farscape2012 opened this issue Aug 10, 2016 · 8 comments

Comments

@farscape2012
Copy link

Hi,

I am trying to create a RecordDAWG object which contains tuple that consists of different data type. But there was error.

Does RecordDAWG only accept numeric tuple ?

data = [(u'key1', (1, b'a')), (u'key2', (2, b'b')),(u'key3', (3, b'c'))]

dawg.RecordDAWG(data)
Traceback (most recent call last):
File "", line 1, in
File "dawg.pyx", line 830, in dawg.RecordDAWG.init (src/dawg.cpp:13810)
struct.error: bad char in struct format

Br,
Eric

@superbobry
Copy link
Member

Hi Eric, you must specify a struct format to use RecordDAWG. Something like "=i1s" should work for your data.

@farscape2012
Copy link
Author

Thanks.
I've figured it out. But my application is a little bit complicated.
If I understand correctly RecordDAWG always save fixed size of binary data. I have a various length of string (the second element int the tuple). An example is shown below. Once I search with key, the returned value contains a lot of "\x00" in the end. Is it possible for dawg to support varying size of binary data ? Is there any plan to proceed in that direction?

Example:
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

@kmike
Copy link
Member

kmike commented Aug 11, 2016

@farscape2012 you can use BytesDAWG and encode/decode data yourselves.

@farscape2012
Copy link
Author

farscape2012 commented Aug 11, 2016

Thanks kmike for suggestion.
I think I had tried BytesDAWG. It seems the value should be bytes, not accept int.
`

data = [(u'key1', b'value1'), (u'key2', b'value2'), (u'key1', 213)]
bytes_dawg = dawg.BytesDAWG(data)
Traceback (most recent call last):
File "", line 1, in
File "dawg.pyx", line 480, in dawg.BytesDAWG.init (src/dawg.cpp:8932)
File "dawg.pyx", line 296, in dawg.CompletionDAWG.init (src/dawg.cpp:6625)
File "dawg.pyx", line 42, in dawg.DAWG.init (src/dawg.cpp:2050)
File "dawg.pyx", line 479, in genexpr (src/dawg.cpp:8735)
TypeError: Expected bytes, got int
`

@kmike
Copy link
Member

kmike commented Aug 11, 2016

@farscape2012 yes, values should be bytes. The only thing RecordTrie does differently from BytesDAWG is that it converts data to/from bytes using a predefined record format (it uses https://docs.python.org/3/library/struct.html from standard library). With BytesDAWG you need to convert data from/to bytes yourselves.

@farscape2012
Copy link
Author

Thanks kmike again.
How about the order of values ? Is order of values kept when they are added?

Does dawg have a sub/class which supports dictionary value, not only bytes and int? That will make programming far easier.

@kmike
Copy link
Member

kmike commented Aug 11, 2016

@farscape2012 key/value pairs are sorted by their binary value. Internally there are no values - values are just appended to corresponding keys after a separator, and the resulting strings are stored in DAFSA. Storing them in DAFSA makes sense when you think that values can be compressed in a similar way as keys. So e.g. adding an unique integer as a value will make DASFA "explode" almost to a Trie, this is inefficient.

If you want to attach arbitrary data to keys then DAFSA is likely a wrong data structure. You may try e.g. https://github.com/pytries/marisa-trie or https://github.com/pytries/hat-trie. With marisa-trie you have an unique ID per key, 0 <= key_id < len(trie); to store arbitrary data just create a Python list of the same length as a Trie and put values at key_id index. HAT-Trie supports Python objects as values natively.

@farscape2012
Copy link
Author

farscape2012 commented Aug 11, 2016

@kmike Thanks again.
Good to know that the values are appended, which means order remains. In my cases, the integer is a not unique, they are just arbitrary. In may case I needed to save values for a key. For now for each key there are three types of values, an integer (arbitrary) and a string. Maybe later the number of elements of value could increase.

Summarizing what you @kmike and @superbobry have suggested, I could proceed in two ways:

  1. use ByteDAWG and apeend values to the key (the order remains). The values should be converted into bytes manually.
  2. continue using recordDAWG like I did, in which case the memory usage is not that sufficient.

Considering scalablity, speed performance, memory efficiency, which method do you guys suggest?

Thanks.
BTW, I had checked marisa-trie before I started to use dawg. I felt that DAWG is much easier to use and has better functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants