Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for the indexed users db format #934

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DaleFarnsworth
Copy link

The indexed format is a tree-structured database. Each
unique string is stored only once in the db and referenced through
pointers by each dmr ID entry that uses that string.

The new format uses about half the space of the standard
md380 userdb format.

The indexed format is a tree-structured database.  Each
unique string is stored only once in the db and referenced through
pointers by each dmr ID entry that uses that string.

The new format uses about half the space of the standard
md380 userdb format.
@DaleFarnsworth
Copy link
Author

Hi Travis. Please review and comment. I've been running various iterations of this code and new db format for a couple of months now without any issues. The changes I made to usersdb.c support both the standard userdb format and this new indexed tree-structured format. The new format also begins with a single ascii line containint "0", so if the new format is installed on a radio running old firmware, it just looks like a 0-length database. After support for the new db format is added to md380tools, the firmware will support either format.

A description of the format is contained in README-INDEXEDDB.md .

The repo at https://github.com/DaleFarnsworth/md380IndexedUserDB contains C programs that convert both ways between the standard db format and this new indexed format. The conversion back and forth is lossless.

Thanks.

@travisgoodspeed
Copy link
Owner

Any volunteers to review this code? At first glance it's a worthy contribution, but I'm too burned out on this project to review the code thoroughly myself.

@rogerclarkmelbourne
Copy link

rogerclarkmelbourne commented Oct 7, 2021

There is another / further compression which can be applied, as long as you only need upper and lower case ASCII and numbers and space and comma, because the total number of unique chars is 64 not 256.

Hence 4 ASCII bytes can be packed into 3 bytes.

AFIK. This is the compression method used by some manufacurers like Connect Systems.

@DaleFarnsworth
Copy link
Author

We don't need comma, but currently, the users db I use has '#', '&', "'", (single quote), '(', ')', '*', '+', '-', '.', '/', ':', ';', '=', '?', '@', ']', '_', '`', '|', and '$'. We could avoid some of these with cleanup, but I think we'll still benefit from having at least space, dash, ampersand, and period. I find it tough to get to the required 64 character alphabet.

One of my (admittedly self-imposed) requirements was that the current database contents be fully supported. I don't plan to add any character string compression, but others are welcome to do so.

@rogerclarkmelbourne
Copy link

rogerclarkmelbourne commented Oct 7, 2021

No worries

It was just a suggestion, as it does yield about 30% extra compression on the entire uncompressed string for each record.

However, it would yield less compression on your shorter sub strings.

BTW. I initially thought your compression, also handed all the complete duplicate records, where people have 2 or 3 ID's and completely the same information in each, apart from the ID

I wonder if you could somehow add that as some sort of special case.
But you'd need to see how much compression that yielded.

There are also a large number of ID's which hardly ever get used. HamDigital.org used to maintain a list of active ID's which could be downloaded with activity range limits up to 1 year or more.

And I recall, only about 50% of the IDs were every active in any given year.

Of course of DMR MARC supported TA, then none of this would be necessary ;-)

And I don't know why no one has written an extension to MMDVMHost to inject TA, because that would fix the problem for the large number of people using hotspots etc on DMR MARC, and potentially for all DMR MARC repeaters which use MMDVMHost

Unfortunetly I don't have time to update MMDVMHost, because I'm busy on loads of other projects

@DaleFarnsworth
Copy link
Author

My method already only stores one record when multiple dmrids have the same callsign, name, etc. It's not a special case.

@rogerclarkmelbourne
Copy link

ok.

thats good to know

@DaleFarnsworth
Copy link
Author

It looks like (back-of-the-envelope guesstimate) that we could save an extra 10% (on a full database containing names, cities, states and countries) by encoding the most often occurring character pairs as unused character values. I.e. encode the current characters into values 0 to <number_of_unique-characters>-1 and use values <number_of_unique_characters> to 255 to represent the most often occurring character pairs. And the decoding would be quite simple. I think I'll code it up and see what it gives us.

If that's implemented it will be independent of the current code, so I would still appreciate someone's careful review of this PU as it currently stands.

@rogerclarkmelbourne
Copy link

I'd not be using it with MD380 tools, and unfortunatly I'm also mega busy with other projects, so this one would not get looked at for several months

@DaleFarnsworth
Copy link
Author

I prototype the character pair compression. My estimate of 10% savings was way off. It's actually only 4%. And that's on the indexed file. The saving based on the original file is less than 2%. I don't know that it's worth it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants