Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: Low or high out of range #1

Open
shitpoet opened this issue Jun 22, 2023 · 4 comments
Open

AssertionError: Low or high out of range #1

shitpoet opened this issue Jun 22, 2023 · 4 comments

Comments

@shitpoet
Copy link

shitpoet commented Jun 22, 2023

I'm trying to use this module on enwik5 data (10 000 bytes). But I encounter this error:

AssertionError: Low or high out of range

Are there any additional limitations in the implementation? Or do I do something wrong?

The script below works ok with enwik4 data (1 000 bytes).

I count statistics myself and then use StaticModel, but I encounter either this Low or high out of range error, or ValueError: Symbol has zero frequency error.

enwik5.zip

fn = 'enwik5'

print(fn)

def read_bytes(path):
    with open(path, 'rb') as f:
        return list(f.read())

data = read_bytes(fn)
nsyms = 256
stats = [0] * nsyms
for c in data:
    stats[c] += 1

from arithmetic_compressor import AECompressor

from arithmetic_compressor.models.base_adaptive_model import BaseFrequencyTable
from arithmetic_compressor.util import *

SCALE_FACTOR = 4096

class StaticModel:
  """A static model, which does not adapt to input data or statistics."""

  def __init__(self, counts_dict):
    #vals = (v for k, v in counts_dict.items())
    #counts_sum = sum(vals)
    #probability = {k: v / counts_sum for k, v in counts_dict.items()}
    #print(probability)
    probability = counts_dict

    symbols = list(probability.keys())

    self.name = "Static"
    self.symbols = symbols
    self.__prob = dict(probability)

    # compute cdf from given probability
    cdf = {}
    prev_freq = 0
    self.freq = freq = {sym: round(SCALE_FACTOR * prob)
                        for sym, prob in probability.items()}
    for sym, freq in freq.items():
      cdf[sym] = Range(prev_freq, prev_freq + freq)
      prev_freq += freq
    self.cdf_object = cdf

  def cdf(self):
    return self.cdf_object

  def probability(self):
    return self.__prob

  def predict(self, symbol):
    assert symbol in self.symbols
    return self.probability()[symbol]

  def update(self, symbol):
    pass

  def test_model(self, gen_random=True, N=10000, custom_data=None):
    self.name = "Static Model"
    return BaseFrequencyTable.test_model(self, gen_random, N, custom_data)

freq_map = {
    sym: freq for sym, freq in enumerate(stats)
    if freq > 0
}

model = StaticModel(freq_map)
coder = AECompressor(model)

N = len(data)
compressed = coder.compress(data)
@kodejuice
Copy link
Owner

kodejuice commented Aug 10, 2023

Oh, sorry i am just seeing this. I have no idea why i wasn't notified 🤔

About the issue, this usually happens when theres a symbol in the data that is not present in your frequency map, this is most likely a unicode symbol. So setting nsyms = 256 is not enough, you probably need to go through the data to get all possible symbols and build a frequency map out of that.

Something like this:

data = read_bytes(fn)
freq_map = {}
for c in data:
    freq_map[c] = freq_map.get(c, 0) + 1

Again, sorry for the late reply

@shitpoet
Copy link
Author

Hmm. I checked the maximum symbol value in enwik5.dat after the read_bytes function and it is 237. So it seems that there is no unicode symbols after fopen(..., "rb").

From Python 3.11 docs:

Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding.

@kodejuice
Copy link
Owner

kodejuice commented Aug 14, 2023

I just tested this code and it works fine for me? 🤔

fn = 'enwik5.zip'

print(fn)

def read_bytes(path):
    with open(path, 'rb') as f:
        return list(f.read())

data = read_bytes(fn)
nsyms = 256
stats = [0] * nsyms
for c in data:
    stats[c] += 1

from arithmetic_compressor import AECompressor

from arithmetic_compressor.models.base_adaptive_model import BaseFrequencyTable
from arithmetic_compressor.models import StaticModel
from arithmetic_compressor.util import *

SCALE_FACTOR = 4096

freq_map = {
    sym: freq for sym, freq in enumerate(stats)
    if freq > 0
}

model = StaticModel(freq_map)
coder = AECompressor(model)

N = len(data)
compressed = coder.compress(data)

print(len(data))
print(len(compressed) // 8)

Output:

enwik5.zip
34850
34823

@kodejuice
Copy link
Owner

I am also on version 3.11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants