Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OndaEDF should use EDF-representable encoding parameters (physical min & max) #90

Open
ericphanson opened this issue Oct 18, 2023 · 5 comments
Assignees

Comments

@ericphanson
Copy link
Member

  • EDF.jl requires the caller to encode the data
  • EDF.jl asks the caller for the encoding parameters (physical_minimum and physical_maximum) and then tries to store those in the file as 8 ASCII characters
  • OndaEDF.jl is one such caller
  • OndaEDF.jl performs a new re-encoding phase which can choose difficult-to-represent encoding parameters

One approach here is to make EDF.jl better at representing encoding parameters, e.g. using scientific notation. I think improving EDF.jl's representation is a good thing to do. However, since OndaEDF gets to choose it's encoding, and a wide variety of encoding parameters can perform reasonably well for a given recording, I claim it should choose easy-to-represent ones.

That is, we could round physical_minimum down to the nearest EDF representable value, and physical_maximum up to the nearest EDF representable value before encoding. Thus, there should be no truncation/rounding when EDF encodes that parameter, and thus reduces unnecessary quantization noise.

As a very silly example, the current approach can result in 5 orders of magnitude more encoding noise, simply due to the truncation of encoding parameters:

using OndaEDF, Onda
scale = 1e6
info = SamplesInfoV2(; sensor_type="eeg",
    channels=["fp1"],
    sample_unit="microvolt",
    sample_resolution_in_unit=0.25,
    sample_offset_in_unit=scale,
    sample_type=Int16,
    sample_rate=1)

data = reshape([scale + 0.999999, scale + 1.999999], 1,:)
samples = Samples(data, info, false)

edf = onda_to_edf([samples])
io = IOBuffer()
EDF.write(io, edf)
seekstart(io)
f = EDF.File(io)

(samples2,), plan = edf_to_onda_samples(f)
samples2 = Onda.decode(samples2)

yields

julia> # roundtrip through encoding alone, without EDF truncation of encoding parameters
       quantization_error = abs(samples.data[1] - decode(OndaEDF.reencode_samples(samples)).data[1])
1.00000761449337e-6

julia> # roundtrip through EDF encoding
       edf_error = abs(samples.data[1] - samples2.data[1])
0.375044776359573

julia> log10(edf_error / quantization_error)
5.574079814047871

Note in both cases we are using OndaEDF.reencode_samples to re-encode them. The only difference is passing through EDF.jl writing, which rounds/truncates these parameters.

@klaff
Copy link
Contributor

klaff commented Oct 18, 2023

EDF (I presume) was designed as a format for data generated by A/D converters, which (excluding μ-law or other non-linear types) has a natural resolution (how much is 1 lsb worth?) and a span or range of valid values.

On the other hand, if EDF is used as an output format from some complex process, the contextual clues from the data's origin may be unavailable (or never existed) and the problem changes to one of finding the best mapping given the limits of the format (e.g. 8 char ASCII representation of mapping parameters and Int16 samples).

Those two constraints can play against each other. The EDF error reported above disappears if one chooses sample_resolution_in_unit=1, but that's because we are lucky that a favorable set of physical min/max work out for that case (967232.0f0, 1.032767f6).

The error does not disappear if one naïvely chooses sample_resolution_in_unit=1/2^16. This would maximize the resolution within the span but the physical min/max is broken by the 8-char ASCII limitation (999999.5f0, 1.0f6).

Is there an algorithmic way to find the optimal mapping into EDF? Would optimum be defined as minimum maximum error over a whole signal? Minimum RMS error?

Should there be a function to determine the error in a given case?

@ericphanson
Copy link
Member Author

ericphanson commented Oct 19, 2023

I'm not sure if I'm am following 100% but I think "two constraints can play against each other" refers to the two independent sources of error:

  • quantization error (in general, from encoding & decoding)
  • truncation-of-encoding-parameter-error which is due to EDF's 8-character representation of the encoding parameters
    • note this truncation happens AFTER encoding, meaning the decoding gets slightly-wrong parameters

I don't think these two are really fundamentally in tension however! Why?

  • minimizing quantization error depends on setting relatively tight bounds to maximize the resolution we can represent with Int16 numbers. That is, we want physical_minimum and physical_maximum to be near the actual min/max of the data
    • note: as you pointed out, if the resolution is actually limited by the underlying recording device, we should just use those parameters, which might give less resolution that we can squeeze out of Int16's but represents the actual resolution of the signal. But if we don't know the resolution of the recording device, then maybe it does make sense to squeeze as much resolution as possible out.
  • minimizing additional error incurred due to truncation of encoding parameters depends on the value being close to an 8-character digit
    • but there's a lot of such "exactly representable" values spread throughout the floating point space!

We can use

function _nearest_representable_edf_value(x, rm::RoundingMode=RoundNearest)
    return round(x, rm; digits=(8 - (ndigits(floor(Int, x)) + signbit(x) + !isinteger(x))))
end

to choose one, deciding on whether or not we want to look up or down to get it. (See beacon-biosignals/EDF.jl#76 for the bugfix I've made to this function compared to it's version in EDF.jl).

So I think if we just choose physical_min = _nearest_representable_edf_value(signal_min, RoundDown) and physical_max = _nearest_representable_edf_value(signal_max, RoundUp) we can eliminate truncation error without significantly changing the quantization error.

BTW, I will include some code below I have used while investigating, in case it is useful (and to put it somewhere).

Code
using Printf
function edf_roundtrip(x::Real)
    result = missing
    if isinteger(x)
        str = string(trunc(Int, x))
        if length(str) <= 8
            result = str
        end
    else
        fpart, ipart = modf(x)
        ipart_str = string('-'^signbit(x), Int(abs(ipart))) # handles `-0.0` case
        fpart_str = @sprintf "%.7f" abs(fpart)
        fpart_str = fpart_str[3:end] # remove leading `0.`
        if length(ipart_str) < 7
            result = ipart_str * '.' * fpart_str[1:(7-length(ipart_str))]
        elseif length(ipart_str) <= 8
            result = ipart_str
        end
    end
    if !ismissing(result)
        return parse(Float64, result)
    end
    error("failed to fit header field into EDF's 8 ASCII character limit. Got: $x")
    return nothing
end

edf_roundtrip_error(x) = abs(x - edf_roundtrip(x))

function _nearest_representable_edf_time_value(x, rm::RoundingMode=RoundNearest)
    return round(x, rm; digits=(8 - (ndigits(floor(Int, x)) + signbit(x) + !isinteger(x))))
end
decode(sample, dmin, dmax, pmin, pmax) = ((sample - dmin) / (dmax - dmin)) * (pmax - pmin) + pmin

encode(sample, resolution_in_unit, offset_in_unit) = round(Int16, clamp((sample - offset_in_unit) / resolution_in_unit, typemin(Int16), typemax(Int16)))


function quantization_error(sample, dmin, dmax, pmin, pmax)
    sample_resolution_in_unit = (pmax - pmin) / (dmax - dmin)
    sample_offset_in_unit = pmin - (sample_resolution_in_unit * dmin)

    encoded_sample = encode(sample, sample_resolution_in_unit, sample_offset_in_unit)
    decoded_sample = decode(encoded_sample, dmin, dmax, pmin, pmax)

    return abs(sample - decoded_sample)
end

function quantization_plus_thresholding_error(sample, dmin, dmax, pmin, pmax)
    sample_resolution_in_unit = (pmax - pmin) / (dmax - dmin)
    sample_offset_in_unit = pmin - (sample_resolution_in_unit * dmin)
    encoded_sample = encode(sample, sample_resolution_in_unit, sample_offset_in_unit)
    decoded_sample = decode(encoded_sample, edf_roundtrip(dmin), edf_roundtrip(dmax), edf_roundtrip(pmin), edf_roundtrip(pmax))
    return abs(sample - decoded_sample)
end

additional_error_due_to_truncation = (args...) -> (quantization_plus_thresholding_error(args...) - quantization_error(args...))


additional_quantization_error_due_to_rounding = (sample, dmin, dmax, pmin, pmax) -> (quantization_error(sample, _nearest_representable_edf_time_value(dmin, RoundDown), _nearest_representable_edf_time_value(dmax, RoundUp), _nearest_representable_edf_time_value(pmin, RoundDown), _nearest_representable_edf_time_value(pmax, RoundUp)) - quantization_error(sample, dmin, dmax, pmin, pmax))

function error_curve(f, pmin, pmax)

    dmin, dmax = Float64(typemin(Int16)), Float64(typemax(Int16))
    xs = range(pmin, pmax; length=100)
    ys = [f(sample, dmin, dmax, pmin, pmax) for sample in xs]
    return (xs, ys)
end


using CairoMakie

pmin, pmax = -3200.0, 3200.0 # or pmin, pmax = 1e6 + 0.999999, 1e6 + 1.999999

# periodic, probably an analytic formula
lines(error_curve(quantization_error, pmin, pmax)...)

lines(error_curve(quantization_plus_thresholding_error, pmin, pmax)...)

lines(error_curve(additional_error_due_to_truncation, pmin, pmax)...)

lines(error_curve(additional_quantization_error_due_to_rounding, pmin, pmax)...)

BTW: this is why I've been saying that I don't think stuff like beacon-biosignals/EDF.jl#70 is really getting at the root problem here. If we represent more numbers accurately, that's helpful, but really we should be choosing exactly-representable encoding parameters from the start, since we both get to choose them here and know the target is EDF.jl, and can thus eliminate truncation error completely, just by calling _nearest_representable_edf_value appropriately before encoding. (EDF.jl is too late, it gets the results after encoding).

@klaff
Copy link
Contributor

klaff commented Oct 20, 2023

@ericphanson Did you mean for the funcitons _nearest_representable_edf_time_value and _nearest_representable_edf_value to be different in some way?

@klaff
Copy link
Contributor

klaff commented Oct 20, 2023

@ericphanson I don't have time today to keep up with all the great detail you've put in here, but let me know if I've got the TL;DR summary correct: The root of the problem in your original example was choosing offset and resolution at the Onda stage without considering that the chosen encoding parameters are not exactly representable in EDF.

@ericphanson
Copy link
Member Author

ericphanson commented Oct 20, 2023

@ericphanson Did you mean for the funcitons _nearest_representable_edf_time_value and _nearest_representable_edf_value to be different in some way?

No, sorry, they are the same. The time one is the name in EDF.jl but it has nothing to do with time so I changed the name posthoc in an attempt at clarity here .

@ericphanson I don't have time today to keep up with all the great detail you've put in here, but let me know if I've got the TL;DR summary correct: The root of the problem in your original example was choosing offset and resolution at the Onda stage without considering that the chosen encoding parameters are not exactly representable in EDF.

Yes! That was actually what I was trying to say with this whole issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants