WIN-1255 Encoding for Type 2 Fields #32

Doreen-Schwartz · 2024-08-06T10:03:45Z

Hello, I understood that currently, the only supported encoding for fields that are not Type 1 is UTF-8. I was wondering if there are any intentions to add support for additional encoding?

If not, I would love to try implementing this myself. Would appreciate getting pointers to how to implement this- which parts of the encoding process would require adjusting to enable this functionality?

From what I've gathered, the default Node Buffer currently used to construct the NIST file only supports very specific types of encoding. Looking at possible alternatives, iconv-lite seems like a decent package to utilize for this.

@JonathanZahavii

ivosh · 2024-08-06T14:41:29Z

Hi @Doreen-Schwartz, this is an interesting subject.
Honestly, throughout various law-enforcement agencies across the world I have not encountered such a requirement.
Nevertheless, adding a conversion process into the library would be fine with me.

A question: how the charset encoding is conveyed between the NIST file producer and the NIST file consumer? Is there a special field in the NIST file? Or just by convention? That would somehow influence the futher discussion.

I don't want to tie node-nist to any specific conversion library (such as iconv-lite, node-iconv...).
I am thinking to add a new property (callback function) to NistCodecOptions interface and then use that instead of Buffer.toString and Buffer.write.
If you are willing to prepare a PR, please do it in several steps:

Propose the updated interface and example usage
Once the interface is discussed/reviewed/agreed, flesh in the implementation, unit tests and documentation.

Doreen-Schwartz · 2024-08-12T08:45:22Z

As far as I have seen, there is no specific field that the consumer takes into consideration. We have an existing producer that creates NIST files which the consumer does successfully read, and I didn't notice any fields that provided this information- not in the file or the documentation for the consumer. So it is just a convention for this specific consumer.

Just to give some context, when I tried uploading a NIST produced using the node-nist library to our consumer, any Hebrew characters (which is the main reason non-ASCII characters are needed) would appear as gibberish. Speaking to someone from the team that worked on the consumer, they said that it only accepts WIN-1255 encoding for Hebrew characters.

As for the potential changes, I was thinking I could add callbacks for the NistFieldEncodeOptions and NistFieldDecodeOptions interfaces, one for writing fields into the buffer and one for decoding fields from the buffer.

Something like this, maybe:

/** Encoding options for a single NIST Field. */
interface NistFieldEncodeOptions extends NistFieldCodecOptions {
  formatter?: (field: NistField, nist: NistFile) => NistFieldValue;
  informationWriter?: (informationItem: NistInformationItem, data: EncodeTracking) => EncodeTracking;
}

Which would then be used in place data.buf.write in encodeNistInformationItem, and of course a parallel for decode. Does this make sense?

Doreen-Schwartz · 2024-08-12T13:17:09Z

Another thing I noted is that there is usage of the Buffer.byteLength function in a few places. I imagine this would also require some changes.

ivosh · 2024-08-12T15:06:47Z

For decoding logic, the proposed interface enhancement looks good, such as:

/** Decoding options for a single NIST Field. */
export interface NistFieldDecodeOptions extends NistFieldCodecOptions {
  parser?: (field: NistField, nist: NistFile) => Result<NistFieldValue, NistParseError>;
  informationDecoder?: (buffer: Buffer, startOffset: number, endOffset: number): string;
}

For encoding logic, consider that the the NistFile tree gets visited couple of times, in particular first for getting right the lengths of all the fields, records, target buffer etc. And second time for actually writing the data into the buffer.
So either you need two new properties (one for determining the length and another one for writing the converted data) or the new property should be rather along a generic lines of informationWritter?: (informationItem: NistInformationItem): Buffer.

Performance-wise, I'd say a charset conversion process is computationally comparable to Buffer.byteLength of an ASCII string. This means it's probably fine to perform the character conversion twice: first for determining the length and then for the serialization. Anyway, you should do at least some rudimentary performance comparison between plain 'UTF-8' and 'WIN-1255' enabled workloads.

ivosh · 2024-09-11T15:01:06Z

@Doreen-Schwartz I've finished updating the documentation. Please check 8473957.
For consistency, I've renamed informationWriter to informationEncoder. I hope it's fine with you.
Once I get a green light from you, I'll publish a new npm version.
Thanks again for the PR.

Doreen-Schwartz · 2024-09-11T16:00:03Z

@ivosh Overall looks great! Only one thing I noticed is that in the JSDoc for informationEncoder, the parameters are the same as for formatter instead of the function's actual parameters.

Otherwise seems to cover things well 👍

ivosh · 2024-09-12T10:57:36Z

@Doreen-Schwartz Thank you for the review. I am always glad to have another pair of eyeballs ;-)
New release 0.10.0 is out.

ivosh self-assigned this Aug 6, 2024

ivosh added the enhancement New feature or request label Aug 7, 2024

ivosh linked a pull request Sep 11, 2024 that will close this issue

Alternative Field Encoding for Types #33

Merged

ivosh closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIN-1255 Encoding for Type 2 Fields #32

WIN-1255 Encoding for Type 2 Fields #32

Doreen-Schwartz commented Aug 6, 2024 •

edited

Loading

ivosh commented Aug 6, 2024

Doreen-Schwartz commented Aug 12, 2024

Doreen-Schwartz commented Aug 12, 2024

ivosh commented Aug 12, 2024

ivosh commented Sep 11, 2024 •

edited

Loading

Doreen-Schwartz commented Sep 11, 2024

ivosh commented Sep 12, 2024

WIN-1255 Encoding for Type 2 Fields #32

WIN-1255 Encoding for Type 2 Fields #32

Comments

Doreen-Schwartz commented Aug 6, 2024 • edited Loading

ivosh commented Aug 6, 2024

Doreen-Schwartz commented Aug 12, 2024

Doreen-Schwartz commented Aug 12, 2024

ivosh commented Aug 12, 2024

ivosh commented Sep 11, 2024 • edited Loading

Doreen-Schwartz commented Sep 11, 2024

ivosh commented Sep 12, 2024

Doreen-Schwartz commented Aug 6, 2024 •

edited

Loading

ivosh commented Sep 11, 2024 •

edited

Loading