You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Regarding the [missing tokes in the parsed vocabulary](https://github.com/huggingface/swift-transformers/pull/113#issuecomment-2267520368), this is my documentation after tracking one of the issues.
First, we are parsing the JSON file (tokenizer.json) using JSONSerialization.jsonObject. This reads data as Foundation objects, parsing tokens from the vocab dictionary as NSString instances. This is a good thing. Strings cannot be used as keys in the vocab dictionary because equality only considers the Unicode canonical representation. Parsing the JSON and casting to [String : Int] would ignore multiple entries.
However, I found that JSONSerialization fails to correctly parse some strings. Consider the following test case:
func testArrayParsingWithBOMPrefix(){
// The second one starts with a BOM prefix
letitems=["a","\u{feff}a"]
// Neither Strings nor NSStrings are equal
XCTAssertNotEqual(items[0],items[1])XCTAssertNotEqual(items[0]asNSString,items[1]asNSString)
// JSONDecoder works
letjsonData=try!JSONSerialization.data(withJSONObject: items, options:[])letdecoder=JSONDecoder()letdecoded=try! decoder.decode([String].self, from: jsonData)XCTAssertEqual(decoded, items)
// JSONSerialization seems to ignore the BOM.
// The decoded array contains two items, but they are the same NSString.
letns_decoded=try!JSONSerialization.jsonObject(with: jsonData, options:[])as!NSArrayXCTAssertEqual(ns_decoded.count, items.count) // passes
XCTAssertNotEqual(ns_decoded[0]as!NSString,ns_decoded[1]as!NSString) // fails
XCTAssertEqual(ns_decoded as![String], items) // fails
// Compare unicodeScalars
func scalars(_ string:String)->[UInt32]{
string.unicodeScalars.map{ $0.value }}
for (decoded, expected) in zip(ns_decoded, items){letdecodedScalars=scalars(decoded as!String)letexpectedScalars=scalars(expected)XCTAssertEqual(decodedScalars, expectedScalars) // first passes, second fails
}}
There are two strings in the test array. The second one starts with a BOM prefix. The prefix is ignored when parsing the two NSStrings, as confirmed by looking at the unicode scalars in the debugger. Unfortunately, the Gemma vocab contains some duplicate entries with/without a BOM prefix, so reading them into a dictionary skips some entries.
Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:
// If the non-breaking space is inside the String, all tests pass
// let items = ["ab", "a\u{feff}b"]
I suspect this is used for parsing, and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data.
Also interestingly, JSONDecoder works and can decode the two distinct String instances in the array. We are not using JSONDecoder in this project because:
The structure of the JSON files to be parsed is quite open and flexible, I don't think it would be straightforward to write a decodable structure that represents it. Instead, we use dynamic member lookup to navigate the contents.
We can't use String instances for vocab keys, as mentioned above.
First, we are parsing the JSON file (
tokenizer.json
) usingJSONSerialization.jsonObject
. This reads data as Foundation objects, parsing tokens from the vocab dictionary asNSString
instances. This is a good thing.String
s cannot be used as keys in the vocab dictionary because equality only considers the Unicode canonical representation. Parsing the JSON and casting to[String : Int]
would ignore multiple entries.However, I found that
JSONSerialization
fails to correctly parse some strings. Consider the following test case:There are two strings in the test array. The second one starts with a BOM prefix. The prefix is ignored when parsing the two
NSString
s, as confirmed by looking at the unicode scalars in the debugger. Unfortunately, the Gemma vocab contains some duplicate entries with/without a BOM prefix, so reading them into a dictionary skips some entries.Interestingly, all the tests pass if the BOM character is in the middle of the string. Replacing the test items with these works fine:
I suspect this is used for parsing, and the stream is incorrectly assumed to start with a BOM even though it's in the middle of the actual json data.
Also interestingly,
JSONDecoder
works and can decode the two distinct String instances in the array. We are not usingJSONDecoder
in this project because:String
instances for vocab keys, as mentioned above.I'm not sure how to deal with this.
Originally posted by @pcuenca in #113 (comment)
The text was updated successfully, but these errors were encountered: