Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to querying with like operator for a record contains Chinese chars #170

Open
GoogleCodeExporter opened this issue Jun 24, 2015 · 10 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. Download the source code which help you to reproduce the issue. Download 
link: https://docs.google.com/open?id=0BwKD9xrjSqz0Vzd6LUV5aGhkeFU

2. Open the solution CsharpSqlite_LikeOperatorTest.sln.

3. In the Visual Studio, run  
CsharpSqlite_LikeOperatorTest\Debug\TestSqliteCInterface.exe. This is an app 
which uses C interface of sqlite3. It query the database with the statement: 
select * from Stock where Name like '%翡翠%'. And It will successfully get 1 
row result.

Note: please don't run directly in the explorer's folder since the app relys on 
Visual Studio's environment path settings.

4.Also in the Visual Studio, run 
CsharpSqlite_LikeOperatorTest\TestCsharpSqliteInterface\bin\Debug\TestCsharpSqli
teInterface.exe. This is an app which uses Csharp-Sqlite. It query the database 
uses the same sql query as step 3:
 select * from Stock where Name like '%翡翠%'.
But no records will be returned.



What is the expected output? What do you see instead?
In step 4, a record should be returned:
35|110000020|满绿翡翠钻石黄金戒指|oeifjasdjfadsjfakdjfalkjaljdg|12000.
0|1331834280|10.0|只|1|0|1331834280|1|0|1|



What version of the product are you using? On what operating system?
SQLite version 3.7.10 2012-01-16 13:28:40
csharp-sqlite_3_7_7_1_71.zip



Please provide any additional information below.
The database Ptolemaic.db have a table to be used in the reproducing code, is 
Stock. In the table there is only one record:
35|110000020|满绿翡翠钻石黄金戒指|oeifjasdjfadsjfakdjfalkjaljdg|12000.
0|1331834280|10.0|只|1|0|1331834280|1|0|1|
If you query the table with:
 select * from Stock where Name like '%满绿翡翠钻石黄金戒指%'
CSharp-sqlite will return the only record. But if you query with
 select * from Stock where Name like '%翡翠%'
CSharp-sqlite will return nothing.
In both cases, Sqlite3 C interface will always return the only record correctly.

So many thanks!

Original issue reported on code.google.com by [email protected] on 24 Oct 2012 at 12:55

@GoogleCodeExporter
Copy link
Author

Original comment by [email protected] on 29 Oct 2012 at 7:41

  • Changed state: Reviewed

@GoogleCodeExporter
Copy link
Author

It seems that in file utf_c.cs in function

static u32 sqlite3Utf8Read(
    string zIn,          /* First byte of UTF-8 character */
    ref string pzNext   /* Write first byte past UTF-8 char here */
    )

the line:
          while ( zIndex != zIn.Length && ( zIn[zIndex] & 0xc0 ) == 0x80 )


needs to be changed to: 
          while ( zIndex != zIn.Length && ( zIn[zIndex -1] & 0xc0 ) == 0x80 )



Being new to the coding architecture of SQLite / c#-SQLite, it took me some two 
days to find this. Please help me testing whether this now fits for all UTF-8 
characters.

Best regards

LLD

Original comment by [email protected] on 23 Dec 2012 at 10:46

@GoogleCodeExporter
Copy link
Author

Thanks LLD for your effort!
I've tested your fix and it worked for the query:
select * from Stock where Name like '%翡翠%

But I prepared another database file, and tried this query but failed:
"select * from Stock where Name like \'%玛瑙%\'"

REPRO STEPS:
1. Download the test code 
https://docs.google.com/open?id=0BwKD9xrjSqz0cUNsSEpBSjNxeWM . I've changed the 
code to search %玛瑙%. And a new database file is in the zip file.
2. Run TestCsharpSqliteInterface
3. Check the follow function's result.
private static string TestQuery(Sqlite3.sqlite3 database, string sql)

Expected:
The query should return 2 records.

Actually:
No records returned.

Other Information:

There are two records in the database.
sqlite> select * from Stock;
964|20000458|南红玛瑙挂坠||800.0|1353415371|3.0|件|0|0|1353415803|0|0|2|2
0000458 2000-0458 南红玛瑙挂坠 NA;hong;MA;NAO;GUA;ZHUI;   2012/11/20 
12:42:50 2012/11/20 12:50:03
973|20000458|南红玛瑙挂坠||800.0|1353415371|3.0|件|1|0|1353418029|0|1|2|2
0000458 2000-0458 南红玛瑙挂坠 NA;hong;MA;NAO;GUA;ZHUI;   2012/11/20 
12:42:51 2012/11/20 13:27:09



Original comment by [email protected] on 24 Dec 2012 at 2:56

@GoogleCodeExporter
Copy link
Author

Hm, I admit I only did few testing. Actually, this function is pretty odd. When 
I comment out most of its code it seems to cover all LIKE operations:

      int zIndex = 0;
      u32 c = zIn[zIndex++];

      //if ( c >= 0xc0 )
      //{
      //  //if ( c > 0xff ) c = 0;
      //  //else
      //  {
      //    //c = sqlite3Utf8Trans1[c - 0xc0];            1100 0000  1000 0000
      //    while ( zIndex != zIn.Length && ( zIn[zIndex -1] & 0xc0 ) == 0x80 )
      //    {
      //      c = (u32)( ( c << 6 ) + ( 0x3f & zIn[zIndex++] ) );
      //    }
      //    if ( c < 0x80
      //    || ( c & 0xFFFFF800 ) == 0xD800
      //    || ( c & 0xFFFFFFFE ) == 0xFFFE )
      //    {
      //      c = 0xFFFD;
      //    }
      //  }
      //}
      pzNext = zIn.Substring( zIndex );
      return c;


Of course, comparison is now done in UNICODE, not UTF-8, but for the LIKE 
operator this seems to work.
Sunshak, can you try this out?

I will further look into this during the next days and hopefully will be able 
to come up with a more solid solution.

Original comment by [email protected] on 24 Dec 2012 at 4:14

@GoogleCodeExporter
Copy link
Author

Hi LLD,

Looks nice. I've tested a couple of queries, and did not find any issue yet. I 
will make a deep test later.

Thank you for your help!  :-)

Original comment by [email protected] on 24 Dec 2012 at 8:34

@GoogleCodeExporter
Copy link
Author

I've tried as many keywords as possible, and no issue found.
The fix looks reliable.
I think the fix could be get reviewed by the project owner.

Original comment by [email protected] on 24 Dec 2012 at 12:47

@GoogleCodeExporter
Copy link
Author

Have you tested with the full testsuite or just this one edge case?

Original comment by [email protected] on 24 Dec 2012 at 1:21

@GoogleCodeExporter
Copy link
Author

I've tested it manually with keywords random came into my mind, and as many as 
possible, including this one edge case.

Original comment by [email protected] on 24 Dec 2012 at 1:32

@GoogleCodeExporter
Copy link
Author

The function in comment #4 correctly handles all UNICODE characters in the 
range of UNICODE code points U+0000 to U+FFFF (which covers the most frequently 
used characters). However, I think it won't correctly handle characters in the 
code point range U+10000 to U+10FFFF. This conclusion is not (yet) based on 
tests but on the following literature:

The Microsoft docs say: 

"Indexes [of Strings]
An index is the position of a Char object, not a Unicode character, in a 
String. An index is a zero-based, nonnegative number starting from the first 
position in the string, which is index position zero. Consecutive index values 
might not correspond to consecutive Unicode characters because a Unicode 
character might be encoded as more than one Char object. To work with each 
Unicode character instead of each Char object, use the 
System.Globalization.StringInfo class. 
(http://msdn.microsoft.com/en-us/library/system.string(v=vs.80).aspx)"

Since the function gets an index value (u32 c = zIn[zIndex++];), this will 
always be a 16 bit value that may not equal a character's UNICODE value in case 
the character is a surrogate or combining character.

The Microsoft docs for the StringInfo class 
(http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo(v=vs.80
).aspx) include a sample that I think serves as a good starting point to make 
function sqlite3Utf8Read capable of handling characters beyond code point 
U+FFFF correctly.

BUT then again: The original SQLite c code comments say that "this routine 
never allows a UTF16 surrogate value to be encoded [...]" and instead returns 
0xfffd in such cases. The question is whether C#-SQLite should correctly 
implement the UNICODE standard here or rather act the same way SQLite does. 

I will conduct further tests on this issue and work out a fix if necessary. 
Unfortunately, I won't be able to do this before something like mid of January.

Best regards

LLD

Original comment by [email protected] on 26 Dec 2012 at 2:57

@GoogleCodeExporter
Copy link
Author

I really appreciate your work. Hope you will be back after the new year 
vacation.

Happy new year!  :-)

Xueqing Sun

Original comment by [email protected] on 27 Dec 2012 at 2:08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant