Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string.HashLower() and string.Equals(other, caseSensitive) or other solution? #330

Open
GWRon opened this issue Sep 3, 2024 · 3 comments

Comments

@GWRon
Copy link
Contributor

GWRon commented Sep 3, 2024

For now any string comparison is already sped up by a "hash" (generated during compilation for defined strings and during runtime for others).

But what happens if you want to eg use a custom sort "by string value" in a case insensitive manner? Such a sort function would look like this:

	Function SortByName:Int(o1:Object, o2:Object)
		Local p1:TProgrammeLicence = TProgrammeLicence(o1)
		Local p2:TProgrammeLicence = TProgrammeLicence(o2)
		If Not p2 Then Return 1
		If Not p1 Then Return -1
		
		'remove "ToLower" for case sensitive comparison
		Local t1:String = p1.title.ToLower()
		Local t2:String = p2.title.ToLower()
		
        If t1 > t2
			Return 1
        ElseIf t1 < t2
			Return -1
		Else
			Return p1.guid > p2.guid
		EndIf
	End Function

So for each time an object is asked to get "compared with another" it will create a "to lower" string.

Couldn't we also get a "case-insensitive hash" or do you think this is too much (as it adds another property to all strings) ?

I know I could - in this example case - have "Field titleLowerCaseHash:Long" which I manually create and recreate on each "title"-change but it does not look "that smart" either.

Do you (others) have a suggestion on how to solve such stuff (a "performant" case insensitive sort implementable in the "custom sort function"-fashion BlitzMax offers) ?

@HurryStarfish
Copy link
Member

I definitely wouldn't want an additional hash stored in strings. Strings are used all over, and this would add overhead to all of them. For the relatively rare case of case-insensitive comparisons being needed (and even rarer, being a performance problem), that does seem worth the cost.
I could see something like this being added (as an overload for the Compare method?), giving you a parameter to specify how the comparison should be done (and could be extended later for stuff like culture-sensitive comparisons).

Do you (others) have a suggestion on how to solve such stuff (a "performant" case insensitive sort implementable in the "custom sort function"-fashion BlitzMax offers) ?

With what's currently available? Your code sample looks like you have (somewhat long-lived?) TProgrammeLicence objects storing the specific strings that you want to compare. Instead of your TProgrammeLicence.title being a string, you could make it a SProgrammeLicenceName struct storing both the original string and the lower-case string or hash. You can then give it struct a Compare method and/or operators to do the case-insensitive comparison you want. The lower-case data can be initialized in the constructor if you expect most titles to be compared, or on the first comparison if only some of them will be.

@GWRon
Copy link
Contributor Author

GWRon commented Sep 26, 2024

Regarding structs - for this your reflection branch needs to be rolled out as I serialize these kind of objects via reflection (Brucey's TPersistence) which does not work with structs in its current incarnation.

Using a special "lookup-lowercase-copy" (which I mark as not necessary for serialization, so only used during "runtime) is an idea which I had too - but it of course would require a "synchronization" (title might change randomly - eg in my case containing actual score of a sports match too). So changes would require "compares" and invalidate the "lookup-lowercase-copy" on changes.

This is why I thought a "in the core" solution would be the better approach - it hides away the issues you will face if you want a faster and lightweight "case insensitive compare".
(But I will still consider adding the cache stuff ... it would not be the only piece in my code doing such caching to speed up things)

Maybe a "not the best but at least better" solution would be to provide a c-written function comparing passed strings case-insensitively - maybe even with handling "utf8" too (or for performance reasons provide two functions - for "normal, ignoring utf8" and one for "including utf8").
My current approach:

		'remove "ToLower" for case sensitive comparison
		Local t1:String = p1.GetTitle().ToLower()
		Local t2:String = p2.GetTitle().ToLower()

		If t1 > t2
			Return 1
		ElseIf t1 < t2
			Return -1
		Else
			Return p1.GetGUID() > p2.GetGUID()
		EndIf

creates 2 new gc objects (the strings) for each compare it does.

But I assume it could be "more lightweight / less gc stress" if we could do something like:

		Local sComparison:Int = p1.GetTitle().Compare( p2.GetTitle(), False)
		'or
		'Local sComparison:Int = StringCompare(p1.GetTitle(), p2.GetTitle(), False)

		If sComparison > 0
			Return 1
		ElseIf sComparison < 0
			Return -1
		Else
			Return p1.GetGUID() > p2.GetGUID()
		EndIf

Of course this would not have the speed beenfits of the hash comparison but at least would avoid affecting the GC by outsourcing the comparison to C.
I assume "case insensitive string comparison" is not only used by me (but maybe I am the only one fearing the GC affection :-)).

@HurryStarfish
Copy link
Member

HurryStarfish commented Sep 26, 2024

Using a special "lookup-lowercase-copy" (which I mark as not necessary for serialization, so only used during "runtime) is an idea which I had too - but it of course would require a "synchronization" (title might change randomly - eg in my case containing actual score of a sports match too).

That's why I suggested creating a struct that contains both. Then setting the title shouldn't require anything changes other than writing p.title = New SProgrammeLicenceName(someString) instead of p.title = someString. The lower-case string stored in the struct needs to be created at most once per title change, not on every comparison.

But I assume it could be "more lightweight / less gc stress" if we could do something like

Local sComparison:Int = p1.GetTitle().Compare( p2.GetTitle(), False)

That's what I was referring to at the start (preferrably with an EStringComparison enum rather than a boolean parameter). Whether this would give you better or worse performance than the custom struct solution would probably depend on the specifics of your use case, e.g. how often do you compare the strings vs. how often to you change them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants