initial prove of concept less string allocations #1826

JanEggers · 2021-07-07T19:36:03Z

I used the benchmark project to generate a file with 1000000 records and then actually run the benchmark to read them with

using (var parser = new CsvReader(reader, config))
{
    foreach (var item in parser.GetRecords<Program.Columns50>())
    {
    
    }
}

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated	branch
Parse	7.599 s	0.1678 s	0.4759 s	1176000.0000	-	-	4.58 GB	master
Parse	4.746 s	0.0875 s	0.1622 s	338000.0000	-	-	1.32 GB	pr

+ 40% speed
-100% string allocations when reading records
-100% type[] allocation with default ctor

if the general design is accepted i will implement more type converters and come up with a way to check wether a row should be skipped without allocating

JoshClose · 2021-07-08T13:30:22Z

Can you also benchmark with config option CacheFields = true for each of those and add them to your results?

JanEggers · 2021-07-08T14:06:45Z

sure

JanEggers · 2021-07-08T17:27:17Z

i modified the testset to generate 100_000 rows with random int values as just having 50 values repeating over and over is not realistic

here are the numbers with caching enabled:

Method	Mean	Error	StdDev	Gen 0	Gen 1	Gen 2	Allocated	branch
Parse	4.165 s	0.0830 s	0.1558 s	72000.0000	29000.0000	7000.0000	708.96 MB	master
Parse	1.138 s	0.0226 s	0.0532 s	33000.0000	-	-	135.36 MB	pr

JoshClose · 2021-07-10T15:22:01Z

Dang. I was planning on doing this at some point so it's great that you could do it. This will be a breaking change so I may deploy with some other changes. We'll see.

JanEggers · 2021-07-11T14:43:48Z

I would really love to also support netstandard and net4.5 properly but i didnt find a parser. the spanbased parser overloads are only present in netcoreapp

JoshClose · 2021-07-12T17:46:26Z

You might be able to use this. https://www.nuget.org/packages/System.Memory/

JoshClose · 2021-07-12T17:57:30Z

There is a lot more work that needs to go into this. All the type converters would need to be switched to use Span. I would think the parser could just return a ReadOnlySpan instead of strings. The consumer could just to .ToString() if they want.

JoshClose · 2021-07-12T18:01:13Z

CacheFields does nothing when using Span. The field cache stores created instances of strings and returns them instead of calling new. I'm curious the speed difference when getting records normally with CacheFields on and then when using Span. Will caching the strings be faster than not allocating the strings to begin with.

JoshClose · 2021-07-12T19:45:48Z

Seems to be a lot slower when using the field cache. If the data is duplicated, the field cache should help, otherwise it's a lot more processing.

JanEggers · 2021-07-13T06:01:37Z

You might be able to use this. https://www.nuget.org/packages/System.Memory/

I will try it. but the utf8parser assumes utf8 encoding so we would need to convert span of char to span of byte which needs to be allocated / arraypooled and also takes some time.

There is a lot more work that needs to go into this.

I agree I was just proposing the design to get feedback.

I would think the parser could just return a ReadOnlySpan instead of strings.

Sure thats possible but breaking. with my approach all old converters (also the ones of library consumers) still work and the span based converter is an opt in. Im also fine with just converting all string returning methods to span as it will not pollute the api with a bunch more overloads

Will caching the strings be faster than not allocating the strings to begin with.

no caching strings is not faster see my second benchmark post.
I didnt use the field cache as it makes no sense to cache a span as its just a pointer into existing data (very low cost) and just calculating the hash and sequenceequals in the cache lookup is more expensive than just gettting the span. also using the fieldcache with many different values you get gen2 allocations wich is not desirable

Another idea would be to cache the converted object instead of the raw value that way you could eliminate the converter call.
But that may need some benchmarking as well as im ot sure whether int.parse and similar are actually slower than the cache lookup

JoshClose · 2021-07-13T14:43:13Z

Another idea would be to cache the converted object instead of the raw value that way you could eliminate the converter call.

That could be an issue if people are expecting a new object each time. If they have a list of objects and edit one, they could be editing multiple without knowing.

Also, that is exactly what EnumerateRecords<T>(T record) does. You supply the object and it gets filled in.

Maybe you were talking about the member values and not the object itself. That's an interesting idea. I think that would work fine with Span also.

JanEggers · 2021-07-13T15:13:56Z

Maybe you were talking about the member values and not the object itself.

yup that is what i meant to say

JoshClose · 2021-07-13T15:24:49Z

I will try it. but the utf8parser assumes utf8 encoding so we would need to convert span of char to span of byte which needs to be allocated / arraypooled and also takes some time.

I don't understand. What is utf8parser?

JanEggers · 2021-07-13T15:31:42Z

the parser in that nuget

JoshClose · 2021-07-13T15:41:04Z

Oh... I just meant to be able to use Span on the older frameworks.

JanEggers · 2021-07-13T15:44:48Z

thats working fine with the dependencies we have but we need something that turns the spans into ints, longs, datetimes and so on

JanEggers · 2021-07-13T15:47:44Z

i certainly dont want to replicate the code in https://github.com/dotnet/runtime/blob/main/src/libraries/System.Private.CoreLib/src/System/Number.Parsing.cs

JoshClose · 2021-07-13T15:56:28Z

Ha. I completely forgot about the converters needing a TryParse overload that takes in Span. Oops.

JoshClose · 2021-07-13T15:59:38Z

Maybe in the case of older frameworks the converters can do a ToString.

JanEggers · 2021-07-13T16:00:49Z

jep that will always work

JoshClose · 2021-07-13T16:05:00Z

When I get some time I'll play around with this idea and see whether I want the parser to return only Span or have another method. I see the parser as mostly an internal thing. Changing all the converters would break any custom converter people have created, but at minimum they could just do a ToString and it should be the same. I'd have to check how this would all work with mapping Convert.

Rand-Random · 2023-03-03T10:56:46Z

I am currently reading a large CSV file (~8 million records) and am analyzing with dotTrace (jetbrains) performance bottle necks, while looking I found this performance metrics

after a google search, I came across this PR, and would like to ask if there is any progress on this front?

initial prove of concept

03113a5

JanEggers changed the title ~~initial prove of concept~~ initial prove of concept less string allocations Jul 7, 2021

Rand-Random mentioned this pull request Apr 3, 2023

Expose GetFieldSpan on IParser #2142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial prove of concept less string allocations #1826

initial prove of concept less string allocations #1826

JanEggers commented Jul 7, 2021 •

edited

Loading

JoshClose commented Jul 8, 2021

JanEggers commented Jul 8, 2021

JanEggers commented Jul 8, 2021

JoshClose commented Jul 10, 2021

JanEggers commented Jul 11, 2021

JoshClose commented Jul 12, 2021

JoshClose commented Jul 12, 2021

JoshClose commented Jul 12, 2021

JoshClose commented Jul 12, 2021

JanEggers commented Jul 13, 2021 •

edited

Loading

JoshClose commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

JoshClose commented Jul 13, 2021 •

edited

Loading

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

Rand-Random commented Mar 3, 2023 •

edited

Loading

initial prove of concept less string allocations #1826

Are you sure you want to change the base?

initial prove of concept less string allocations #1826

Conversation

JanEggers commented Jul 7, 2021 • edited Loading

JoshClose commented Jul 8, 2021

JanEggers commented Jul 8, 2021

JanEggers commented Jul 8, 2021

JoshClose commented Jul 10, 2021

JanEggers commented Jul 11, 2021

JoshClose commented Jul 12, 2021

JoshClose commented Jul 12, 2021

JoshClose commented Jul 12, 2021

JoshClose commented Jul 12, 2021

JanEggers commented Jul 13, 2021 • edited Loading

JoshClose commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

JoshClose commented Jul 13, 2021 • edited Loading

JanEggers commented Jul 13, 2021

JoshClose commented Jul 13, 2021

Rand-Random commented Mar 3, 2023 • edited Loading

JanEggers commented Jul 7, 2021 •

edited

Loading

JanEggers commented Jul 13, 2021 •

edited

Loading

JoshClose commented Jul 13, 2021 •

edited

Loading

Rand-Random commented Mar 3, 2023 •

edited

Loading