sequence() was the bottleneck in my program #451

KJ7LNW · 2023-08-14T20:59:52Z

I was surprised to find that the sequence() function was the bottleneck in a Particle Swarm optimization. Ultimately I cached $sequence = sequence(@foo) instead of generating it every time it was used.

Not sure if there is room to optimize sequence or not, but here is the nytprof output.

Before caching:

After caching:

Caching fixed it for me in my own program, but FYI in case it helps others, or if it can be addressed somehow.

The text was updated successfully, but these errors were encountered:

Optimizations: * cache sequence(). See PDLPorters/pdl#451 * Use scalars when comparing 1,1-piddles * Only evaluate slices when necessary. For example, bestPos is only used when a better fit is found ($sfit < $bestfit) * Remove ->copy from temp variable, there is no re-assignment. * Group scalar math together with parens _before_ applying that result to a PDL. Possible performance side-effects: * When logging is enabled, some PDL operations happen more than once, but I bet the printf takes longer than the PDL op. * getBestFit() creates a PDL on return, so maybe a touch slower, but it drastically speeds up internals with scalar comparisons.

mohawk2 · 2024-02-06T21:42:06Z

Notes on profiling C/XS code: https://www.perlmonks.org/?node_id=791611

mohawk2 · 2024-02-17T19:14:22Z

With the above, with_time { sequence(1e2) for 0..6e4 } went from (on my machine) 850ms to 790ms.

mohawk2 · 2024-02-17T20:44:36Z

Ironically, the topdl change, which is causing a segfault thanks to a fault in the var-args stuff for barf, doesn't make sequence go any faster once the inplace change is made (since the XS inplace obviously doesn't use topdl at all), which at this writing is the only one on master, and is getting tried a third time against downstream. With that, the above snippet takes more like 690ms.

mohawk2 · 2024-02-29T22:09:34Z

With the changes linked above, on my machine it's now about twice as quick (450ms) as when I started.

mohawk2 · 2024-03-18T21:34:14Z

With the above-linked commit, this is now taking about 380ms here.

Notes on identifying this and probably similar performance bottlenecks, using cachegrind:

$ rm cachegrind.out.*
$ make core && valgrind --tool=cachegrind perl -Mblib -MPDL -e 'sequence(1e2) for 1..1e4'
$ cg_annotate cachegrind.out.*
[...]
113,127,912 (14.74%) 463,865 ( 2.48%)  89 ( 0.59%) 40,271,896 (18.65%)   819,306 (13.70%)   698 ( 1.61%) 13,008,932 (11.12%)  47,211 ( 5.03%)      23 ( 0.01%)  ???:Perl_hv_common
 28,971,054 ( 3.77%) 118,977 ( 0.64%) 140 ( 0.93%)  6,323,461 ( 2.93%)   218,400 ( 3.65%)   314 ( 0.73%)  3,345,234 ( 2.86%)   7,931 ( 0.84%)     511 ( 0.30%)  ???:Perl_yyparse
[...]
  8,390,000 ( 1.09%) 110,000 ( 0.59%)  11 ( 0.07%)    560,000 ( 0.26%)    20,001 ( 0.33%)     2 ( 0.00%)  1,090,000 ( 0.93%)       0                0           /home/osboxes/pdl-code/Basic/Primitive/pp-axisvalues.c:pdl_axisvalues_readdata
[...]

The first two are the top-two expensive functions, the most so being the unexpected Perl_hv_common at nearly 15%. The first directly PDL-defined one, the readdata one, only uses 1.1% of CPU. To find out what was calling that took a bit of trickery, on the suspicion that it was SvPDLV that was slow, running inplace twice and inserting a break on Perl_hv_common after the first one, since Perl_hv_common gets run during Perl startup:

$ make core && gdb perl -ex 'break XS_PDL_inplace' -ex 'run -Mblib -MPDL -e "pdl(5)->inplace for 1..2"'
[...]
Function "XS_PDL_inplace" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
[...]
(gdb) break Perl_hv_common
Breakpoint 2 at 0x55555564c840
(gdb) c
Continuing.

Breakpoint 2, 0x000055555564c840 in Perl_hv_common ()
(gdb) bt
#0  0x000055555564c840 in Perl_hv_common ()
#1  0x00005555556d9f76 in S_isa_lookup ()
#2  0x00005555556da129 in S_sv_derived_from_svpvn ()
#3  0x00007ffff7333e94 in pdl_SvPDLV (sv=sv@entry=0x5555558c4600) at pdlcore.c:37
#4  0x00007ffff7300c85 in XS_PDL_inplace (cv=<optimised out>) at /home/osboxes/pdl-code/Basic/Core/Core.xs:192
[...]
(gdb) frame 3
#3  0x00007ffff7333e94 in pdl_SvPDLV (sv=sv@entry=0x5555558c4600) at pdlcore.c:37
37	   if(sv_derived_from(sv, "PDL") && !SvROK(sv)) {

That confirmed it was indeed coming from SvPDLV, which calls the function as part of an inheritance lookup. This is how long 10m inplace calls take before the immediate change:

$ make core && time perl -Mblib -MPDL -e '$p = pdl(5); $p->inplace for 1..1e7'
real	0m3.776s
[...]
real	0m3.663s
[...]
real	0m3.529s

This is how long after, about 20% less time:

real	0m2.871s
[...]
real	0m2.849s
[...]
real	0m2.895s

…#451

mohawk2 · 2024-03-18T21:51:04Z

With that commit, the 10m inplace now take about 1.2s, about 3 times quicker than before today.

The sequence stuff now takes about 320ms.

mohawk2 · 2024-06-18T02:45:38Z

For more performance issues, see the benchmark on nrdvana/perl-Math-3Space#8. The stripped-down timings:

pdl> $N = 1e6
pdl> with_time { $vecs_1pdl = pdl(([1,1,1]) x $N) }
1260.28 ms
pdl> p $basis = identity(3)
[
 [1 0 0]
 [0 1 0]
 [0 0 1]
]
pdl> with_time { $vecs_1pdl x $basis }
40.527 ms
pdl> with_time { $vecs_1pdl x $basis }
46.82 ms
pdl> with_time { $vecs_1pdl x $basis }
21.061 ms
pdl> p $vecs_1pdl->info
PDL: Double D [3,1000000]
pdl> use PDL::LinearAlgebra
pdl> with_time { $vecs_1pdl x $basis }
17.646 ms
pdl> with_time { $vecs_1pdl x $basis }
12.661 ms
pdl> with_time { $vecs_1pdl x $basis }
12.916 ms
pdl> with_time { $vecs_1pdl x $basis }
13.946 ms

Here, PDL::LinearAlgebra does make things quicker.

nrdvana · 2025-01-04T18:22:42Z

It appears that in the most common case of a PDL method being called on an existing PDL instance, this code runs:

pdl/lib/PDL/Core/pdlcore.c

Lines 39 to 41 in fc7e9b6

    
           if ( !SvROK(sv) ) { 
        
              if (sv_derived_from(sv, "PDL")) /* object method called as class method */ 
        
                 pdl_pdl_barf("called object method on 'PDL' or similar");

This means every single call to any pdl method will perform an "sv_derived_from" check, which isn't cheap. (I mean it's not terrible, but it's more than walking a few quick pointers) If that logic could be changed to lead with if (SvMAGICAL(sv)) and then look up the PDL magic first, the common case would get on with things while the edge cases would take the detour to figure out what happened.

I can't easily judge whether you could put the magic directly onto the hashref without rearranging a bunch of other assumptions in the code. I see it doing interesting things like checking if ->{PDL} is a coderef and then executing that coderef... Hopefully you have a better idea of how much other code it would touch, but yes looking up a MAGIC record on a hashref should be faster than looking up a hash bucket by a string.

mohawk2 · 2025-01-04T20:11:47Z

@nrdvana It's great to have your thoughts on this! Can you confirm one thing: line 39 above constrains that bit to only non-ref - i.e. strings. Therefore, this section will only get run for class methods (i.e. not every single call of instance methods - that was one of the optimisations I made, referred to earlier), which is different from your thought above?

nrdvana · 2025-01-04T21:04:53Z

@mohawk2 Indeed, I misread what was going on. But, the end conclusion is the same because it calls sv_derived_from lower down:

pdl/lib/PDL/Core/pdlcore.c

Lines 113 to 118 in fc7e9b6

    
           if (SvTYPE(SvRV(sv)) != SVt_PVMG) 
        
             croak("Error - tried to use an unknown data structure as a PDL"); 
        
           if (!sv_derived_from( sv, "PDL")) 
        
             croak("Error - tried to use an unknown Perl object type as a PDL"); 
        
           sv2 = (SV*) SvRV(sv);

Another idea I just had is that you could attach new PDL magic to the ndarray, and still leave the pointer stored in the SV and/or HV->{PDL} to provide backward compatibility. The detection of the magic would go first in this function (make the common case fast) and then fall back to the other stuff.

mohawk2 · 2025-01-04T21:16:51Z

That does sound like the natural conclusion from your XS guide. And an obvious place to insert that would be in pdlcore.[ch] (which contain the more Perl-specific bits of PDL, which this definitely is). Are you interested in having a go? Note that there's pdlmagic.*, which are a bit of PDL-specific "magic" (based on, but independent of, Perl's magic) and some POSIX threading stuff, because that's where it got put back in the day.

nrdvana · 2025-01-09T18:17:05Z

@mohawk2 I didn't get far, but this change doesn't break any tests. Want to benchmark it?

mohawk2 · 2025-01-12T21:25:27Z

@nrdvana Before applying your change, I got these results:

pdl> $p = pdl(0); with_time { $p->inplace for 1..1e7 } for 1..5
1232.44 ms
1222.37 ms
1227.75 ms
1221.71 ms
1219.91 ms

After:

pdl> $p = pdl(0); with_time { $p->inplace for 1..1e7 } for 1..5
818.978 ms
797.802 ms
794.102 ms
799.054 ms
795.801 ms

A little over 50% faster for a simple-case method call, very nice! I've cherry-picked it and tweaked it so it doesn't give compiler warnings.

I've removed the MGf_DUP stuff because as you noted in comments, PDL does its own thing re threads. Currently it has a CLONE_SKIP method returning 1. I couldn't find any docs at all on what the API is for such a magic type, but if you know better and have advice, I'm all ears. I'm not opposed to adding the rest of the stuff you put in your commit, but I'd like to understand it better first.

This links the pdl to the SV un-ambiguously, so that looking up the pdl from an SV can skip checking the inheritance hierarchy. More functionality could be moved into the extension magic later, but this is just a preliminary test to see if it speeds things up.

mohawk2 added a commit that referenced this issue Feb 17, 2024

XS topdl - #451

0129b99

mohawk2 added a commit that referenced this issue Feb 17, 2024

XS inplace - #451

a696aae

mohawk2 added a commit that referenced this issue Feb 17, 2024

XS inplace - #451

af807df

mohawk2 added a commit that referenced this issue Feb 17, 2024

XS inplace - #451

b5f578f

mohawk2 added a commit that referenced this issue Feb 17, 2024

XS topdl - #451

6d015dc

mohawk2 added a commit that referenced this issue Feb 18, 2024

XS new_from_specification - #451

95e4ee1

mohawk2 added a commit that referenced this issue Feb 18, 2024

XS new_from_specification - #451

aff3a76

mohawk2 added a commit that referenced this issue Feb 25, 2024

no clump(-1) if already 1D - #451

ba3cae0

mohawk2 added a commit that referenced this issue Mar 18, 2024

SvPDLV do cheaper SvROK before expensive sv_derived_from - #451

7b93a62

mohawk2 added a commit that referenced this issue Mar 18, 2024

SvPDLV move expensive Math::Complex check to if hashref but no {PDL} - …

2b24aa7

…#451

mohawk2 mentioned this issue Mar 31, 2024

better document debug macros #393

Open

mohawk2 added a commit that referenced this issue Mar 31, 2024

add pdl.ntrans_children to speed up destroying ndarrays - #421 #451

91802e9

mohawk2 mentioned this issue Mar 31, 2024

PDL::Core::dog doesn't scale #421

Closed

mohawk2 mentioned this issue Jan 7, 2025

Add benchmarks to check performance #518

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequence() was the bottleneck in my program #451

sequence() was the bottleneck in my program #451

KJ7LNW commented Aug 14, 2023

mohawk2 commented Feb 6, 2024

mohawk2 commented Feb 17, 2024

mohawk2 commented Feb 17, 2024

mohawk2 commented Feb 29, 2024

mohawk2 commented Mar 18, 2024

mohawk2 commented Mar 18, 2024

mohawk2 commented Jun 18, 2024

nrdvana commented Jan 4, 2025 •

edited

Loading

mohawk2 commented Jan 4, 2025

nrdvana commented Jan 4, 2025

mohawk2 commented Jan 4, 2025

nrdvana commented Jan 9, 2025

mohawk2 commented Jan 12, 2025 •

edited

Loading

sequence() was the bottleneck in my program #451

sequence() was the bottleneck in my program #451

Comments

KJ7LNW commented Aug 14, 2023

mohawk2 commented Feb 6, 2024

mohawk2 commented Feb 17, 2024

mohawk2 commented Feb 17, 2024

mohawk2 commented Feb 29, 2024

mohawk2 commented Mar 18, 2024

mohawk2 commented Mar 18, 2024

mohawk2 commented Jun 18, 2024

nrdvana commented Jan 4, 2025 • edited Loading

mohawk2 commented Jan 4, 2025

nrdvana commented Jan 4, 2025

mohawk2 commented Jan 4, 2025

nrdvana commented Jan 9, 2025

mohawk2 commented Jan 12, 2025 • edited Loading

nrdvana commented Jan 4, 2025 •

edited

Loading

mohawk2 commented Jan 12, 2025 •

edited

Loading