attoparsec now seems to be slower than joyent http parser #153

flip111 · 2019-03-02T14:37:09Z

Since the blog post 2014 http://www.serpentine.com/blog/2014/05/31/attoparsec/ things seem to have changed. For another project in benchmarked a few http parsers are posted the results here rust-bakery/parser_benchmarks#25

I get 99290 / 64687 = 1.53x more slowdown on one test and 759000 / 282021 = 2.69x on the other test (measuring in ns) compared to joyent more recent version of http parser.

Perhaps there is some more room for improvement? I've been looking at core, but with my untrained eye i couldn't make much of it. All i could find was potentially #152

The text was updated successfully, but these errors were encountered:

bgamari · 2019-03-30T00:19:34Z

What is the joyent HTTP parser?

Regardless, I'll have a look at the Core tonight. Sounds like a fun puzzle.

flip111 · 2019-03-30T00:35:02Z

Hi @bgamari when you look at the blog post, there is a link "the hand-written C code that powers Node.js’s HTTP parser". This link now forwards to https://github.com/nodejs/http-parser/blob/master/http_parser.c I believe node.js used to be a joyent product and now has been moved into it's own foundation.

I'd love to know how you go about reading core. I've been having difficulties with it. I was trying to read some core to see why WAI/WARP is slower than other languages in framework benchmarks. But i just don't have the skills to make much sense of it ...

bgamari · 2019-03-30T01:14:37Z

@flip111, yes, I've been meaning to write a bit about how I read Core for a long time now.

In general I start by moving the bits that I'm interested in benchmarking into a new module. In the case of the attoparsec benchmark that you linked to I moved all of the Parsers into a new module which exported only a single other binding:

parse :: ByteString -> Either String [(Request, [Header])]
parse = P.parseOnly allRequests

This ensures that I'm just looking at the Core of the bindings of interest; these are otherwise often drowned out by other clutter (especially when using a benchmarking framework like criterion).

Next I work out how to compile the program using GHC alone; when working on a performance issue I don't like working through a build tool like cabal or stack. These days this typically involves building it with cabal new-build; this produces a package environment file which allows me to run ghc with all of the necessary dependencies in scope.

Next I dump the Core of the module of interest. When looking at a performance issue the GHC command line will almost always look something like:

$ ghc src/Main.hs  -O -isrc -fforce-recomp \
  -ddump-to-file -ddump-simpl \
  -dsuppress-coercions -dsuppress-idinfo -dsuppress-type-applications

Another useful trick is to enable debug information with -g3. This enables the production of source tick AST nodes which can be helpful in determining the origin of a particular snippet of Core.

Once I have the Core I'll jump into a good editor; syntax highlighting is important since Core tends to be quite large. Typically just using the Haskell lexer is good enough but haskell-emacs also ships a ghc-core-mode. The syntax is pretty close to plain Haskell without infix function application.

When browsing the Core the usual thing I look for is allocation. This is represented in Core by lifted let bindings. For instance:

  let {
    x :: Int
    x = I# y
  } ...

Corresponds to an allocation of two words (one word for I# info table pointer and another for the unlifted value y). By contrast,

  let {
    x :: Int#
    x = +# y 1#
  } ...

Will be lowered to computing y+1 and storing the result to a register. This generally isn't something you need to worry about.

Aside: Since we are interested primarily in allocations, you might think that we would be better off looking at the -ddump-stg output. However, in practice I find that the GHC's STG output is a tad too noisy to be worth the effort; Core is usually easier to read even if it isn't as straightforward to see allocation.

Ticky profiling

Another trick that is often helpful in identify allocation (particularly in programs which should have very little allocation) is GHC's -ticky profiler. For this I would compile the program with ticky support:

$ ghc src/Main.hs  -O -isrc -fforce-recomp -ticky -rtsopts \
  -ddump-to-file -ddump-simpl -ddump-stg \
  -dsuppress-coercions -dsuppress-idinfo -dsuppress-type-applications

Then run the program with +RTS -rticky. This will produce a text file ./ticky containing a report of all allocation (and closure entry) counts. For instance, here are some excepted sections from the http parser example:

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
         54          0          0   2 TI                   sat_sfC4{v} (main:Main) (fun) in raFr
      23097     554328          0   4 MSEL                 sat_s8Pj{v} (main:Parser) (fun) in s8Pk
    1670738   40097712          0   4 MSEL                 sat_s8ON{v} (main:Parser) (fun) in s8OO
    1670738   40097712          0   4 MSET                 sat_s8OO{v} (main:Parser) (fun) in r1T
    1693835   94854760          0   9 pMiiiiiE>            main:Parser.$wmany_v{v r1T} (fun)
      23097     554328          0   4 MSET                 sat_s8Pk{v} (main:Parser) (fun) in r1I
    1693835   81304080          0   9 pMiiiiiEL            $wlvl{v s8It} (main:Parser) (fun) in s8Ii
   11826048  283825152          0   4 MSEL                 sat_s8B7{v} (main:Parser) (fun) in s8AO
   11826048  851475456          0   9 pMiiiiiEL            $wlvl{v s8AO} (main:Parser) (fun) in s8AE
   11826048 2459817984          0   9 pMiiiiiE>            $wmany_v1{v r8kY} (main:Parser) (fun)
   13519883 2379499408          0   9 pMiiiiiES            $wsucc{v s8Au} (main:Parser) (fun) in r8l1
   13519883 3547321648          0   9 pMiiiiiE>            $wmany_v2{v r8l1} (main:Parser) (fun)
    1693835  162608160          0   9 pMiiiiiES            $wsucc2{v s8Ii} (main:Parser) (fun) in s8I8

Focus on the Alloc column, which shows how many bytes each of the listed closures allocated over the course of the program run. This makes it clear that the $wmany_v1, $wmany_v2, and $wsucc functions are pretty allocation-heavy. We can grep for these names in the Parser.dump-stg output to get a sense of what is being allocated.

$wmany is a worker (a la "worker/wrapper transformation") for a function called many. This suggests that this allocation is coming from attoparsec itself. At this point it's worthwhile to checkout attoparsec alongside the testcase repository and add it to our GHC command line:

$ git clone [email protected]:bgamari/attoparsec
$ ghc src/Main.hs  -O -isrc -fforce-recomp -ticky -rtsopts -iattoparsec \
  -ddump-to-file -ddump-simpl -ddump-stg \
  -dsuppress-coercions -dsuppress-idinfo -dsuppress-type-applications

flip111 · 2019-03-30T14:12:00Z

Hi @bgamari thanks for the writeup

So far i looked into refactoring warp's code into a small snippet. Then went on to obtaining the "package environment file".

I always used stack, not cabal or ghc. From what i could find stack does not generate the package environment file and does also not install cabal-install. cabal-install is best installed globally through stack install cabal-install outside of a project. Because when running it from within a project you get cabal: Use of GHC's environment variable GHC_PACKAGE_PATH is incompatible with Cabal. Use the flag --package-db to specify a package database (it can be used multiple times).. This is unfortunate because now cabal using a different compiler (and i guess also other packages) than the stack LTS release.

After that i found a flag on cabal for the new-build command --write-ghc-environment-files=always. I looked around in cabal's (and stack's before) build artifacts but i'm not sure which file is the package environment file. My guess it's cache/plan.json (which was already there before using this flag).

I guess now this file has to be loaded with -package-env (i got this from https://downloads.haskell.org/~ghc/master/users-guide/packages.html#package-environments ). But i don't see this flag in the command you gave:

$ ghc src/Main.hs  -O -isrc -fforce-recomp \
  -ddump-to-file -ddump-simpl \
  -dsuppress-coercions -dsuppress-idinfo -dsuppress-type-applications

So maybe there is another way to load the package environment file.

Now i'm left with the question if i got the right file (cache/plan.json) and the right way to load it. And also how do i get the same ghc & packages as stack was using. It's a bit rough experience this way, perhaps i will open an issue on the stack repo to see if anything can be done about it (like generating that package environment file). Maybe the right environment will already be loaded when i just call stack runghc but i'm not sure about all that, so i rather follow the exact procedure as you are using and try to call ghc directly.

llelf · 2019-03-30T14:34:26Z

@flip111 TL;DR is:
do cabal new-build (or for example cabal new-build -w ghc-8.6.3) and you will get .ghc.environment.⟨smth⟩ file. Then ghc and ghci will see the package and just work™️.

$ ghci -v0
Loaded package environment from /path/to/package/.ghc.environment.x86_64-darwin-8.6.3

cache/plan.json is unrelated, it is the build plan.

flip111 · 2019-03-30T14:42:08Z

Right it was hidden so that's why i didn't see it :P thanks @llelf

flip111 · 2019-03-31T01:00:03Z

I ended up calling ghc through stack like this

stack exec -- ghc src/Main.hs -O -isrc -fforce-recomp -ddump-to-file -ddump-simpl -dsuppress-coercions -dsuppress-idinfo -dsuppress-type-applications

which takes care of getting me the right ghc version. I assume also the right packages now. The binary gets written out into the source directory instead of something like .stack-work/dist/x86_64-linux-tinfo6/Cabal-2.4.0.1/build/program/program .. seems ok to me.

I installed spacemaces but i didn't get the ghc-core-mode syntax highlighting to load (i don't know anything about spacemacs). I saw this package https://github.com/yav/dump-core might give that a try. Anyway i went on to read core in sublime ...

main :: IO ()
main = Main.main1 `cast` <Co:3>

not sure from which module this cast function is from. And a symbol like <Co:3> i've never seen that before, no idea what it is.

Main.main1
  :: GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld, () #)
Main.main1
  = \ (s_a3NY :: GHC.Prim.State# GHC.Prim.RealWorld) ->
      case Control.Concurrent.runInUnboundThread1
             (Main.main2 `cast` <Co:3>) s_a3NY
      of
      { (# ipv_a3O1, ipv1_a3O2 #) ->
      (# ipv_a3O1, GHC.Tuple.() #)
      }

The (# #) are unwrapped tuples i believe. Here some other info about GHC.Prim Ok so far all good.

There isn't a let binding in the core output, so nothing to optimize there i guess? Flamegraph shows a lot of time being taken in throwSocketErrorIfMinus1RetryMayBlock of Network.Socket.Internal. So i'd like to zoom in on that but i don't have the core file for it. So i leave that for now and see what i get with ticky.

the ticky file shows

    Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
--------------------------------------------------------------------------------
    1036374          0          0   2 >S                   main:Main.main5{v rT} (fun)
    1036374          0          0   2 S>                   main:Main.main3{v rM} (fun)

Those two functions are:

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
Main.main5
  :: forall r.
     Data.ByteString.Builder.Internal.BuildStep r
     -> Data.ByteString.Builder.Internal.BufferRange
     -> GHC.Prim.State# GHC.Prim.RealWorld
     -> (# GHC.Prim.State# GHC.Prim.RealWorld,
           Data.ByteString.Builder.Internal.BuildSignal r #)
Main.main5 = Network.Wai.responseLBS1 Main.main6

-- RHS size: {terms: 4, types: 4, coercions: 0, joins: 0/0}
Main.main3
  :: Network.Wai.Internal.Request
     -> (Network.Wai.Internal.Response
         -> IO Network.Wai.Internal.ResponseReceived)
     -> IO Network.Wai.Internal.ResponseReceived
Main.main3
  = \ _
      (respond_a2I6
         :: Network.Wai.Internal.Response
            -> IO Network.Wai.Internal.ResponseReceived) ->
      respond_a2I6 Main.main4

I don't see that responseLBS1 of Network.Wai in my profiling output (when compiling with profile). Maybe core/ticky is a bit more precise here about finding the right function? I can see where that respond_a2I6 is coming from (being passed down through several functions). But i am not able to find something that indicates it's a "heavy" function. main5 and main3 seem to work in tandem because they make the exact same amount of allocations.

I googled and asked around a bit, so far i have not found out how to get core/ticky files from dependencies.

andrewthad · 2019-08-22T21:02:59Z

#139

Keep in mind that the http parser used to be faster because it didn’t work correctly.

flip111 mentioned this issue Mar 30, 2019

Add framework Warp TechEmpower/FrameworkBenchmarks#4571

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attoparsec now seems to be slower than joyent http parser #153

attoparsec now seems to be slower than joyent http parser #153

flip111 commented Mar 2, 2019

bgamari commented Mar 30, 2019

flip111 commented Mar 30, 2019

bgamari commented Mar 30, 2019 •

edited

Loading

flip111 commented Mar 30, 2019

llelf commented Mar 30, 2019

flip111 commented Mar 30, 2019

flip111 commented Mar 31, 2019

andrewthad commented Aug 22, 2019

attoparsec now seems to be slower than joyent http parser #153

attoparsec now seems to be slower than joyent http parser #153

Comments

flip111 commented Mar 2, 2019

bgamari commented Mar 30, 2019

flip111 commented Mar 30, 2019

bgamari commented Mar 30, 2019 • edited Loading

Ticky profiling

flip111 commented Mar 30, 2019

llelf commented Mar 30, 2019

flip111 commented Mar 30, 2019

flip111 commented Mar 31, 2019

andrewthad commented Aug 22, 2019

bgamari commented Mar 30, 2019 •

edited

Loading