Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

See if use of vectors can improve performance of listy code #165

Open
barrucadu opened this issue Dec 10, 2017 · 1 comment
Open

See if use of vectors can improve performance of listy code #165

barrucadu opened this issue Dec 10, 2017 · 1 comment

Comments

@barrucadu
Copy link
Owner

There's quite a bit of very core code in dejafu which is using lists in a known-bad way: lots of appending, iteration, and updates. The findBacktrackSteps function is a representative example:

findBacktrackSteps backtrack boundKill = go initialDepState S.empty initialThread [] . F.toList where
-- Walk through the traces one step at a time, building up a list of
-- new backtracking points.
go state allThreads tid bs ((e,i):is) ((d,_,a):ts) =
let tid' = tidOf tid d
state' = updateDepState state tid' a
this = BacktrackStep
{ bcktThreadid = tid'
, bcktDecision = d
, bcktAction = a
, bcktRunnable = M.fromList e
, bcktBacktracks = M.fromList $ map (\i' -> (i', False)) i
, bcktState = state
}
bs' = doBacktrack killsEarly allThreads' e (bs++[this])
runnable = S.fromList (M.keys $ bcktRunnable this)
allThreads' = allThreads `S.union` runnable
killsEarly = null ts && boundKill
in go state' allThreads' tid' bs' is ts
go _ _ _ bs _ _ = bs
-- Find the prior actions dependent with this one and add
-- backtracking points.
doBacktrack killsEarly allThreads enabledThreads bs =
let tagged = reverse $ zip [0..] bs
idxs = [ (ehead "doBacktrack.idxs" is, False, u)
| (u, n) <- enabledThreads
, v <- S.toList allThreads
, u /= v
, let is = idxs' u n v tagged
, not $ null is]
idxs' u n v = go' True where
{-# INLINE go' #-}
go' final ((i,b):rest)
-- Don't cross subconcurrency boundaries
| isSubC final b = []
-- If this is the final action in the trace and the
-- execution was killed due to nothing being within bounds
-- (@killsEarly == True@) assume worst-case dependency.
| bcktThreadid b == v && (killsEarly || isDependent b) = i : go' False rest
| otherwise = go' False rest
go' _ [] = []
{-# INLINE isSubC #-}
isSubC final b = case bcktAction b of
Stop -> not final && bcktThreadid b == initialThread
Subconcurrency -> bcktThreadid b == initialThread
_ -> False
{-# INLINE isDependent #-}
isDependent b
-- Don't impose a dependency if the other thread will
-- immediately block already. This is safe because a
-- context switch will occur anyway so there's no point
-- pre-empting the action UNLESS the pre-emption would
-- possibly allow for a different relaxed memory stage.
| isBlock (bcktAction b) && isBarrier (simplifyLookahead n) = False
| otherwise = dependent' (bcktState b) (bcktThreadid b) (bcktAction b) u n
in backtrack bs idxs

In every iteration, it computes bs ++ [this] and reverse $ zip [0..] bs (both bad) and passes this bs list, indirectly, to pBacktrack which computes reverse (take (i-1) bs), and to backtrackAt, which updates-by-index one or two points towards the end of that list. These are all cases which lists are terrible for!

I wonder if using vectors here would be a performance boost. go could be replaced with a list fold over the function parameters to construct an initial bs, followed by a vector map-with-index over that; the list-comprehension-over-reversed-list in doBacktrack would just be a vector fold-right-with-index, as would the recursion in pBacktrack; and backtrackAt would just use vector update.

@barrucadu
Copy link
Owner Author

barrucadu commented Dec 10, 2017

I had a quick go at this, linked is an almost working patch against bd64dfe. There's one test failure:

    Independent Read Independent Write:                 
      :SQ: [Failed]                                                               
Failed:                                                                         
Expected: Right ((0,1),(0,0))                                                                                                         
Expected: Right ((0,1),(0,1))                                                                                                         


      :TSO: [OK]
      :PSO: [OK]

Benchmark results are not that great. Less allocation and less time in the garbage collector, but max resident memory basically the same, and more time in the mutator:

original:

 106,023,792,976 bytes allocated in the heap
  24,972,461,280 bytes copied during GC
      17,829,800 bytes maximum residency (8026 sample(s))
         523,456 bytes maximum slop
              50 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     196429 colls,     0 par   43.323s  43.246s     0.0002s    0.0015s
  Gen  1      8026 colls,     0 par   17.711s  17.706s     0.0022s    0.0406s

  TASKS: 205044 (205040 bound, 4 peak workers (4 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time  142.002s  (134.145s elapsed)
  GC      time   61.035s  ( 60.952s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time  203.119s  (195.098s elapsed)

  Alloc rate    746,633,705 bytes per MUT second

  Productivity  70.0% of total user, 72.8% of total elapsed

vector:

  89,709,834,032 bytes allocated in the heap
  15,715,802,120 bytes copied during GC   
      17,561,104 bytes maximum residency (3588 sample(s))
         578,704 bytes maximum slop
              45 MB total memory in use (0 MB lost due to fragmentation)
                                    
                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     171402 colls,     0 par   33.321s  33.260s     0.0002s    0.0033s
  Gen  1      3588 colls,     0 par    7.655s   7.651s     0.0021s    0.0451s
                                               
  TASKS: 222854 (222850 bound, 4 peak workers (4 total), using -N1)
                                     
  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
                                       
  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time  160.987s  (152.481s elapsed)
  GC      time   40.976s  ( 40.911s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time  202.046s  (193.394s elapsed)
                                   
  Alloc rate    557,247,973 bytes per MUT second
                                           
  Productivity  79.7% of total user, 83.3% of total elapsed

Measured by compiling with -rtsopts and running dejafu-tests +RTS -s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant