Parser makes Ubuntu crash on 1.2G file #130

apraga · 2017-06-29T19:18:26Z

Hi,

I've implemented a parser using Attoparsec, which works very well. Unfortunately, for a large file (one of 1.2Go), running the parser makes it crash on my Ubuntu 16.04.2 LTS. I'm using attoparsec 0.13.1.0 with stack.

Below is the complete code for the parser. As a example, a small file is also given to have an idea of the file format. small_test.txt

If someone is interested, I can give the large file making the parser crash. Thanks.

{-# LANGUAGE DeriveDataTypeable, OverloadedStrings #-}
import Control.Applicative
import Control.Monad (void)
import Data.List
import Data.Scientific as S hiding (scientific)
import Data.Text.Lazy as T hiding (map, count)
import Data.Text.Lazy.IO as TIO
import Prelude hiding (exponent, id)
import Data.Attoparsec.Text.Lazy
import System.Console.CmdArgs
import System.Environment

-- Reading ploc using Attoparsec : fast but not helpful error messages.
-- For debug, use parseTest and ghci for each component.
--
-- The file format is 
-- TIME
-- HEADER
-- [PARTICLE]
--
-- with 
--
-- TIME = realtime = FLOAT [gamt = FLOAT]
-- HEADER = PART # XX YY ANGZ | ZZ  ALPHA BETA GAMMA ADX ADY ADZ
-- PARTICLE = INT FLOAT*9
data Particle = Particle {
  id :: Integer,
  pos :: [Scientific],
  ad :: [Integer]
}

data Iteration = Iteration {
  realtime :: Scientific,
  particles :: [Particle]
}

toText :: Show a => a -> T.Text
toText = T.pack . show

addComma x = T.intercalate "," $ map toText x

printPart :: Particle -> T.Text
printPart (Particle i p a) =  T.intercalate "," l
    where l = [toText i, addComma p, addComma a]

printIter :: Iteration -> T.Text
printIter (Iteration t p) = T.intercalate "\n" $ map format p
      where format x = T.concat [toText t, ",", printPart x]
     
signedInt :: Parser Integer
signedInt = signed decimal

mySep1 = some $ char ' '
 
mySep = many space 

gamt = mySep >> asciiCI "gamt =" >> mySep >> scientific

-- time :: Parser Scientific
time = do
  mySep >> asciiCI "REALTIME =" >> mySep 
  t <- scientific 
  t' <- option 0 gamt
  return t

-- Helper
stringify x = mySep >> asciiCI x
 
-- Two headers are possible : the 5th column can be "zz" or "ANGZ"
header = do 
  mapM_ stringify header0 
  mySep *> (asciiCI "zz" <|> asciiCI "ANGZ")
  mapM_ stringify header1 
  where 
    header0 = [ "PART", "#" , "XX", "YY"]
    header1 = [ "ALPHA", "BETA", "GAMMA" , "ADX", "ADY", "ADZ"]

-- Read a particle coordinates
part :: Parser Particle
part = do
  id <- mySep >> decimal <* mySep1
  coord <- count 6 (scientific <* mySep1)
  asd <-  sepBy signedInt mySep1 
  return $ Particle id coord asd

emptyLine = mySep >> endOfLine

 -- Read an iteration
iter :: Parser Iteration
iter = do 
  t <- time <* endOfLine
  header  >> endOfLine
  allPart <- sepBy part endOfLine
  return $ Iteration t allPart
 
parseExpr = space >> sepBy iter space 
 
readExpr input = case eitherResult . parse parseExpr $ input of
  Left err -> error "failed to read"
  Right val -> val
-- 
data ParserArgs = ParserArgs { input :: String
                             , output :: FilePath } 
                   deriving (Show, Data, Typeable)

parserArgs = ParserArgs { 
                input = def &= argPos 0 &= typ "INPUT"
                , output = def &= argPos 1 &= typ "OUTPUT"
                }

main = do
  args <- cmdArgs parserArgs
  txt <- TIO.readFile $ input args
  let d = readExpr txt
  let result = T.intercalate "\n" $ map printIter d
  TIO.writeFile (output args) result
  print "done"

The text was updated successfully, but these errors were encountered:

bgamari · 2017-06-29T20:48:08Z

What precisely do you mean by crash? Keep in mind that heap representations (especially your particular Particle representation) are generally larger than their on-disk representation. Are you certain you aren't simply running out of memory?

apraga · 2017-06-30T08:22:17Z

Thanks for the quick answer. By "crash", I mean the computer freezes and becomes unresponsive.

I've monitored memory usage and you are right, I'm running out of memory. Is there a way to decrease memory usage of my program ?

bgamari · 2017-06-30T15:26:48Z

Looking at your program, a few things stand out:

You have several fields of type Scientific. Each of these will take at least three machine words. If you need the precision then this might be acceptable, but if not you are likely going to be better of using Double.
You are encoding the particle position and orientation vectors as linked lists. Lists are extremely inefficient for this sort of thing as each list element incurs a cost of three machine words (one to identify the constructor, one pointer to the element, and one pointer to the tail of the list) in addition to the element itself. This means that each particle's pos field will require 5 words for each element, or 15 words in total.
Your particle numbers are Integers. Integers are only slightly larger than Int (two words instead of one), but you likely don't need the range here.
The list of particles for each iteration is a list. You may be better off using an unboxed vector.
All of your fields are lazy. You likely want to make them strict unless you really need the laziness. Strictness allows the compiler to unpack them, eliminating a few pointers.
Depending upon how many iterations you need to work with, you might consider instead using a streaming approach to avoid having to keep your entire dataset in memory at once.

bgamari · 2017-06-30T15:38:30Z

@alexDarcy, see https://github.com/bgamari/memory-reduction for a few examples. Come find me in #haskell on irc.freenode.net if you want to chat about your problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser makes Ubuntu crash on 1.2G file #130

Parser makes Ubuntu crash on 1.2G file #130

apraga commented Jun 29, 2017

bgamari commented Jun 29, 2017

apraga commented Jun 30, 2017

bgamari commented Jun 30, 2017 •

edited

Loading

bgamari commented Jun 30, 2017

Parser makes Ubuntu crash on 1.2G file #130

Parser makes Ubuntu crash on 1.2G file #130

Comments

apraga commented Jun 29, 2017

bgamari commented Jun 29, 2017

apraga commented Jun 30, 2017

bgamari commented Jun 30, 2017 • edited Loading

bgamari commented Jun 30, 2017

bgamari commented Jun 30, 2017 •

edited

Loading