Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser makes Ubuntu crash on 1.2G file #130

Open
apraga opened this issue Jun 29, 2017 · 4 comments
Open

Parser makes Ubuntu crash on 1.2G file #130

apraga opened this issue Jun 29, 2017 · 4 comments

Comments

@apraga
Copy link

apraga commented Jun 29, 2017

Hi,

I've implemented a parser using Attoparsec, which works very well. Unfortunately, for a large file (one of 1.2Go), running the parser makes it crash on my Ubuntu 16.04.2 LTS. I'm using attoparsec 0.13.1.0 with stack.

Below is the complete code for the parser. As a example, a small file is also given to have an idea of the file format. small_test.txt

If someone is interested, I can give the large file making the parser crash. Thanks.

{-# LANGUAGE DeriveDataTypeable, OverloadedStrings #-}
import Control.Applicative
import Control.Monad (void)
import Data.List
import Data.Scientific as S hiding (scientific)
import Data.Text.Lazy as T hiding (map, count)
import Data.Text.Lazy.IO as TIO
import Prelude hiding (exponent, id)
import Data.Attoparsec.Text.Lazy
import System.Console.CmdArgs
import System.Environment

-- Reading ploc using Attoparsec : fast but not helpful error messages.
-- For debug, use parseTest and ghci for each component.
--
-- The file format is 
-- TIME
-- HEADER
-- [PARTICLE]
--
-- with 
--
-- TIME = realtime = FLOAT [gamt = FLOAT]
-- HEADER = PART # XX YY ANGZ | ZZ  ALPHA BETA GAMMA ADX ADY ADZ
-- PARTICLE = INT FLOAT*9
data Particle = Particle {
  id :: Integer,
  pos :: [Scientific],
  ad :: [Integer]
}

data Iteration = Iteration {
  realtime :: Scientific,
  particles :: [Particle]
}

toText :: Show a => a -> T.Text
toText = T.pack . show

addComma x = T.intercalate "," $ map toText x

printPart :: Particle -> T.Text
printPart (Particle i p a) =  T.intercalate "," l
    where l = [toText i, addComma p, addComma a]

printIter :: Iteration -> T.Text
printIter (Iteration t p) = T.intercalate "\n" $ map format p
      where format x = T.concat [toText t, ",", printPart x]
     
signedInt :: Parser Integer
signedInt = signed decimal

mySep1 = some $ char ' '
 
mySep = many space 

gamt = mySep >> asciiCI "gamt =" >> mySep >> scientific

-- time :: Parser Scientific
time = do
  mySep >> asciiCI "REALTIME =" >> mySep 
  t <- scientific 
  t' <- option 0 gamt
  return t

-- Helper
stringify x = mySep >> asciiCI x
 
-- Two headers are possible : the 5th column can be "zz" or "ANGZ"
header = do 
  mapM_ stringify header0 
  mySep *> (asciiCI "zz" <|> asciiCI "ANGZ")
  mapM_ stringify header1 
  where 
    header0 = [ "PART", "#" , "XX", "YY"]
    header1 = [ "ALPHA", "BETA", "GAMMA" , "ADX", "ADY", "ADZ"]

-- Read a particle coordinates
part :: Parser Particle
part = do
  id <- mySep >> decimal <* mySep1
  coord <- count 6 (scientific <* mySep1)
  asd <-  sepBy signedInt mySep1 
  return $ Particle id coord asd

emptyLine = mySep >> endOfLine

 -- Read an iteration
iter :: Parser Iteration
iter = do 
  t <- time <* endOfLine
  header  >> endOfLine
  allPart <- sepBy part endOfLine
  return $ Iteration t allPart
 
parseExpr = space >> sepBy iter space 
 
readExpr input = case eitherResult . parse parseExpr $ input of
  Left err -> error "failed to read"
  Right val -> val
-- 
data ParserArgs = ParserArgs { input :: String
                             , output :: FilePath } 
                   deriving (Show, Data, Typeable)

parserArgs = ParserArgs { 
                input = def &= argPos 0 &= typ "INPUT"
                , output = def &= argPos 1 &= typ "OUTPUT"
                }

main = do
  args <- cmdArgs parserArgs
  txt <- TIO.readFile $ input args
  let d = readExpr txt
  let result = T.intercalate "\n" $ map printIter d
  TIO.writeFile (output args) result
  print "done"
@bgamari
Copy link
Collaborator

bgamari commented Jun 29, 2017

What precisely do you mean by crash? Keep in mind that heap representations (especially your particular Particle representation) are generally larger than their on-disk representation. Are you certain you aren't simply running out of memory?

@apraga
Copy link
Author

apraga commented Jun 30, 2017

Thanks for the quick answer. By "crash", I mean the computer freezes and becomes unresponsive.

I've monitored memory usage and you are right, I'm running out of memory. Is there a way to decrease memory usage of my program ?

@bgamari
Copy link
Collaborator

bgamari commented Jun 30, 2017

Looking at your program, a few things stand out:

  • You have several fields of type Scientific. Each of these will take at least three machine words. If you need the precision then this might be acceptable, but if not you are likely going to be better of using Double.
  • You are encoding the particle position and orientation vectors as linked lists. Lists are extremely inefficient for this sort of thing as each list element incurs a cost of three machine words (one to identify the constructor, one pointer to the element, and one pointer to the tail of the list) in addition to the element itself. This means that each particle's pos field will require 5 words for each element, or 15 words in total.
  • Your particle numbers are Integers. Integers are only slightly larger than Int (two words instead of one), but you likely don't need the range here.
  • The list of particles for each iteration is a list. You may be better off using an unboxed vector.
  • All of your fields are lazy. You likely want to make them strict unless you really need the laziness. Strictness allows the compiler to unpack them, eliminating a few pointers.
  • Depending upon how many iterations you need to work with, you might consider instead using a streaming approach to avoid having to keep your entire dataset in memory at once.

@bgamari
Copy link
Collaborator

bgamari commented Jun 30, 2017

@alexDarcy, see https://github.com/bgamari/memory-reduction for a few examples. Come find me in #haskell on irc.freenode.net if you want to chat about your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants