Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming parser #205

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Streaming parser #205

wants to merge 6 commits into from

Conversation

Daniel-Diaz
Copy link
Contributor

I added the ability to consume yaml files incrementally using conduits. This is exposed by the parseStream function in the module Data.Yaml.Internal.

The motivation was simply wanting to consume large files without having to have them fully in memory. I don't know what the demand for something like this is, but I wanted it.

I was thinking maybe to define a wrapper in Data.Yaml that would be easier to use and discover. However, I wasn't sure what that would look like. In any case, exporting the conduit works well enough for myself.

I also had to export ParseState or otherwise I wasn't able to run the parser.

To try it out, I wrote the following code:

{-# LANGUAGE ImportQualifiedPost, OverloadedStrings #-}

module Main (main) where

import Data.Void (Void)
import Data.Aeson (FromJSON, parseJSON, withObject, (.:), fromJSON)
import Data.Aeson qualified as JSON
import Conduit (ConduitM, runConduit, (.|), runResourceT)
import Data.Conduit.List qualified as CL
import Data.Yaml.Internal (Parse, ParseState (..), parseStream)
import Control.Monad.Reader (runReaderT)
import Control.Monad.State.Strict (evalStateT)
import Text.Libyaml (decodeFile)

data ID = ID { id_field :: Int }

instance FromJSON ID where
  parseJSON = withObject "ID" $ \o -> fmap ID $ o .: "id"

main :: IO ()
main = do
  let f :: Int -> JSON.Value -> Int
      f i v =
        case fromJSON v of
          JSON.Error err -> error err
          JSON.Success x -> i + id_field x
  let conduit :: ConduitM () Void Parse Int
      conduit = decodeFile "big.yaml"
             .| runReaderT parseStream []
             .| CL.fold f 0
  r <- runResourceT $ evalStateT (runConduit conduit) $ ParseState mempty []
  print r

And then I created a 1,5Gb file (big.yaml). The memory still goes up as the program runs, but does so very slowly and the program finishes before it is a problem (it ends up taking close to 3Gb though). Trying to parse the entire file quickly used all of my available memory (31,2Gb). I was expecting the conduit version to take a roughly constant amount of memory. I wonder what causes the slow build up. Could it be related to the parser state?

Thanks in advance for your review.

me <- lift CL.head
case me of
Just (EventSequenceStart _ _ a) -> parseArrayStreaming 0 a
_ -> liftIO $ throwIO $ UnexpectedEvent me $ Just $ EventSequenceStart NoTag AnySequence Nothing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some values here for the "expected" event, hoping that would be more helpful that using Nothing.

-- In combination with 'Y.decodeFile', it can be used to consume
-- a yaml file that consists in a very large list of values.
parseStream :: ReaderT JSONPath (ConduitM Event Value Parse) ()
parseStream = do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the name of this if there's a better name for it.

@@ -5,7 +5,7 @@ cabal-version: 1.12
-- see: https://github.com/sol/hpack

name: yaml
version: 0.11.7.0
version: 0.11.8.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this myself. I don't know why it kept happening.

@snoyberg
Copy link
Owner

I don't think this would make a good addition here at this time. Such an approach would make more sense as a separate library to allow the API design to evolve instead of trying to promote it to a stable package immediately.

One thought is that it seems to me like the ReaderT is misplaced. It would logically make more sense to me to have the ReaderT wrapping around the Parse monad inside the ConduitT instead. There may be some limitations I didn't think of that make that difficult/impossible.

@Daniel-Diaz
Copy link
Contributor Author

Daniel-Diaz commented May 15, 2022

I don't think this would make a good addition here at this time. Such an approach would make more sense as a separate library to allow the API design to evolve instead of trying to promote it to a stable package immediately.

One thought is that it seems to me like the ReaderT is misplaced. It would logically make more sense to me to have the ReaderT wrapping around the Parse monad inside the ConduitT instead. There may be some limitations I didn't think of that make that difficult/impossible.

OK. It might be a good idea, but I don't have the time right now to create a separate project. Perhaps another time. For now, I will just use my fork and merge any updates from your repository.

The reason I added it here instead of on my own code is that it uses internal functions that are not exported, and I also thought it might be useful to other people. I simply copied parseAll and modified it to yield the values downstream instead of gathering them in a list. So it's just a modification of one the functions that's already there. This is also why ReaderT is in the place it is.

Thanks for the prompt reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants