Skip to content

Latest commit

 

History

History
521 lines (384 loc) · 16.3 KB

Readme.adoc

File metadata and controls

521 lines (384 loc) · 16.3 KB

Geny

// Mill
ivy"com.lihaoyi::geny:1.1.1"
ivy"com.lihaoyi::geny::1.1.1" // Scala.js / Native

// SBT
"com.lihaoyi" %% "geny" % "1.1.1"
"com.lihaoyi" %%% "geny" % "1.1.1" // Scala.js / Native

Geny is a small library that provides push-based versions of common standard library interfaces:

More background behind the Writable and Readable interface can be found in this blog post:

Generator

Generator is basically the inverse of a scala.Iterator: instead of the core functionality being the pull-based hasNext and next: T methods, the core is based around the push-based generate method, which is similar to foreach with some tweaks.

Unlike a scala.Iterator, subclasses of Generator can guarantee any clean up logic is performed by placing it after the generate call is made. This is useful for using Generators to model streaming data from files or other sources that require cleanup: the most common alternative, scala.Iterator, has no way of guaranteeing that the file gets properly closed after reading. Even so called "self-closing iterators" that close the file after the iterator is exhausted fail to close the files if the developer uses .head or .take to access the first few elements of the iterator, and never exhausts it.

Although geny.Generator is not part of the normal collections hierarchy, the API is intentionally modelled after that of scala.Iterator and should be mostly drop-in, with conversion functions provided where you need to interact with APIs using the standard Scala collections.

Geny is intentionally a tiny library with one file and zero dependencies, so you can depend on it (or even copy-paste it into your project) without fear of taking on unknown heavyweight dependencies.

Construction

The two simplest ways to construct a Generator are via the Generator(...) and Generator.from constructors:

import geny.Generator

scala> Generator(0, 1, 2)
res1: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2))

scala> Generator.from(Seq(1, 2, 3)) // pass in any iterable or iterator
res2: geny.Generator[Int] = Generator(List(1, 2, 3))

If you need a Generator for a source that needs cleanup (closing file-handles, database connections, etc.) you can use the Generator.selfClosing constructor:

scala> class DummyCloseableSource {
     |   val iterator = Iterator(1, 2, 3, 4, 5, 6, 7, 8, 9)
     |   var closed = false
     |   def close() = {
     |     closed = true
     |   }
     | }
defined class DummyCloseableSource

scala> val g = Generator.selfClosing {
     |   val closeable = new DummyCloseableSource()
     |   (closeable.iterator, () => closeable.close())
     | }
g: geny.Generator[Int] = Gen.SelfClosing(...)

This constructor takes a block that will be called to generate a tuple of an Iterator[T] and a cleanup function of type () => Unit. Each time the Generator is evaluated:

  • A new pair of (Iterator[T], () => Unit) is created using this block

  • The iterator is used to generate however many elements are necessary

  • the cleanup function is called.

Terminal Operations

Transformations on a Generator are lazy: calling methods like filter or map do not evaluate the entire Generator, but instead construct a new Generator that delegates to the original. The only methods that evaluate the Generator are the "terminal operation" methods like foreach/find, or the "Conversion" methods like toArray or similar. In this way, Generator behaves similarly to Iterator, whose map/filter methods are also lazy until terminal oepration is called.

Terminal operations include the following:

scala> Generator(0, 1, 2).toSeq
res3: Seq[Int] = ArrayBuffer(0, 1, 2)

scala> Generator(0, 1, 2).reduceLeft(_ + _)
res4: Int = 3

scala> Generator(0, 1, 2).foldLeft(0)(_ + _)
res5: Int = 3

scala> Generator(0, 1, 2).exists(_ == 3)
res6: Boolean = false

scala> Generator(0, 1, 2).count(_ > 0)
res7: Int = 2

scala> Generator(0, 1, 2).forall(_ >= 0)
res8: Boolean = true

Overall, they behave mostly the same as on the standard Scala collections. Not every method is supported, but even those that aren’t provided can easily be re-implemented using foreach and the other methods available.

Transformations

Transformations on a Generator are lazy: they do not immediately return a result, and only build up a computation:

scala> Generator(0, 1, 2).map(_ + 1)
res9: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2)).map(<function1>)

scala> Generator(0, 1, 2).map { x => println(x); x + 1 }
res10: geny.Generator[Int] = Generator(WrappedArray(0, 1, 2)).map(<function1>)

This computation will be evaluated when one of the Terminal Operations described above is called:

scala> res10.toSeq
0
1
2
res11: Seq[Int] = ArrayBuffer(1, 2, 3)

Most of the common operations on the Scala collections are supported:

scala> (Generator(0, 1, 2).filter(_ % 2 == 0).map(_ * 2).drop(2) ++
       Generator(5, 6, 7).map(_.toString.toSeq).flatMap(x => x))
res12: geny.Generator[AnyVal] = Generator(WrappedArray(0, 1, 2)).filter(<function1>).map(<function1>).slice(2, 2147483647) ++ Generator(WrappedArray(5, 6, 7)).map(<function1>).map(<function1>)

scala> res12.toSeq
res13: Seq[AnyVal] = ArrayBuffer(5, 6, 7)

scala> Generator(0, 1, 2, 3, 4, 5, 6, 7, 8, 9).flatMap(i => i.toString.toSeq).takeWhile(_ != '6').zipWithIndex.filter(_._1 != '2')
res14: geny.Generator[(Char, Int)] = Generator(WrappedArray(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)).map(<function1>).takeWhile(<function1>).zipWithIndex.filter(<function1>)

scala> res14.toVector
res15: Vector[(Char, Int)] = Vector((0,0), (1,1), (3,3), (4,4), (5,5))

As you can see, you can flatMap, filter, map, drop, takeWhile, ++ and call other methods on the Generator, and it simply builds up the computation without running it. Only when a terminal operation like toSeq or toVector is called is it finally evaluated into a result.

Note that a geny.Generator is immutable, and is thus never exhausted. However, it also does not perform any memoization or caching, and so calling a terminal operation like .toSeq on a Generator multiple times will evaluate any preceding transformations multiple times. If you do not want this to be the case, call .toSeq to turn it into a concrete sequence and work with that.

Self Closing Generators

One major use case of geny.Generator is to ensure resources involved in streaming results from some external source get properly cleaned up. For example, using scala.io.Source, we can get a scala.Iterator over the lines of a file. For example, you may define a helper function like this:

def getFileLines(path: String): Iterator[String] = {
  val s = scala.io.Source.fromFile(path)(charSet)
  s.getLines()
}

However, this is incorrect: you never close the source s, and thus if you call this lots of times, you end up leaving tons of open file handles! If you are lucky this will crash your program; if you are unlucky it will hang your kernel and force you to reboot your computer.

One solution to this would be to simply not write helper functions: everyone who wants to read from a file must instantiate the scala.io.Source themselves, and manually cleanup themselves. This is a possible solution, but is tedious and annoying. Another possible solution is to have the Iterator close the io.Source itself when exhausted, but this still leaves open the possibility that the caller will use .head or .take on the iterator: a perfectly reasonable thing to do if you don’t need all the output, but one that would leave a "self-closing" iterator open and still leaking file handles.

Using geny.Generators, the helper function can instead return a Generator.selfClosing:

def getFileLines(path: String): geny.Generator[String] = Generator.selfClosing {
  val s = scala.io.Source.fromFile(path)(charSet)
  (s.getLines(), () => s.close())
}

The caller can then use normal collection operations on the returned geny.Generator: map it, filter it, take, toSeq, etc. and it will always be properly opened when a terminal operation is called, the required operations performed, and properly closed when everything is done.

Writable

geny.Writable is a minimal interface that can be implemented by any data type that writes binary output to a java.io.OutputStream:

trait Writable {
  def writeBytesTo(out: OutputStream): Unit
}

Writable allows for zero-friction zero-overhead streaming data exchange between these libraries, e.g. allowing you pass Scalatags Frags directly os.write:

@ import $ivy.`com.lihaoyi::scalatags:0.12.0`, scalatags.Text.all._
import $ivy.$                             , scalatags.Text.all._

@ os.write(os.pwd / "hello.html", html(body(h1("Hello"), p("World!"))))

@ os.read(os.pwd / "hello.html")
res1: String = "<html><body><h1>Hello</h1><p>World!</p></body></html>"

Sending ujson.Values directly to requests.post

@ requests.post("https://httpbin.org/post", data = ujson.Obj("hello" -> 1))

@ res2.text
res3: String = """{
  "args": {},
  "data": "{\"hello\":1}",
  "files": {},
  "form": {},
...

Serialize Scala data types directly to disk:

@ os.write(os.pwd / "two.json", upickle.default.stream(Map((1, 2) -> (3, 4), (5, 6) -> (7, 8))))

@ os.read(os.pwd / "two.json")
res5: String = "[[[1,2],[3,4]],[[5,6],[7,8]]]"

Or streaming file uploads over HTTP:

@ requests.post("https://httpbin.org/post", data = os.read.stream(os.pwd / "two.json")).text
res6: String = """{
  "args": {},
  "data": "[[[1,2],[3,4]],[[5,6],[7,8]]]",
  "files": {},
  "form": {},

All this data exchange happens efficiently in a streaming fashion, without unnecessarily buffering data in-memory.

geny.Writable also allows an implementation to ensure cleanup code runs after all data has been written (e.g. closing file handles, free-ing managed resources) and is much easier to implement than java.io.InputStream.

Writable has implicit constructors from the following types:

  • String

  • Array[Byte]

  • java.io.InputStream

And implemented by the following libraries:

And is accepted by the following libraries:

  • Requests-Scala takes Writable in the data = field of requests.post and requests.put

  • OS-Lib accepts a Writable in os.write and the stdin parameter of subprocess.call or subprocess.spawn

  • Cask: supports returning a Writable from any Cask endpoint

Any data type that writes bytes out to a java.io.OutputStream, java.io.Writer, or StringBuilder can be trivially made to implement Writable, which allows it to output data in a streaming fashion without needing to buffer it in memory. You can also implement Writables in your own datatypes or accept it in your own method, if you want to inter-operate with this existing ecosystem of libraries.

Readable

trait Readable extends Writable {
  def readBytesThrough[T](f: InputStream => T): T
  def writeBytesTo(out: OutputStream): Unit = readBytesThrough(Internal.transfer(_, out))
}

Readable is a subtype of Writable that provides an additional guarantee: not only can it be written to an java.io.OutputStream, it can also be read from by providing a java.io.InputStream. Note that the InputStream is scoped and only available within the readBytesThrough callback: after that the InputStream will be closed and associated resources (HTTP connections, file handles, etc.) will be released.

Readable is supported by the following built in types:

  • String

  • Array[Byte]

  • java.io.InputStream

Implemented by the following libraries

And is accepted by the following libraries:

  • uPickle: upickle.default.read, upickle.default.readBinary, ujson.read, and upack.read all support Readable

  • FastParse: fastparse.parse accepts parsing streaming input from any Readable

Readable can be used to allow handling of streaming input, e.g. parsing JSON directly from a file or HTTP request, without needing to buffer the whole file in memory:

@ val data = ujson.read(requests.get.stream("https://api.github.com/events"))
data: ujson.Value.Value = Arr(
  ArrayBuffer(
    Obj(
      LinkedHashMap(
        "id" -> Str("11169088214"),
        "type" -> Str("PushEvent"),
        "actor" -> Obj(
...

You can also implement Readable in your own data types, to allow them to be seamlessly passed into uPickle or FastParse to be parsed in a streaming fashion.

Note that in exchange for the reduced memory usage, parsing streaming data via Readable in uPickle or FastParse typically comes with a 20-40% CPU performance penalty over parsing data already in memory, due to the additional book-keeping necessary with streaming data. Whether it is worthwhile or not depends on your particular usage pattern.

Changelog

1.1.1 - 2024-06-14

  • Implement .grouped and .sliding operators

1.1.0 - 2024-04-14

  • Support for Scala-Native 0.5.0

  • Minimum version of Scala 3 increased from 3.1.3 to 3.3.1

  • Minimum version of Scala 2 increased from 2.11.x to 2.12.x

1.0.0 - 2022-09-15

  • Support Semantic Versioning

  • Removed deprecated API

0.7.1 - 2022-01-23

  • Support Scala Native for Scala 3

0.7.0 - 2021-12-10

Re-release of 0.6.11

Older Versions

0.6.11 - 2021-11-26

  • Add httpContentType to inputStreamReadable

  • Improved Build and CI setup

  • Added MiMa checks

0.6.10 - 2021-05-14

  • Add support for Scala 3.0.0

0.6.9 - 2021-04-28

  • Add support for Scala 3.0.0-RC3

0.6.8 - 2021-04-28

  • Add support for Scala 3.0.0-RC2

0.6.4

  • Scala-Native 0.4.0 support

0.6.2

  • Improve performance of writing small strings via StringWritable

0.5.0

  • Improve streaming of InputStreams to OutputStreams by dynamically sizing the transfer buffer.

0.4.2

  • Standardize geny.Readable as well

0.2.0

0.1.8

  • Support for Scala 2.13.0 final

0.1.6 - 2019-01-15

  • Add scala-native support

0.1.5

  • Add .withFilter

0.1.4

  • Add .collect, .collectFirst, .headOption methods

0.1.3

  • Allow calling .count() without a predicate to count the total number of items in the generator

0.1.2

  • Add .reduce, .fold, .sum, .product, .min, .max, .minBy, .maxBy

  • Rename .fromIterable to .from, make it also take Iterators

0.1.1

  • Publish for Scala 2.12.0

0.1.0

  • First release