Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight intergation with Scala, bringing advantages of Scala to your MapReduce jobs.
Current version: 0.8.2
Hadoop is a distributed system for counting words. Here is how it's done in Scalding.
package com.twitter.scalding.examples
import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+")
}
}
Notice that the tokenize
function, which is standard Scala, integrates naturally with the rest of the MapReduce job. This is a very powerful feature of Scalding. (Compare it to the use of UDFs in Pig.)
You can find more example code under examples/. If you're interested in comparing Scalding to other languages, see our Rosetta Code page, which has several MapReduce tasks in Scalding and other frameworks (e.g., Pig and Hadoop Streaming).
- Getting Started page on the Scalding Wiki
- Runnable tutorials in the source.
- The API Reference, including many example Scalding snippets:
- Scalding Scaladocs provide details beyond the API References
- The Matrix Library provides a way of working with key-attribute-value scalding pipes:
- The Introduction to Matrix Library contains an overview and a "getting started" example
- The Matrix API Reference contains the Matrix Library API reference with examples
- Install sbt 0.11.3 (sorry, but the assembly plugin is sbt version dependent).
sbt update
(takes 2 minutes or more)sbt test
sbt assembly
(needed to make the jar used by the scald.rb script)
The test suite takes a while to run. When you're in sbt, here's a shortcut to run just one test:
> test-only com.twitter.scalding.FileSourceTest
We use Travis CI to verify the build:
The current version is 0.8.2 and is available from maven central: org="com.twitter", artifact="scalding_2.9.2".
Currently we are using the cascading-user mailing list for discussions: http://groups.google.com/group/cascading-user
In the remote possibility that there exist bugs in this code, please report them to: https://github.com/twitter/scalding/issues
Follow @Scalding on Twitter for updates.
- Avi Bryant http://twitter.com/avibryant
- Oscar Boykin http://twitter.com/posco
- Argyris Zymnis http://twitter.com/argyris
Thanks for assistance and contributions:
- Chris Wensel http://twitter.com/cwensel
- Ning Liang http://twitter.com/ningliang
- Dmitriy Ryaboy http://twitter.com/squarecog
- Dong Wang http://twitter.com/dongwang218
- Edwin Chen http://twitter.com/edchedch
- Sam Ritchie http://twitter.com/sritchie09
- Flavian Vasile http://twitter.com/flavianv
Copyright 2012 Twitter, Inc.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0