-
Notifications
You must be signed in to change notification settings - Fork 76
Apache Pig Features
The Hadoop Plugin comes with features that should make it much easier for you to quickly run and debug Pig scripts.
The main one is the ability to quickly run Pig scripts on your Hadoop cluster through a gateway machine. Having a gateway machine is a common setup for Hadoop clusters, especially secure clusters (this is the setup we have at LinkedIn).
These tasks will build your project, rsync
your Pig script and all of its transitive runtime dependencies to the gateway and execute your script for you.
(Since version 0.3.9) If for some reason you need to disable the Plugin, you can pass -PdisablePigPlugin
on the Gradle command line or add disablePigPlugin=true
to your gradle.properties
file.
If you are using the Hadoop Plugin at LinkedIn, you don't need to do anything to set it up! By default, the Plugin comes setup out-of-the-box to run Pig scripts on the cluster gateway.
In the out-of-the-box setup, the Plugin will use the directory ~/.hadoopPlugin
on your local box and the directory /export/home/${user.name}/.hadoopPlugin
on the gateway for its working files.
The plugin will automatically rsync all of the dependencies in the hadoopRuntime
Gradle dependency configuration to the gateway before it executes your script. If you want to use another dependency configuration instead, you can tell the plugin to use this configuration in the .pigProperties
file described below.
If you wish to customize the setup, in your Hadoop Plugin project add a file called .pigProperties
with the following:
# Example custom setup that runs on the cluster gateway at LinkedIn. In the file <projectDir>/.pigProperties:
dependencyConf=hadoopRuntime
pigCacheDir=/home/your-user-name/.hadoopPlugin
pigCommand=/export/apps/pig/latest/bin/pig
pigOptions=
remoteCacheDir=/export/home/your-user-name/.hadoopPlugin
remoteHostName=theGatewayNode.linkedin.com
remoteSshOpts=-q -K
We recommend adding this file to your project's .gitignore
, so that every developer on the project can have their own .pigProperties
file.
-
dependencyConf
- Specifies which Gradle configuration will be used to determine your transitive jar dependencies. By default, this is set tohadoopRuntime
. All dependencies in this specified configuration will be rsync'd to the remote machine and passed to Pig through itspig.additional.jars
command line property. This line can also be deleted if your project has no runtime dependencies. -
pigCacheDir
- Directory on your local machine that will be rsync'd to the remote machine -
pigCommand
- Command that runs Apache Pig -
pigOptions
- Additional options to pass to Pig -
remoteCacheDir
- Directory on the remote machine that will rsync'd with your local machine -
remoteHostName
- Name of the remote machine on which you will run Pig -
remoteSshOpts
-ssh
options that will be used when logging into the remote machine
If you have a local install of Apache Pig, you can easily setup the Plugin to call your local install of Pig instead of Pig on a remote machine.
# Example setup to run on a local install of Pig. In the file <projectDir>/.pigProperties:
dependencyConf=hadoopRuntime
pigCacheDir=/home/abain/.hadoopPlugin
pigCommand=pigCommand=/home/abain/pig-0.12.0/bin/pig
pigOptions=-Dudf.import.list=org.apache.pig.builtin.:oink.:com.linkedin.pig.:com.linkedin.pig.date. -x local
# Must blank these out to override the out of the box LinkedIn setup for the cluster gateway
remoteCacheDir=
remoteHostName=
remoteSshOpts=
Note that in the local setup, you must specify the default UDF imports yourself (with -Dudf.import.list
), as this is something the bin/pig script on the cluster gateway does for you. Additionally, you must specify udf.import.list
before specifying -x local
, as Pig is sensitive to the order of its arguments.
If you have a local install of Hadoop and HDFS, you might run with different arguments, such as -x mapreduce
.
Now that you have the Plugin set up, you can quickly run parameter-less Pig scripts with the run_your_script_name.pig
tasks, or run Pig scripts that require parameters by configuring them with the Hadoop DSL.
In your Hadoop Plugin project, one of these tasks is generated for each for each .pig file found (recursively) under <projectDir>/src/main
. These tasks run the given Pig script with NO Pig parameters and NO special JVM properties. The task does the following:
- Copies the Pig scripts under
<projectDir>/src/main
, the dependencies for the configuration${dependencyConf}
, and the directory<projectDir>/resources
(if it exists) to the local directory${pigCacheDir}
- Generates the script
${pigCacheDir}/run_your_script_name.sh
(that executes the following steps in this list) - Creates the directory
${remoteCacheDir}
on the host${remoteHostName}
if it does not already exist -
rsync
the directory${pigCacheDir}
to ${remoteCacheDir}
-
ssh
into${remoteHostName}
, changes the directory to${remoteCacheDir}
, and executes${pigCommand}
on the host${remoteHostName}
with your script, passing all the jar dependencies that were rsync'd to${remoteCacheDir}
as-Dpig.additional.jars=${remoteCacheDir}/*.jar
You should occasionally blow away ${pigCacheDir}
and ${remoteCacheDir}
, or they will keep getting bigger. If you get stuck in a series of errors where it seems like you have the wrong jar dependencies (like your script is calling old versions of your UDF's), you might try this as well.
This task will show all Pig jobs configured with the Hadoop DSL. Configuring a Pig job with the DSL will allow you to specify Pig parameters and JVM properties to be used with your script.
This task runs a Pig job configured with the Hadoop DSL.
You must pass the fully qualified name of the Pig job you want to run as the job name. The fully qualified job names may not be obvious, so run the showPigJobs
task to see them.
This task will execute the same steps as the run_your_script_name.pig
tasks, except that it will pass the Pig parameters and JVM properties configured in the Hadoop DSL on the command line when it executes Pig. To see examples of how to configure Pig jobs with the Hadoop DSL, see the Hadoop DSL Language Reference.
You should occasionally blow away the local directory ~/.hadoopPlugin
and the directory ${remoteCacheDir}
on the gateway, or they will keep getting bigger. If you get stuck in a series of errors where it seems like you have the wrong dependencies, you might try this as well.
When you run on the gateway, you will see the message tcgetattr: Invalid argument
go by. This message is safe to ignore.