Skip to content

Apache Pig Features

Alex Bain edited this page Jun 17, 2017 · 11 revisions

Table of Contents

Apache Pig Features

The Hadoop Plugin comes with features that should make it much easier for you to quickly run and debug Pig scripts.

The main one is the ability to quickly run Pig scripts on your Hadoop cluster through a gateway machine. Having a gateway machine is a common setup for Hadoop clusters, especially secure clusters (this is the setup we have at LinkedIn).

These tasks will build your project, rsync your Pig script and all of its transitive runtime dependencies to the gateway and execute your script for you.

(Since version 0.3.9) If for some reason you need to disable the Plugin, you can pass -PdisablePigPlugin on the Gradle command line or add disablePigPlugin=true to your gradle.properties file.

Setup the Plugin

If you are using the Hadoop Plugin at LinkedIn, you don't need to do anything to set it up! By default, the Plugin comes setup out-of-the-box to run Pig scripts on the cluster gateway.

In the out-of-the-box setup, the Plugin will use the directory ~/.hadoopPlugin on your local box and the directory /export/home/${user.name}/.hadoopPlugin on the gateway for its working files.

The plugin will automatically rsync all of the dependencies in the hadoopRuntime Gradle dependency configuration to the gateway before it executes your script. If you want to use another dependency configuration instead, you can tell the plugin to use this configuration in the .pigProperties file described below.

Customize the Plugin Setup

If you wish to customize the setup, in your Hadoop Plugin project add a file called .pigProperties with the following:

# Example custom setup that runs on the cluster gateway at LinkedIn. In the file <projectDir>/.pigProperties:
dependencyConf=hadoopRuntime
pigCacheDir=/home/your-user-name/.hadoopPlugin
pigCommand=/export/apps/pig/latest/bin/pig
pigOptions=
 
remoteCacheDir=/export/home/your-user-name/.hadoopPlugin
remoteHostName=theGatewayNode.linkedin.com
remoteSshOpts=-q -K

We recommend adding this file to your project's .gitignore, so that every developer on the project can have their own .pigProperties file.

  • dependencyConf - Specifies which Gradle configuration will be used to determine your transitive jar dependencies. By default, this is set to hadoopRuntime. All dependencies in this specified configuration will be rsync'd to the remote machine and passed to Pig through its pig.additional.jars command line property. This line can also be deleted if your project has no runtime dependencies.
  • pigCacheDir - Directory on your local machine that will be rsync'd to the remote machine
  • pigCommand - Command that runs Apache Pig
  • pigOptions - Additional options to pass to Pig
  • remoteCacheDir - Directory on the remote machine that will rsync'd with your local machine
  • remoteHostName - Name of the remote machine on which you will run Pig
  • remoteSshOpts - ssh options that will be used when logging into the remote machine
Setup the Plugin to Run Pig Locally

If you have a local install of Apache Pig, you can easily setup the Plugin to call your local install of Pig instead of Pig on a remote machine.

# Example setup to run on a local install of Pig. In the file <projectDir>/.pigProperties:
dependencyConf=hadoopRuntime
pigCacheDir=/home/abain/.hadoopPlugin
pigCommand=pigCommand=/home/abain/pig-0.12.0/bin/pig
pigOptions=-Dudf.import.list=org.apache.pig.builtin.:oink.:com.linkedin.pig.:com.linkedin.pig.date. -x local

# Must blank these out to override the out of the box LinkedIn setup for the cluster gateway
remoteCacheDir=
remoteHostName=
remoteSshOpts=

Note that in the local setup, you must specify the default UDF imports yourself (with -Dudf.import.list), as this is something the bin/pig script on the cluster gateway does for you. Additionally, you must specify udf.import.list before specifying -x local, as Pig is sensitive to the order of its arguments.

If you have a local install of Hadoop and HDFS, you might run with different arguments, such as -x mapreduce.

Run Pig Scripts on the Gateway

Now that you have the Plugin set up, you can quickly run parameter-less Pig scripts with the run_your_script_name.pig tasks, or run Pig scripts that require parameters by configuring them with the Hadoop DSL.

Generated run_your_script_name.pig Tasks

In your Hadoop Plugin project, one of these tasks is generated for each for each .pig file found (recursively) under <projectDir>/src/main. These tasks run the given Pig script with NO Pig parameters and NO special JVM properties. The task does the following:

  • Copies the Pig scripts under <projectDir>/src/main, the dependencies for the configuration ${dependencyConf}, and the directory <projectDir>/resources (if it exists) to the local directory ${pigCacheDir}
  • Generates the script ${pigCacheDir}/run_your_script_name.sh (that executes the following steps in this list)
  • Creates the directory ${remoteCacheDir} on the host ${remoteHostName} if it does not already exist
  • rsync the directory ${pigCacheDir} to ${remoteCacheDir}
  • ssh into ${remoteHostName}, changes the directory to ${remoteCacheDir}, and executes ${pigCommand} on the host ${remoteHostName} with your script, passing all the jar dependencies that were rsync'd to ${remoteCacheDir} as -Dpig.additional.jars=${remoteCacheDir}/*.jar

You should occasionally blow away ${pigCacheDir} and ${remoteCacheDir}, or they will keep getting bigger. If you get stuck in a series of errors where it seems like you have the wrong jar dependencies (like your script is calling old versions of your UDF's), you might try this as well.

showPigJobs Task

This task will show all Pig jobs configured with the Hadoop DSL. Configuring a Pig job with the DSL will allow you to specify Pig parameters and JVM properties to be used with your script.

runPigJob -PjobName=<jobName> Task

This task runs a Pig job configured with the Hadoop DSL.

You must pass the fully qualified name of the Pig job you want to run as the job name. The fully qualified job names may not be obvious, so run the showPigJobs task to see them.

This task will execute the same steps as the run_your_script_name.pig tasks, except that it will pass the Pig parameters and JVM properties configured in the Hadoop DSL on the command line when it executes Pig. To see examples of how to configure Pig jobs with the Hadoop DSL, see the Hadoop DSL Language Reference.

Known Issues

You should occasionally blow away the local directory ~/.hadoopPlugin and the directory ${remoteCacheDir} on the gateway, or they will keep getting bigger. If you get stuck in a series of errors where it seems like you have the wrong dependencies, you might try this as well.

When you run on the gateway, you will see the message tcgetattr: Invalid argument go by. This message is safe to ignore.