-
Notifications
You must be signed in to change notification settings - Fork 31
Convert the rows in an Excel document to CSV
This is a MapReduce application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of files in Excel format (.xls,.xlsx) on HDFS containing. As output it converts them to a CSV file / sheet. It has successfully been tested with the Cloudera Quickstart VM 5.5, but other Hadoop distributions should work equally well.
You can create yourself an Excel document in LibreOffice or Microsoft Excel. Alternatively, you can download an Excel document that is used for unit testing of hadoopoffice library by executing the following command:
wget --no-check-certificate https://github.com/ZuInnoTe/hadoopoffice/raw/master/fileformat/src/test/resources/excel2013test.xlsx
You can put it on your HDFS cluster by executing the following commands:
hadoop fs -mkdir -p /user/office/excel/input
hadoop fs -put ./excel2013test.xlsx /user/office/excel/input
After it has been copied you are ready to use the example.
Note the fileformat is available on Maven Central and you do NOT need to build and publish the Hadoop File Format anymore.
Execute
git clone https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice
You can build the application by changing to the directory hadoopoffice/examples/mapreduce-excelinput and using the following command:
../gradlew clean build
Make sure that the output directory is clean:
hadoop fs -rm -R /user/office/output
Execute the following command:
hadoop jar ./build/libs/example-ho-mr-excelinput-0.1.0.jar /user/office/excel/input /user/office/output
After the map/reduce job has completed, you find the result (ie the Excel input file converted into a CSV output file) in /user/cloudera/office/output. You can display it using the following command:
hadoop fs -cat /user/office/output/part-r-00000
Keep in mind that CSV requires that each row should have all the columns, which is not the case for Excel. Hence, you need to have an Excel sheet with all columns filled for each row or you need to make sure in your application that each row has all the columns. The example can be easily extended to fulfil this requirement.
Find here further configuration options of the HadoopOffice library, such as encryption, decryption, locale, meta data filter, linked workbooks and filtering by sheets.