spark-xml-utils

This site offers some background information on how to utilize the capabilities provided by the spark-xml-utils library within an Apache Spark application. Some scala examples (leveraging XPath, XSLT, and XQuery) within the Apache Spark framework are provided. I have modified the spark-xml-utils APIs in this new version. The previous version was really a work in progress whereas the newer version incorporates some of the experience I have gained with both Spark and the spark-xml-utils package. My hope is that the new version will be simpler to use as well as more performant. As time permits, I plan to optimize the implementation as well as add some additional features.

Spark-xml-utils is not meant for processing one large single GBs XML record. However, if you have many XML records (we have millions) in the MBs (or less) then this should be a handy tool.

The javadoc is available for spark-xml-utils and could be helpful with understanding the class interactions.

Motivation

The spark-xml-utils library was developed because there is a large amount of XML in our big datasets and I felt this data could be better served by providing some helpful XML utilities. This includes the ability to filter documents based on an XPath expression, return specific nodes for an XPath/XQuery expression, or transform documents using a XSLT stylesheet. By providing some basic wrappers to Saxon, the spark-xml-utils library exposes some basic XPath, XQuery, and XSLT functionality that can readily be leveraged by any Spark application.

Examples

The basic examples included only scratch the surface for what is possible with spark-xml-utils and XPath, XQuery, and XSLT. I have used spark-xml-utils to transform millions of XML documents to json and html, performed a simple batch search against millions of XML documents, and more. Some more complex examples are available to further showcase the power of spark-xml-utils.

The sequence file used in all of the examples is publicly available in s3://spark-xml-utils/xml/part*. In the sequence file, the key is a unique identifier for the record and the value is the XML (as a string). This should allow you to try out the examples as well as experiment with your own expressions.

Maven Coordinate

<dependency> 
	<groupId>com.elsevier</groupId>
	<artifactId>spark-xml-utils</artifactId>
	<version>1.8.0</version>
</dependency>

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
javadoc		javadoc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-xml-utils

Motivation

Examples

Maven Coordinate

About

Releases

Packages

Languages

License

davkuyek/spark-xml-utils

Folders and files

Latest commit

History

Repository files navigation

spark-xml-utils

Motivation

Examples

Maven Coordinate

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages