Skip to content

Command-line tool to convert OpenData XML from Microsoft MarketPlace to Apache Avro files

License

Notifications You must be signed in to change notification settings

datadudes/odata2avro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

odata2avro Build Status Coverage Status

odata2avro is a Python command-line tool to automatically convert OData datasets to Avro. Using odata2avro together with standard Hadoop tooling, it should be very simple to ingest OData data from Microsoft Azure DataMarket to Hadoop.

Usage:

$ odata2avro ODATA_XML AVRO_SCHEMA AVRO_FILE

This command reads data from ODATA_XML and creates two files: AVRO_SCHEMA and AVRO_FILE. The Avro schema is in JSON format.

Example: Ingest data from Azure DataMarket to Hive/Impala

# Download OData data in XML format
$ curl 'https://api.datamarket.azure.com/opendata.rdw/VRTG.Open.Data/v1/KENT_VRTG_O_DAT?$top=100' > cars.xml

# Convert data to Avro
$ odata2avro cars.xml cars.avsc cars.avro

# Upload to HDFS
$ hdfs dfs -put cars.avro cars.avsc /tmp

# Create Avro-backed Hive table using Avro schema stored in /tmp/cars.avsc
$ hive -e "
  CREATE TABLE cars
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/cars.avsc');"

# Load data from /tmp/cars.avro to the cars table
$ hive -e "LOAD DATA INPATH '/tmp/cars.avro' INTO TABLE cars"

# Query with Impala
$ impala-shell -i <impala-daemon-ip> -q "REFRESH cars; select count(*) from cars"
+----------+
| count(*) |
+----------+
|      100 |
+----------+

Installation:

pip install odata2avro

Contributions:

Please create an issue if you spot any problem or bug. We'll try to get back to you as soon as possible.

Authors:

Created with passion by Marcel and Daan.

About

Command-line tool to convert OpenData XML from Microsoft MarketPlace to Apache Avro files

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages