odata2avro is a Python command-line tool to automatically
convert OData datasets to Avro. Using odata2avro
together with standard
Hadoop tooling, it should be very simple to ingest OData data from
Microsoft Azure DataMarket to Hadoop.
$ odata2avro ODATA_XML AVRO_SCHEMA AVRO_FILE
This command reads data from ODATA_XML
and creates two files: AVRO_SCHEMA
and AVRO_FILE
. The Avro schema is in JSON format.
# Download OData data in XML format
$ curl 'https://api.datamarket.azure.com/opendata.rdw/VRTG.Open.Data/v1/KENT_VRTG_O_DAT?$top=100' > cars.xml
# Convert data to Avro
$ odata2avro cars.xml cars.avsc cars.avro
# Upload to HDFS
$ hdfs dfs -put cars.avro cars.avsc /tmp
# Create Avro-backed Hive table using Avro schema stored in /tmp/cars.avsc
$ hive -e "
CREATE TABLE cars
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.url'='hdfs:///tmp/cars.avsc');"
# Load data from /tmp/cars.avro to the cars table
$ hive -e "LOAD DATA INPATH '/tmp/cars.avro' INTO TABLE cars"
# Query with Impala
$ impala-shell -i <impala-daemon-ip> -q "REFRESH cars; select count(*) from cars"
+----------+
| count(*) |
+----------+
| 100 |
+----------+
pip install odata2avro
Please create an issue if you spot any problem or bug. We'll try to get back to you as soon as possible.