This project brings in-DBMS data analytics to Impala. This leverages previous work done by two projects:
- MADlib (http://madlib.net/)
- Bismarck (http://hazy.cs.wisc.edu/hazy/victor/bismarck/)
Each of these projects use User Defined Aggregates (UDAs) to train analytic models using an existing DBMS's data management and processing ability.
Dependencies: yum install -y eigen3-devel.noarch
Also, install boost 1.54.0
This code base includes the following components.
There is a fork of MADlib 1.0 which has been modified for use with impala. The specific changes were:
- madlib/test with tests for the new code
- madlib/Makefile to make the tests
- madlib/src/ports/metaport which is a modified MADlib backend for main memory
To run the example SVM,
- Create database toysvm
- to register the UDFs with a database (without re-making the binaries), execute: python python/deploy.py -mp toysvm
- create a synthetic table of examples in the database toysvm with the table toy: python python/gen_classify_data.py toysvm toy
- python python/impala_svm.py lbl e0 e1 e2 --db toysvm --table toy -e 1
- impala-shell -q 'use toysvm; select iter, printarray(model) from history;'