Neural-network native operations

Small Java lib with few neural-network operations:

ReLU
ELU
linearForward (matrix-by-vector aka gemv)
linearBatchForward (matrix-by-matrix aka gemm)

Behind the scenes it uses OpenBlas native library hence it's even an order of magnitude faster than pure Java implementation.

It's also faster than netlib-java package.

Since full performance is achieved only with direct float buffers, which are expensive to create, they must be reused.

Building the library

We supply a Maven artifact precompiled for Linux and Sandy Bridge 64-bit processors with the SSE and AVX instruction set on, but without AVX2.

To run on another processor / architecture one needs to follow the steps below:

Get OpenBLAS
Compile it:
- for single-threaded application: make COMMON_OPT="-O3 -g -fno-omit-frame-pointer"
- for multi-threaded application:
  - make USE_THREAD=0 NUM_THREADS=500 COMMON_OPT="-O3 -g -fno-omit-frame-pointer" or
  - make USE_THREAD=0 NUM_THREADS=500 CC=gcc-7 HOSTCC=gcc-7 COMMON_OPT="-O3 -g -fno-omit-frame-pointer" TARGET=HASWELL
(we use frame pointers to be able to produce [flame graphs] (http://techblog.netflix.com/2015/07/java-in-flames.html). You can disable them to save one register)
and install precisely like this

make PREFIX=~/OpenBLASlib install
Finally go to neural-network-native-ops dir and

mvn clean compile exec:exec install

The exec:exec goal will execute javacpp postprocessing to generate C++ file and finaly g++ compiler to produce JNI lib (.so).

To use Intel Math Kernel Library (MKL) instead of OpenBLAS,  section in pom.xml need to be used instead of  section.

Performance

Benchmarks were run on a desktop machine

$ cat /proc/cpuinfo
...
Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
model name	: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
cpu MHz		: 3288.750
cache size	: 6144 KB
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid
bogomips	: 6400.01

gemv

Results below (table and graph) show performance ratio between native and pure Java implementation of the matrix-by-vector multiplication (gemv)

input vector size	1	10	20	50	100	200	500	1000	2000
output vector size
1	0.1	0.1	0.2	0.5	0.9	1.7	2.8	4.2	5.3
10	0.3	0.6	1.1	2.4	4.7	6.4	9.8	10.4	10.7
20	0.5	0.9	1.8	3.8	7.2	10.1	12.3	14.6	12.2
50	1.3	1.3	2.7	5.0	8.3	9.3	11.9	11.5	9.9
100	1.9	1.7	3.3	5.7	6.5	8.5	13.1	9.7	9.8
200	2.8	1.9	3.7	5.5	8.8	8.7	9.5	9.9	10.0
500	4.2	2.1	4.1	5.9	8.6	8.0	9.7	9.4	9.3
1000	4.4	2.1	4.3	5.7	7.8	7.8	9.9	8.7	5.3
2000	5.4	2.1	4.3	4.9	7.5	8.0	8.9	4.8	3.6

It proves that using native code will repay when at least one dimension is over 20. When both dimensions are between 50 and 1000, native multiplication outperforms pure Java by an order of magnitude. Best performance is observed when input vector is large and output vector is small. Surprisingly, when both dimensions are over 1000, performance gain starts to diminish.

comparsion with netlib-java

nnno is faster than popular netlib-java package: As you can se below, for dimensions up to 100 it is even twice as fast.

input vector size	1	10	20	50	100	200	500	1000	2000
output vector size
1	2.8	2.5	2.5	2.6	2.6	2.3	2.0	1.8	1.5
10	2.7	2.2	2.2	1.9	1.8	1.8	1.4	1.2	1.3
20	2.6	1.9	1.9	1.7	1.9	1.6	1.3	1.3	1.1
50	2.4	1.6	1.5	1.4	1.4	1.4	1.2	1.1	1.0
100	2.2	1.4	1.3	1.3	1.2	1.1	1.0	1.0	1.0
200	1.8	1.2	1.1	1.1	1.4	1.0	1.0	1.0	1.0
500	1.5	1.1	1.1	1.1	1.0	1.0	1.0	1.0	1.0
1000	1.3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0
2000	1.1	1.0	1.0	1.0	1.0	1.1	1.0	1.1	0.9

linearForward

Here are same ratios for linear forward operation:

input vector size	1	10	20	50	100	200	500	1000	2000
output vector size
1	0.1	0.1	0.2	0.4	0.8	1.4	2.6	3.7	5.0
10	0.3	0.5	0.9	2.0	4.1	6.1	10.2	9.3	11.5
20	0.6	0.8	1.7	3.2	6.8	9.0	11.2	13.8	14.9
50	1.2	1.3	2.6	4.4	8.2	9.2	12.8	13.2	9.8
100	1.9	1.7	3.2	5.5	7.9	10.0	11.3	9.8	9.7
200	2.8	1.9	3.5	5.4	8.7	9.8	9.6	10.0	9.5
500	4.0	2.1	3.9	5.5	8.5	8.3	9.3	10.0	9.1
1000	4.5	2.1	4.0	5.3	7.5	8.6	9.9	9.3	4.7
2000	4.9	2.1	4.1	5.1	6.0	8.1	8.7	5.5	3.9

detailed benchmarks

Finally, some detailed benchmarks for preselected dimension 300x150. You can see that using heap float buffers slows down native processing terribly.

# JMH 1.11.3 (released 42 days ago)
# VM version: JDK 1.8.0_45, VM 25.45-b02
# VM invoker: /opt/java/jdk1.8.0_45/jre/bin/java
# VM options: <none>
...
# Parameters: (inputSize = 300, outputSize = 150)
...
Benchmark                                (inputSize)  (outputSize)   Mode  Cnt         Score         Error  Units
NNNOBenchmark.pureJavaReLU                       300           150  thrpt    5  10322450,628 ±  575321,058  ops/s
NNNOBenchmark.nativeHeapReLU                     300           150  thrpt    5   2304726,292 ±  133278,375  ops/s
NNNOBenchmark.nativeDirectReLU                   300           150  thrpt    5  14078625,780 ± 2194065,899  ops/s

NNNOBenchmark.pureJavaGemv                       300           150  thrpt    5     24516,760 ±    3118,883  ops/s
NNNOBenchmark.nativeHeapGemv                     300           150  thrpt    5     29610,671 ±    3712,852  ops/s
NNNOBenchmark.nativeDirectGemv                   300           150  thrpt    5    295503,692 ±   30550,677  ops/s
NNNOBenchmark.netlibJavaGemv                     300           150  thrpt    5    280583,873 ±   16496,446  ops/s

NNNOBenchmark.pureJavaLinearForward              300           150  thrpt    5     24543,647 ±    1257,571  ops/s
NNNOBenchmark.nativeHeapLinearForward            300           150  thrpt    5     29064,199 ±    2509,292  ops/s
NNNOBenchmark.nativeDirectLinearForward          300           150  thrpt    5    302488,678 ±   26459,281  ops/s

run benchmarks

It takes more than 1h to run all benchmarks, hence they are turned off by default. To run them type

mvn test -P benchmarks

Installation

Releases are distributed on Maven central:

<dependency>
    <groupId>com.rtbhouse.model</groupId>
    <artifactId>neural-network-native-ops</artifactId>
    <version>${nnno.version}</version>
</dependency>

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
benchmarks		benchmarks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural-network native operations

Building the library

Performance

gemv

comparsion with netlib-java

linearForward

detailed benchmarks

run benchmarks

Installation

About

Releases

Packages

Contributors 4

Languages

License

RTBHOUSE/neural-network-native-ops

Folders and files

Latest commit

History

Repository files navigation

Neural-network native operations

Building the library

Performance

gemv

comparsion with netlib-java

linearForward

detailed benchmarks

run benchmarks

Installation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages