Skip to content
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions java/source/c_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
.. _c-data-java:

==================
C Data Integration
==================

C Data interface is an important aspect of supporting multiple languages in Apache Arrow.
A Java programme can seamlessly work with C++ and Python programmes. The following examples
demonstrates how it can be done.

:ref:`arrow-python-java`
------------------------
1 change: 1 addition & 0 deletions java/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This cookbook is tested with Apache Arrow |version|.
data
avro
jdbc
c_data

Indices and tables
==================
Expand Down
254 changes: 254 additions & 0 deletions java/source/python_java.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
.. _arrow-python-java:

========================
PyArrow Java Integration
========================

The PyArrow library offers a powerful API for Python that can be integrated with Java applications.
This document provides a guide on how to enable seamless data exchange between Python and Java components using PyArrow.

.. contents::

Dictionary Data Roundtrip
=========================

This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.
This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java,
and finally re-accessed and validated in Python for data consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is misleading. You cannot mutate data in the C Data Interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the wording.

This section demonstrates a data roundtrip where C Data interface is being used to provide
the seamless access to data across language boundaries.



Python Component:
-----------------

The Python code uses jpype to start the JVM and make the Java class MapValuesConsumer available to Python.
Data is generated in PyArrow and exported through C Data to Java.

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run your code through a formatter so that it's consistent. (You can use Sphinx directives that include code from files instead of having to inline the code here, to make it easier.)


import jpype
import jpype.imports
from jpype.types import *
import pyarrow as pa
from pyarrow.cffi import ffi as arrow_c

# Init the JVM and make MapValuesConsumer class available to Python.
jpype.startJVM(classpath=[ "../target/*"])
java_c_package = jpype.JPackage("org").apache.arrow.c
MapValuesConsumer = JClass('MapValuesConsumer')
CDataDictionaryProvider = JClass('org.apache.arrow.c.CDataDictionaryProvider')

# Starting from Python and generating data

# Create a Python DictionaryArray

dictionary = pa.dictionary(pa.int64(), pa.utf8())
array = pa.array(["A", "B", "C", "A", "D"], dictionary)
print("From Python")
print("Dictionary Created: ", array)

# create the CDataDictionaryProvider instance which is
# required to create dictionary array precisely
c_provider = CDataDictionaryProvider()

consumer = MapValuesConsumer(c_provider)

# Export the Python array through C Data
c_array = arrow_c.new("struct ArrowArray*")
c_array_ptr = int(arrow_c.cast("uintptr_t", c_array))
array._export_to_c(c_array_ptr)

# Export the Schema of the Array through C Data
c_schema = arrow_c.new("struct ArrowSchema*")
c_schema_ptr = int(arrow_c.cast("uintptr_t", c_schema))
array.type._export_to_c(c_schema_ptr)

# Send Array and its Schema to the Java function
# that will update the dictionary
consumer.update(c_array_ptr, c_schema_ptr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misleading wording/naming

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is about the comment, how about

# update values in Java

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is about both. See above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.


# Importing updated values from Java to Python

# Export the Python array through C Data
updated_c_array = arrow_c.new("struct ArrowArray*")
updated_c_array_ptr = int(arrow_c.cast("uintptr_t", updated_c_array))

# Export the Schema of the Array through C Data
updated_c_schema = arrow_c.new("struct ArrowSchema*")
updated_c_schema_ptr = int(arrow_c.cast("uintptr_t", updated_c_schema))

java_wrapped_array = java_c_package.ArrowArray.wrap(updated_c_array_ptr)
java_wrapped_schema = java_c_package.ArrowSchema.wrap(updated_c_schema_ptr)

java_c_package.Data.exportVector(
consumer.getAllocatorForJavaConsumer(),
consumer.getVector(),
c_provider,
java_wrapped_array,
java_wrapped_schema
)

print("From Java back to Python")
updated_array = pa.Array._import_from_c(updated_c_array_ptr, updated_c_schema_ptr)

# In Java and Python, the same memory is being accessed through the C Data interface.
# Since the array from Java and array created in Python should have same data.
assert updated_array.equals(array)
print("Updated Array: ", updated_array)

del updated_array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicit del should be unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the following warning when I remove that line (I added it for this reason, but I maybe missing something in Java end).

WARNING: Failed to release Java C Data resource: Failed to attach the current thread to a Java VM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case document why it is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I have one question since this API is pretty new to me.

So what happens here is we call Java from Python. So Python VM is up first, then from Python VM we up another JVM. Then we access the memory from Java and from that we create a Python object. So the Python object and Java object points to the same memory. Is this statement correct?

Then what could happen is, the Python shutsdown its VM and in the process it would try to shutdown JVM first. The exportVector function call to Java would call a function called release_exported. This is where we see that warning.

Further according to a comment in the release_exported in jni_wrapper.cc

// It is possible for the JVM to be shut down when this is called;
// guard against that.  Example: Python code using JPype may shut
// down the JVM before releasing the stream.

I believe this above warning could cause when attempting to delete global references?
Please correct me if I am wrong. And if there is a better and accurate explanation, would appreciate to learn a few things about it.


.. code-block:: shell

From Python
Dictionary Created:
-- dictionary:
[
"A",
"B",
"C",
"D"
]
-- indices:
[
0,
1,
2,
0,
3
]
Doing work in Java
From Java back to Python
Updated Array:
-- dictionary:
[
"A",
"B",
"C",
"D"
]
-- indices:
[
2,
1,
2,
0,
3
]

In the Python component, the following steps are executed to demonstrate the data roundtrip:

1. Create data in Python
2. Export data to Java
3. Import updated data from Java
4. Validate the data consistency


Java Component:
---------------

In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.
In the Java component, the MapValuesConsumer class receives data from the Python component through C Data.
It then updates the data and sends it back to the Python component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, misleading wording

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the wording. Thanks for catching this.


.. testcode::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format the code here as well.


import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.Data;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.FieldVector;
import org.apache.arrow.vector.BigIntVector;


public class MapValuesConsumer {
private final static BufferAllocator allocator = new RootAllocator();
private final CDataDictionaryProvider provider;
private FieldVector vector;
private final static BigIntVector intVector = new BigIntVector("internal_test_vector", allocator);


public MapValuesConsumer(CDataDictionaryProvider provider) {
this.provider = provider;
}

public static BufferAllocator getAllocatorForJavaConsumer() {
return allocator;
}

public FieldVector getVector() {
return this.vector;
}

public void update(long c_array_ptr, long c_schema_ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be an option to also validate Python call as as part of Java testing using .. testcode:: and .. testoutput:: directives?

Something like this:

import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.c.Data;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;
import org.apache.arrow.vector.FieldVector;


public class MapValuesConsumer {
    private final static BufferAllocator allocator = new RootAllocator();
    private final CDataDictionaryProvider provider;
    private FieldVector vector;

    public MapValuesConsumer(CDataDictionaryProvider provider) {
        this.provider = provider;
    }

    public MapValuesConsumer() {
        this.provider = null;
    }

    public static BufferAllocator getAllocatorForJavaConsumer() {
        return allocator;
    }

    public FieldVector getVector() {
        return this.vector;
    }

    public void update(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
        this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
        this.doWorkInJava(vector);
    }

    public void update2(long c_array_ptr, long c_schema_ptr) {
        ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
        ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
        this.vector = Data.importVector(allocator, arrow_array, arrow_schema, null);
        this.doWorkInJava(vector);
    }

    private void doWorkInJava(FieldVector vector) {
        System.out.println("Doing work in Java");
        BigIntVector bigIntVector = (BigIntVector)vector;
        bigIntVector.setSafe(0, 2);
    }

    public static void main(String[] args) {
        simulateAsAJavaConsumers();
    }

    final static BigIntVector intVector =
            new BigIntVector("internal_test", allocator);

    public static BigIntVector getIntVectorForJavaConsumers() {
        intVector.allocateNew(3);
        intVector.set(0, 1);
        intVector.set(1, 7);
        intVector.set(2, 93);
        intVector.setValueCount(3);
        return intVector;
    }

    public static void simulateAsAJavaConsumers() {
        MapValuesConsumer mvc = new MapValuesConsumer();//FIXME! Use constructor with dictionary provider
        try (
            ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
            ArrowSchema arrowSchema = ArrowSchema.allocateNew(allocator)
        ) {
            //FIXME! Add custo  logic to emulate a dictionary provider adding
            Data.exportVector(allocator, getIntVectorForJavaConsumers(), null, arrowArray, arrowSchema);
            mvc.update2(arrowArray.memoryAddress(), arrowSchema.memoryAddress());
            try (FieldVector valueVectors = Data.importVector(allocator, arrowArray, arrowSchema, null);) {
                System.out.print(valueVectors); //FIXME! Validate on .. testoutput::
            }
        }
        intVector.close(); //FIXME! Expose this method also to be called by end python program
        allocator.close();
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the insight, @davisusanibar. I will work on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davisusanibar this code doesn't seem to be working, but I get your idea.

ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
this.doWorkInJava(vector);
}

public FieldVector updateFromJava(long c_array_ptr, long c_schema_ptr) {
ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
vector = Data.importVector(allocator, arrow_array, arrow_schema, null);
this.doWorkInJava(vector);
return vector;
}

private void doWorkInJava(FieldVector vector) {
System.out.println("Doing work in Java");
BigIntVector bigIntVector = (BigIntVector)vector;
bigIntVector.setSafe(0, 2);
}

private static BigIntVector getIntVectorForJavaConsumers() {
intVector.allocateNew(3);
intVector.set(0, 1);
intVector.set(1, 7);
intVector.set(2, 93);
intVector.setValueCount(3);
return intVector;
}

public static void simulateAsAJavaConsumers() {
CDataDictionaryProvider provider = new CDataDictionaryProvider();
MapValueConsumerV2 mvc = new MapValueConsumerV2(provider);//FIXME! Use constructor with dictionary provider
try (
ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
ArrowSchema arrowSchema = ArrowSchema.allocateNew(allocator)
) {
Data.exportVector(allocator, getIntVectorForJavaConsumers(), provider, arrowArray, arrowSchema);
FieldVector updatedVector = mvc.updateFromJava(arrowArray.memoryAddress(), arrowSchema.memoryAddress());
try (ArrowArray usedArray = ArrowArray.allocateNew(allocator);
ArrowSchema usedSchema = ArrowSchema.allocateNew(allocator)) {
Data.exportVector(allocator, updatedVector, provider, usedArray, usedSchema);
try(FieldVector valueVectors = Data.importVector(allocator, usedArray, usedSchema, provider)) {
System.out.println(valueVectors);
}
}
}
}

public static void close() {
intVector.close();
}

public static void main(String[] args) {
simulateAsAJavaConsumers();
close();
}
}

.. testoutput::

Doing work in Java
[2, 7, 93]


The Java component performs the following actions:

1. Receives data from the Python component.
2. Updates the data.
3. Exports the updated data back to Python.

By integrating PyArrow in Python and Java components, this example demonstrates that
a system can be created where data is shared and updated across both languages seamlessly.