-
Notifications
You must be signed in to change notification settings - Fork 46
[Java] How dictionaries work - roundtrip Java-Python #327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 8 commits
3704302
71877c8
0b029e4
740a139
d2b0491
c35eefd
46bba74
8508059
9329cda
23617f0
59ac6bb
33e2cbf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| .. _c-data-java: | ||
|
|
||
| ================== | ||
| C Data Integration | ||
| ================== | ||
|
|
||
| C Data interface is an important aspect of supporting multiple languages in Apache Arrow. | ||
| A Java programme can seamlessly work with C++ and Python programmes. The following examples | ||
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| demonstrates how it can be done. | ||
|
|
||
| :ref:`arrow-python-java` | ||
| ------------------------ | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,254 @@ | ||||||||||
| .. _arrow-python-java: | ||||||||||
|
|
||||||||||
| ======================== | ||||||||||
| PyArrow Java Integration | ||||||||||
| ======================== | ||||||||||
|
|
||||||||||
| The PyArrow library offers a powerful API for Python that can be integrated with Java applications. | ||||||||||
| This document provides a guide on how to enable seamless data exchange between Python and Java components using PyArrow. | ||||||||||
|
|
||||||||||
| .. contents:: | ||||||||||
|
|
||||||||||
| Dictionary Data Roundtrip | ||||||||||
| ========================= | ||||||||||
|
|
||||||||||
| This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | ||||||||||
| and finally re-accessed and validated in Python for data consistency. | ||||||||||
|
||||||||||
| This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | |
| and finally re-accessed and validated in Python for data consistency. | |
| This section demonstrates a data roundtrip, where a dictionary array is created in Python, accessed and updated in Java, | |
| and finally re-accessed and validated in Python for data consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is misleading. You cannot mutate data in the C Data Interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the wording.
This section demonstrates a data roundtrip where C Data interface is being used to provide
the seamless access to data across language boundaries.
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run your code through a formatter so that it's consistent. (You can use Sphinx directives that include code from files instead of having to inline the code here, to make it easier.)
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misleading wording/naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is about the comment, how about
# update values in JavaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is about both. See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicit del should be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the following warning when I remove that line (I added it for this reason, but I maybe missing something in Java end).
WARNING: Failed to release Java C Data resource: Failed to attach the current thread to a Java VMThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case document why it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I have one question since this API is pretty new to me.
So what happens here is we call Java from Python. So Python VM is up first, then from Python VM we up another JVM. Then we access the memory from Java and from that we create a Python object. So the Python object and Java object points to the same memory. Is this statement correct?
Then what could happen is, the Python shutsdown its VM and in the process it would try to shutdown JVM first. The exportVector function call to Java would call a function called release_exported. This is where we see that warning.
Further according to a comment in the release_exported in jni_wrapper.cc
// It is possible for the JVM to be shut down when this is called;
// guard against that. Example: Python code using JPype may shut
// down the JVM before releasing the stream.I believe this above warning could cause when attempting to delete global references?
Please correct me if I am wrong. And if there is a better and accurate explanation, would appreciate to learn a few things about it.
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
vibhatha marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | |
| It then updates the data and sends it back to the Python component. | |
| In the Java component, the MapValuesConsumer class receives data from the Python component through C Data. | |
| It then updates the data and sends it back to the Python component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, misleading wording
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the wording. Thanks for catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format the code here as well.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be an option to also validate Python call as as part of Java testing using .. testcode:: and .. testoutput:: directives?
Something like this:
import org.apache.arrow.c.ArrowArray;
import org.apache.arrow.c.ArrowSchema;
import org.apache.arrow.c.CDataDictionaryProvider;
import org.apache.arrow.c.Data;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.memory.RootAllocator;
import org.apache.arrow.vector.BigIntVector;
import org.apache.arrow.vector.FieldVector;
public class MapValuesConsumer {
private final static BufferAllocator allocator = new RootAllocator();
private final CDataDictionaryProvider provider;
private FieldVector vector;
public MapValuesConsumer(CDataDictionaryProvider provider) {
this.provider = provider;
}
public MapValuesConsumer() {
this.provider = null;
}
public static BufferAllocator getAllocatorForJavaConsumer() {
return allocator;
}
public FieldVector getVector() {
return this.vector;
}
public void update(long c_array_ptr, long c_schema_ptr) {
ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, this.provider);
this.doWorkInJava(vector);
}
public void update2(long c_array_ptr, long c_schema_ptr) {
ArrowArray arrow_array = ArrowArray.wrap(c_array_ptr);
ArrowSchema arrow_schema = ArrowSchema.wrap(c_schema_ptr);
this.vector = Data.importVector(allocator, arrow_array, arrow_schema, null);
this.doWorkInJava(vector);
}
private void doWorkInJava(FieldVector vector) {
System.out.println("Doing work in Java");
BigIntVector bigIntVector = (BigIntVector)vector;
bigIntVector.setSafe(0, 2);
}
public static void main(String[] args) {
simulateAsAJavaConsumers();
}
final static BigIntVector intVector =
new BigIntVector("internal_test", allocator);
public static BigIntVector getIntVectorForJavaConsumers() {
intVector.allocateNew(3);
intVector.set(0, 1);
intVector.set(1, 7);
intVector.set(2, 93);
intVector.setValueCount(3);
return intVector;
}
public static void simulateAsAJavaConsumers() {
MapValuesConsumer mvc = new MapValuesConsumer();//FIXME! Use constructor with dictionary provider
try (
ArrowArray arrowArray = ArrowArray.allocateNew(allocator);
ArrowSchema arrowSchema = ArrowSchema.allocateNew(allocator)
) {
//FIXME! Add custo logic to emulate a dictionary provider adding
Data.exportVector(allocator, getIntVectorForJavaConsumers(), null, arrowArray, arrowSchema);
mvc.update2(arrowArray.memoryAddress(), arrowSchema.memoryAddress());
try (FieldVector valueVectors = Data.importVector(allocator, arrowArray, arrowSchema, null);) {
System.out.print(valueVectors); //FIXME! Validate on .. testoutput::
}
}
intVector.close(); //FIXME! Expose this method also to be called by end python program
allocator.close();
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the insight, @davisusanibar. I will work on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davisusanibar this code doesn't seem to be working, but I get your idea.
Uh oh!
There was an error while loading. Please reload this page.