This project contains python notebooks that load data into a Neo4j database or in-memory graph projection from large datasets consisted of Apache Parquet files.
One of the main capabilities of Apache Arrow is the ability to deal with memory-mapped files,
this allows Arrow to read datasets that are bigger than the available RAM without incurring any additional memory cost.
With the PyArrow ParquetDataset module we can create a dataset of memory-map parquet files, it can then be fragmented and processed in batches.
In addition to local
& hdfs
storage, PyArrow also supports cloud object storage such as:
- Amazon Simple Storage (S3)
- Google Cloud Storage (GCS)
- Microsoft Azure Blob Storage
Please see the Apache Arrow installation instructions and requirements in the Neo4j GDS documentation.
neo4j_arrow_create_database.ipynb
: Used to create a Neo4j database from a PyArrow ParquetDatasetneo4j_arrow_create_projection.ipynb
:Used to create a Neo4j in-memory graph projection from a PyArrow ParquetDataset
The configuration is defined in the config.yaml
file with the below parameters:
neo4j:
host: 'your-neo4j-host-ip'
port: 8491
user: 'your-neo4j-username'
password: 'your-password'
tls: False
concurrency: 10
database: 'neo4j'
#projection: 'mygraph'
dataset:
node_parquet_files_uri: 'gs://your-bucket/pqdataset-projection/nodes'
relationship_parquet_files_uri: 'gs://your-bucket/pqdataset-projection/relationships'
worker:
concurrency: 10
arrow_table_size: 1_000_000
host
: Server address of your Neo4j installationport
: Port that the arrow flight server listens for incoming connectionsuser
: Username of neo4j userpassword
: Password for neo4j usertls
: Sets the encryption between the client and Neo4j server, for more information please see Arrow Encryptionconcurrency
: Number of threads to use for the database/projection creation processdatabase
: The name of the import process and the resulting database nameprojection
: The name of the resulting in-memory graph in the graph catalog, (Required only when creating an in-memory projection)node_parquet_files_uri
: Node parquet files location uri used by pyarrow to create the dataset (Required, and it must contain the nodes string)relationship_parquet_files_uri
: Relationship parquet files location uri used by pyarrow to create the dataset (Required, and it must contain the relationships string)concurrency
: Number of workers used by the notebook to process our datasets (Recommended that it matches the concurrency set above for our Neo4j process)arrow_table_size
: Controls the arrow table size sent to Neo4j by our workers
- The PyArrow ParquetDataset module parses the uri to determine the filesystem,
s3
,gcs
,hdfs
,local
etc. - To manage your datasets easier, especially large ones. It is recommended to follow the below subfolder structure:
- Under your required nodes directory have a subdirectory for each node label, with its parquet files
- Under your required relationships directory have a subdirectory for each relationship type with its parquet files
- PyArrow will automatically detect parquet files located in sub-directories
├── parquet-dataset-root-folder
│ ├── nodes
│ │ ├── label1
│ │ │ ├── label1-00000.parquet
│ │ │ ├── label1-00001.parquet
│ │ │ ├── label1-00002.parquet
│ │ ├── label2
│ │ │ ├── label2-00000.parquet
│ │ │ ├── label2-00001.parquet
│ │ │ ├── label2-00002.parquet
│ ├── relationships
│ │ ├── type1
│ │ │ ├── type1-00000.parquet
│ │ │ ├── type1-00001.parquet
│ │ │ ├── type1-00002.parquet
│ │ ├── type2
│ │ │ ├── type2-00000.parquet
│ │ │ ├── type2-00001.parquet
│ │ │ ├── type2-00002.parquet