Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 16 additions & 18 deletions .github/workflows/publish-docs.yml → .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -1,30 +1,16 @@
name: Publish Docs
name: Build documentation

on:
push:
branches:
- master

permissions:
actions: read
pages: write
id-token: write
pull_request:

jobs:
build-and-deploy:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v6

- name: Setup Python
uses: actions/setup-python@v6
with:
python-version: 3.x

- name: Run Preprocessing Script
run: python docs/tools/preprocess_docs.py

- name: Setup .NET
uses: actions/setup-dotnet@v5
with:
Expand All @@ -34,13 +20,25 @@ jobs:
run: dotnet tool update -g docfx

- name: Build Documentation
run: docfx docfx.json
run: docfx --warningsAsErrors docfx.json
working-directory: ./docs

- name: Upload Site Artifact
uses: actions/upload-pages-artifact@v4
with:
path: './docs/_site'

deploy:
if: github.event_name == 'push' && github.ref == 'refs/heads/master' && !github.event.repository.fork
runs-on: ubuntu-latest
needs: build
permissions:
pages: write
id-token: write
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v4
28 changes: 14 additions & 14 deletions docs/guides/Arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ The Apache Parquet C++ library provides APIs for reading and writing data in the
These are wrapped by ParquetSharp using the [Arrow C data interface](https://arrow.apache.org/docs/format/CDataInterface.html)
to allow high performance reading and writing of Arrow data with zero copying of array data between C++ and .NET.

The Arrow API is contained in the `ParquetSharp.Arrow` namespace,
The Arrow API is contained in the @ParquetSharp.Arrow namespace,
and included in the [ParquetSharp NuGet package](https://www.nuget.org/packages/ParquetSharp/).

## Reading Arrow data

Reading Parquet data in Arrow format uses a `ParquetSharp.Arrow.FileReader`.
This can be constructed using a file path, a .NET `System.IO.Stream`,
or a subclass of `ParquetSharp.IO.RandomAccessFile`.
Reading Parquet data in Arrow format uses a @ParquetSharp.Arrow.FileReader.
This can be constructed using a file path, a .NET @System.IO.Stream,
or a subclass of @ParquetSharp.IO.RandomAccessFile.
In this example, we'll open a file using a path:

```csharp
Expand Down Expand Up @@ -68,9 +68,9 @@ the reader properties, discussed below.

### Reader properties

The `ParquetSharp.Arrow.FileReader` constructor accepts an instance of
`ParquetSharp.ReaderProperties` to control standard Parquet reading behaviour,
and additionally accepts an instance of `ParquetSharp.Arrow.ArrowReaderProperties`
The @ParquetSharp.Arrow.FileReader constructor accepts an instance of
@ParquetSharp.ReaderProperties to control standard Parquet reading behaviour,
and additionally accepts an instance of @ParquetSharp.Arrow.ArrowReaderProperties
to customise Arrow specific behaviour:

```csharp
Expand All @@ -94,7 +94,7 @@ using var fileReader = new FileReader(

## Writing Arrow data

The `ParquetSharp.Arrow.FileWriter` class allows writing Parquet files
The @ParquetSharp.Arrow.FileWriter class allows writing Parquet files
using Arrow format data.

In this example we'll walk through writing a file with a timestamp,
Expand Down Expand Up @@ -134,15 +134,15 @@ RecordBatch GetBatch(int batchNumber) =>
}, numIds);
```

Now we create a `ParquetSharp.Arrow.FileWriter`, specifying the path to write to and the
Now we create a @ParquetSharp.Arrow.FileWriter, specifying the path to write to and the
file schema:

```csharp
using var writer = new FileWriter("data.parquet", schema);
```

Rather than specifying a file path, we could also write to a .NET `System.IO.Stream`
or a subclass of `ParquetSharp.IO.OutputStream`.
Rather than specifying a file path, we could also write to a .NET @System.IO.Stream
or a subclass of @ParquetSharp.IO.OutputStream.

### Writing data in batches

Expand Down Expand Up @@ -207,9 +207,9 @@ writer.Close();

### Writer properties

The `ParquetSharp.Arrow.FileWriter` constructor accepts an instance of
`ParquetSharp.WriterProperties` to control standard Parquet writing behaviour,
and additionally accepts an instance of `ParquetSharp.Arrow.ArrowWriterProperties`
The @ParquetSharp.Arrow.FileWriter constructor accepts an instance of
@ParquetSharp.WriterProperties to control standard Parquet writing behaviour,
and additionally accepts an instance of @ParquetSharp.Arrow.ArrowWriterProperties
to customise Arrow specific behaviour:

```csharp
Expand Down
34 changes: 17 additions & 17 deletions docs/guides/Encryption.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Double wrapping is enabled by default.
For further details, see the
[Key Management Tools design document](https://docs.google.com/document/d/1bEu903840yb95k9q2X-BlsYKuXoygE4VnMDl9xz_zhk).

The Key Management Tools API is contained in the `ParquetSharp.Encryption` namespace.
The Key Management Tools API is contained in the @ParquetSharp.Encryption namespace.
In order to use this API,
a client for a Key Management Service must be implemented:

Expand Down Expand Up @@ -55,7 +55,7 @@ internal sealed class MyKmsClient : IKmsClient
```

The main entrypoint for the Key Management Tools API is the
`ParquetSharp.Encryption.CryptoFactory` class.
@ParquetSharp.Encryption.CryptoFactory class.
This requires a factory method for creating KMS clients,
which are cached internally and periodically recreated:

Expand All @@ -76,7 +76,7 @@ kmsConnectionConfig.KmsInstanceUrl = ...;
kmsConnectionConfig.KeyAccessToken = ...;
```

Then to configure how the file is encrypted, an `ParquetSharp.Encryption.EncryptionConfiguration` is created:
Then to configure how the file is encrypted, an @ParquetSharp.Encryption.EncryptionConfiguration is created:

```c#
string footerKeyId = ...;
Expand Down Expand Up @@ -113,7 +113,7 @@ encryptionConfig.PlaintextFooter = true;
```

The `kmsConnectionConfig` and `encryptionConfiguration` are used to generate
file encryption properties, which are used to build the `ParquetSharp.WriterProperties`:
file encryption properties, which are used to build the @ParquetSharp.WriterProperties:

```c#
using var fileEncryptionProperties = cryptoFactory.GetFileEncryptionProperties(
Expand All @@ -126,7 +126,7 @@ using var writerProperties = writerPropertiesBuilder
.Build();
```

Finally, the Parquet file can be written using the `ParquetSharp.WriterProperties`:
Finally, the Parquet file can be written using the @ParquetSharp.WriterProperties:

```c#
Column[] columns = ...;
Expand All @@ -136,9 +136,9 @@ using var fileWriter = new ParquetFileWriter(parquetFilePath, columns, writerPro

### Reading Encrypted Files

Reading encrypted files requires creating `ParquetSharp.FileDecryptionProperties`
with a `ParquetSharp.Encryption.CryptoFactory`, and adding these to the
`ParquetSharp.ReaderProperties`:
Reading encrypted files requires creating @ParquetSharp.FileDecryptionProperties
with a @ParquetSharp.Encryption.CryptoFactory, and adding these to the
@ParquetSharp.ReaderProperties:

```c#
using var decryptionConfig = new DecryptionConfiguration();
Expand All @@ -164,16 +164,16 @@ Key material is stored inside the Parquet file metadata by default,
but key material can also be stored in separate JSON files alongside Parquet files,
to allow rotation of master keys without needing to rewrite the Parquet files.

This is configured in the `ParquetSharp.Encryption.EncryptionConfiguration`:
This is configured in the @ParquetSharp.Encryption.EncryptionConfiguration:

```c#
using var encryptionConfig = new EncryptionConfiguration(footerKeyId);
encryptionConfig.InternalKeyMaterial = false; // External key material
```

When using external key material, the path to the Parquet file being written or read
must be specified when creating `ParquetSharp.FileEncryptionProperties` and
`ParquetSharp.FileDecryptionProperties`:
must be specified when creating @ParquetSharp.FileEncryptionProperties and
@ParquetSharp.FileDecryptionProperties:

```c#
using var fileEncryptionProperties = cryptoFactory.GetFileEncryptionProperties(
Expand Down Expand Up @@ -247,7 +247,7 @@ using var fileDecryptionProperties = builder
```

Rather than having to specify decryption keys directly, a
`ParquetSharp.DecryptionKeyRetriever` can be used to retrieve keys
@ParquetSharp.DecryptionKeyRetriever can be used to retrieve keys
based on the key metadata, to allow more flexibility:

```c#
Expand Down Expand Up @@ -298,7 +298,7 @@ using var fileDecryptionProperties = builder

If the AAD prefix doesn't match the expected prefix an exception will be thrown when reading the file.

Alternatively, you can implement an `ParquetSharp.AadPrefixVerifier` if you have more complex verification logic:
Alternatively, you can implement an @ParquetSharp.AadPrefixVerifier if you have more complex verification logic:

```c#
internal sealed class MyAadVerifier : ParquetSharp.AadPrefixVerifier
Expand All @@ -324,8 +324,8 @@ using var fileDecryptionProperties = builder

## Arrow API Compatibility

Note that the above examples use the `ParquetSharp.ParquetFileReader` and
`ParquetSharp.ParquetFileWriter` classes, but encryption may also be used with the Arrow API.
The `ParquetSharp.Arrow.FileReader` and `ParquetSharp.Arrow.FileWriter` constructors
accept `ParquetSharp.ReaderProperties` and `ParquetSharp.WriterProperties` parameters
Note that the above examples use the @ParquetSharp.ParquetFileReader and
@ParquetSharp.ParquetFileWriter classes, but encryption may also be used with the Arrow API.
The @ParquetSharp.Arrow.FileReader and @ParquetSharp.Arrow.FileWriter constructors
accept @ParquetSharp.ReaderProperties and @ParquetSharp.WriterProperties parameters
respectively, which can have encryption properties configured.
10 changes: 5 additions & 5 deletions docs/guides/Nested.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ but the Parquet format can be used to represent data with a complex nested struc

In order to write a file with nested columns,
we must define the Parquet file schema explicitly as a graph structure using schema nodes,
rather than using ParquetSharp's `ParquetSharp.Column` type.
rather than using ParquetSharp's @ParquetSharp.Column type.

Imagine we have the following JSON object we would like to store as Parquet:

Expand Down Expand Up @@ -41,8 +41,8 @@ or we had a non-null object with a null `message` and null `ids`.
Instead, we will represent this data in Parquet with a single
`objects` column.

In order to define the schema, we will be using `ParquetSharp.Schema.PrimitiveNode`
and `ParquetSharp.Schema.GroupNode`.
In order to define the schema, we will be using @ParquetSharp.Schema.PrimitiveNode
and @ParquetSharp.Schema.GroupNode.

In the Parquet schema, we have one one top-level group node named `objects`,
which contains two nested fields, `ids` and `message`.
Expand Down Expand Up @@ -74,7 +74,7 @@ using var schema = new GroupNode(

### Writing data

We can then create a `ParquetSharp.ParquetFileWriter` with this schema:
We can then create a @ParquetSharp.ParquetFileWriter with this schema:

```csharp
using var propertiesBuilder = new WriterPropertiesBuilder();
Expand All @@ -85,7 +85,7 @@ using var fileWriter = new ParquetFileWriter("objects.parquet", schema, writerPr

When writing data to this file,
the leaf-level values written must be nested within ParquetSharp's
`ParquetSharp.Nested` type to indicate they are contained in a group,
@ParquetSharp.Nested type to indicate they are contained in a group,
and allow nullable nested structures to be represented unambiguously.

For example, both the `objects` and `message` fields are optional,
Expand Down
6 changes: 3 additions & 3 deletions docs/guides/PowerShell.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# ParquetSharp in PowerShell

The main requirement to using ParquetSharp from PowerShell is that `ParquetSharpNative.dll` is in the `PATH` or in the same directory as `ParquetSharp.dll`. The following guide shows one possible approach to achieve this:
The main requirement to using ParquetSharp from PowerShell is that @ParquetSharpNative.dll is in the `PATH` or in the same directory as @ParquetSharp.dll. The following guide shows one possible approach to achieve this:

### Installation

Expand All @@ -23,7 +23,7 @@ Copy-Item -Path ".\lib\System.Runtime.CompilerServices.Unsafe.4.5.3\lib\net461\S
Copy-Item -Path ".\lib\System.ValueTuple.4.5.0\lib\net461\System.ValueTuple.dll" -Destination ".\bin"
```

Finally, copy `ParquetSharp.dll` and `ParquetSharpNative.dll` into `bin`. This will depend on the current version of ParquetSharp, as well as your architecture and OS:
Finally, copy @ParquetSharp.dll and @ParquetSharpNative.dll into `bin`. This will depend on the current version of ParquetSharp, as well as your architecture and OS:

```powershell
# Replace path with the appropriate version of ParquetSharp
Expand All @@ -36,7 +36,7 @@ Copy-Item -Path ".\lib\ParquetSharp.12.1.0\runtimes\win-x64\native\ParquetSharpN
The available runtime architectures are `win-x64`, `linux-x64`, `linux-arm64`, `osx-x64`, and `osx-arm64`.

### Usage
Use `Add-Type` to load `ParquetSharp.dll`. Note that we're using custom directories:
Use `Add-Type` to load @ParquetSharp.dll. Note that we're using custom directories:

```powershell
# Replace path with the appropriate versions of ParquetSharp
Expand Down
16 changes: 8 additions & 8 deletions docs/guides/Reading.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Reading Parquet files

The low-level ParquetSharp API provides the `ParquetSharp.ParquetFileReader` class for reading Parquet files.
The low-level ParquetSharp API provides the @ParquetSharp.ParquetFileReader class for reading Parquet files.
This is usually constructed from a file path, but may also be constructed from a
`ParquetSharp.IO.ManagedRandomAccessFile`, which wraps a .NET `System.IO.Stream` that supports seeking.
@ParquetSharp.IO.ManagedRandomAccessFile, which wraps a .NET @System.IO.Stream that supports seeking.

```csharp
using var fileReader = new ParquetFileReader("data.parquet");
Expand All @@ -15,7 +15,7 @@ using var fileReader = new ParquetFileReader(input);

### Obtaining file metadata

The `ParquetSharp.FileMetaData` property of a `ParquetFileReader` exposes information about the Parquet file and its schema:
The @ParquetSharp.FileMetaData property of a `ParquetFileReader` exposes information about the Parquet file and its schema:

```csharp
int numColumns = fileReader.FileMetaData.NumColumns;
Expand All @@ -34,7 +34,7 @@ for (int columnIndex = 0; columnIndex < schema.NumColumns; ++columnIndex) {

Parquet files store data in separate row groups, which all share the same schema,
so if you wish to read all data in a file, you generally want to loop over all of the row groups
and create a `ParquetSharp.RowGroupReader` for each one:
and create a @ParquetSharp.RowGroupReader for each one:

```csharp
for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup) {
Expand All @@ -45,10 +45,10 @@ for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGro

### Reading columns directly

The `Column` method of `RowGroupReader` takes an integer column index and returns a `ParquetSharp.ColumnReader` object,
The `Column` method of `RowGroupReader` takes an integer column index and returns a @ParquetSharp.ColumnReader object,
which can read primitive values from the column, as well as raw definition level and repetition level data.
Usually you will not want to use a `ColumnReader` directly, but instead call its `LogicalReader` method to
create a `ParquetSharp.LogicalColumnReader` that can read logical values.
create a @ParquetSharp.LogicalColumnReader that can read logical values.
There are two variations of this `LogicalReader` method; the plain `LogicalReader` method returns an abstract
`LogicalColumnReader`, whereas the generic `LogicalReader<TElement>` method returns a typed `LogicalColumnReader<TElement>`,
which reads values of the specified element type.
Expand Down Expand Up @@ -96,7 +96,7 @@ When reading Timestamp to a DateTime, ParquetSharp sets the DateTimeKind based o

If `IsAdjustedToUtc` is `true` the DateTimeKind will be set to `DateTimeKind.Utc` otherwise it will be set to `DateTimeKind.Unspecified`.

This behavior can be overwritten by setting the AppContext switch `ParquetSharp.ReadDateTimeKindAsUnspecified` to `true`, so the DateTimeKind will be always set to `DateTimeKind.Unspecified` regardless of the value of `IsAdjustedToUtc`.
This behavior can be overwritten by setting the AppContext switch @ParquetSharp.ReadDateTimeKindAsUnspecified to `true`, so the DateTimeKind will be always set to `DateTimeKind.Unspecified` regardless of the value of `IsAdjustedToUtc`.
This also matches the old behavior of [ParquetSharp < 7.0.0](https://github.com/G-Research/ParquetSharp/pull/261)

```csharp
Expand All @@ -117,7 +117,7 @@ Some legacy implementations of Parquet write timestamps using the Int96 primitiv
which has been [deprecated](https://issues.apache.org/jira/browse/PARQUET-323).
ParquetSharp doesn't support reading Int96 values as .NET `DateTime`s
as not all Int96 timestamp values are representable as a `DateTime`.
However, there is limited support for reading raw Int96 values using the `ParquetSharp.Int96` type
However, there is limited support for reading raw Int96 values using the @ParquetSharp.Int96 type
and it is left to applications to decide how to interpret these values.

## Long path handling
Expand Down
4 changes: 2 additions & 2 deletions docs/guides/RowOriented.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,8 +70,8 @@ using (var rowReader = ParquetFile.CreateRowReader<MyRow>("example.parquet"))

## Reading and writing custom types

The row-oriented API supports reading and writing custom types by providing a `ParquetSharp.LogicalTypeFactory`
and a `ParquetSharp.LogicalReadConverterFactory` or `ParquetSharp.LogicalWriteConverterFactory`.
The row-oriented API supports reading and writing custom types by providing a @ParquetSharp.LogicalTypeFactory
and a @ParquetSharp.LogicalReadConverterFactory or @ParquetSharp.LogicalWriteConverterFactory.

### Writing custom types

Expand Down
4 changes: 2 additions & 2 deletions docs/guides/TimeSpan.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,6 @@ Note that when using this approach, if you read the file back with
ParquetSharp the data will be read as `long` values as there's no
way to tell it was originally `TimeSpan` data.
To read the data back as `TimeSpan`s, you'll also need to implement
a custom `ParquetSharp.LogicalReadConverterFactory` and use the `LogicalReadOverride` method
or provide a custom `ParquetSharp.LogicalTypeFactory`.
a custom @ParquetSharp.LogicalReadConverterFactory and use the `LogicalReadOverride` method
or provide a custom @ParquetSharp.LogicalTypeFactory.
See the [type factories documentation](TypeFactories.md) for more details.
Loading
Loading