Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs/perf best practices update #7586

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 57 additions & 7 deletions docs/understand/performance-best-practices.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,6 @@ Concurrent commits/merges on the same branch result in a race. The first operati
## Perform meaningful commits
It's a good idea to perform commits that are meaningful in the senese that they represent a logical point in your data's lifecycle. While lakeFS supports arbirartily large commits, avoiding commits with a huge number of objects will result in a more comprehensible commit history.

## Use zero-copy import
To import object into lakeFS, either a single time or regularly, lakeFS offers a [zero-copy import][zero-copy-import] feature.
Use this feature to import a large number of objects to lakeFS, instead of simply copying them into your repository.
This feature will create a reference to the existing objects on your bucket and avoids the copy.

## Read data using the commit ID
In cases where you are only interested in reading committed data:
* Use a commit ID (or a tag ID) in your path (e.g: `lakefs://repo/a1b2c3`).
Expand All @@ -31,9 +26,62 @@ When accessing data using the branch name (e.g. `lakefs://repo/main/path`) lakeF
For more information, see [how uncommitted data is managed in lakeFS][representing-refs-and-uncommitted-metadata].

## Operate directly on the storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to breakdown this section down into sub-sections, i'd suggest that each section describe a way to operate directly on the storage rather than distinguishing between reads and writes that are supported by most ways. That is, I'd use a structure like:

Operate directly on the storage

Pre-sign URLs

lakeFS Hadoop Filesystem

Staging API

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can work but can't personally commit to that as I don't quite grok the details of the staging API and the lakeFS HDFS setup (sadly, I know more than I would like to about HDFS itself :) ).

Is this something you'd like to address in this PR or is it possible to address in a subsequent one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a pr with the suggestions

Sometimes, storage operations can become a bottleneck. For example, when your data pipelines upload many big objects.
Storage operations can become a bottleneck when operating on large datasets.

In such cases, it can be beneficial to perform only versioning operations on lakeFS, while performing storage reads/writes directly on the object store.
lakeFS offers multiple ways to do that:
lakeFS offers multiple ways to do that.

### Use zero-copy import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is the appropriate location for this section. For example, I could import data to lakeFS and then read it directly from lakeFS. So that flow isn't "operating directly on the object store"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, The zero-copy import section talks about how to regularly feed data into lakeFS rather than how to interact with data already managed by lakeFS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you prefer to organize this information?

The existing flow makes sense to me but I'm not dead set on it.

The reason it makes sense to me is that when I looked at the LakeFS architecture diagram, the first question in my mind was "how do I bypass the lakefs service for bulk operations on data", as that's the obvious bottleneck assuming the object storage is S3 or similar. This then leads to 3 questions, which are addressed in this section:

  1. If I have data already in S3, how do I make LakeFS aware of it without re-copying? Answer: use zero-copy writes / imports . (This is where DVC fell out of our evaluation, btw...)
  2. If I already have a large dataset in LakeFS, how do I add new data? 2 answers: another zero copy import is fine, or write by performing metadata operations in lakefs and writing directly to storage
  3. How do I do efficient, scaleable reads? Answer: get the urls from the metadata service, talk to S3/GCS directly using the urls.

And the fuse stuff got in here because the follow up to the third question is, what if I am not writing my own reader, but rather using fuse to mount a bucket - does this mean I'm SOL for using LakeFS at all, and if not, do all the reads go through the slow way, streaming data through the LakeFS service? And the answer to that is also no, LakeFS thought of that, it's all done via symlinks and you can read directly from the storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvryaboy, apologies for the delay, and thanks for elaborating on the thought process behind your suggestion. You are making good points, and I'd like to suggest a structure that will make the most sense.

I created this pr #8359 with the changes I suggest to your pr. I included most of your changes but changed the order. Let me know your thoughts.

If you are ok with the changes, you are welcome to apply them to your pr and we can approve this pr.

Thanks again for your contribution!

To import object into lakeFS, either a single time or regularly, lakeFS offers a [zero-copy import][zero-copy-import] feature.
Use this feature to import a large number of objects to lakeFS, instead of simply copying them into your repository.
This feature will create a reference to the existing objects on your bucket and avoids the copy.

The lakeFS blog documents a number of patterns for importing data: [Import Data to lakeFS: Effortless, Fast, and Zero Copy](https://lakefs.io/blog/import-data-lakefs/).

An import essentially scans the supplied dataset (eg a path in a GCS or S3 bucket) and records various metadata about the files in lakeFS. This is purely a metadata operation, and does not require copying any data.

From that point on, the source data should be considered frozen and immutable. Any changes to an imported object in the origin location will result in read failures when reading that object from lakeFS.

You can re-import a dataset, and capture any changed or newly added files (this is use case 3 / option 2 in the blog). What that will not do is deal nicely with updated or overwritten files.

This pattern will work nicely for append-only setups.

### Writing directly to the object store
In addition to importing large, batch-generated datasets, we may want to to add a few new files after an initial import, or to modify existing files. lakeFS allows “uploading” changes to a dataset.

Unlike an import, in the case of an upload, lakeFS is in control of the actual location the file is stored at in the backing object store. This will not modify any directories you may have “imported” the dataset from originally, and you will need to use lakeFS to get consistent views of the data; see the next subsection for advice on scalable reads.

If we need to upload a lot of files, we likely want to avoid writing directly through the lakeFS service. lakeFS allows this, if we follow a particular pattern: we request a location from lakeFS to which a new file should go, and then use regular object store (S3, GCS, etc) client to upload the data directly.

This is achieved by `lakectl fs upload --pre-sign` (Docs: [lakectl-upload][lakectl-upload]). The equivalent OpenAPI endpoint will return a URL to which the user can upload the file(s) in question; other clients, such as the Java client, Python SDK, etc, also expose presign functionality - consult relevant client documentation for details.

### Read directly from the object store
lakeFS maintains versions by keeping track of each file; the commit and branch paths (`my-repo/commits/{commit_id}`, `my-repo/branches/main`) are virtual and resolved by the lakefs service to iterate through appropriate files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this part belongs to the concepts and model page.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a version of this is in the concepts and model page ("A lot of what lakeFS does is to manage how lakeFS paths translate to physical paths on the object store.", etc).

Do you feel this sort of thing needs to be DRYed up? My thought was that "repetition doesn't spoil the prayer", as the saying goes, and a brief mention here helps set the context without expecting the reader to have gone in detail through other pages.


lakeFS does allow you do either read directly from the lakefs API (or S3 gateways), or to query lakeFS API for underlying actual locations of the files that constitute a particular commit.

To read data from lakeFS without it being transferred through lakeFS:
* Read an object getObject (lakeFS OpenAPI) and add `--presign`. You'll get a link to download an object.
* Use statObject (lakeFS OpenAPI) which will return `physical_address` that is the actual S3/GCS path that you can read with any S3/GCS client.
* Use the `lakectl fs presign` ([docs][lakectl-fs-presign])

#### Reading directly from GCS when using GCS-Fuse
GCP users commonly mount GCS buckets using `gcs-fuse`, particularly when using GCP's Vertex AI pipelines. The lakeFS Fuse integration is written in a way that ensures reads are scaleable and work directly off GCS, without involving any lakeFS services in the read path. This is achieved by using hooks to automatically create symlinks to reflect the virtual directory structure of branches and commits. See the [gcs-fuse integration][gcs-fuse] documentation for details.

The end result is that you can read the branches or commits directly from the file system, and not involve lakeFS in the read path at all:

```
with open('/gcs/my-bucket/exports/my-repo/branches/main/datasets/images/001.jpg') as f:
image_data = f.read()
```

```
commit_id = 'abcdef123deadbeef567'
with open(f'/gcs/my-bucket/exports/my-repo/commits/{commit_id}/datasets/images/001.jpg') as f:
image_data = f.read()
```

### Read More
* The [`lakectl fs upload --pre-sign`][lakectl-upload] command (or [download][lakectl-download]).
* The lakeFS [Hadoop Filesystem][hadoopfs].
* The [staging API][api-staging] which can be used to add lakeFS references to objects after having written them to the storage.
Expand All @@ -50,5 +98,7 @@ It will also lower the storage cost.
[zero-copy-import]: {% link howto/import.md %}#zero-copy-import
[lakectl-upload]: {% link reference/cli.md %}#lakectl-fs-upload
[lakectl-download]: {% link reference/cli.md %}#lakectl-fs-download
[lakectl-fs-presign]: {% link reference/cli.html %}#lakectl-fs-presign
[api-staging]: {% link reference/api.md %}#operations-objects-stageObject
[representing-refs-and-uncommitted-metadata]: {% link understand/how/versioning-internals.md %}#representing-references-and-uncommitted-metadata
[gcs-fuse]: {% link integrations/vertex_ai.md %}#using-lakefs-with-cloud-storage-fuse
Loading