Skip to content

Commit

Permalink
Merge branch 'master' into ompp-model-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Souheil-Yazji committed Jan 29, 2024
2 parents 0e5e03f + b7aa33a commit 3cdaebc
Show file tree
Hide file tree
Showing 41 changed files with 558 additions and 395 deletions.
69 changes: 38 additions & 31 deletions docs/dev/features/object-storage/blobcsi.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,41 +165,45 @@ resource "kubernetes_secret" "aaw-<acronym>-prod-sp-secret" {

#### c. Add bucket info:


Add the following to `resource "kubectl_manifest" "fdi-aaw-configuration-data"`, in one of:
Add the following to `resource "kubectl_manifest" "fdi-aaw-configuration-data"`, in one of the following, depending on the classification of the bucket:

1. `fdi-protected-b-external.json: |` or
2. `fdi-unclassified-external.json: |` or
3. `fdi-protected-b-internal.json: |` or
4. `fdi-unclassified-internal.json: |`

depending on the classification of the bucket.

```
{
"bucketName": "<should-be-provided-for-you>",
"pvName": "<acronym>-eprotb",
"subfolder": "",
"readers": ["<name-of-kuebeflow-profile>"],
"writers": ["<name-of-kuebeflow-profile>"],
"spn": "aaw-<acronym>-prod-sp"
},
{
"bucketName": "<should-be-provided-for-you>-transit",
"pvName": "<acronym>-inbox-eprotb",
"subfolder": "from-de",
"readers": ["<name-of-kuebeflow-profile>"],
"writers": ["<name-of-kuebeflow-profile>"],
"spn": "aaw-<acronym>-prod-sp"
},
{
"bucketName": "<should-be-provided-for-you>-transit",
"pvName": "<acronym>-outbox-eprotb",
"subfolder": "to-vers",
"readers": ["<name-of-kuebeflow-profile>"],
"writers": ["<name-of-kuebeflow-profile>"],
"spn": "aaw-<acronym>-prod-sp"
}
{
"bucketName": "<should-be-provided-for-you>",
"pvName": "<acronym>-eprotb",
"subfolder": "",
"readers": ["<name-of-kuebeflow-profile>"],
"writers": ["<name-of-kuebeflow-profile>"],
"spn": "aaw-<acronym>-prod-sp"
}
```

##### Transit Containers

If the storage solution requires transit containers, you'll want to add this as well. Not all solutions require this.

```
{
"bucketName": "<should-be-provided-for-you>-transit",
"pvName": "<acronym>-inbox-eprotb",
"subfolder": "from-de",
"readers": ["<name-of-kuebeflow-profile>"],
"writers": ["<name-of-kuebeflow-profile>"],
"spn": "aaw-<acronym>-prod-sp"
},
{
"bucketName": "<should-be-provided-for-you>-transit",
"pvName": "<acronym>-outbox-eprotb",
"subfolder": "to-vers",
"readers": ["<name-of-kuebeflow-profile>"],
"writers": ["<name-of-kuebeflow-profile>"],
"spn": "aaw-<acronym>-prod-sp"
}
```

##### Info
Expand All @@ -214,19 +218,22 @@ depending on the classification of the bucket.
>
> `writers:` use the kubeflow profile name for this
>
> `spn:` this has to be created by YOU. Send a JIRA ticket to the Cloud Team.
> `spn:` this has to be obtained by you by sending a Jira ticket to the Cloud Team. See below for an example SPN request.
>
##### Example Cloud Ticket

To obtain the SPN, send a Jira ticket to the Cloud Team, follow the template below:

> Hi,
>
> Can I get a service principle named aaw-\<acronym\>-prod-sp created please?
>
> The owners should be:
>
> [email protected]
> [email protected]
> - [email protected]
> - [email protected]
>
> More info: https://jirab.statcan.ca/browse/?????-????
>
> Thanks!
Expand Down
124 changes: 61 additions & 63 deletions docs/en/5-Storage/AzureBlobStorage.md
Original file line number Diff line number Diff line change
@@ -1,104 +1,102 @@
# Overview
# Azure Blob Storage (Containers)

[Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) is Microsoft's object storage solution for the cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn't adhere to a particular data model or definition, such as text or binary data.

Azure Blob Storage Containers are good at three things:

- Large amounts of data - Containers can be huge: way bigger than hard drives. And they are still fast.
- Accessible by multiple consumers at once - You can access the same data source from multiple Notebook Servers and pipelines at the same time without needing to duplicate the data.
- Sharing - Project namespaces can share a container. This is great for sharing data with people outside of your workspace.

# Setup
Azure Blob Storage Containers have the following advantages over Kubeflow Volumes (Disks):

1. **Capacity:** Containers can be huge: way bigger than hard drives. And they are still fast.
2. **Simultaneity:** You can access the same data source from multiple Notebook Servers and pipelines at the same time without needing to duplicate the data.
3. **Shareability:** Project namespaces can share a container. This is great for sharing data with people outside of your workspace.

<!-- prettier-ignore -->
!!! warning "Azure Blob Storage containers and buckets mount will be replacing the Minio Buckets and Minio storage mounts"
Users will be responsible for migrating data from Minio Buckets to the Azure Storage folders. For larger files, users may contact AAW for assistance.
!!! warning "Azure Blob Storage containers and buckets have replaced MinIO storage and buckets."
Users will be responsible for migrating data from MinIO Buckets to the Azure Storage folders. [Click here for instructions on how to migrate!](#how-to-migrate-from-minio-to-azure-blob-storage). For larger files, users may [contact AAW for assistance](https://statcan-aaw.slack.com).

## Blob Container Mounted on a Notebook Server
## Setup

<!-- prettier-ignore -->
### Accessing Blob Container from JupyterLab

The Blob CSI volumes are persisted under `/home/jovyan/buckets` when creating a Notebook Server. Files under `~/buckets` are backed by Blob storage. All AAW notebooks will have the `~/buckets` mounted to the file-system, making data accessible from everywhere.
The Blob CSI volumes are persisted under `~/buckets` when creating a Notebook Server. Files under `~/buckets` are backed by Blob storage. All AAW notebooks will have the `~/buckets` mounted to the file-system, making data accessible from everywhere.

![Blob folders mounted as Jupyter Notebook directories](../images/container-mount.png)
These folders can be used like any other - you can copy files to/from using the file browser, write from Python/R, etc. The only difference is that the data is being stored in the Blob storage container rather than on a local disk (and is thus accessible wherever you can access your Kubeflow notebook).

# Unclassified Notebook AAW folder mount
![Unclassified notebook folders mounted in Jupyter Notebook directories](../images/unclassified-mount.png)
![Blob folders mounted as directories](../images/container-mount.png)

# Protected-b Notebook AAW folder mount
![Protected-b notebooks mounted as Jupyter Notebook directories](../images/protectedb-mount.png)
#### Unclassified Containers

These folders can be used like any other - you can copy files to/from using the file browser, write from Python/R, etc. The only difference is that the data is being stored in the Blob storage container rather than on a local disk (and is thus accessible wherever you can access your Kubeflow notebook).
Unclassified blob storage containers will appear as follows in the `~/buckets` folder.

## How to Migrate from MinIO to Azure Blob Storage
![Unclassified notebook folders mounted as directories in JupyterLab](../images/unclassified-mount.png)

First, import the environmental variables stored in your secrets vault. You will either import from `minio-gateway` or `fdi-gateway` depending on where your data was ingested.
#### Protected B Containers

```
jovyan@rstudio-0:~$ source /vault/secrets/fdi-gateway-protected-b
```
Protected B blob storage containers will appear as follows in the `~/buckets` folder.

Then you create an alias to access your data.
![Protected B notebooks mounted as directories in JupyterLab](../images/protectedb-mount.png)

```
jovyan@rstudio-0:~$ mc alias set minio $MINIO_URL $MINIO_ACCESS_KEY $MINIO_SECRET_KEY
```
### Container Types

List the contents of your data folder with `mc ls`.
The following Blob containers are available. Accessing all Blob containers is the same. The difference between containers is the storage type behind them:

```
jovyan@rstudio-0:~$ mc ls minio
```
- **aaw-unclassified:** By default, use this one to store unclassified data.
- **aaw-protected-b:** Use this one to store sensitive, Protected B data.
- **aaw-unclassified-ro:** This classification is Protected B but read-only access. This is so users can view unclassified data within a Protected B notebook.

Finally, copy your MinIO data into your Azure Blob Storage directory with `mc cp --recursive`.
### Accessing Internal Data

```
jovyan@rstudio-0:~$ mc cp —-recursive minio ~/buckets/aaw-unclassified
```
Accessing internal data uses the DAS common storage connection which has use for internal and external users that require access to unclassified or Protected B data. The following containers can be provisioned:

If you have protected-b data, you can copy your data into the protected-b bucket.
- **external-unclassified:** Unclassified and accessible by both StatCan and non-Statcan employees.
- **external-protected-b:** Protected B and accessible by both StatCan and non-StatCan employees.
- **internal-unclassified:** Unclassified and accessible by Statcan employees, only.
- **internal-protected-b:** Protected B and accessible by StatCan employees, only.

```
jovyan@rstudio-0:~$ mc cp —-recursive minio ~/buckets/aaw-protected-b
```
The above containers follow the same convention as the AAW containers in terms of data, however there is a layer of isolation between StatCan employees and non-StatCan employees. Non-Statcan employees are only allowed in **external** containers, while StatCan employees can have access to any container.

AAW has an integration with the FAIR Data Infrastructure team that allows users to transfer unclassified and Protected B data to Azure Storage Accounts, thus allowing users to access this data from Notebook Servers.

<!-- prettier-ignore -->
Please reach out to the FAIR Data Infrastructure team if you have a use case for this data.

## Container Types
## Pricing

The following Blob containers are available:
<!-- prettier-ignore -->
!!! info "Pricing models are based on CPU and Memory usage"
Pricing is covered by KubeCost for user namespaces (In Kubeflow at the bottom of the Notebooks tab).

Accessing all Blob containers is the same. The difference between containers is the storage type behind them:
In general, Blob Storage is much cheaper than [Azure Manage Disks](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) and has better I/O than managed SSD.

- **aaw-unclassified:** By default, use this one. Stores unclassified data.
## The Azure Storage Explorer

- **aaw-protected-b:** Stores sensitive protected-b data.
Our friends over at the Collaborative Analytics Environment (CAE) have some documentation on accessing your Azure Blob Storage from your AVD using the [Azure Storage Explorer](https://statcan.github.io/cae-eac/en/AzureStorageExplorer/).

- **aaw-unclassified-ro:** This classification is protected-b but read-only access. This is so users can view unclassified data within a protected-b notebook.
## How to Migrate from MinIO to Azure Blob Storage

<!-- prettier-ignore -->
First, `source` the environmental variables stored in your secrets vault. You will either `source` from **minio-gateway** or **fdi-gateway** depending on where your data was ingested:

## Accessing Internal Data
```
source /vault/secrets/fdi-gateway-protected-b
```

<!-- prettier-ignore -->
Accessing internal data uses the DAS common storage connection which has use for internal and external users that require access to unclassified or protected-b data. The following containers can be provisioned:
Then you create an alias to access your data:

- **external-unclassified**
- **external-protected-b**
- **internal-unclassified**
- **internal-protected-b**
```
mc alias set minio $MINIO_URL $MINIO_ACCESS_KEY $MINIO_SECRET_KEY
```

They follow the same convention as the AAW containers above in terms of data, however there is a layer of isolation between StatCan employees and non-StatCan employees. Non-Statcan employees are only allowed in **external** containers, while StatCan employees can have access to any container.
List the contents of your data folder with `mc ls`:

AAW has an integration with the FAIR Data Infrastructure team that allows users to transfer unclassified and protected-b data to Azure Storage Accounts, thus allowing users to access this data from Notebook Servers.
```
mc ls minio
```

Please reach out to the FAIR Data Infrastructure team if you have a use case for this data.
Finally, copy your MinIO data into your Azure Blob Storage directory with `mc cp --recursive`:

## Pricing
```
mc cp --recursive minio ~/buckets/aaw-unclassified
```

<!-- prettier-ignore -->
!!! info "Pricing models are based on CPU and Memory usage"
Pricing is covered by KubeCost for user namespaces (In Kubeflow at the bottom of the Notebooks tab).
If you have Protected B data, you can copy your data into the Protected B bucket:

In general, Blob Storage is much cheaper than [Azure Manage Disks](https://azure.microsoft.com/en-us/pricing/details/managed-disks/) and has better I/O than managed SSD.
```
mc cp --recursive minio ~/buckets/aaw-protected-b
```
36 changes: 16 additions & 20 deletions docs/en/5-Storage/Disks.md → docs/en/5-Storage/KubeflowVolumes.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Overview
# Kubeflow Volumes (Disks)

Disks are the familiar hard drive style file systems you're used to, provided to you from fast solid state drives (SSDs)!
Kubeflow Volumes are similar in concept to the hard disk drives you are used to on your Windows, Mac or Linux Desktop. Kubeflow Volumes are sometimes just called disks and are backed by fast solid state drives (SSDs) under the hood!

# Setup
## Setup

When creating your notebook server, you request disks by adding Data Volumes to your notebook server (pictured below, with go to `Advanced Options`). They are automatically mounted at the directory (`Mount Point`) you choose, and serve as a simple and reliable way to preserve data attached to a Notebook Server.

![Adding an existing volume to a new notebook server](../images/kubeflow_existing_volume.png)

<!-- prettier-ignore -->
??? warning "You pay for all disks you own, whether they're attached to a Notebook Server or not"
As soon as you create a disk, you're [paying](#pricing) for it until it is [deleted](#deleting-disk-storage), even if it's original Notebook Server is deleted. See [Deleting Disk Storage](#deleting-disk-storage) for more info
!!! Warning "You pay for all disks you own, whether they're attached to a Notebook Server or not."
As soon as you create a disk, you're [paying](#pricing) for it until it is [deleted](#deleting-disk-storage), even if it's original Notebook Server is deleted. See [Deleting Disk Storage](#deleting-disk-storage) for more info.

# Once you've got the basics ...
## Once you've got the basics...

When you delete your Notebook Server, your disks **are not deleted**. This let's you reuse that same disk (with all its contents) on a new Notebook Server later (as shown above with `Type = Existing` and the `Name` set to the volume you want to reuse). If you're done with the disk and it's contents, [delete it](#deleting-disk-storage).

Expand All @@ -25,22 +25,18 @@ To see your disks, check the Notebook Volumes section of the Notebook Server pag
## Pricing

<!-- prettier-ignore -->
??? info "Pricing models are tentative and may change"
??? info "Pricing models are tentative and may change."
As of writing, pricing is covered by the platform for initial users. This guidance explains how things are expected to be priced priced in future, but this may change.

When mounting a disk, you get an [Azure Managed Disk](https://azure.microsoft.com/en-us/pricing/details/managed-disks/). The **Premium SSD Managed Disks** pricing shows the cost per disk based on size. Note that you pay for the size of disk requested, not the amount of space you are currently using.

<!-- prettier-ignore -->
??? info "Tips to minimize costs"
As disks can be attached to a Notebook Server and reused, a typical usage pattern could be:

* At 9AM, create a Notebook Server (request 2CPU/8GB RAM and a 32GB attached
disk)
* Do work throughout the day, saving results to the attached disk
* At 5PM, shut down your Notebook Server to avoid paying for it overnight
* NOTE: The attached disk **is not destroyed** by this action
* At 9AM the next day, create a new Notebook Server and **attach your existing
disk**
* Continue your work...

This keeps all your work safe without paying for the computer when you're not using it
??? info "Tips to minimize costs."
You can minimize costs by suspending your notebook servers when not in use. A typical workflow may look like:

- Create a Notebook Server with the appropriate about of storage allocated to Workspace and Data Volumes.
- Do work throughout the day, saving results to the Data or Workspace Volume, depending on your needs.
- At the end of the workday, suspend your Notebook Server to avoid paying for it overnight.
- At 9AM the next day, resume your Notebook Server and continue your work.
- **Tip:** You can migrate your Workspace or Data Volume to a new notebook server without losing data as the destruction of the notebook server does not affect any attached Workspace or Data Volume.

Loading

0 comments on commit 3cdaebc

Please sign in to comment.