Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add] basic pipelines documentation #480

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions docs/Pipelines/BranchingStrategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@

Your feature branch must be forked off from master ideally. Suppose, you are creating a new pipeline named pipeline_007 pertaining to Jira ticket TKT-162, you should create a branch like this:

```
git checkout master
git checkout -b TKT-162/ftr/pipeline_007_dev
```
<br>

There are 3 main branches in the pipelines repository - `develop`, `staging` & `master`. Your branch should be first merged to `develop` followed by `staging` and then when it is production-ready, a last PR should be raised to `master`. Each branch merge or commit to 3 main branches or on `*_dev` branches triggers a deployment to specific environment(s). Check the following table to understand deployment triggers:

| Branch | Devpolly | Testpolly | Polly | Description |
| --------- | ----------------- | ----------------- | ----------------- | ----------------------------------------------------------------------------- |
| `*_dev` | :material-check: | :material-close: | :material-check: | Commits to branches ending with `_dev` are deployed to devpolly and on polly |
| `develop` | :material-check: | :material-close: | :material-check: | PR merges to `develop` branch is deployed to devpolly and polly |
| `staging` | :material-close: | :material-check: | :material-check: | PR merges to `staging` branch is deployed to testpolly and polly |
| `master` | :material-close: | :material-close: | :material-check: | PR merges to `master` branch is only deployed to polly |


!!! note "Important Note"
As you can see, all the 4 branch types (`*_dev`, `develop`, `staging`, `master`) are deployed to production along with other environments. This approach ensures that pipeline developers can test their pipelines without relying solely on the stability of devpolly and testpolly environments. In the production environment, each pipeline is assigned a 'stage' attribute that differentiates its maturity level. The *_dev and develop branches are designated as the 'dev' stage, the staging branch represents the 'test' stage, and the master branch is considered the 'prod' or production stage.

<br>
<br>
<br>
<br>
<br>
<br>

25 changes: 25 additions & 0 deletions docs/Pipelines/GettingStarted.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@


Welcome to Polly Pipelines, a powerful workflow orchestration framework designed to simplify the process of building, managing, and executing complex pipelines. With Polly Pipelines, users can focus on running their pipelines without worrying about the underlying infrastructure.


## Why use Polly Pipelines?
- **Multi-language support:** Supports two languages for writing pipelines: Nextflow and Polly Workflow Language (PWL). We are planning to support Snakemake soon!
- **GUI and Programmatic Interface:** Provides user-friendly GUI and polly-python interfaces to monitor and execute pipelines
- **No infrastructure management:** Abstracts away complexities of infrastructure management and deployment. As a user, you just need to focus on writing pipelines!
- **Cloud and on-prem execution:** You can choose to execute your pipelines on cloud or on-prem. Helps you save costs! This feature is available only on Nextflow pipelines for now


If you are confused on what language to choose from while writing pipelines, please [check these guidelines](NextflowVsPWL.md)


<br>

To learn how to write pipelines, please check the following quick start guides

<div class="grid cards" markdown>

- :material-arrow-right: [__Nextflow__ Quick Start Guide](WritingPipelines/Nextflow/QuickStartNextflow.md)
- :material-arrow-right: [__PWL__ Quick Start Guide](WritingPipelines/PWL/QuickStartPWL.md)

</div>
23 changes: 23 additions & 0 deletions docs/Pipelines/NextflowVsPWL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@


#### Choose Nextflow if:
- You like to use [nf-core](https://nf-co.re/) community pipelines
- You primarily work in bioinformatics and scientific workflows
- You need data-parallelism capabilities
- You're comfortable with learning a new syntax

#### Choose PWL if:
- You're a Python developer comfortable with functional programming
- You need a highly scalable and flexible framework for diverse workflows, including data science
- You prioritize ease of use, user-friendliness, and a rich feature set

<br>

<div class="grid cards" markdown>

- :material-arrow-right: [__Nextflow__ Quick Start Guide](WritingPipelines/Nextflow/QuickStartNextflow.md)
- :material-arrow-right: [__PWL__ Quick Start Guide](WritingPipelines/PWL/QuickStartPWL.md)

</div>


140 changes: 140 additions & 0 deletions docs/Pipelines/WritingPipelines/Nextflow/QuickStartNextflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@

Welcome to Nextflow quick start guide!

## Setting up the environment

Start by cloning the repository. Assuming you have your ElucidataInc GitHub SSH key setup:
``` bash
git clone [email protected]:ElucidataInc/pipelines.git
```

Create a virtual environment in Python [refer to this doc](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/) and activate it

``` bash
cd pipelines
# Activate your virtual env here
```

Install some basic requirements

``` bash
pip install -r requirements.txt
```

Install pre-commit hooks for basic formatting checks on code commit

``` bash
pre-commit install
```

<hr>


## Understanding the structure of pipelines repo

Let’s go over the structure of the repository in brief. The following schematic shows some important root level files and folders and their purposes:

``` hl_lines="7 8 9 10 11 12 13 14"
pipelines # the repository
├── .circleci/ # config for CI/CD
├── deployment/ # deployment scripts and utilities
├── orchestration/ # utilities for enabling pipeline development
├── pipelines/
│ │
│ ├── nextflow/ # All Nextflow pipelines
│ │ ├── pipeline_1/
│ │ └── pipeline_2/
│ │
│ └── pwl/ # All PWL pipelines
│ └── pipeline_3/
├── ...
├── requirements.txt # dependencies
├── ...
└── scripts/ # common scripts
```

!!! info
As a pipeline developer, you should only care about the pipelines directory (highlighted above). It will contain both Nextflow and PWL pipelines

<hr>

A Pipeline will follow a specific directory structure. To better grasp this concept, let's explore the directory structure of a demo pipeline.

```
toy/ # nextflow pipeline named "toy"
├── __init__.py
├── build
│ ├── Dockerfile # For building docker image (must)
│ └── environment.yml # dependencies for pipeline (must)
├── config
│ ├── dev.json # config for devpolly
│ ├── test.json # config for testpolly
│ └── prod.json # config for polly
├── src # Source code
│ ├── main.nf
│ ├── Makefile
│ └── nextflow.config
└── parameter_schema.json # Defines pipeline's parameters (must)

```


## Let's create your first pipeline

1. We will start by forking a branch from `#! master`

``` bash
git checkout master
git checkout -b <add_your_branch_name>_dev
# Make sure your branch name ends with _dev.
```

The pipelines repository employs a branching strategy. For more details please refer to [this page](../../BranchingStrategy.md).


2. Secondly, instead of creating a pipeline from scratch, let's copy an example pipeline and try playing with it

``` bash
mkdir pipelines/nextflow/<name_your_pipeline>
cp -r pipelines/nextflow/toy/ pipelines/nextflow/<name_your_pipeline>/
```

3. Go to `build/Dockerfile` and change the pipeline path in the highlighted `COPY` command

``` hl_lines="4"
FROM nfcore/base:2.1

# Install the conda environment
COPY pipelines/nextflow/toy/build/environment.yml .
RUN pip3 --no-cache-dir install --upgrade awscli

CMD ["bash","echo 'ECS_IMAGE_PULL_BEHAVIOR=once' >> /etc/ecs/ecs.config"]
```

4. After all the above changes are done, let's push your pipeline

``` bash
git add .
git commit -m 'First pipeline'
git push origin <name_of_your_branch>
```

6. Go to [circleCI](https://app.circleci.com/pipelines/github/ElucidataInc/pipelines) and approve the hold to deploy your pipeline


Congrats! You have deployed your first pipeline. Go to [Polly](https://polly.elucidata.io/manage/pipelines) (after circleCI jobs are completed). Click on your pipeline, pass in the parameters and initiate your first run.


<br>
<br>
<br>
<br>
<br>
<br>
157 changes: 157 additions & 0 deletions docs/Pipelines/WritingPipelines/PWL/QuickStartPWL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@

Welcome to PWL quick start guide!

## Setting up the environment

Start by cloning the repository. Assuming you have your ElucidataInc GitHub SSH key setup:
``` bash
git clone [email protected]:ElucidataInc/pipelines.git
```

Create a virtual environment in Python [refer to this doc](https://www.freecodecamp.org/news/how-to-setup-virtual-environments-in-python/) and activate it

``` bash
cd pipelines
# Activate your virtual env here
```

Install some basic requirements

``` bash
pip install -r requirements.txt
```

Install pre-commit hooks for basic formatting checks on code commit

``` bash
pre-commit install
```

<hr>


## Understanding the structure of pipelines repo

Let’s go over the structure of the repository in brief. The following schematic shows some important root level files and folders and their purposes:

``` hl_lines="7 8 9 10 11 12 13 14"
pipelines # the repository
├── .circleci/ # config for CI/CD
├── deployment/ # deployment scripts and utilities
├── orchestration/ # utilities for enabling pipeline development
├── pipelines/
│ │
│ ├── nextflow/ # All Nextflow pipelines
│ │ ├── pipeline_1/
│ │ └── pipeline_2/
│ │
│ └── pwl/ # All PWL pipelines
│ └── pipeline_3/
├── ...
├── requirements.txt # dependencies
├── ...
└── scripts/ # common scripts
```

!!! info
As a pipeline developer, you should only care about the pipelines directory (highlighted above). It will contain both Nextflow and PWL pipelines

<hr>

A Pipeline will follow a specific directory structure. To better grasp this concept, let's explore the directory structure of a demo pipeline.

```
demo_protein_processing/ # pwl pipeline named "demo_protein_processing"
├── __init__.py
├── build
│ ├── Dockerfile # For building docker image (must be present)
│ └── requirements.txt # dependencies for pipeline (must be present)
├── config
│ ├── dev.json # config for devpolly
│ ├── test.json # config for testpolly
│ └── prod.json # config for polly
├── src # Source code
│ ├── __init__.py
│ └── main.py
└── parameter_schema.json # Defines pipeline's parameters (must be present)
```


## Let's create your first pipeline

1. We will start by forking a branch from `#! master`

``` bash
git checkout master
git checkout -b <add_your_branch_name>_dev
# Make sure your branch name ends with _dev.
```

The pipelines repository employs a branching strategy. For more details please refer to [this page](../../BranchingStrategy.md).


2. Secondly, instead of creating a pipeline from scratch, let's copy an example pipeline and try playing with it

``` bash
mkdir pipelines/pwl/<name_your_pipeline>
cp -r pipelines/pwl/demo_protein_processing/ pipelines/pwl/<name_your_pipeline>/
```

3. Go to `build/Dockerfile` and change the pipeline path in the highlighted `COPY` command

``` hl_lines="3"
FROM mithoopolly/workflows-base:python3.9

COPY pipelines/pwl/demo_protein_processing/build/requirements.txt .

RUN pip install -r requirements.txt

```

4. Change the entrypoint function name in `main.py` to match pipeline name. This is important!

``` python hl_lines="2 12"
@workflow(result_serialization=Serialization.JSON)
def demo_protein_processing(exp_id: str = "exp1", pre_process: bool = False):
secret_key = "MY_SECRET_KEY"
secret_value = Secrets.get(secret_key)
Logger.info(f"My secret value: {secret_value}")

##
##
##

if __name__ == "__main__":
demo_protein_processing("exp1.data", True)
```


5. After all the above changes are done, let's push your pipeline

``` bash
git add .
git commit -m 'First pipeline'
git push origin <name_of_your_branch>
```

6. Go to [circleCI](https://app.circleci.com/pipelines/github/ElucidataInc/pipelines) and approve the hold to deploy your pipeline


Congrats! You have deployed your first pipeline. Go to [Polly](https://polly.elucidata.io/manage/pipelines) (after circleCI jobs are completed). Click on your pipeline, pass in the parameters and initiate your first run.


Now that you have deployed your first PWL pipeline, let's do in-depth dive on creating your pipelines from scratch. [Check this page](UnderstandingTheSyntax.md).

<br>
<br>
<br>
<br>
<br>
<br>
Loading