Skip to content

Commit 76b7d37

Browse files
egrace479NetZissougwtaylorhlapp
authored
Add FAIR Guide with Checklists (#37)
* Reformulate Metadata Guide as FAIR Guide with checklists (code, data, model, metadata) explain FAIR and reproducibility, motivate use of checklists to follow FAIR principles link to relevant lines of templates, specific section of GH guide, other checklists link to repo issues to encourage dialog in case of questions/comments --------- Co-authored-by: Net Zhang <[email protected]> Co-authored-by: Graham Taylor <[email protected]> Co-authored-by: Hilmar Lapp <[email protected]>
1 parent c89dfb6 commit 76b7d37

File tree

10 files changed

+418
-31
lines changed

10 files changed

+418
-31
lines changed

.markdownlint.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"MD007": { "indent": 4 },
3+
"no-hard-tabs": false,
4+
"MD013": false
5+
}

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Check out our guides to get your project off on the right foot!
1515

1616
- [The Hugging Face Repo Guide](wiki-guide/Hugging-Face-Repo-Guide.md): Analogous expected and suggested repository contents for Hugging Face repositories; there are notable differences from GitHub in both content and structure.
1717

18-
- [Metadata Guide](wiki-guide/Metadata-Guide.md): Guide to metadata collection and documentation. This closely follows our [HF Dataset Card Template](wiki-guide/HF_DatasetCard_Template_mkdocs.md) sections.
18+
- [FAIR Guide](wiki-guide/FAIR-Guide.md): Guide to producing FAIR digital products, from metadata collection through product documentation and publication. This builds on the content in both the GitHub and Hugging Face Repository Guides, providing checklists to ensure [code](wiki-guide/Code-Checklist.md), [data](wiki-guide/Data-Checklist.md), and [model](wiki-guide/Model-Checklist.md) repositories are FAIR. The latter two closely follow our [HF Templates](wiki-guide/About-Templates.md).
1919

2020
### Project repo up, what's next?
2121
Check out our workflow guides for how to interact with your new repo:

docs/wiki-guide/Code-Checklist.md

Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Code Checklist
2+
3+
This checklist provides an overview of essential and recommended elements to include in a GitHub repository to ensure that it conforms to FAIR principles and best practices for reproducibility. Along with the generation of a DOI (see [DOI Generation](DOI-Generation.md) and [Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md)), following this checklist ensures compliance with the FAIR Principles for research software.[^1]
4+
[^1]: Barker, M., Chue Hong, N. P., Katz, D. S., Lamprecht, A. L., Martinez-Ortiz, C., Psomopoulos, F., Harrow, J., Castro, L. J., Gruenpeter, M., Martinez, P. A., & Honeyman, T. (2022). Introducing the FAIR Principles for research software. _Scientific data_, 9(1), 622. [URL](https://doi.org/10.1038/s41597-022-01710-x).
5+
6+
!!! tip "Pro tip"
7+
8+
Use the eye icon at the top of this page to access the source and copy the markdown for the checklist below into an issue on your GitHub [Repo](GitHub-Repo-Guide.md) or [Project](Guide-to-GitHub-Projects.md) so you can check the boxes as you add each element to your GitHub repository.
9+
10+
## Required Files
11+
12+
- [ ] **License**: Verify and include an appropriate license (e.g., `MIT`, `CC0-1.0`, etc.). See discussion in the [Repo Guide](GitHub-Repo-Guide.md/#license).
13+
- [ ] **README File**: Following the [Repo Guide](GitHub-Repo-Guide.md/#readme), provide a detailed `README.md` with:
14+
- [ ] Overview of the project.
15+
- [ ] Installation instructions.
16+
- [ ] Basic usage examples.
17+
- [ ] Links to related/created dataset(s).
18+
- [ ] Links to related/created model(s).
19+
- [ ] Acknowledge source code dependencies and contributors.
20+
- [ ] Reference related datasets used in training or evaluation.
21+
- [ ] **Requirements File**: Provide a [file detailing software requirements](GitHub-Repo-Guide.md/#software-requirements-file), such as a `requirements.txt` or `pyproject.toml` for Python dependencies.
22+
- [ ] **Gitignore File**: GitHub has premade `.gitignore` files ([here](https://github.com/github/gitignore)) tailored to particular languages (eg., [R](https://github.com/github/gitignore/blob/main/R.gitignore) or [Python](https://github.com/github/gitignore/blob/main/Python.gitignore)), operating systems, etc.
23+
- [ ] **CITATION CFF**: This facilitates citation of your work, follow guidance provided in the [Repo Guide](GitHub-Repo-Guide.md/#citation).
24+
25+
### Data-Related
26+
27+
- [ ] Preprocessing code.
28+
- [ ] Description of dataset(s), including description of training and testing sets (with links to relevant portions of dataset card, which will have more information).
29+
30+
### Model-Related
31+
32+
- [ ] Training code.
33+
- [ ] Inference/evaluation code.
34+
- [ ] Model weights (if not in Hugging Face model repository).
35+
- [ ] Description of model(s)/benchmark(s).
36+
- [ ] Explanation of training and testing (with links to relevant portions of model card, which will have more information).
37+
38+
!!! note
39+
The [bioclip GitHub repository](https://github.com/Imageomics/bioclip) provides an example of incorporating data-and model-related code into a GitHub repository as published open-source code for both data and model development.
40+
41+
## General Information
42+
43+
- [ ] **Repository Structure**: Ensure the code repository follows a clear and logical directory structure. (See [Repo Guide](GitHub-Repo-Guide.md/#general-repository-structure).)
44+
- [ ] **Code Comments**: Include meaningful inline comments and function descriptions for clarity.
45+
- [ ] **Random Seed Control**: Save seed(s) for random number generator(s) to ensure reproducible results.
46+
47+
## Security Considerations
48+
49+
- [ ] **Sensitive Data Handling**: Ensure no hardcoded sensitive information (e.g., API keys, credentials) are included in your repository. These can be shared through a config file on OSC.
50+
51+
!!! note
52+
The best practices described below will help you meet the above requirements. The more advanced development practices noted further down are included for educational purposes and are highly recommended&mdash;though these may go beyond what is expected for a given project, we advise collaborators to at least have a discussion about the topics covered in [Code Quality](#code-quality) and whether other practices discussed would be appropriate for their project.
53+
54+
---
55+
56+
## Best Practices
57+
58+
The [Repo Guide](GitHub-Repo-Guide.md/) provides general guidance on repository structure, [collaborative workflow](The-GitHub-Workflow.md/), and [how to make and review pull requests (PR)](The-GitHub-Pull-Request-Guide.md/). Below, we highlight some best practices in checklist form to help you meet the requirements described above for a FAIR and Reproducible project.
59+
60+
### Reproducibility
61+
62+
- **Version Control**: Use Git for version control and commit regularly.
63+
- **Modularization**: Structure code into reusable and independent modules.
64+
- **Code Execution**: Provide Notebooks to demonstrate how to reproduce results.
65+
66+
### Code Review & Maintenance
67+
68+
- **Code Reviews**: Regular peer reviews for quality assurance. Refer to the [GitHub PR Review Guide](The-GitHub-Pull-Request-Guide.md/#2-review-a-pull-request).
69+
- **Issue Tracking**: Use GitHub issues for tracking bugs and feature requests.
70+
- **Versioning**: Tag releases, changelogs can be auto-generated and informative when PRs are appropriately scoped.
71+
72+
### Installation and Dependencies
73+
74+
- [ ] **Environment Setup**: Include setup instructions (e.g., `conda` environment file, `Dockerfile`).
75+
- [ ] **Dependency Management**: Use virtual environments and the frameworks that manage them (e.g., `venv`, `conda`, `uv` for Python) to isolate dependencies.
76+
77+
---
78+
79+
## More Advanced Development
80+
81+
### Documentation
82+
83+
- [ ] **API Documentation**: Generate API documentation (e.g., [`MkDocs`](https://www.mkdocs.org) for Python or wiki pages in the repo).
84+
- [ ] **Docstrings**: Add comprehensive docstrings for all functions, classes, and modules. These can be incorporated to help generate documentation. Note that generative AI tools with access to your code, such as GitHub Copilot, can be quite accurate in generating these, especially if you are using type annotations.
85+
- [ ] **Example Scripts**: Include example scripts for common use cases.
86+
- [ ] **Configuration Files**: Use `yaml`, `json`, or `ini` for configuration settings.
87+
88+
### Code Quality
89+
90+
- [ ] **Consistent Style**: Follow coding style guidelines (e.g., `PEP 8` for Python).
91+
- [ ] **Linting**: Ensure the code passes a linter (e.g., `Ruff` for Python).
92+
- [ ] **Logging**: Use logging instead of print statements for better debugging (e.g., `logging` in Python).
93+
- [ ] **Error Handling**: Implement robust exception handling to avoid crashes or bogus results from input outside of code expectations.
94+
95+
### Testing
96+
97+
- [ ] **Unit Tests**: Write unit tests to validate core functionality.
98+
- [ ] **Integration Tests**: Ensure components work together correctly.
99+
- [ ] **Test Coverage**: Check test coverage, e.g., using [Coverage](https://coverage.readthedocs.io/).
100+
- [ ] **Continuous Integration (CI)**: Set up CI/CD pipelines (e.g., [GitHub Actions](https://docs.github.com/en/actions)) for automated testing.
101+
102+
### Code Distribution & Deployment
103+
104+
- [ ] **Packaging**: Provide installation instructions (e.g., `setup.py`, `hatch`, `poetry`, `uv` for Python).
105+
- [ ] **Deployment Guide**: Document deployment procedures
106+
107+
!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)"

docs/wiki-guide/DOI-Generation.md

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,28 @@
11
# DOI Generation
22

33
This guide discusses DOI generation for digital artifacts that may be associated with publications, such as datasets, models, and software.
4-
You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. One may think of this in the manner they are handled on arXiv, where there are options for "Cite as:" or "for this version" (with the "v#" at the end) option when citing a preprint.
4+
You are likely familiar with DOIs from citing (journal/arXiv/conference) papers, for which they are generated by the publisher and regularly used in citations. However, they are also invaluable for proper citation of code, models, and data. Similar to how DOIs help track different versions of preprints on repositories like arXiv, they can provide persistent identification and versioning for your research artifacts beyond traditional publications.
55

66
## What is a DOI?
77

8-
A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information&mdash;metadata&mdash;about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information.
9-
8+
A DOI (Digital Object Identifier) is a _persistent_ (permanent) digital identifier for any object (data, model, code, etc.) that _uniquely_ distinguishes it from other objects and links to information&mdash;metadata&mdash;about the object. The International DOI Foundation (IDF) is responsible for developing and administering the DOI system. See their [What is a DOI?](https://www.doi.org/the-identifier/what-is-a-doi/) article for more information.
109

1110
## How do you generate a DOI?
1211

1312
When publishing code, data, or models, there are various options for DOI generation, and selecting one is generally dependent on where the object of interest is published. We will go over the two standard methods used by the Institute here, and we mention a third option for completeness. A comparison of these three options is provided in the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf).
1413

15-
1614
### 1. Generate a DOI on Hugging Face
1715

18-
This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi).
16+
This is the simplest method for generating a DOI for a model or dataset since [Hugging Face partnered with DataCite to offer this option](https://huggingface.co/blog/introducing-doi).
1917

2018
!!! warning "Warning"
21-
Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a ***new*** DOI for the updated version: the old version will be maintained in perpetuity!
19+
Though it is a very simple process, it is not one to be taken lightly, as there is no removing data once this has been done--any changes require generation of a _**new**_ DOI for the updated version: the old version will be maintained in perpetuity!
2220

2321
!!! warning "Warning"
2422
As stated in the [Imageomics Digital Products Release and Licensing Policy](Digital-products-release-licensing-policy.md), DOIs are not to be generated for Imageomics Organization Repositories until approval has been granted by the Senior Data Scientist or Institute Leadership.
2523

2624
Hugging Face allows for the generation of a DOI through the settings tab on the Model or Dataset. For details on _how_ to generate a DOI with Hugging Face, please see the [Hugging Face DOI Documentation](https://huggingface.co/docs/hub/doi).
2725

28-
2926
### 2. Generate a DOI with Zenodo
3027

3128
This is the most common method used for generating a DOI for a GitHub repository, because [Zenodo](https://zenodo.org/) has a [GitHub integration](https://zenodo.org/account/settings/github/), which is accessed through your Zenodo account settings (for more information, please see [GitHub's associated Docs](https://docs.github.com/articles/referencing-and-citing-content)). Zenodo can also be used to generate DOIs for data, as is relatively common in biology. However, for direct use of ML models and datasets, there are many more advantages to using Hugging Face; please see the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information.[^1]
@@ -38,11 +35,11 @@ When your GitHub and Zenodo accounts are linked, there will be a list of availab
3835
![Zenodo instructions and enabled repos](images/doi-generation/enabled_repos+intstructions.png){ loading=lazy, width="800" }
3936

4037
!!! info "The Sync now button"
41-
There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that <b>_has_</b> a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself.
38+
There is a "Sync now" button at the top right of the instructions, with information on when the last sync occurred. Observe that a badge appears for the enabled repository that **_has_** a DOI, while the one without just shows up as enabled; this will also be true for repositories to which you have access but that you did not submit to Zenodo yourself.
4239

4340
#### Metadata Tracking
4441

45-
When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases.
42+
When automatically generating a DOI with Zenodo, it uses information provided in your `CITATION.cff` file to populate the metadata for the record. However, there is important information that is not supported through this integration despite its inclusion in the `CITATION.cff` format in some cases.
4643

4744
If your repository is likely to be updated repeatedly (i.e., generating new releases), then you may consider adding a `.zenodo.json` to preserve the remaining metadata on release sync with Zenodo for DOI. This metadata includes grant (funding) information, references (which may be included in your `CITATION.cff`), and a description of your repository/code.
4845

@@ -70,8 +67,8 @@ Building on the alternate edit options, there is also the option to simply gener
7067

7168
When creating a new record on Zenodo, please ensure that other members of your project have access, as appropriate. In particular, there should be at least one member of Institute leadership or the Senior Data Scientist added to the record with management permissions. This ensures the ability to maintain the metadata and address matters related to the record (which may extend beyond your tenure with the Institute) in a timely manner.
7269

73-
7470
### 3. Generate a DOI with Dryad
7571

7672
[Dryad](https://datadryad.org/stash/about) is another research data repository, similar to Zenodo, through which one can archive digital objects (such as, but not limited to, data) supporting scholarly publications, and obtain a DOI. It has a review process when depositing data and requires dedication to the public domain (CC0) of all digital objects uploaded. Imageomics through OSU is a member organization of Dryad, reducing or eliminating data deposit charge(s). To determine whether Dryad is a suitable archive for Institute data products supporting your publication, please consider the [Data Archive Options Comparative Overview](../pdfs/Data_Archive-Publication-Options-Comparative-Overview.pdf) for more information, and consult with the Institute's Senior Data Scientist.[^1]
7773

74+
!!! question "[Questions, Comments, or Concerns?](https://github.com/Imageomics/Imageomics-guide/issues)"

0 commit comments

Comments
 (0)