poseidon-framework
diff --git a/‎janno_details.md
+10-10 b/‎janno_details.md
+10-10
diff --git a/‎janno_r_package.md
+2-2 b/‎janno_r_package.md
+2-2
@@ -12,11 +12,11 @@ The column `Alternative_IDs` provides a way to list other IDs used for the respe
 
 The `Collection_ID` column stores an additional, secondary identifier as it is often provided by collaboration partners (archaeologists, museums, collections) that provide specimen for archaeogenetic research. These identifiers might have a very heterogenous structure and may not be unique across different projects or institutions. The `Collection_ID` column is therefore a free form text field.
 
-The `Group_Name` column contains one or multiple group or population names for each individual, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package. Assigning group and population names is a hard problem in archeogenetics ([@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z)), so the `.janno` file allows for more than one identifier.
+The `Group_Name` column contains one or multiple group or population names for each individual, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package. Assigning group and population names is a hard problem in archeogenetics [@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z), so the `.janno` file allows for more than one identifier.
 
 ### Relations among samples/individuals
 
-To systematically document biological relationships uncovered among samples/individuals in one or multiple Poseidon datasets (e.g. with software like READ ([@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491)) or BREADR ([@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. They together should be capable to encode all kinds of pairwise, biological relationships an individual might have.
+To systematically document biological relationships uncovered among samples/individuals in one or multiple Poseidon datasets (e.g. with software like READ [@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491) or BREADR [@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. They together should be capable to encode all kinds of pairwise, biological relationships an individual might have.
 
 `Relation_To` is a string list column (so: multiple values are possible if separated by `;`) that stores the `Poseidon_ID`s of other samples/individuals to which the current individual has some relationship. 
 
@@ -97,9 +97,9 @@ The `Genetic_Sex` column should encode the biological sex as determined from the
 - `M`: male
 - `U`: unknown 
 
-This limitation stems from the genotype data formats by Plink and the Eigensoft software package. Edge cases (e.g. XXY, XYY, X0, ...) can not be expressed with this format and should be reported as `U` with an additional comment in the free text `Note` field. Genetic sex determination for ancient DNA can be performed for example with Sex.DetERRmine ([@Lamnidis2018](https://doi.org/10.1038/s41467-018-07483-5)).
+This limitation stems from the genotype data formats by Plink and the Eigensoft software package. Edge cases (e.g. XXY, XYY, X0, ...) can not be expressed with this format and should be reported as `U` with an additional comment in the free text `Note` field. Genetic sex determination for ancient DNA can be performed for example with Sex.DetERRmine [@Lamnidis2018](https://doi.org/10.1038/s41467-018-07483-5).
 
-The `MT_Haplogroup` column is meant to store the human mitochondrial DNA haplogroup for the respective individual in a simple string. The entry can be arbitrarily precise. A software tool to determine the MT haplogroup is for example Haplogrep ([@Schnoeherr2023](https://doi.org/10.1093/nar/gkad284)).
+The `MT_Haplogroup` column is meant to store the human mitochondrial DNA haplogroup for the respective individual in a simple string. The entry can be arbitrarily precise. A software tool to determine the MT haplogroup is for example Haplogrep [@Schnoeherr2023](https://doi.org/10.1093/nar/gkad284).
 
 The `Y_Haplogroup` column holds the respective human Y-chromosome DNA haplogroup in a simple string. The notation should follow a syntax with the main branch + the most terminal derived Y-SNP separated with a minus symbol (e.g. R1b-P312).
 
@@ -112,9 +112,9 @@ The `Nr_Libraries` column holds a simple integer value of the number of librarie
 The `Capture_Type` column specifies the general pre-sequencing preparation methods that have been applied to the library. See [@Knapp2010](https://doi.org/10.3390/genes1020227) for a review of the different techniques (not including younger developments). This field can hold one of multiple different values, but also multiple of these separated by `;` if different methods have been applied for different libraries.
 
 - `Shotgun`: Sequencing without any enrichment (whole genome sequencing, screening etc.)
-- `1240k`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array ([@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152))
+- `1240k`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152)
 - `ArborComplete`, `ArborPrimePlus`, `ArborAncestralPlus`: Target enrichment with hybridization capture as provided by Arbor Biosciences in three different kits branded [myBaits Expert Human Affinities](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-human-affinities)
-- `TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience ([@Rohland2022](https://doi.org/10.1101/gr.276728.122))
+- `TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience [@Rohland2022](https://doi.org/10.1101/gr.276728.122)
 - `OtherCapture`: Target enrichment with hybridization capture for any other set of sequences
 - `ReferenceGenome`: Modern reference genomes where aDNA fragmentation is not an issue and other sample preparation techniques apply
 
@@ -138,23 +138,23 @@ The `Genotype_Ploidy` column stores a characteristic of the aDNA data treatment.
 - `diploid`: No random read selection
 - `haploid`: Random read selection to produce pseudo-haploid data
 
-The column `Data_Preparation_Pipeline_URL` should finally store an URL that links to a complete and human-readable description of the computational pipeline (for example a specific configuration for nf-core/eager [@FellowsYates2021](https://doi.org/10.7717/peerj.10947)) by which the sample data was processed.
+The column `Data_Preparation_Pipeline_URL` should finally store an URL that links to a complete and human-readable description of the computational pipeline (for example a specific configuration for nf-core/eager [@FellowsYates2021](https://doi.org/10.7717/peerj.10947) by which the sample data was processed.
 
 #### Data yield
 
 The `Endogenous` column holds the percentage of mapped reads over the total amount of reads that went into the mapping pipeline. That boils down to the DNA percentage of the library that matches the (human) reference. It should be determined from Shotgun libraries (so before any hybridization capture), not on target and without any quality filtering. In case of multiple libraries only the highest value should be reported. The % endogenous DNA can be calculated for example with the [endorS.py](https://github.com/aidaanva/endorS.py) script.
 
 The `Nr_SNPs` column gives the number of SNPs reported in the genotype data files for this individual.
 
-The `Coverage_on_Target_SNPs` column reports the mean SNP coverage on the target SNP array (e.g. 1240K) for the merged libraries of this sample. To calculate the coverage it is necessary to determine which SNPs are covered how many times by the mapped reads. Individual SNPs might be covered multiple times, whereas others may not be covered at all by the highly deteriorated ancient DNA. The coverage for each SNP is therefore a number between 0 and n. The statistic can be determined for example with the QualiMap ([@Okonechnikov2015](https://doi.org/10.1093/bioinformatics/btv566)) software package. In case of multiple libraries, the coverage can be given as a mean across all of them.
+The `Coverage_on_Target_SNPs` column reports the mean SNP coverage on the target SNP array (e.g. 1240K) for the merged libraries of this sample. To calculate the coverage it is necessary to determine which SNPs are covered how many times by the mapped reads. Individual SNPs might be covered multiple times, whereas others may not be covered at all by the highly deteriorated ancient DNA. The coverage for each SNP is therefore a number between 0 and n. The statistic can be determined for example with the QualiMap [@Okonechnikov2015](https://doi.org/10.1093/bioinformatics/btv566) software package. In case of multiple libraries, the coverage can be given as a mean across all of them.
 
 #### Data quality
 
 The `Damage` column contains the % damage on the first position of the 5' end for the main Shotgun library used for sequencing or capture. This is an important statistic to verify the age of ancient DNA. In case of multiple libraries you should report a value from the merged read alignment.
 
 ##### Contamination
 
-Contamination of ancient DNA with foreign reads is a major challenge for archaeogenetics. There exist multiple competing ideas, algorithms and software tools to estimate the degree of contamination for individual samples (e.g. ANGSD ([@Korneliussen2014](https://doi.org/10.1186/s12859-014-0356-4)), contamLD ([@Nakatsuka2020](https://doi.org/10.1186/s13059-020-02111-2)) or hapCon ([@Huang2022](https://doi.org/10.1093/bioinformatics/btac390))), with some methods only applicable under certain circumstances (e.g. popular X-chromosome based approaches only work on male individuals). Also the results of different methods tend to differ both in the degree of contamination they estimate and in the way the output is usually encoded. To cover the multitude of methods in this domain, and to make the results representable in the `.janno` file, we offer the `Contamination_*` column family.
+Contamination of ancient DNA with foreign reads is a major challenge for archaeogenetics. There exist multiple competing ideas, algorithms and software tools to estimate the degree of contamination for individual samples (e.g. ANGSD [@Korneliussen2014](https://doi.org/10.1186/s12859-014-0356-4), contamLD [@Nakatsuka2020](https://doi.org/10.1186/s13059-020-02111-2) or hapCon [@Huang2022](https://doi.org/10.1093/bioinformatics/btac390)), with some methods only applicable under certain circumstances (e.g. popular X-chromosome based approaches only work on male individuals). Also the results of different methods tend to differ both in the degree of contamination they estimate and in the way the output is usually encoded. To cover the multitude of methods in this domain, and to make the results representable in the `.janno` file, we offer the `Contamination_*` column family.
 
 `Contamination` is a list column to represent the different contamination values estimated for a sample with one or multiple software tools. As usual multiple values are separated by `;`.
 
@@ -174,7 +174,7 @@ The `Contamination_Note` column is a free text field to add additional informati
 
 ### Context information
 
-The `Genetic_Source_Accession_IDs` column was introduced to link the derived genotype data in Poseidon with the raw sequencing data typically uploaded to archives like the ENA ([@Burgin2022](https://doi.org/10.1093/nar/gkac1051)) or SRA ([@Katz2021](https://doi.org/10.1093/nar/gkab1053)). There projects or even individual samples are given clear identifiers: Accession IDs. This janno column is supposed to store one or multiple of these Accessions IDs for each individual/sample in Poseidon. If multiple are entered, then they should be arranged by descending specificity from left to right (e.g. project id > sample id > sequencing run id).
+The `Genetic_Source_Accession_IDs` column was introduced to link the derived genotype data in Poseidon with the raw sequencing data typically uploaded to archives like the ENA [@Burgin2022](https://doi.org/10.1093/nar/gkac1051) or SRA [@Katz2021](https://doi.org/10.1093/nar/gkab1053). There projects or even individual samples are given clear identifiers: Accession IDs. This janno column is supposed to store one or multiple of these Accessions IDs for each individual/sample in Poseidon. If multiple are entered, then they should be arranged by descending specificity from left to right (e.g. project id > sample id > sequencing run id).
 
 The `Primary_Contact` column is a free form text field that stores the name of the main or the corresponding author of the respective paper for published data.
 
 
@@ -39,7 +39,7 @@ Before loading the `.janno` files they are validated with `janno::validate_janno
 
 Usually the `.janno` files are loaded as normal `.tsv` files with every column type set to `character` and then the columns are transformed to the intended types. This transformation can be turned off with `to_janno = FALSE`.
 
-`read_janno()` returns an object of class `janno`. `janno` objects are derived [`tibble`s](https://tibble.tidyverse.org/), which integrate well with the tidyverse ([@Wickham2019](https://doi.org/10.21105/joss.01686)) and its packages, e.g. `dplyr` or `ggplot2`. As long as the data layout does not change, they will remain `janno` objects and not be transformed to default tibbles.
+`read_janno()` returns an object of class `janno`. `janno` objects are derived [`tibble`s](https://tibble.tidyverse.org/), which integrate well with the tidyverse [@Wickham2019](https://doi.org/10.21105/joss.01686) and its packages, e.g. `dplyr` or `ggplot2`. As long as the data layout does not change, they will remain `janno` objects and not be transformed to default tibbles.
 
 ### Validate janno files
 
@@ -77,7 +77,7 @@ janno::process_age(
 )
 ```
 
-`janno::process_age` includes calibration of radiocarbon dates with the Bchron R package ([@Haslett2008](https://doi.org/10.1111/j.1467-9876.2008.00623.x)). The calibration curve set in `cal_curve` is applied for every date in the `janno` object. If there are multiple radiocarbon dates for one sample they are automatically combined as the normalized sum of all individual post-calibration probability distributions. 
+`janno::process_age` includes calibration of radiocarbon dates with the Bchron R package [@Haslett2008](https://doi.org/10.1111/j.1467-9876.2008.00623.x). The calibration curve set in `cal_curve` is applied for every date in the `janno` object. If there are multiple radiocarbon dates for one sample they are automatically combined as the normalized sum of all individual post-calibration probability distributions. 
 
 The `choices` argument contains the list of columns that should be calculated and added by `janno::process_age`. `n` is the number of samples that should be drawn for `Date_BC_AD_Sample`.