You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: janno_details.md
+10-10
Original file line number
Diff line number
Diff line change
@@ -12,11 +12,11 @@ The column `Alternative_IDs` provides a way to list other IDs used for the respe
12
12
13
13
The `Collection_ID` column stores an additional, secondary identifier as it is often provided by collaboration partners (archaeologists, museums, collections) that provide specimen for archaeogenetic research. These identifiers might have a very heterogenous structure and may not be unique across different projects or institutions. The `Collection_ID` column is therefore a free form text field.
14
14
15
-
The `Group_Name` column contains one or multiple group or population names for each individual, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package. Assigning group and population names is a hard problem in archeogenetics ([@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z)), so the `.janno` file allows for more than one identifier.
15
+
The `Group_Name` column contains one or multiple group or population names for each individual, separated by `;`. The first entry must be identical to the one used in the genotype data for the respective sample in a Poseidon package. Assigning group and population names is a hard problem in archeogenetics [@Eisenmann2018](https://doi.org/10.1038/s41598-018-31123-z), so the `.janno` file allows for more than one identifier.
16
16
17
17
### Relations among samples/individuals
18
18
19
-
To systematically document biological relationships uncovered among samples/individuals in one or multiple Poseidon datasets (e.g. with software like READ ([@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491)) or BREADR ([@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. They together should be capable to encode all kinds of pairwise, biological relationships an individual might have.
19
+
To systematically document biological relationships uncovered among samples/individuals in one or multiple Poseidon datasets (e.g. with software like READ [@MonroyKuhn2018](https://doi.org/10.1371/journal.pone.0195491) or BREADR [@Rohrlach2023](https://doi.org/10.1101/2023.04.17.537144)), the `.janno` file can be fit with a set of columns featuring the `Relation_*` prefix. They together should be capable to encode all kinds of pairwise, biological relationships an individual might have.
20
20
21
21
`Relation_To` is a string list column (so: multiple values are possible if separated by `;`) that stores the `Poseidon_ID`s of other samples/individuals to which the current individual has some relationship.
22
22
@@ -97,9 +97,9 @@ The `Genetic_Sex` column should encode the biological sex as determined from the
97
97
-`M`: male
98
98
-`U`: unknown
99
99
100
-
This limitation stems from the genotype data formats by Plink and the Eigensoft software package. Edge cases (e.g. XXY, XYY, X0, ...) can not be expressed with this format and should be reported as `U` with an additional comment in the free text `Note` field. Genetic sex determination for ancient DNA can be performed for example with Sex.DetERRmine ([@Lamnidis2018](https://doi.org/10.1038/s41467-018-07483-5)).
100
+
This limitation stems from the genotype data formats by Plink and the Eigensoft software package. Edge cases (e.g. XXY, XYY, X0, ...) can not be expressed with this format and should be reported as `U` with an additional comment in the free text `Note` field. Genetic sex determination for ancient DNA can be performed for example with Sex.DetERRmine [@Lamnidis2018](https://doi.org/10.1038/s41467-018-07483-5).
101
101
102
-
The `MT_Haplogroup` column is meant to store the human mitochondrial DNA haplogroup for the respective individual in a simple string. The entry can be arbitrarily precise. A software tool to determine the MT haplogroup is for example Haplogrep ([@Schnoeherr2023](https://doi.org/10.1093/nar/gkad284)).
102
+
The `MT_Haplogroup` column is meant to store the human mitochondrial DNA haplogroup for the respective individual in a simple string. The entry can be arbitrarily precise. A software tool to determine the MT haplogroup is for example Haplogrep [@Schnoeherr2023](https://doi.org/10.1093/nar/gkad284).
103
103
104
104
The `Y_Haplogroup` column holds the respective human Y-chromosome DNA haplogroup in a simple string. The notation should follow a syntax with the main branch + the most terminal derived Y-SNP separated with a minus symbol (e.g. R1b-P312).
105
105
@@ -112,9 +112,9 @@ The `Nr_Libraries` column holds a simple integer value of the number of librarie
112
112
The `Capture_Type` column specifies the general pre-sequencing preparation methods that have been applied to the library. See [@Knapp2010](https://doi.org/10.3390/genes1020227) for a review of the different techniques (not including younger developments). This field can hold one of multiple different values, but also multiple of these separated by `;` if different methods have been applied for different libraries.
113
113
114
114
-`Shotgun`: Sequencing without any enrichment (whole genome sequencing, screening etc.)
115
-
-`1240k`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array ([@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152))
115
+
-`1240k`: Target enrichment with hybridization capture optimised for sequences covering the 1240k SNP array [@Fu2015](https://doi.org/10.1038/nature14558), [@Haak2015](https://doi.org/10.1038/nature14317), [@Mathieson2015](https://doi.org/10.1038/nature16152)
116
116
-`ArborComplete`, `ArborPrimePlus`, `ArborAncestralPlus`: Target enrichment with hybridization capture as provided by Arbor Biosciences in three different kits branded [myBaits Expert Human Affinities](https://arborbiosci.com/genomics/targeted-sequencing/mybaits/mybaits-expert/mybaits-expert-human-affinities)
117
-
-`TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience ([@Rohland2022](https://doi.org/10.1101/gr.276728.122))
117
+
-`TwistAncientDNA`: Target enrichment with hybridization capture as provided by Twist Bioscience [@Rohland2022](https://doi.org/10.1101/gr.276728.122)
118
118
-`OtherCapture`: Target enrichment with hybridization capture for any other set of sequences
119
119
-`ReferenceGenome`: Modern reference genomes where aDNA fragmentation is not an issue and other sample preparation techniques apply
120
120
@@ -138,23 +138,23 @@ The `Genotype_Ploidy` column stores a characteristic of the aDNA data treatment.
138
138
-`diploid`: No random read selection
139
139
-`haploid`: Random read selection to produce pseudo-haploid data
140
140
141
-
The column `Data_Preparation_Pipeline_URL` should finally store an URL that links to a complete and human-readable description of the computational pipeline (for example a specific configuration for nf-core/eager [@FellowsYates2021](https://doi.org/10.7717/peerj.10947)) by which the sample data was processed.
141
+
The column `Data_Preparation_Pipeline_URL` should finally store an URL that links to a complete and human-readable description of the computational pipeline (for example a specific configuration for nf-core/eager [@FellowsYates2021](https://doi.org/10.7717/peerj.10947) by which the sample data was processed.
142
142
143
143
#### Data yield
144
144
145
145
The `Endogenous` column holds the percentage of mapped reads over the total amount of reads that went into the mapping pipeline. That boils down to the DNA percentage of the library that matches the (human) reference. It should be determined from Shotgun libraries (so before any hybridization capture), not on target and without any quality filtering. In case of multiple libraries only the highest value should be reported. The % endogenous DNA can be calculated for example with the [endorS.py](https://github.com/aidaanva/endorS.py) script.
146
146
147
147
The `Nr_SNPs` column gives the number of SNPs reported in the genotype data files for this individual.
148
148
149
-
The `Coverage_on_Target_SNPs` column reports the mean SNP coverage on the target SNP array (e.g. 1240K) for the merged libraries of this sample. To calculate the coverage it is necessary to determine which SNPs are covered how many times by the mapped reads. Individual SNPs might be covered multiple times, whereas others may not be covered at all by the highly deteriorated ancient DNA. The coverage for each SNP is therefore a number between 0 and n. The statistic can be determined for example with the QualiMap ([@Okonechnikov2015](https://doi.org/10.1093/bioinformatics/btv566)) software package. In case of multiple libraries, the coverage can be given as a mean across all of them.
149
+
The `Coverage_on_Target_SNPs` column reports the mean SNP coverage on the target SNP array (e.g. 1240K) for the merged libraries of this sample. To calculate the coverage it is necessary to determine which SNPs are covered how many times by the mapped reads. Individual SNPs might be covered multiple times, whereas others may not be covered at all by the highly deteriorated ancient DNA. The coverage for each SNP is therefore a number between 0 and n. The statistic can be determined for example with the QualiMap [@Okonechnikov2015](https://doi.org/10.1093/bioinformatics/btv566) software package. In case of multiple libraries, the coverage can be given as a mean across all of them.
150
150
151
151
#### Data quality
152
152
153
153
The `Damage` column contains the % damage on the first position of the 5' end for the main Shotgun library used for sequencing or capture. This is an important statistic to verify the age of ancient DNA. In case of multiple libraries you should report a value from the merged read alignment.
154
154
155
155
##### Contamination
156
156
157
-
Contamination of ancient DNA with foreign reads is a major challenge for archaeogenetics. There exist multiple competing ideas, algorithms and software tools to estimate the degree of contamination for individual samples (e.g. ANGSD ([@Korneliussen2014](https://doi.org/10.1186/s12859-014-0356-4)), contamLD ([@Nakatsuka2020](https://doi.org/10.1186/s13059-020-02111-2)) or hapCon ([@Huang2022](https://doi.org/10.1093/bioinformatics/btac390))), with some methods only applicable under certain circumstances (e.g. popular X-chromosome based approaches only work on male individuals). Also the results of different methods tend to differ both in the degree of contamination they estimate and in the way the output is usually encoded. To cover the multitude of methods in this domain, and to make the results representable in the `.janno` file, we offer the `Contamination_*` column family.
157
+
Contamination of ancient DNA with foreign reads is a major challenge for archaeogenetics. There exist multiple competing ideas, algorithms and software tools to estimate the degree of contamination for individual samples (e.g. ANGSD [@Korneliussen2014](https://doi.org/10.1186/s12859-014-0356-4), contamLD [@Nakatsuka2020](https://doi.org/10.1186/s13059-020-02111-2) or hapCon [@Huang2022](https://doi.org/10.1093/bioinformatics/btac390)), with some methods only applicable under certain circumstances (e.g. popular X-chromosome based approaches only work on male individuals). Also the results of different methods tend to differ both in the degree of contamination they estimate and in the way the output is usually encoded. To cover the multitude of methods in this domain, and to make the results representable in the `.janno` file, we offer the `Contamination_*` column family.
158
158
159
159
`Contamination` is a list column to represent the different contamination values estimated for a sample with one or multiple software tools. As usual multiple values are separated by `;`.
160
160
@@ -174,7 +174,7 @@ The `Contamination_Note` column is a free text field to add additional informati
174
174
175
175
### Context information
176
176
177
-
The `Genetic_Source_Accession_IDs` column was introduced to link the derived genotype data in Poseidon with the raw sequencing data typically uploaded to archives like the ENA ([@Burgin2022](https://doi.org/10.1093/nar/gkac1051)) or SRA ([@Katz2021](https://doi.org/10.1093/nar/gkab1053)). There projects or even individual samples are given clear identifiers: Accession IDs. This janno column is supposed to store one or multiple of these Accessions IDs for each individual/sample in Poseidon. If multiple are entered, then they should be arranged by descending specificity from left to right (e.g. project id > sample id > sequencing run id).
177
+
The `Genetic_Source_Accession_IDs` column was introduced to link the derived genotype data in Poseidon with the raw sequencing data typically uploaded to archives like the ENA [@Burgin2022](https://doi.org/10.1093/nar/gkac1051) or SRA [@Katz2021](https://doi.org/10.1093/nar/gkab1053). There projects or even individual samples are given clear identifiers: Accession IDs. This janno column is supposed to store one or multiple of these Accessions IDs for each individual/sample in Poseidon. If multiple are entered, then they should be arranged by descending specificity from left to right (e.g. project id > sample id > sequencing run id).
178
178
179
179
The `Primary_Contact` column is a free form text field that stores the name of the main or the corresponding author of the respective paper for published data.
Copy file name to clipboardexpand all lines: janno_r_package.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -39,7 +39,7 @@ Before loading the `.janno` files they are validated with `janno::validate_janno
39
39
40
40
Usually the `.janno` files are loaded as normal `.tsv` files with every column type set to `character` and then the columns are transformed to the intended types. This transformation can be turned off with `to_janno = FALSE`.
41
41
42
-
`read_janno()` returns an object of class `janno`. `janno` objects are derived [`tibble`s](https://tibble.tidyverse.org/), which integrate well with the tidyverse ([@Wickham2019](https://doi.org/10.21105/joss.01686)) and its packages, e.g. `dplyr` or `ggplot2`. As long as the data layout does not change, they will remain `janno` objects and not be transformed to default tibbles.
42
+
`read_janno()` returns an object of class `janno`. `janno` objects are derived [`tibble`s](https://tibble.tidyverse.org/), which integrate well with the tidyverse [@Wickham2019](https://doi.org/10.21105/joss.01686) and its packages, e.g. `dplyr` or `ggplot2`. As long as the data layout does not change, they will remain `janno` objects and not be transformed to default tibbles.
43
43
44
44
### Validate janno files
45
45
@@ -77,7 +77,7 @@ janno::process_age(
77
77
)
78
78
```
79
79
80
-
`janno::process_age` includes calibration of radiocarbon dates with the Bchron R package ([@Haslett2008](https://doi.org/10.1111/j.1467-9876.2008.00623.x)). The calibration curve set in `cal_curve` is applied for every date in the `janno` object. If there are multiple radiocarbon dates for one sample they are automatically combined as the normalized sum of all individual post-calibration probability distributions.
80
+
`janno::process_age` includes calibration of radiocarbon dates with the Bchron R package [@Haslett2008](https://doi.org/10.1111/j.1467-9876.2008.00623.x). The calibration curve set in `cal_curve` is applied for every date in the `janno` object. If there are multiple radiocarbon dates for one sample they are automatically combined as the normalized sum of all individual post-calibration probability distributions.
81
81
82
82
The `choices` argument contains the list of columns that should be calculated and added by `janno::process_age`. `n` is the number of samples that should be drawn for `Date_BC_AD_Sample`.
0 commit comments