This is a summary of the best practices implemented in the local analysis scripts.
Numerical arrays are stored as numpy arrays of various types (integers, floats, booleans), and not as regular python arrays.
Numpy arrays have a number of advantages, including:
- Low memory load: Regular python lists allow elements to be of different types, which means that the type/class of each element has to be stored individually, whereas numpy arrays must have elements that are all the same type (eg integers), so they only have to store the type once.
- Fast and efficient computation: Numpy has functions called ufuncs, which are vectorized and support broadcasting.
Data tables (e.g. reference genome annotations) are stored as pandas dataframes.
Dataframes have a number of advantages, including:
- Labeled axes (e.g. column names can be used like a dictionary)
- Ability to support heterogeneous data (e.g. integers in one column, strings in another column)
The following standards have been implemented for counting positions and contigs on the reference genome:
- Positions on the reference genome are counted starting from 1 in order to match the VCF files.
- Contigs on the reference genome are also counted starting from 1.
These standards are implemented in the reference genome class.
(All python indexing still starts at 0. The above just refers to how we are naming positions and contigs.)
Basecalls are stored in numerical arrays where:
- 0 = N (ambiguous basecall)
- 1/2/3/4 = A/T/C/G
Python dictionaries NTs_to_int_dict
and int_to_NTs_dict
define the mapping between integers and nucleotides. This mapping should never be hard-coded.
Storing Ns as 0s is efficient because we can utilize built-in functions like numpy.count_nonzero
.
Indexing of arrays is standardized according to:
- Index 0 = sample
- Index 1 = position on genome
- Index 2 = another characteristic (if applicable)