Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. — Wikipedia
A curated list of awesome Bioinformatics software, resources, and libraries. Mostly command line based, and free or open-source. Please feel free to contribute!
Table of Contents
- datamash - Data transformations and statistics.
- Bioinformatics One Liners - Git repo of useful single line commands.
- CSVKit - Utilities for working with CSV/Tab-delimited files.
- csvtk - Another cross-platform, efficient, practical and pretty CSV/TSV toolkit.
- easy_qsub - Easily submitting PBS jobs with script template. Multiple input files supported.
- GNU
parallel
- General parallelizer that runs jobs in parallel on a single multi-core machine. Here are some example scripts using GNUparallel
. - zindex - Create an index on a compressed text file.
- tabix - Table file index.
- wormtable - Write-once-read-many table for large datasets.
- grabix - A wee tool for random access into BGZF files.
- BioNode - Modular and universal bioinformatics, Bionode provides pipeable UNIX command line tools and JavaScript APIs for bioinformatics analysis workflows.
- Awesome-Pipeline - A list of pipeline resources.
- Common Workflow Language - a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments.
- Cromwell - A Workflow Management System geared towards scientific workflows.
- Ruffus - Computation Pipeline library for python widely used in science and bioinformatics.
- Snakemake - A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment.
- Nextflow - A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner.
- BigDataScript - A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities.
- Bpipe - A small language for defining pipeline stages and linking them together to make pipelines.
- GATK Queue - A pipelining system built to work natively with GATK as well as other high-throughput sequence analysis software.
- SeqWare - Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments.
- bcbio-nextgen - Batteries included genomic analysis pipeline for variant and RNA-Seq analysis, structural variant calling, annotation, and prediction.
- Workflow Descriptor Language - Workflow standard developed by the Broad.
Sequence Processing includes tasks such as demultiplexing raw read data, and trimming low quality bases.
- Fastqp - FASTQ and SAM quality control using Python.
- FastQC - A quality control tool for high throughput sequence data.
- Fastx Tookit - FASTQ/A short-reads pre-processing tools: Demultiplexing, trimming, clipping, quality filtering, and masking utilities.
- Seqtk - Toolkit for processing sequences in FASTA/Q formats.
- SeqKit - A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang.
- seqmagick - file format conversion in Biopython in a convenient way
- AfterQC - Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
De Novo Alignment
DNA Resequencing
- BWA - Burrow-Wheeler Aligner for pairwise alignment between DNA sequences.
- Bowtie 2 - An ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
- samtools/bcftools/htslib - A suite of tools for manipulating next-generation sequencing data.
- freebayes - Bayesian haplotype-based polymorphism discovery and genotyping.
- GATK - Variant Discovery in High-Throughput Sequencing Data.
- Bamtools - Collection of tools for working with BAM files.
- mergesam - Automate common SAM & BAM conversions.
- SAMstat - Displaying sequence statistics for next-generation sequencing.
- Telseq - Telseq is a tool for estimating telomere length from whole genome sequence data.
- bam toolbox MtDNA:Nuclear Coverage; BAM Toolbox can output the ratio of MtDNA:nuclear coverage, a proxy for mitochondrial content.
- vcflib - A C++ library for parsing and manipulating VCF files.
- bcftools - Set of tools for manipulating VCF files.
- vcftools - VCF manipulation and statistics (e.g. linkage disequilibrium, allele frequency, Fst).
- vcfanno - Annotate a VCF with other VCFs/BEDs/tabixed files.
- Bedtools2 - A Swiss Army knife for genome arithmetic.
- BEDOPS - The fast, highly scalable and easily-parallelizable genome analysis toolkit.
- gffutils - GFF and GTF file manipulation and interconversion.
- wgsim - Comes with samtools! - Reads simulator.
- Bam Surgeon - Tools for adding mutations to existing
.bam
files, used for testing mutation callers.
- SIFT - Predicts whether an amino acid substitution affects protein function.
- SnpEff - Genetic variant annotation and effect prediction toolbox.
- cruzdb - Pythonic access to the UCSC Genome database.
- pyensembl - Pythonic Access to the Ensembl database.
- pyfaidx - Pythonic access to FASTA files.
- pyBedTools - Python wrapper for bedtools.
- pysam - Python wrapper for samtools.
- pyVCF - A VCF Parser for Python.
- cyvcf - A port of pyVCF using Cython for speed.
- cyvcf2 - Cython + HTSlib == fast VCF parsing; even faster parsing than pyVCF.
The following tools can be used to visualize genomic data or for constructing customized visualizations of genomic data including sequence data from DNA-Seq, RNA-Seq, and ChIP-Seq, variants, and more.
- biodalliance - Embeddable genome viewer. Integration data from a wide variety of sources, and can load data directly from popular genomics file formats including bigWig, BAM, and VCF.
- IGV js - Java-based browser. Fast, efficient, scalable visualization tool for genomics data and annotations. Handles a large variety of formats.
- Island Plot - D3 JavaScript based genome viewer. Constructs SVGs.
- pileup.js - JavaScript library that can be used to generate interactive and highly customizable web-based genome browsers.
- scribl - JavaScript library for drawing canvas-based gene diagrams. The Homepage has examples.
- DNAism - Horizon chart D3-based JavaScript library for DNA data.
- Circleator - Flexible circular visualization of genome-associated data with BioPerl and SVG.
- BioJS - BioJS is a library of over hundred JavaScript components enabling you to visualize and process data using current web technologies.
- Circos - Perl package for circular plots, which are well suited for genomic rearrangements.
- J-Circos - A Java application for doing interactive work with circos plots.
- ClicO FS - An interactive web-based service of Circos.
- rCircos - R package for circular plots.
- OmicCircos - R package for circular plots for omics data.
- Entrez Direct: E-utilities on the UNIX command line - UNIX command line tools to access NCBI's databases programmatically. Instructions to install and examples are found in the link.
- What is a bioinformatician
- Bioinformatics Curriculum Guidelines: Toward a Definition of Core Competencies
- Top N Reasons To Do A Ph.D. or Post-Doc in Bioinformatics/Computational Biology
- A 10-Step Guide to Party Conversation For Bioinformaticians - Here is a step-by-step guide on how to convey concepts to people not involved in the field when asked the question: 'So, what do you do?'
- A History Of Bioinformatics (In The Year 2039) - A talk by C. Titus Brown on his take of looking back at bioinformatics from the year 2039. His notes for this talk can be found here.
- A farewell to bioinformatics - A critical view of the state of bioinformatics.
- A Series of Interviews with Notable Bioinformaticians - Dr. Keith Bradnam "thought it might be instructive to ask a simple series of questions to a bunch of notable bioinformaticians to assess their feelings on the current state of bioinformatics research, and maybe get any tips they have about what has been useful to their bioinformatics careers."
- Learning Resources Index - Adrián E. Salatino's attempt at consolidating useful links and resources he has found helpful in his graduate career, ranging from (but not limited to) programming help, bioinformatics software, and even blogs to follow.
- Rosalind - Rosalind is a platform for learning bioinformatics through problem solving.
- A guide for the lonely bioinformatician - This guide is aimed at bioinformaticians, and is meant to guide them towards better career development.
- Next-Generation Sequencing Technologies - Elaine Mardis (2014) [1:34:35] - Excellent (technical) overview of next-generation and third-generation sequencing technologies, along with some applications in cancer research.
- Annotated bibliography of *Seq assays - List of ~100 papers on various sequencing technologies and assays ranging from transcription to transposable element discovery.
- For all you seq... (PDF) (3456x5471) - Massive infographic by Illumina on illustrating how many sequencing techniques work. Techniques cover protein-protein interactions, RNA transcription, RNA-protein interactions, RNA low-level detection, RNA modifications, RNA structure, DNA rearrangements and markers, DNA low-level detection, epigenetics, and DNA-protein interactions. References included.
- Review papers on RNA-seq (Biostars) - Includes lots of seminal papers on RNA-seq and analysis methods.
- Informatics for RNA-seq: A web resource for analysis on the cloud - Educational resource on performing RNA-seq analysis in the cloud using Amazon AWS cloud services. Topics include preparing the data, preprocessing, differential expression, isoform discovery, data visualization, and interpretation.
- RNA-seqlopedia - RNA-seqlopedia provides an awesome overview of RNA-seq and of the choices necessary to carry out a successful RNA-seq experiment.
- A survey of best practices for RNA-seq data analysis - Gives awesome roadmap for RNA-seq computational analyses, including challenges/obstacles and things to look out for, but also how you might integrate RNA-seq data with other data types.
- Stories from the Supplement [46:39] - Dr. Lior Pachter shares his stories from the supplement for well-known RNA-seq analysis software CuffDiff and Cufflinks and explains some of their methodologies.
- List of RNA-seq Bioinformatics Tools - Extensive list on Wikipedia of RNA-seq bioinformatics tools needed in analysis, ranging from all parts of an analysis pipeline from quality control, alignment, splice analysis, and visualizations.
- RNA-seq Analysis - @crazyhottommy's notes on various steps and considerations when doing RNA-seq analysis.
- ChIP-seq analysis notes from Tommy Tang - Resources on ChIP-seq data which include papers, methods, links to software, and analysis.
- Current Topics in Genome Analysis 2016 - Excellent series of fourteen lectures given at NIH about current topics in genomics ranging from sequence analysis, to sequencing technologies, and even more translational topics such as genomic medicine.
- GenomeTV - "GenomeTV is NHGRI's collection of official video resources from lectures, to news documentaries, to full video collections of meetings that tackle the research, issues and clinical applications of genomic research."
- Leading Strand - Keynote lectures from Cold Spring Harbor Laboratory (CSHL) Meetings. More on The Leading Strand.
- Genomics, Big Data and Medicine Seminar Series - "Our seminars are dedicated to the critical intersection of GBM, delving into 'bleeding edge' technology and approaches that will deeply shape the future."
- Rafael Irizarry's Channel - Dr. Rafael Irizarry's lectures and academic talks on statistics for genomics.
- NIH VideoCasting and Podcasting - "NIH VideoCast broadcasts seminars, conferences and meetings live to a world-wide audience over the Internet as a real-time streaming video." Not exclusively genomics and bioinformatics video but many great talks on domain specific use of bioinformatics and genomics.
- ACGT - Dr. Keith Bradnam writes about this "thoughts on biology, genomics, and the ongoing threat to humanity from the bogus use of bioinformatics acroynums."
- Opiniomics - Dr. Mick Watson write on bioinformatics, genomes, and biology.
- Bits of DNA - Dr. Lior Pachter writes review and commentary on computational biology.
- it is NOT junk - Dr. Michael Eisen writes "a blog about genomes, DNA, evolution, open science, baseball and other important things"
- The Leek group guide to genomics papers - Expertly curated genomics papers to get up to speed on genomics, RNA-seq, statistics (used in genomics), software development, and more.
- A New Online Computational Biology Curriculum - "This article introduces a catalog of several hundred free video courses of potential interest to those wishing to expand their knowledge of bioinformatics and computational biology. The courses are organized into eleven subject areas modeled on university departments and are accompanied by commentary and career advice."
- How Perl Saved the Human Genome Project - An anecdote by Lincoln D. Stein on the importance of the Perl programming language in the Human Genome Project.
- Educational Papers from Nature Biotechnology and PLoS Computational Biology - Page of links to primers and short educational articles on various methods used in computational biology and bioinformatics.