Skip to content

File formats

Mathieu Fourment edited this page May 18, 2015 · 1 revision
  1. Sequence file formats
  2. Tree file formats
## Sequence file formats

It is strongly recommended to avoid sequence names containing the following characters: (:,[]); These characters will cause errors for phylogenetic analyses since they are part of the NEWICK syntax. Some program will refuse to have - or spaces in the names.

FASTA

Names must start with >


>seq1
ATCG
ACCC
>seq2
TCAT
AAAA

PHYLIP

Sequences can either be interleaved or sequential. Sequence names are not restricted in length unlike the original format The first line of the file must contain the alignment length followed by a space and the number of sequences. Optionally the file format can be specified by appending a space and i (for interleaved) or s (for sequential). If the format specification is not present then the file is assumed to be interleaved.

Interleaved format:


 2 8 i
seq1 ATCG
seq2 TCAT
  
ACCC
AAAA

Sequential format:


 2 8 s
seq1 ATCG
ACCC
seq2 TCAT
AAAA

MEGA

Sequences can either be interleaved or sequential. Interleaved sequences cannot contain spaces in their name.

Interleaved format:


#mega
Tile: My interleaved alignment

#seq1 ATCG
#seq2 TCAT

#seq1 ACCC
#seq2 AAAA

Sequential format:


#mega
Tile: My sequential alignment

#seq1
ATCG
ACCC
#seq2
TCAT
AAAA

GDE flat


#seq1
ATCG
ACCC
#seq2
TCAT
AAAA

CLUSTAL

Names cannot contain spaces.


#seq1 ATCG 4
#seq2 TCAT 4

#seq1 ACCC 8
#seq2 AAAA 8

NBRF/PIR

Names must be preceded by > followed by a 2 character sequence type and semi-colon. The next line is the sequence description. Sequences end with a star.

Sequence type:

  • P1 - Protein (complete)
  • F1 - Protein (fragment)
  • D1 - DNA (e.g. EMBOSS seqret output)
  • DL - DNA (linear)
  • DC - DNA (circular)
  • RL - RNA (linear)
  • RC - RNA (circular)
  • N3 - tRNA
  • N1 - Other functional RNA
  • XX - Unknown

>DL;seq1
Sequence 1 description
ATCG
ACCC
*
>DL;seq2
Sequence 2 description
TCAT
AAAA
*

Stockholm

Names cannot contain spaces.


# STOCKHOLM 1.0

seq1 ATCG
seq2 TCAT

seq1 ACCC
seq2 AAAA

NEXUS

Comments can be inserted using square brackets. Sequence names can contain spaces as long as they are between single or double quotes. The file must contain ntax (number of sequences) and nchar (alignment length) as specified below.

More information is available in this paper


#NEXUS

[That's a comment]

Begin taxa;
dimensions ntax=2;
taxlabels
'seq 1'
'seq 2'
;
end;

Begin characters;
dimensions nchar=8;
format datatype=dna gap=-;
matrix
'seq 1' ATCG AC-C
'seq 2' TCAT AAAA
;
end;

The following example is also valid:


#NEXUS

[That's a comment]

Begin data;
dimensions nchar=8 ntax=2;
format datatype=dna gap=-;
matrix
'seq 1' ATCG AC-C
'seq 2' TCAT AAAA
;
end;
## Tree file formats

NEWICK

Sequence names can contain spaces as long as they are between single or double quotes. Sequence names cannot contain any of these characters as they are part of the NEWICK syntax (:,[]); Example


((taxon1:0.1,taxon2:0,2),taxon3:0.3);

NEXUS

Sequence names can contain spaces as long as they are between single or double quotes. Sequence names cannot contain any of these characters as they are part of the NEWICK syntax (:,[]); Comments can be inserted using square brackets.

More information is available in this paper


#NEXUS

[That's a comment]

Begin trees;
TREE tree1 = ((taxon1:[&rate=0.003]0.1,taxon2:[&rate=0.003]0,2),taxon3:[&rate=0.003]0.3); 
end;

Another example using a translate block:


#NEXUS

[That's a comment]

Begin trees;
Translate
 1 taxon1,
 2 taxon2,
 3 taxon3
;
TREE tree1 = ((1:[&rate=0.003]0.1,2:[&rate=0.003]0,2),3:[&rate=0.003]0.3); 
end;
Clone this wiki locally