-
Notifications
You must be signed in to change notification settings - Fork 38
rhmm
This is the documentation of rhmm is a Hidden Markov Model Package. It currently supports Baum and Welch training, as well as Viterbi and Posterior decoding.
Install rhmm.pl on your PATH Compile rhmmC.c
gcc rhmmC.c -o rhmmC -O3 -lm
The Standard usage is explained here using the occasionally dishonest casino (ODC) example. The ODC is a standard problem where a casino is using two dices (a fair and an unfair one) and switching the dices. While the output is shown (emissions) the purpose of the training and the subsequent decoding is to identify which dice is responsible for each emission (hidden state).
Run the following command
rhmm.pl -test
It will produce the two following files:
File | Description |
---|---|
test.rhmm.model | A model for the ODC |
test.rhmm.data | Simulated data from the model |
Note that the simulated data has a field named RST that means Reference State. It indicates which dice was used when running the model.
The following command will estimate a model having the highest probability given the data. The topology of the model will remain unchanged and the procedure will only estimate the transitions and the emissions. The emissions and transitions set to 0 in the seed model will keep these values. The other values will be re-estimated unless they are fixed (see the Format section for the models). The following command will run the Baum and Welch estimation on your data:
rhmm.pl -model test.rhmm.model -data test.rhmm.data -action bw -out trained -nrounds 10 -nit 1000 -evaluate viterbi
On this command line:
| | |
------------ | ------------- | ------------- | -action | bw | Do Baum and Welch Estimation | -nrounds | 10 | Run 10 independent bw rounds starting from random models, keep the best | -nit| 100 | Run up to 100 iterations on each bw round | -evaluate | viterbi | Use the RST field (reference) to evaluate your viterbi decoding | -out | trained | Prefix for output files |
This command will output the following file:
| |
------------ | ------------- | trained.rhmm.2_6.decoded | Data with Viterbi and Posterior decoding| trained.rhmm.2_6.model | The newly estimated model| trained.rhmm.2_6.constraints | A list of the hard set constraints for the model|
The values 2_6
indicate that your model is made of 2 states with 6 emission bins each.
Note that although you are providing a reference model, the training procedure is only using the topology of this model (connectivity between states and number of emissions) and re-estimating every other parameter. If needed it is possible to keep constant some of these values (see the input model section).
If your data is already decoded, that is to say if each bin value is associated with a given state, you can estimate your model directly from data with the following command:
rhmm.pl -data test.rhmm.data -model test.rhmm.model -action decode -evaluate viterbi -out estimated
In this case you can also evaluate your model. The output line
Data Pred : sc1=0.80 sc2=0.73`
is an evaluation based on the RST field of your training data:
| |
------------ | ------------- | sc1|Correspondence between the RST states and the Viterbi predicted States| sc2|Minimum Sen, Sp, Sn2 value across all the states|
Assuming you have your own model and you want to decode available data, the following command will do the trick:
rhmm.pl -model test.rhmm.model -data test.rhmm.data -action decode -c -out decoded -nrounds 10
It will reformat your input data file and add two fields:
#d;0;74;RST;ST::unfair;bin;2;bpost_score;-0.08947;posterior;ST::unfair;viterbi;ST::unfair;
||bpost_score|| best posterior decoding score || ||posterior|| State yielding the best posterior decoding score|| ||viterbi || Viterbi decoding||
Using your favourite model you can generate simulated data. The state used for emitting each bin is recorded in the RST
field while the decoding is in the viterbi
and posterior
field. Generation can be achieved with the following command:
rhmm.pl -model test.rhmm.model -action model2data -out replicate -n 10 -len 100 -outdata simulated
| | |
------------ | ------------- | ------------- | -n | 10 | Generate 10 experiments| -len | 100 | Each experiment must contain 100 emissions|
Two types of data input formats are supported: comma separated values (csv) and FASTA:
| | |
------------ | ------------- | ------------- |
- -data *|CSV files|
-data file1 file2 ...
| - -data *|FASTA files with header|
-data file1 file2 ...
| - -data2 *|FASTA files w/o header|
-data2 file1 field1 file2 field2
|
| |
------------ | ------------- |
- -out* | set the prefix name for all output files|
- -output* |
fasta
orcsv
(default), sets the output format| - -outdata* | specify the output format for the decoded data|
By default, the output name of the data is:
out.rhmm.<nst>_<nem>.decoded
The standard format for the data is set as follows:
#d;<exp name>;<rec unique id>;<field1>;<value1>;<field2>;<value2;...;>`
In practice, one can have as many fields as required and those not needed by the HMM will be left untouched. The following fields are needed by the HMM:
| | |
------------ | ------------- | ------------- | bin | defines the bin that emitted the observation | must be provided as an input | RST | defines the state that emitted the bin | Not compulsory, can be provided when using reference data | viterbi | Viterbi decoding, indicates which state may have produced the emission in the bin| | posterior | Posterior decoding, indicates the most likely emitting state according to posterior decoding| |
For instance, after decoding, the sample test looks like this:
#d;1;50;RST;ST::fair;bin;2;bpost_score;0.729904146308265;posterior;ST::unfair;viterbi;ST::unfair;
Note that the bin names can be strings, numbers or any combination. The state names are also arbitrary but they must start with ST::<statename>
, for instance, ST::unfair
.
Note the output can be used as input but the decoded states will be over-written.
The Fasta format requires that your states and your emission can be represented with a single symbol. Sample fasta files can be produced with the following command line:
rhmm.pl -test -action convert -output fasta -outdata sample.fasta
In the file sample.fasta
, each sequence corresponds to a field and each symbol of the sequence corresponds to the field value the the record having the index of the considered position. All sequences must have the same length:
>1 RST
fffffffffffffffffffufffffffffffffffuuuuuu
>1 bin
31422531454235621656326646532425266466664
Note that when converting your data into a fasta format, states will be renamed using single symbol names. The renaming will only take place if the states/emissions use names longer than 1 char. Note Using FASTA is only possible when using less than 62 states and less than 62 emissions.
Models can be specified in two ways: either via the command line or via a file.
#Input
| |
------------ | ------------- |
-model|specifies the name of the file containing the model|
-nst|specifies the number of states for the model (-model topology
)|
-nem|specifies the number of emissions for the model (-model topology
)|
#Output
| |
------------ | ------------- | -outmodel|Name of the output model| -outmodel_f|zeros or default, print probabilities equal to zero in the output model|
##Specification via the Command Line The simplest mode is via the command line. It will create a fully connected model with the same number of emissions for each state. The following command:
rhmm.pl -model topology -nst 4 -nem 5 -outmodel xx
Will output a model with four fully connected states each emitting 5 symbols with the same probability. If data is provided via -data
or via -data2
, the emissions can be identified from the data and -nem
can be omitted.
##Specification via a file
It is possible to describe complex models in a generic way using the following set of instruction:
| |
------------ | ------------- | #set|explicit setting of an emission or a transition value| #bset|batch setting of transitions or emission values| #graph|definition of a sub graph made of combined states and emissions|
Note Emissions and States are distinguished by the prefix ST:: used for naming the states.
This is the standard way of declaring a model. It involves declaring the probability for each state transition and each emission. The number of emission must be the same for each state. Any emission or transition undeclared will be forbidden and will stay to 0 during training. For instance, the model for the ODC is as follows:
#comment;Reference_test
#set;ST::fair;ST::fair;0.9500;
#set;ST::fair;ST::unfair;0.0500;
#set;ST::unfair;ST::fair;0.0500;
#set;ST::unfair;ST::unfair;0.9500;
#set;ST::fair;1;0.1667;
#set;ST::fair;2;0.1667;
#set;ST::fair;3;0.1667;
#set;ST::fair;4;0.1667;
#set;ST::fair;5;0.1667;
#set;ST::fair;6;0.1667;
#set;ST::unfair;1;0.1000;
#set;ST::unfair;2;0.1000;
#set;ST::unfair;3;0.1000;
#set;ST::unfair;4;0.1000;
#set;ST::unfair;5;0.1000;
#set;ST::unfair;6;0.5000;
Note that states must start with ST::
###Hard coded constraints
It is possible to make sure that some values remain unchanged when a model is trained. For instance, the following model (saved as constrained.model
)
#comment;Reference_test
#set;ST::fair;ST::fair;0.9500;
#set;ST::fair;ST::unfair;0.0500;
#set;ST::unfair;ST::fair;0.0500;
#set;ST::unfair;ST::unfair;0.9500;
#set;ST::fair;1;0.1667;
#set;ST::fair;2;0.1667;
#set;ST::fair;3;0.1667;
#set;ST::fair;4;0.1667;
#set;ST::fair;5;0.1667;
#set;ST::fair;6;0.1667;
#set;ST::unfair;1;0.1000;
#set;ST::unfair;2;0.1000;
#set;ST::unfair;3;0.1000;
#set;ST::unfair;4;0.1000;
#set;ST::unfair;5;0.1000;
#set;ST::unfair;6;0.5000;fixed;
Can be trained with the following command:
rhmm.pl -data test.rhmm.data -model constrained.model -action bw -c -nit 10000
The result is a slightly different model where the fixed value has remained unchanged:
#set;ST::fair;ST::fair;0.9532;
#set;ST::fair;ST::unfair;0.0468;
#set;ST::unfair;ST::fair;0.0405;
#set;ST::unfair;ST::unfair;0.9595;
#set;ST::fair;1;0.1687;
#set;ST::fair;2;0.1819;
#set;ST::fair;3;0.1630;
#set;ST::fair;4;0.1633;
#set;ST::fair;5;0.1665;
#set;ST::fair;6;0.1566;
#set;ST::unfair;1;0.0913;
#set;ST::unfair;2;0.0916;
#set;ST::unfair;3;0.1092;
#set;ST::unfair;4;0.1044;
#set;ST::unfair;5;0.1035;
#set;ST::unfair;6;0.5000;fixed;
Note The constraint is applied when doing the Baum and Welch. Each time a model is updated from decoding, the fixed values are set first and the remaining update is spread across the unfixed sates (or emissions).
This instruction makes it possible to define models in a semi-compact way, using the #graph;
instruction:
#graph;<graph_name>;<number of state>;<connect backward>;<connect self>;<connect forward>;<topology>;
The connection define the way the considered states are connected:
-
Connect backward
| |
------------ | ------------- | nb|Connect only previous node| full| Connect to every backward node| none| No connection to any backward node|
-
Connect Self
| |
------------ | ------------- | full | Connect to itself| no | no self connection|
-
Connect Forward
| |
------------ | ------------- | nb|Connect to next node| full| Connect to every forward node| none| No connection to any forward node|
-
Topology
| |
------------ | ------------- | linear| Extremities remain unconnected (terminal states)| circular| Extremities are connected using the backward and forward rule|
For instance, the following line declares a forward only circular graph made of 6 states:
#graph;test;5;2;no;no;nb;circular;
This graph can be extended into a full model by saving it into a file (model.compact
) and by reformatting it:
rhmm.pl -model model.compact -action convert -outmodel expanded.model
The result is the following graph:
#set;ST::test#1;ST::test#2;1.0000;
#set;ST::test#2;ST::test#3;1.0000;
#set;ST::test#3;ST::test#4;1.0000;
#set;ST::test#4;ST::test#5;1.0000;
#set;ST::test#5;ST::test#1;1.0000;
#set;ST::test#1;1;0.5000;
#set;ST::test#1;2;0.5000;
#set;ST::test#2;1;0.5000;
#set;ST::test#2;2;0.5000;
#set;ST::test#3;1;0.5000;
#set;ST::test#3;2;0.5000;
#set;ST::test#4;1;0.5000;
#set;ST::test#4;2;0.5000;
#set;ST::test#5;1;0.5000;
#set;ST::test#5;2;0.5000;
States are automatically named:
ST::<graph_name>#<state index in graph, starting from 1>
Emissions are also automatically numbered starting from 1.
By default the transition probabilities of the unspecified states are set to 0. When the model is read and trained, they will result in forbidden transitions. The emission probabilities are all set to equal values. These transitions can then be set using #set
instruction or set in batch using the #bset
instruction (see below).
Note that #set
should be used to connect two sub graphs.
Batch Setting of Transition Values. The #bset
lines can be used to set in batch a series of states and emissions. These states and emissions must have been declared before hand, either via the #graph
or the #set
instruction.
Where State 1 can take the following values:
| |
------------ | ------------- | transitions|All starting transitions will be set to value| regular exp|All transitions matching the considered regular expression|
And State 2 transitions|All state1 ->X transitions will be set to value| regular exp|All state1->regexp|
The value can be any probability and the mode has to be : fixed| For a fixed value| reverse| sets everything to value except st1->st2 transitions/emissions|
For instance, in the following model:
#graph;g1;3;5;no;no;nb;circular;
#graph;g2;4;5;nb;self;nb;circular
#set;ST::g1#1;ST::g2#4;1;fixed;
#bset;.*g1.*;[1-4];0.1;fixed;
- The first line declares g1, made of 3 states having 5 emissions each. The nodes are only linked to their right neighbour and the graph is circular.
- The second line declares the graph g2. 4 states with 5 emissions each. The nodes are linked to their left and right neighbour and they also have a self emission transition.
- The third line (set) declares a link between g1->g2.
- The third line (bset) sets the emissions 1-4 of g1 to 0.1.
The resulting graph can be produced as follows:
rhmm.pl -model sample2.model -action convert -outmodel xx
#Going Further
In practice, to produce a model, all you need is to reformat your data so that each element is binned. You can then define the your model. When defining a model, it is possible to specify a topology (the connection between states) and to fix the value of some of the emissions. These models can be generated by the user, alghough some possibilities are given here. #Using High Dimensionality HMMs Two possibilities exist to increase the dimensionality of the HMM. One is to increase the number of emissions (for instance, bin each amino acid according to the di-amino acid it belongs to), the other possibility is to increase the number of states.
#Credit ##Authors This package was developed by Cedric Notredame, Centro de Regulacio Genomica, Barcelona. ##Contact Please report bugs and requests to: [email protected]