merge_records

Biopiece: merge_records

Description

merge_records merges records in the stream based on two specified keys with values that are used as identifiers. Merging is done by splitting the stream and saving all records with identifier A to one file and all records with identifier B to another file. These files are then sorted based on the A and B values and merged according to the chosen merging scheme of which there are five:

AandB - only emit merged records (Default).
AorB - emit A records or merged records (i.e. all A records and merged records).
BorA - emit B records or merged records (i.e. all B records and merged records).
AnotB - emit A records that could not be merged with B.
BnotA - emit B records that could not be merged with A.

It is important that there are no duplicate identifier values - the behaviour is not warrented and you computer will probably explode.

It is important that there is no common keys in the records that are to be merged because the values will be overwritten.

Usage

... | merge_records [options]

Options

[-?          | --help]               #  Print full usage description.
[-k <list>   | --keys=<list>]        #  Keys (A and B) which values are used for merging. Append n for numeric values.
[-m <string> | --merge=<string>]     #  Merge AandB, AorB, BorA, AnotB, or BnotA  -  Default=AandB
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file               -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file               -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

Consider the following two tables in the following files:

cat test1.tab
2   test1:2
3   test1:3
4   test1:4

cat test2.tab
test2:1 1
test2:2 2
test2:3 3

We read in the first table from test1.tab using read_tab with the -k switch to name the first column A and the second column V1 remembering that it is important that there is no collisions between any column keys!:

read_tab -i test1.tab -k A,V1  

A: 2
V1: test1:2
---
A: 3
V1: test1:3
---
A: 4
V1: test1:4
---

The resulting stream shows a number of table records with a key A and a key V1. Now we read in the next table with another round of read_tab using the -k switch to name the first column V0 and second column B like this:

read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B

A: 2
V1: test1:2
---
A: 3
V1: test1:3
---
A: 4
V1: test1:4
---
V0: test2:1
B: 1
---
V0: test2:2
B: 2
---
V0: test2:3
B: 3
---

Now we can use merge_records to merge the records on key A and key B using the default merge scheme AandB that outputs only merged records:

read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B

A: 2
V0: test2:2
V1: test1:2
---
A: 3
V0: test2:3
V1: test1:3
---

If we change the merging scheme from AandB to AorB using the -m switch then all A records and all merged records will be output,

read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m AorB

A: 2
V0: test2:2
V1: test1:2
---
A: 3
V0: test2:3
V1: test1:3
---
A: 4
V1: test1:4
---

Similarly, if we change to BorA we get this:

read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m BorA

V0: test2:1
B: 1
---
A: 2
V0: test2:2
V1: test1:2
---
A: 3
V0: test2:3
V1: test1:3
---

is your friend.)

Finally, we can get all records that are in test1.tab but not in test2.tab by using AnotB with the -m switch:

read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m AnotB

A: 4
V1: test1:4
---

Or using BnotA:

read_tab -i test1.tab -k A,V1 | read_tab -i test2.tab -k V0,B | merge_records -k A,B -m BnotA

V0: test2:1
B: 1
---

Author

[email protected]

July 2008

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

merge_records is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly