-
-
Notifications
You must be signed in to change notification settings - Fork 19
Chunking
"Chunk" is a name of a group of standard input.
teip
divides standard input into multiple chunks.
One chunk is outside the hole and one chunk is inside the hole of masking tape.
Only the chunks in the hole are bypassed.
A chunk that does not match the pattern will be displayed on the standard output as it is. On the other hand, the matched chunk is passed to the standard input of a targeted command. After that, the matched chunk is replaced with the result of the targeted command.
In the next example, the standard input is divided into four chunks as follows.
echo ABC100EFG200 | teip -og '\d+' -- sed 's/.*/@@@/g'
ABC => Chunk(1)
100 => Chunk(2) -- Matched
EFG => Chunk(3)
200 => Chunk(4) -- Matched
By default, the matched chunks are combined by line breaks and used as the new standard input for the targeted command.
Imagine that teip
executes the following command in its process.
$ printf "100\n200\n" | sed 's/.*/@@@/g'
@@@ # => Result of Chunk(2)
@@@ # => Result of Chunk(4)
(It is not technically accurate but you can now see why $1
is used not $3
in one of the examples in "Getting Started")
After that, matched chunks are replaced with each line of result.
ABC => Chunk(1)
@@@ => Chunk(2) -- Replaced
EFG => Chunk(3)
@@@ => Chunk(4) -- Replaced
Finally, all the chunks are concatenated and the following result is printed.
ABC@@@EFG@@@
Practically, the above process is performed asynchronously. Chunks being printed sequentially as they become available.
Let me show you another example.
The following is an example of selecting a number and replacing it with the string "@@@".
However, there appear to be too many numbers. The 100
is replaced with "@@@@@@@@@".
$ echo ABC100EFG200 | teip -og '\d' -- sed 's/.*/@@@/g'
ABC@@@@@@@@@EFG@@@@@@@@@
The reason why a lot of @
are printed in the example below is that the input is broken up into many chunks.
$ echo ABC100EFG200 | teip -og '\d'
ABC[1][0][0]EFG[2][0][0]
teip
recognizes input matched with the entire regular expression as a single chunk.
\d
matches a single digit, and it results in many chunks.
ABC => Chunk(1)
1 => Chunk(2) -- Matched
0 => Chunk(3) -- Matched
0 => Chunk(4) -- Matched
EFG => Chunk(5)
2 => Chunk(6) -- Matched
0 => Chunk(7) -- Matched
0 => Chunk(8) -- Matched
Therefore, sed
loads many newline characters.
$ printf "1\n0\n0\n2\n0\n0\n" | sed 's/.*/@@@/g'
@@@ # => Result of Chunk(2)
@@@ # => Result of Chunk(3)
@@@ # => Result of Chunk(4)
@@@ # => Result of Chunk(6)
@@@ # => Result of Chunk(7)
@@@ # => Result of Chunk(8)
The chunks of the final form are like the following.
ABC => Chunk(1)
@@@ => Chunk(2) -- Replaced
@@@ => Chunk(3) -- Replaced
@@@ => Chunk(4) -- Replaced
EFG => Chunk(5)
@@@ => Chunk(6) -- Replaced
@@@ => Chunk(7) -- Replaced
@@@ => Chunk(8) -- Replaced
And, here is the final result.
ABC@@@@@@@@@EFG@@@@@@@@@
The concept of chunking is also used for other options.
For example, if you use -f
to specify a range of A-B
, each field will be a separate chunk.
Also, the field delimiter is always an unmatched chunk.
$ echo "AA,BB,CC" | teip -f 2-3 -d,
AA,[BB],[CC]
With the -c
option, adjacent characters are treated as the same chunk even if they are separated by ,
.
$ echo "ABCDEFGHI" | teip -c1,2,3,7-9
[ABC]DEF[GHI]