-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weirdly specific failure to parse at exact position based on what's before it, from pipe() input but not from path #1433
Comments
If you replace a character before the newline with a unicode character, then delete another ascii character, then it also has the problem. I think maybe this is related to the thing where ascii is half a byte, and unicode is a byte or more? Does that seem like a useful model to test? If I used Is there a buffer that's off by one, something in pipe? |
I'm able to reproduce the problem you are seeing where rows starting at 6834 are being parsed incorrectly but only when using read_tsv("weird_subset_woheader.txt",
col_names = FALSE,
show_col_types = FALSE
)[6834:6837, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 4 × 22
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
#> <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 samp… 0 pool… 29 60 211M… * 0 0 AATT… * NM:i… ms:i…
#> 2 samp… 0 pool… 29 60 244M * 0 0 GTCC… * NM:i… ms:i…
#> 3 samp… 0 pool… 29 60 244M * 0 0 ACCA… * NM:i… ms:i…
#> 4 samp… 4 * 0 0 * * 0 0 ATCG… * rl:i… <NA>
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> # X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
#> # ℹ Use `colnames()` to see all variable names
# odd behavior when using pipe and cat where data is stuffed into incorrect columns
read_tsv(pipe("cat weird_subset_woheader.txt"),
col_names = FALSE,
show_col_types = FALSE
)[6834:6837, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 4 × 22
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 * NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA>
#> 2 * NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA>
#> 3 * NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA>
#> 4 * rl:i… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> # X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
#> # ℹ Use `colnames()` to see all variable names But I can also see that there are parsing problems which makes it difficult to discern what the cause is. problems(read_tsv(pipe("cat weird_subset_woheader.txt"),
num_threads = 1,
col_names = FALSE,
show_col_types = FALSE
))
#> # A tibble: 1,085 × 5
#> row col expected actual file
#> <int> <int> <chr> <chr> <chr>
#> 1 34 12 22 columns 12 columns ""
#> 2 36 12 22 columns 12 columns ""
#> 3 73 12 22 columns 12 columns ""
#> 4 79 12 22 columns 12 columns ""
#> 5 86 12 22 columns 12 columns ""
#> 6 88 12 22 columns 12 columns ""
#> 7 90 23 22 columns 23 columns ""
#> 8 91 23 22 columns 23 columns ""
#> 9 95 12 22 columns 12 columns ""
#> 10 98 12 22 columns 12 columns ""
#> # … with 1,075 more rows
#> # ℹ Use `print(n = ...)` to see more rows It looks like you already knew this.
But I did want to point out that in your example, you're losing some information since it's not parsing the 23rd column. As you noted, this is a situation that # rows 90 and 91 have 23 columns per row
# read_tsv does not generate a 23rd column
read_tsv(pipe("cat weird_subset_woheader.txt"),
show_col_types = FALSE,
col_names = paste0("V", 1:23)
)[90:93, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#> dat <- vroom(...)
#> problems(dat)
#> # A tibble: 4 × 22
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 samp… 0 pool… 26 60 247S… * 0 0 ACTA… * NM:i… ms:i…
#> 2 samp… 2048 pool… 29 60 250M… * 0 0 ACTA… * NM:i… ms:i…
#> 3 samp… 0 pool… 29 60 244M * 0 0 ACAA… * NM:i… ms:i…
#> 4 samp… 0 pool… 29 60 244M * 0 0 AGAT… * NM:i… ms:i…
#> # … with 9 more variables: V14 <chr>, V15 <chr>, V16 <chr>, V17 <chr>,
#> # V18 <chr>, V19 <chr>, V20 <chr>, V21 <chr>, V22 <chr>
#> # ℹ Use `colnames()` to see all variable names
# read.table handles this better
# if you supply column names
read.table(pipe("cat weird_subset_woheader.txt"),
header = F, fill = T,
col.names = paste0("V", 1:23)
)[90:93, ]
#> V1 V2 V3 V4 V5
#> 90 sample21_A_F3_TCTAAAATCGCAAAAAACTTTACGCA_13to15 0 poolFour_343 26 60
#> 91 sample21_A_F3_TCTAAAATCGCAAAAAACTTTACGCA_13to15 2048 poolOne_99 29 60
#> 92 sample29_A_G12_CATGCAAATTAGAAGACCTTTAGGAA_20to21 0 poolNine_807 29 60
#> 93 sample8_A_O15_CAAGGAACACTAAAATAGGTTAGTAC_11to11 0 poolSeven_655 29 60
#> V6 V7 V8 V9
#> 90 247S247M * 0 0
#> 91 250M244H * 0 0
#> 92 244M * 0 0
#> 93 244M * 0 0
#> V10
#> 90 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATTCTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGCGCGGCCAAATCCTGAATATCTTTGTTAATTTTCTGTCTCAATGATTTGTATAATATTGACAGTGGGGAGTTAAAGTCTCCCACTATTATTGTGTAGGAGTCGAAGTCTCTTTGTAGGTCTCTAAGAACTTGTTTTATGAATCTGAGTGCTCCTGTATTGGGTACATGTACATTTAGGATAGTTAGCTCTCCTTGTTGAATTGAACCCTTTACCATTACGTAATGCCCTTCTTTGTCTTTTTTAATCTTTG
#> 91 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATTCTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGCGCGGCC
#> 92 ACAATATTAAGTCTTTCAACCTGTGAACATGGGATGTCTTTCCTTTTATTTGTATCTGCTTTAATTTCCTTCATCAAGGTTTTTTAGTTTCCAGTGTACAAGTCTTACACTTTCTTAAGTTTATTCCTATTTTATTATTTTTAATCCTATTGTAAATGGGATTCTTATGTCCTTTTTGTATAGTTTATTTTTAGTATATAGAAATGTCACTGATTTTTGTATGTTTTGTATGCTGCAACTTAAT
#> 93 AGATAAATTACATTCATGAAAGAAGCATATTATTTTTAAAGTACTTTATTTTGGAAAGGTAAAATGCTTGTGTAGTTATAATTTGGTTACTCTTGATTTCACCTTAGGAAAAACAATATCACCTTCTAACCATTTCTTTTTTAGTCAAATCTCTTGCTTCTATTTCTCTCTGTAGATCCGCTATTAAAGACTGTAATCACTGCTGCATCTTTCCTGTAAGGCTTGATCGCATTGTTAATTTCTT
#> V11 V12 V13 V14 V15 V16 V17 V18 V19
#> 90 * NM:i:1 ms:i:227 AS:i:227 nn:i:0 tp:A:P cm:i:20 s1:i:215 s2:i:0
#> 91 * NM:i:2 ms:i:210 AS:i:210 nn:i:0 tp:A:P cm:i:20 s1:i:226 s2:i:0
#> 92 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:21 s1:i:232 s2:i:0
#> 93 * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:223 s2:i:0
#> V20 V21 V22 V23
#> 90 de:f:0.0040 SA:Z:poolOne_99,29,+,250M244S,60,2; cs:Z::42*ct:204 rl:i:0
#> 91 de:f:0.0080 SA:Z:poolFour_343,26,+,247S247M,60,1; cs:Z::157*ct*tc:91 rl:i:0
#> 92 de:f:0 cs:Z::244 rl:i:0
#> 93 de:f:0 cs:Z::244 rl:i:0 Here is a related issue thread that might also help #762. |
Thanks for filing this bug report! Unfortunately because it's requires such specific set up to reproduce we believe it's unlikely to affect many people, and we don't have the development resources to fix at this time. It's our policy to close such issues to help stay focussed on the biggest problems, but the issue is still indexed by google, so if other people hit it, they'll be able to find it, and we can consider reopen it if it turns out to be a common problem. Thanks for reporting and I'm sorry we couldn't help more 😞. |
tldr; Parsing seems to fail for a newline on line 6833 for this exact file when it is input using
pipe
, but works whenever I insert, remove, replace any character before the newline on line 6833 - EXCEPT if the replacement is an ASCII character!This is very weirdly specific, so I thought that maybe it'd be useful as an edge edge edge case....
Hope I'm in the right place.
Background
This is a SAM format file output by minimap2 (as exactly specified on line 1101): weird_subset_woheader.txt
I have removed the header (usually starts with "@"). There are variable numbers of tab-delimited columns! Different lines have different numbers of columns!
I got some weird results with this file in my analysis pipeline, so I've spent the past few hours trying to figure out where.
For later,
file -bi weird_subset_woheader.txt
returnstext/plain; charset=us-ascii
.A one-liner to experiment with
I've narrowed it down to this minimal example if you want it on one line:
Rscript --vanilla -e 'readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)[6830:6840,]'
What I expect vs what I get
That should give me a nice table, and every first field should start with "sample". However, it does a thing where it seems to (speculating here) escape the newline from field 12, fill field 13 with NA, then eat the first character of the subsequent line and then parse it along until it hits the next (defective?) newline.
For an example of a "good" output, see below "reprex" block for an example using
read.table
.What doesn't "fix" it
What does "fix" it.....
pipe
the "cat" output intoread.table
, with arguments tofill=T
andheader=F
, then it runs fine (no first column values are exactly "*", all good).pipe
. Requirespipe
!What I want
I'm good for my pipeline, easy fix to work around this. But I'd like to understand what is going on here, or how I can better dig into this specific point in the file to learn more, see if there's better ways of debugging this and making sure I don't run into it again.
Is this weird or did I miss something?
reprex and version info
Created on 2022-09-08 with reprex v2.0.2
R is
--version
4.2.1readr
info frominstalled.packages()
:The text was updated successfully, but these errors were encountered: