Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weirdly specific failure to parse at exact position based on what's before it, from pipe() input but not from path #1433

Closed
darachm opened this issue Sep 8, 2022 · 3 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@darachm
Copy link

darachm commented Sep 8, 2022

tldr; Parsing seems to fail for a newline on line 6833 for this exact file when it is input using pipe, but works whenever I insert, remove, replace any character before the newline on line 6833 - EXCEPT if the replacement is an ASCII character!

This is very weirdly specific, so I thought that maybe it'd be useful as an edge edge edge case....

Hope I'm in the right place.


Background

This is a SAM format file output by minimap2 (as exactly specified on line 1101): weird_subset_woheader.txt
I have removed the header (usually starts with "@"). There are variable numbers of tab-delimited columns! Different lines have different numbers of columns!

I got some weird results with this file in my analysis pipeline, so I've spent the past few hours trying to figure out where.

For later, file -bi weird_subset_woheader.txt returns text/plain; charset=us-ascii.

A one-liner to experiment with

I've narrowed it down to this minimal example if you want it on one line:

Rscript --vanilla -e 'readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)[6830:6840,]'

What I expect vs what I get

That should give me a nice table, and every first field should start with "sample". However, it does a thing where it seems to (speculating here) escape the newline from field 12, fill field 13 with NA, then eat the first character of the subsequent line and then parse it along until it hits the next (defective?) newline.

For an example of a "good" output, see below "reprex" block for an example using read.table.

What doesn't "fix" it

  • If I change the file in any way past the newline on line 6833, or if I change the file in any way in the "@"-denoted header, including if I insert a Unicode character past this point.

What does "fix" it.....

  • If I use the same code to pipe the "cat" output into read.table, with arguments to fill=T and header=F, then it runs fine (no first column values are exactly "*", all good).
  • If I provide the file as a path, ie just put the path name in the first position, not in pipe. Requires pipe!
  • If I remove line 6833
  • If I remove, insert, or replace any character before this newline with a Unicode character using copy-and-paste into vim insert/replace mode.

What I want

I'm good for my pipeline, easy fix to work around this. But I'd like to understand what is going on here, or how I can better dig into this specific point in the file to learn more, see if there's better ways of debugging this and making sure I don't run into it again.

Is this weird or did I miss something?

reprex and version info

readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)[6830:6840,]
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 6898 Columns: 22
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (22): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 11 × 22
#>    X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 samp… 0     pool… 29    60    244M  *     0     0     ACCA… *     NM:i… ms:i…
#>  2 samp… 0     pool… 29    60    244M  *     0     0     AGAT… *     NM:i… ms:i…
#>  3 samp… 0     pool… 29    60    244M  *     0     0     AAAT… *     NM:i… ms:i…
#>  4 samp… 4     *     0     0     *     *     0     0     ACAA… *     rl:i… <NA> 
#>  5 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  6 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  7 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  8 *     rl:i… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  9 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> 10 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> 11 *     rl:i… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> #   X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>

readr::problems()

read.table(pipe("cat weird_subset_woheader.txt"),header=F,fill=T)[6830:6840,]
#>                                                    V1 V2            V3 V4 V5
#> 6830    sample3_B_K17_ACAATAAGTGGTAAGAAGGTTCATCC_6to6  0    poolOne_45 29 60
#> 6831  sample25_A_L8_CGATTAACATTTAAGAACTTTTAAAT_19to18  0  poolFour_330 29 60
#> 6832  sample6_B_A14_CAAGTAACCTAAAATGATATTATGTC_14to16  0    poolOne_70 29 60
#> 6833   sample19_A_E23_AGAGGAAATGGTAATGTGCTTGCGAA_7to9  0   poolTwo_130 29 60
#> 6834    sample8_A_H19_CGCGTAATCCACAAAAATCTTTCCGC_7to8  0    poolOne_99 29 60
#> 6835    sample5_A_P13_ATAACAATCGAGAAAAAGGTTAAGGG_8to8  0  poolFour_343 29 60
#> 6836 sample15_A_A16_ACCGCAAACGGTAAATGTATTCTAAG_14to14  0 poolEight_730 29 60
#> 6837  sample22_B_A8_AACGGAACTTTTAATCTCCTTACCAA_27to35  4             *  0  0
#> 6838  sample21_B_N2_ATCCAAAAAAATAATGAATTTGTTAC_21to26  0  poolNine_806 29 60
#> 6839 sample21_B_G15_CGCCAAAACACGAATCCGGTTCCCGA_18to23  0   poolTen_911 29 60
#> 6840  sample30_B_G9_TCCCTAAGGAGAAAAGTGTTTGGCCC_17to15  0  poolNine_889 29 60
#>        V6 V7 V8 V9
#> 6830 244M  *  0  0
#> 6831 244M  *  0  0
#> 6832 244M  *  0  0
#> 6833 244M  *  0  0
#> 6834 244M  *  0  0
#> 6835 244M  *  0  0
#> 6836 244M  *  0  0
#> 6837    *  *  0  0
#> 6838 244M  *  0  0
#> 6839 244M  *  0  0
#> 6840 244M  *  0  0
#>                                                                                                                                                                                                                                                       V10
#> 6830 AGAGCTTCACATGCAGCAGCTGATCTAATTTTCCCATGTGTTCAGTGAGGTTGGGTGCCACGAAGCCTGTTTTATATAGGCAGGTTTTATCTGCATAGGAACTAAGAAAGACATTCTTGCCAAAGACTACAAGGGTCTGCCTGACCTGGCCTCTGCCATTTGGTAAGAAAACATTTAACAGCCATGCCAAAGTGCTTTTTACTTCTTAAACTGCCCAGATTTTCCTAACAAAGCTTTTGTATTC
#> 6831 AAATATCTACTAGATTCATTTGTTCTGTAGTGCAGATGAAGTTTGATGTCTCTGTGTTGATTTTCTCTAAAGAAGATCTGTTCAGTTCTGAAAGTGGGGTGTTGAAGTCTCCAACTATTATTGTATTGGGGCCTATTTCTCTCTTTAGCTCTAATAGTATTTACTTTATTTATCTTGGTGCTTCAGTGTTTGGTGAATATATATTTAAAATTGTTAAAACCTATTGCAGAATTGGCCCTTTTAT
#> 6832 AAAAGTTTCAAAAAAATTAAAATTATATCAAGTATTTTCTCTGACCACAATGGAATAAAGCTAGAAATCAATAACAAGAGGAATTTTGGAAACTATACAGCACATAGAAATTAAACAGTATGCTTCTGAATGAACAGTGTGTCAATGAAGAAATTACTAATAAGAAGGAAATTTTTTAAATTCTTGCAACAAATGAAAATAGAAACACAATATACCAAAAGTATGGAATGCAGTGAAAGCAGTA
#> 6833 ACTCTCACACTCGCACTCACTCTTAAATAAGATTAAGATCCTGTTTCAAATCCTTCTCCTCCTCCTCTTCATTTGATGTGTCGTTTTTCTCTTTATTTCCCTATCACTTGTTTTTTGAAAAACTAAATTGGCTCTGCTATTGACATAACATTATGTGTAATAGATTTCAAATTTGGAGGTATTTGGATTTTTTTCTATTTTAAAATGATTTCCTTGTTTATATTTTTGCTTACTTAGACAAAGA
#> 6834 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATCTTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGC
#> 6835 AAATCCTGAATATCTTTGTTAATTTTCTGTCTCAATGATTTGTATAATATTGACAGTGGGGAGTTAAAGTCTCCCACTATTATTGTGTAGGAGTCGAAGTCTCTTTGTAGGTCTCTAAGAACTTGTTTTATGAATCTGAGTGCTCCTGTATTGGGTACATGTACATTTAGGATAGTTAGCTCTCCTTGTTGAATTGAACCCTTTACCATTACGTAATGCCCTTCTTTGTCTTTTTTAATCTTTG
#> 6836 AATTACTCATGATTTCTAGTCACTGATGAGACTTTCTCCTATTTTCCAGGGCTTGTTGGGCTTATCCTTGTAATATTAAAAATATTTGGATGGGAGGATCAGCGGTCATCTGTGTTCAGTCTATTATCTTGAACCAATTCCAGTATAGTTGCTGCTGTATTTAAGAAAAATTTTATGCCTGTTTTATAACTTTCTGCCCCATTAAATTATCTTTTTTTGGACTAATAGTTTTGTATTAATTGTT
#> 6837                                                                                                                                                                                                                             CTCGCCTGCAGGATGCCCGGGCAT
#> 6838 ACAATATATTAGTGTTGCATATCCTCAATATTCATTTAGATTAATCCATACAATTACTCTTTCTATTGTCTTTCACTCCTTACTGAATTTCTAGGCTTTTATCTGAAATTATTTTCCTTCTGCCTGAAGAACTACCTTTTATAGTAATTTTAGTGTAATTCTGCTTAGGAAGTATTATCTGTTCTTGTTATTCTGACAACATTTTCATTTCTCCTTCATTTTTGCAGGATGTTTTTGCTGGGTG
#> 6839 ACCAATTTATATTTAAGGCAAAAATACTCAACAATTTACTCGTGTAGATAAAATTTGTTTTTACTGTTGTGTAGTATGCTATTTTATAAATAGGTGATATTTTATGTGTTTCTTTTTCTTGTAGTGTTTGGGATATCGACTAATGTTTATGAGAGACACAATTTTGACTATTCTTGAAAATGAGATTTTGAGTACATGTGACTTTTTGAGTACACATTCTCCAACTTATAAATCTAGAAGATAA
#> 6840 ACATCTTATTAAAATGTTGTTAATAATTACTTGAACAAGTACTATTTGAACGCCTATGATATTCTATGAAAAATTCTATTAATCCTTTACAGCTCAGTTAATACTACCATCTCACCTCAGCTTTATATGAGCATCTCAGCAAAGATTCCTTCCTTCTTTGTGTTTCTGGTCTACTTAATTCATGCTGAGTACTTTGCATAAACCGCATACATAGAACTGTGTATATTTATGTGTATATTCATTT
#>      V11    V12      V13      V14    V15    V16     V17      V18    V19
#> 6830   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:25 s1:i:227 s2:i:0
#> 6831   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:216 s2:i:0
#> 6832   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:23 s1:i:232 s2:i:0
#> 6833   * NM:i:1 ms:i:224 AS:i:224 nn:i:0 tp:A:P cm:i:20 s1:i:215 s2:i:0
#> 6834   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:23 s1:i:237 s2:i:0
#> 6835   * NM:i:1 ms:i:224 AS:i:224 nn:i:0 tp:A:P cm:i:19 s1:i:207 s2:i:0
#> 6836   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:24 s1:i:237 s2:i:0
#> 6837   * rl:i:0                                                        
#> 6838   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:22 s1:i:230 s2:i:0
#> 6839   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:225 s2:i:0
#> 6840   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:22 s1:i:229 s2:i:0
#>              V20             V21    V22
#> 6830      de:f:0       cs:Z::244 rl:i:0
#> 6831      de:f:0       cs:Z::244 rl:i:0
#> 6832      de:f:0       cs:Z::244 rl:i:0
#> 6833 de:f:0.0041 cs:Z::48*ga:195 rl:i:0
#> 6834      de:f:0       cs:Z::244 rl:i:0
#> 6835 de:f:0.0041 cs:Z::39*ct:204 rl:i:0
#> 6836      de:f:0       cs:Z::244 rl:i:0
#> 6837                                   
#> 6838      de:f:0       cs:Z::244 rl:i:0
#> 6839      de:f:0       cs:Z::244 rl:i:0
#> 6840      de:f:0       cs:Z::244 rl:i:0

z <- readr::read_tsv(pipe("cat weird_subset_woheader.txt"),col_names=F)
#> Warning: One or more parsing issues, see `problems()` for details
#> Rows: 6898 Columns: 22
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (22): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
z[z$X1=="*",]
#> # A tibble: 65 × 22
#>    X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  2 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  3 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  4 *     rl:i… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  5 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  6 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  7 *     rl:i… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#>  8 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#>  9 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> 10 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> # … with 55 more rows, and 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>,
#> #   X17 <chr>, X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>

z <- read.table(pipe("cat weird_subset_woheader.txt"),header=F,fill=T)
z[z$X1=="*",]
#>  [1] V1  V2  V3  V4  V5  V6  V7  V8  V9  V10 V11 V12 V13 V14 V15 V16 V17 V18 V19
#> [20] V20 V21 V22
#> <0 rows> (or 0-length row.names)

Created on 2022-09-08 with reprex v2.0.2

R is --version 4.2.1

readr info from installed.packages() :

    Package 
                                                                                                                    "readr" 
                                                                                                                    LibPath 
                                                                              "/home/zed/R/x86_64-pc-linux-gnu-library/4.2" 
                                                                                                                    Version 
                                                                                                                    "2.1.2" 
                                                                                                                   Priority 
                                                                                                                         NA 
                                                                                                                    Depends 
                                                                                                               "R (>= 3.1)" 
                                                                                                                    Imports 
"cli (>= 3.0.0), clipr, crayon, hms (>= 0.4.1), lifecycle (>=\n0.2.0), methods, R6, rlang, tibble, utils, vroom (>= 1.5.6)" 
                                                                                                                  LinkingTo 
                                                                                                   "cpp11, tzdb (>= 0.1.1)" 
                                                                                                                   Suggests 
     "covr, curl, datasets, knitr, rmarkdown, spelling, stringi,\ntestthat (>= 3.1.2), tzdb (>= 0.1.1), waldo, withr, xml2" 
                                                                                                                   Enhances 
                                                                                                                         NA 
                                                                                                                    License 
                                                                                                       "MIT + file LICENSE" 
                                                                                                            License_is_FOSS 
                                                                                                                         NA 
                                                                                                      License_restricts_use 
                                                                                                                         NA 
                                                                                                                    OS_type 
                                                                                                                         NA 
                                                                                                                     MD5sum 
                                                                                                                         NA 
                                                                                                           NeedsCompilation 
                                                                                                                      "yes" 
                                                                                                                      Built 
                                                                                                                    "4.2.0" 
@darachm
Copy link
Author

darachm commented Sep 9, 2022

If you replace a character before the newline with a unicode character, then delete another ascii character, then it also has the problem. I think maybe this is related to the thing where ascii is half a byte, and unicode is a byte or more? Does that seem like a useful model to test?

If I used head -n 6833 to get everything up to (and including?) the newline, that is 2752492 bytes. It's divisible by 4... but doesn't look like an interesting number otherwise.

Is there a buffer that's off by one, something in pipe?

@sbearrows sbearrows added the bug an unexpected problem or unintended behavior label Sep 12, 2022
@sbearrows
Copy link
Contributor

I'm able to reproduce the problem you are seeing where rows starting at 6834 are being parsed incorrectly but only when using pipe("cat file.txt"):

read_tsv("weird_subset_woheader.txt",
  col_names = FALSE,
  show_col_types = FALSE
)[6834:6837, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 4 × 22
#>   X1       X2 X3       X4    X5 X6    X7       X8    X9 X10   X11   X12   X13  
#>   <chr> <dbl> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
#> 1 samp…     0 pool…    29    60 211M… *         0     0 AATT… *     NM:i… ms:i…
#> 2 samp…     0 pool…    29    60 244M  *         0     0 GTCC… *     NM:i… ms:i…
#> 3 samp…     0 pool…    29    60 244M  *         0     0 ACCA… *     NM:i… ms:i…
#> 4 samp…     4 *         0     0 *     *         0     0 ATCG… *     rl:i… <NA> 
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> #   X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
#> # ℹ Use `colnames()` to see all variable names

# odd behavior when using pipe and cat where data is stuffed into incorrect columns
read_tsv(pipe("cat weird_subset_woheader.txt"),
  col_names = FALSE,
  show_col_types = FALSE
)[6834:6837, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 4 × 22
#>   X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13  
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> 2 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> 3 *     NM:i… ms:i… AS:i… nn:i… tp:A… cm:i… s1:i… s2:i… de:f… cs:Z… rl:i… <NA> 
#> 4 *     rl:i… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> # … with 9 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
#> #   X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>
#> # ℹ Use `colnames()` to see all variable names

But I can also see that there are parsing problems which makes it difficult to discern what the cause is.

problems(read_tsv(pipe("cat weird_subset_woheader.txt"),
  num_threads = 1,
  col_names = FALSE,
  show_col_types = FALSE
))
#> # A tibble: 1,085 × 5
#>      row   col expected   actual     file 
#>    <int> <int> <chr>      <chr>      <chr>
#>  1    34    12 22 columns 12 columns ""   
#>  2    36    12 22 columns 12 columns ""   
#>  3    73    12 22 columns 12 columns ""   
#>  4    79    12 22 columns 12 columns ""   
#>  5    86    12 22 columns 12 columns ""   
#>  6    88    12 22 columns 12 columns ""   
#>  7    90    23 22 columns 23 columns ""   
#>  8    91    23 22 columns 23 columns ""   
#>  9    95    12 22 columns 12 columns ""   
#> 10    98    12 22 columns 12 columns ""   
#> # … with 1,075 more rows
#> # ℹ Use `print(n = ...)` to see more rows

It looks like you already knew this.

There are variable numbers of tab-delimited columns! Different lines have different numbers of columns!

But I did want to point out that in your example, you're losing some information since it's not parsing the 23rd column. As you noted, this is a situation that read.table() seems to handle your data better, but even better would be to give names to the number of columns present in your data.

# rows 90 and 91 have 23 columns per row
# read_tsv does not generate a 23rd column
read_tsv(pipe("cat weird_subset_woheader.txt"),
  show_col_types = FALSE,
  col_names = paste0("V", 1:23)
)[90:93, ]
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> # A tibble: 4 × 22
#>   V1    V2    V3    V4    V5    V6    V7    V8    V9    V10   V11   V12   V13  
#>   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 samp… 0     pool… 26    60    247S… *     0     0     ACTA… *     NM:i… ms:i…
#> 2 samp… 2048  pool… 29    60    250M… *     0     0     ACTA… *     NM:i… ms:i…
#> 3 samp… 0     pool… 29    60    244M  *     0     0     ACAA… *     NM:i… ms:i…
#> 4 samp… 0     pool… 29    60    244M  *     0     0     AGAT… *     NM:i… ms:i…
#> # … with 9 more variables: V14 <chr>, V15 <chr>, V16 <chr>, V17 <chr>,
#> #   V18 <chr>, V19 <chr>, V20 <chr>, V21 <chr>, V22 <chr>
#> # ℹ Use `colnames()` to see all variable names

# read.table handles this better
# if you supply column names
read.table(pipe("cat weird_subset_woheader.txt"),
  header = F, fill = T,
  col.names = paste0("V", 1:23)
)[90:93, ]
#>                                                  V1   V2            V3 V4 V5
#> 90  sample21_A_F3_TCTAAAATCGCAAAAAACTTTACGCA_13to15    0  poolFour_343 26 60
#> 91  sample21_A_F3_TCTAAAATCGCAAAAAACTTTACGCA_13to15 2048    poolOne_99 29 60
#> 92 sample29_A_G12_CATGCAAATTAGAAGACCTTTAGGAA_20to21    0  poolNine_807 29 60
#> 93  sample8_A_O15_CAAGGAACACTAAAATAGGTTAGTAC_11to11    0 poolSeven_655 29 60
#>          V6 V7 V8 V9
#> 90 247S247M  *  0  0
#> 91 250M244H  *  0  0
#> 92     244M  *  0  0
#> 93     244M  *  0  0
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               V10
#> 90 ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATTCTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGCGCGGCCAAATCCTGAATATCTTTGTTAATTTTCTGTCTCAATGATTTGTATAATATTGACAGTGGGGAGTTAAAGTCTCCCACTATTATTGTGTAGGAGTCGAAGTCTCTTTGTAGGTCTCTAAGAACTTGTTTTATGAATCTGAGTGCTCCTGTATTGGGTACATGTACATTTAGGATAGTTAGCTCTCCTTGTTGAATTGAACCCTTTACCATTACGTAATGCCCTTCTTTGTCTTTTTTAATCTTTG
#> 91                                                                                                                                                                                                                                                     ACTAATGATCCTTTGGATTTCTGGTGTATTGGTTTTAATGTCTCCTTTTCTGTCTTTGATTTTGTTTATTTGGGTCTTCTCTCGTTTTTCTTAGTCTGGCTAAAGGCTTGTCAATTTTATTTATCTTTTCAAAAATCCAACTTTTTGTTTTGTTGATTCTATTATTTTCTTCACTTCAGTTGCATTTATTTCTGCTCTTATCTTTATTATTTCTTCTGCTAATCTTGAGTTTAGTTTCTCTTGCGCGGCC
#> 92                                                                                                                                                                                                                                                           ACAATATTAAGTCTTTCAACCTGTGAACATGGGATGTCTTTCCTTTTATTTGTATCTGCTTTAATTTCCTTCATCAAGGTTTTTTAGTTTCCAGTGTACAAGTCTTACACTTTCTTAAGTTTATTCCTATTTTATTATTTTTAATCCTATTGTAAATGGGATTCTTATGTCCTTTTTGTATAGTTTATTTTTAGTATATAGAAATGTCACTGATTTTTGTATGTTTTGTATGCTGCAACTTAAT
#> 93                                                                                                                                                                                                                                                           AGATAAATTACATTCATGAAAGAAGCATATTATTTTTAAAGTACTTTATTTTGGAAAGGTAAAATGCTTGTGTAGTTATAATTTGGTTACTCTTGATTTCACCTTAGGAAAAACAATATCACCTTCTAACCATTTCTTTTTTAGTCAAATCTCTTGCTTCTATTTCTCTCTGTAGATCCGCTATTAAAGACTGTAATCACTGCTGCATCTTTCCTGTAAGGCTTGATCGCATTGTTAATTTCTT
#>    V11    V12      V13      V14    V15    V16     V17      V18    V19
#> 90   * NM:i:1 ms:i:227 AS:i:227 nn:i:0 tp:A:P cm:i:20 s1:i:215 s2:i:0
#> 91   * NM:i:2 ms:i:210 AS:i:210 nn:i:0 tp:A:P cm:i:20 s1:i:226 s2:i:0
#> 92   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:21 s1:i:232 s2:i:0
#> 93   * NM:i:0 ms:i:244 AS:i:244 nn:i:0 tp:A:P cm:i:20 s1:i:223 s2:i:0
#>            V20                                   V21                V22    V23
#> 90 de:f:0.0040   SA:Z:poolOne_99,29,+,250M244S,60,2;    cs:Z::42*ct:204 rl:i:0
#> 91 de:f:0.0080 SA:Z:poolFour_343,26,+,247S247M,60,1; cs:Z::157*ct*tc:91 rl:i:0
#> 92      de:f:0                             cs:Z::244             rl:i:0       
#> 93      de:f:0                             cs:Z::244             rl:i:0

Here is a related issue thread that might also help #762.

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Thanks for filing this bug report! Unfortunately because it's requires such specific set up to reproduce we believe it's unlikely to affect many people, and we don't have the development resources to fix at this time. It's our policy to close such issues to help stay focussed on the biggest problems, but the issue is still indexed by google, so if other people hit it, they'll be able to find it, and we can consider reopen it if it turns out to be a common problem. Thanks for reporting and I'm sorry we couldn't help more 😞.

@hadley hadley closed this as completed Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants