CountMatrix bugfix and new tests #43

mbabadi · 2018-06-27T21:01:35Z

This PR fixes some algorithmic issues in the CountMatrix class. It also improves related unit tests. The current counting algorithm implemented in CountMatrix.from_sorted_tagged_bam is expected to be functionally similar to CellRanger methods.

Closes #42.

new tests for count matrix computation from BAM updated count matrix test documentation updated count matrix class documentation basic refactoring and code cleanup

codecov-io · 2018-06-27T21:05:44Z

Codecov Report

Merging #43 into master will increase coverage by 0.26%.
The diff coverage is 94.84%.

@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
+ Coverage   94.67%   94.94%   +0.26%     
==========================================
  Files          24       25       +1     
  Lines        1727     1997     +270     
==========================================
+ Hits         1635     1896     +261     
- Misses         92      101       +9

Impacted Files	Coverage Δ
src/sctools/test/test_entrypoints.py	`100% <100%> (ø)`	⬆️
src/sctools/consts.py	`100% <100%> (ø)`
src/sctools/test/test_bam.py	`98.68% <100%> (ø)`	⬆️
src/sctools/test/test_gtf.py	`100% <100%> (ø)`	⬆️
src/sctools/test/test_fastq.py	`100% <100%> (ø)`	⬆️
src/sctools/encodings.py	`96.19% <100%> (ø)`	⬆️
src/sctools/__init__.py	`100% <100%> (ø)`	⬆️
src/sctools/test/test_metrics.py	`100% <100%> (ø)`	⬆️
src/sctools/test/test_barcode.py	`97.64% <100%> (ø)`	⬆️
src/sctools/metrics/aggregator.py	`91.27% <100%> (+0.05%)`	⬆️
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24ed25f...57b9369. Read the comment docs.

ambrosejcarr

This looks awesome. It was really fun to read -- you've obviously done a ton of work with bam files and have added some really cool tricks here.

I left you a bunch of questions and requests which are hopefully not too onerous. :)

In case I missed something, I also reviewed the notebook here: HumanCellAtlas/skylab#118

ambrosejcarr · 2018-06-28T02:59:31Z

src/sctools/count.py

+        return self._col_index
+
+    @staticmethod
+    def _get_alignments_grouped_by_query_name_generator(bam_file: str,


I like arguments on a newline to reduce leading whitespace.

ambrosejcarr · 2018-06-28T03:03:19Z

src/sctools/count.py

+        """
+        with pysam.AlignmentFile(bam_file, mode=open_mode) as bam_records:
+            for alignment in itertools.groupby(bam_records, key=lambda record: record.query_name):
+                query_name: str = alignment[0]


I feel like this section could be just:

for (query_name, grouper) in itertools.groupby(bam_records, key=lambda record: record.query_name): alignments: List[pysam.AlignedSegment] = [a for a in grouper] # or list(grouper)

ambrosejcarr · 2018-06-28T03:06:52Z

src/sctools/gtf.py

+        gene_name = record.get_attribute('gene_name')
+        if gene_name is None:
+            raise ValueError(
+                'Malformed GTF file detected. Record is of type gene but does not have a '


I'm a fan of f-style strings everywhere (although the repository has not yet been converted).

Agreed, and done.

While at it, I converted all exception messages to f-stye in the repo.

ambrosejcarr · 2018-06-28T03:07:19Z

src/sctools/count.py

        sp.save_npz(prefix + '.npz', self._matrix, compressed=True)
        np.save(prefix + '_row_index.npy', self._row_index)
        np.save(prefix + '_col_index.npy', self._col_index)

    @classmethod
-    def load(cls, prefix: str):
+    def load(cls, prefix: str) -> 'CountMatrix':


ambrosejcarr · 2018-06-28T13:49:12Z

src/sctools/test/test_count.py

+
+
+@pytest.mark.parametrize('query_sort_order', ['CB_UB_GE_n', 'n'])
+def test_count_matrix_from_bam(query_sort_order):


Looking at the tests, it looks like the same data is used for each test, but then sorted differently. Could you factor out the synthetic data generation into a pytest.fixture(scope='module') such that the data generation is only done once? This could speed up testing.

Nice suggestion, but let me pass on this one -- it's too much work for little gain. The main speed bottleneck in this test, by and large, is loading chr1 annotations. By the way, it is not uncommon for test suites to take hours to complete! (GATK takes about 1 hour).

Also, had we used a truncated (or synthetic) set of annotations, we would not have discovered the indexing bug caused by multiple entries per gene!

I believe pretty strongly in responsive unit testing -- it aids in a fast development cycle. Mocks and fixtures can help with this, and they are not difficult to work with once you get a hang of it. I would bet that @rexwangcc could definitely help with this.

Given what you said about loading the annotations, I'd suggest cutting down on the chr1 annotation sizes, and loading the annotations as a fixture so they only get loaded once for further tests.

Could you use a smaller subset of the annotations but still maintain the same testing quality? This repo is not a great example, as we obviously have testing holes, but the existing tests complete in < 5s, so I think there's a burden to prove that the tests you're adding need to be so much long er.

Alright, truncated chr1 annotations, changed some signatures, added gene names as a fixture. Test is fast now.

ambrosejcarr · 2018-06-28T16:35:52Z

src/sctools/test/test_count.py

+                                   incomplete_alignment_tags_list,
+                                   multiple_alignment_tags_list)
+
+    def generate_synthetic_bam_and_counts_matrix(self,


Question: for python code, I like to read top to bottom. This function calls a bunch of the private methods below. My tendency would be to move this below those. Do you have any strong opinions on this?

I definitely agree -- I moved things around for better code readability.

ambrosejcarr · 2018-06-28T16:38:14Z

src/sctools/test/test_count.py

+    def _generate_incomplete_alignment_tags(self, num_missing_some_tags: int) -> List[AlignmentRecordTags]:
+        incomplete_alignment_tags_list: List[AlignmentRecordTags] = list()
+        for _ in range(num_missing_some_tags):
+            tag_mask = self.rng.randint(low=0, high=7)


Nice, I like this way of masking. Very efficient!

ambrosejcarr · 2018-06-28T16:41:04Z

src/sctools/test/test_count.py

+        _test_annotation_file,
+        _test_gene_expression_rate)
+
+    with tempfile.TemporaryDirectory() as _test_temp_dir:


I'd add a finally clause to explicitly shutil.rmtree the temp dir -- on some operating systems, these files get left around until restart, so it's nice to clean up after ourselves. :)

I did it just to be safe, but I have a feeling that this is redundant. There are several instances of improper usage of tempfile.TemporaryDirectory() in sctools and that is likely to be the cause of improper cleanup.

Cool, made #45.

ambrosejcarr · 2018-06-28T16:45:48Z

src/sctools/test/test_metrics.py

@@ -1,13 +1,12 @@
+import fileinput


Thanks for sorting 👍

ambrosejcarr · 2018-06-28T16:52:22Z

src/sctools/test/test_count.py

-    assert np.allclose(csr1.indptr, csr2.indptr)
-    assert np.allclose(csr1.data, csr2.data)
+    # assert equality
+    assert np.allclose(sorted_count_matrix_data_from_bam, sorted_count_matrix_data_expected)


Can we add a docstring to this test that's super clear on what we're testing here? My initial concern was that we're generating synthetic bams and count matrices, and using similar code to transition between them, instead of using some kind of external gold standard that we're aiming to hit.

Done well, these tests are probably way better than some small matrix or bam used to smoke test the functionality, but there's also an opportunity to code 2 + 2 = 5 somewhere into both the count matrix generation and the bam_to_count function.

[moonshot] Longer term, when we've figured out all the discrepancies, what do you think about trying to run the cell ranger code in our tests to verify we're equivalent?

Added an extensive preamble documentation block.

ambrosejcarr · 2018-07-01T19:49:52Z

src/sctools/count.py

-            gene_id_tag: str='GE',
-            open_mode: str='rb',
-            ):
+            gene_name_tag: str='GE',


One question I forgot to ask: are these HGNC gene names that are going in the tags, and not the numerical IDs? If so then this is a good change.

If they are IDs, I'd prefer to keep the name as ID.

Unfortunately, they actually gene names and not IDs (we reply on dropseq-tools for tagging genes; it uses gene names by default, there is no way to force it to use gene IDs).

mbabadi · 2018-07-06T01:19:37Z

@ambrosejcarr I'm done -- thanks for the thorough review! let me rebase and regroups the commits and it will be good to go for your next review.

f-style exceptions fastq to FASTQ in fast.py docblocks some code cleanup

- refactoring, rearrangement, and cleanup of test_count.py - documentation in test_count.py - AlignmentSortOrder class in bam.py - refactoring further instances of constants to consts.py

mbabadi · 2018-07-09T20:26:26Z

@ambrosejcarr back to you.

mbabadi · 2018-07-10T16:40:54Z

@ambrosejcarr please feel free to correct my grammar in the docs (I am very sloppy with plain language, specially while coding!) + stylistic inconsistencies -- I could notice a few typos/grammatical issues while browsing the diff! :)

ambrosejcarr

Strongly approve of how you extended from my comments to make some really nice changes. :-)

I want to have a bit more conversation about testing duration, and left a few very minor comments. Besides that I have nothing further.

ambrosejcarr · 2018-07-12T14:23:30Z

src/sctools/bam.py

@@ -231,27 +236,27 @@ def split(in_bam, out_prefix, *tags, approx_mb_per_split=1000, raise_missing=Tru
    if len(tags) == 0:
        raise ValueError('At least one tag must be passed')

-    def _cleanup(files_to_counts, files_to_names, rm_all=False) -> None:
+    def _cleanup(_files_to_counts, _files_to_names, rm_all=False) -> None:


These parameters should have types.

I am unclear on why you'd have a leading underscore here. It seems obvious that these methods are private since it's a function with no internal closures.

Added types.

files_to_counts and files_to_names shadow from the outer scope. Even though this specific is OK, it is generally not a good practice to let inner classes use the outer scope var names (it is a recipe for hard-to-catch bugs!).

Yep, agree. For some reason I'd extrapolated in my mind such that trailing underscores were used, where as weakly private methods have leading underscores.

single_trailing_underscore_: used by convention to avoid conflicts with Python keyword, e.g. Tkinter.Toplevel(master, class_='ClassName')

However, I couldn't find it in pep8.

ambrosejcarr · 2018-07-12T14:25:28Z

src/sctools/bam.py

@@ -260,8 +265,8 @@ def _cleanup(files_to_counts, files_to_names, rm_all=False) -> None:
        warnings.warn('Number of requested subfiles (%d) exceeds 500; this may cause OS errors by '
                      'exceeding fid limits' % n_subfiles)
    if n_subfiles > 1000:
-        raise ValueError('Number of requested subfiles (%d) exceeds 1000; this will usually cause '
-                         'OS errors, think about increasing max_mb_per_split.' % n_subfiles)
+        raise ValueError(f'Number of requested subfiles ({ n_subfiles}) exceeds 1000; this will usually cause '


Why the leading space before n_subfiles?

Also, extracted constants.

ambrosejcarr · 2018-07-12T14:25:46Z

src/sctools/bam.py

    current_tag : str
        the molecule barcode that records in the group all share

    """
-    return iter_tag_groups(tag='UB', bam_iterator=bam_iterator)
+    return iter_tag_groups(tag=consts.MOLECULE_BARCODE_TAG_KEY, bam_iterator=bam_iterator)


ambrosejcarr · 2018-07-12T14:28:56Z

src/sctools/bam.py

@@ -417,9 +421,69 @@ def iter_genes(bam_iterator: Iterator[pysam.AlignedSegment]) -> Generator:
    Yields
    ------
    grouped_by_tag : Iterator[pysam.AlignedSegment]
-        reads sharing a unique gene id (``GE`` tag)
+        reads sharing a unique gene name tag


Yikes I thought these were gene ids, not gene names. Good catch.

ambrosejcarr · 2018-07-12T14:32:11Z

src/sctools/gtf.py

+                    f'malformed GTF file.')
+
+
+def extract_gene_names(files='-', mode='r', header_comment_char='#') -> Dict[str, int]:


What are your feeling on types for parameters with default arguments? Mypy suggests it because it enforces static typing on them (in addition to annotating them as strings, which is obvious given the default). I think I like this, but I'm open to opinions.

Added types.

- faster test_count.py (truncated chr1.gtf.gz, pytest.fixture for gene_name_to_index, changed signature of count.from_sorted_tagged_bam) - misc refactoring and code cleanup

mbabadi · 2018-07-16T19:11:10Z

@ambrosejcarr back to you.

ambrosejcarr

I'm now getting some frustrating and maybe environment-specific installation errors with pysam, but it looks like the tests pass in under 7 seconds, which is good enough for me. Thanks for putting in the extra effort!

[Edit] I was right about environment-specific -- my computer partially uninstalled xcode. 🙃

mbabadi requested a review from ambrosejcarr June 27, 2018 21:01

gene_id -> gene_name

e736196

new tests for count matrix computation from BAM updated count matrix test documentation updated count matrix class documentation basic refactoring and code cleanup

mbabadi force-pushed the mb-cell-ranger-func-equivalence branch from 47eb3c3 to e736196 Compare June 27, 2018 21:03

ambrosejcarr reviewed Jun 28, 2018

View reviewed changes

ambrosejcarr reviewed Jul 1, 2018

View reviewed changes

refactoring out most global constants to a consts.py file

a2117d5

f-style exceptions fastq to FASTQ in fast.py docblocks some code cleanup

mbabadi force-pushed the mb-cell-ranger-func-equivalence branch from 90ae59b to 394fb21 Compare July 9, 2018 20:18

PR review requested changes

702a2a4

- refactoring, rearrangement, and cleanup of test_count.py - documentation in test_count.py - AlignmentSortOrder class in bam.py - refactoring further instances of constants to consts.py

mbabadi force-pushed the mb-cell-ranger-func-equivalence branch from 394fb21 to 702a2a4 Compare July 9, 2018 20:25

ambrosejcarr reviewed Jul 12, 2018

View reviewed changes

changes after PR review round 2

57b9369

- faster test_count.py (truncated chr1.gtf.gz, pytest.fixture for gene_name_to_index, changed signature of count.from_sorted_tagged_bam) - misc refactoring and code cleanup

ambrosejcarr approved these changes Jul 18, 2018

View reviewed changes

mbabadi merged commit 9bff1c6 into master Jul 25, 2018

mbabadi deleted the mb-cell-ranger-func-equivalence branch July 25, 2018 19:29



		@pytest.mark.parametrize('query_sort_order', ['CB_UB_GE_n', 'n'])
		def test_count_matrix_from_bam(query_sort_order):

		f'malformed GTF file.')


		def extract_gene_names(files='-', mode='r', header_comment_char='#') -> Dict[str, int]:

CountMatrix bugfix and new tests #43

CountMatrix bugfix and new tests #43

Conversation

mbabadi commented Jun 27, 2018

codecov-io commented Jun 27, 2018 • edited Loading

Codecov Report

ambrosejcarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Jul 6, 2018

mbabadi commented Jul 9, 2018

mbabadi commented Jul 10, 2018

ambrosejcarr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbabadi commented Jul 16, 2018

ambrosejcarr left a comment • edited Loading

Choose a reason for hiding this comment

codecov-io commented Jun 27, 2018 •

edited

Loading

ambrosejcarr left a comment •

edited

Loading