Updated img in source and source/_static

dputhier · Oct 7, 2020 · b4008b3 · b4008b3
1 parent 10065b6
commit b4008b3
Show file tree

Hide file tree

Showing 87 changed files with 326 additions and 266 deletions.
diff --git a/docs/_images/example_01.png b/docs/_images/example_01.png
diff --git a/docs/_images/example_05.png b/docs/_images/example_05.png
diff --git a/docs/_images/example_06.png b/docs/_images/example_06.png
diff --git a/docs/_images/example_06b.png b/docs/_images/example_06b.png
diff --git a/docs/_images/example_08.png b/docs/_images/example_08.png
diff --git a/docs/_images/example_13.png b/docs/_images/example_13.png
diff --git a/docs/_sources/ologram.rst.txt b/docs/_sources/ologram.rst.txt
@@ -54,7 +54,7 @@ The program will return statistics for both the number of intersections and the
 	- H1: The regions of the query (--peak-file) tend to overlap the reference (--inputfile or --more-bed).
 
 
-.. warning:: The ologram examples below use 8 CPUs. Please adapt.
+.. warning:: The ologram examples below use 8 CPUs. Please adapt the number of threads.
 
 
 
@@ -167,33 +167,16 @@ The program will return statistics for both the number of intersections and the
 ologram (multiple overlaps)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-While previously we computed paiwise enrichment (ie. Query+A, Query+B ...) , It is also possible to use the **OLOGRAM-MODL** Multiple Overlap Dictionary Learning) plugin to find multiple overlaps (ie. between n>=2 sets) enrichment (ie. Query+A+B, Query+A+C, ...) in order to highlight combinations of genomic regions, such as Transcriptional Regulator complexes. 
+While previously we computed paiwise enrichment (ie. Query+A, Query+B, ...) , it is also possible to use the **OLOGRAM-MODL** Multiple Overlap Dictionary Learning) plugin to find multiple overlaps (ie. between n>=2 sets) enrichment (ie. Query+A+B, Query+A+C, ...) in order to highlight combinations of genomic regions, such as Transcriptional Regulator complexes. 
 
 This is done only on custom regions supplied as BEDs supplied with the `--more-bed` argument. In most cases you may use the --no-gtf argument and only pass the regions of interest.
 
-
-For statistical reasons, we recommend shuffling across a relevant subsection of the genome only (ie. enhancers only) using --bed-excl or --bed-incl to ensure the longer combinations have a reasonable chance of being randomly encountered in the shuffles.
-
-
-**MODL itemset mining algorithm:** By default, OLOGRAM-MODL will compute the enrichment of all n-wise combinations that are encountered in the real data it was passed. This however can add up to 2**N combinations and make the result hard to read. Furthermore, in biological data noise is a real problem and can obscure the relevant combinations.
-
-As such, we also give the option to use a custom itemset mining algorithm on the true overlaps to identify interesting combinations. 
-
-In broad strokes, this custom algorithm MODL (Multiple Overlap Dictionary Learning) will perform many matrix factorizations on the matrix of true overlaps to identify relevant correlation groups of genomic regions. Then a greedy algorithm based on how much these words improve the reconstruction will select the utmost best words. MODL is only used to filter the output of OLOGRAM : once it returns a list of interesting combination, OLOGRAM will compute their enrichment as usual, but for them only. Each combination is of the form [Query + A + B + C] where A, B and C are BED files given as --more-bed. You can also manually specify the combinations to be studied with the format defined in OLOGRAM notes (below).
-
-Unlike classical association rules mining algorithms, this focuses on mining relevant bio complexes/clusters and correlation groups (item sets), and you should not request more than 20-30 combinations. As a matrix factorization based algorithm, it is designed to be resistant
-to noise which is a known problem in biological data. Its goal is to extract meaningful frequent combinations from noisy data. As a result however, it is biased in favor of the most abundant combinations in the data, and may return correlation groups if you ask for too few words (ie. if AB, BC and AC are complexes, ABC might be returned).
-
-
-This itemset mining algorithm is a work-in-progress. Whether you use MODL will not change the results for each combination, it only changes which combinations are displayed. If you want the enrichment of all combinations, ignore it. To use MODL, use the --multiple-overlap-max-number-of-combinations argument.
-
+For statistical reasons, we recommend shuffling across a relevant subsection of the genome only (ie. enhancers only) using --bed-excl or --bed-incl to ensure the longer combinations have a reasonable chance of being randomly encountered in the shuffles. Conversely, if you do not filter the combinations, keep in mind that the longer ones may be enriched even though they are present only on a few base pairs, because at random they would be even rarer.
 
 **Exact combinations:** By default, OLOGRAM will compute "inexact" combinations, meaning that when encountering an overlap of [Query + A + B + C] it will count towards [A + B + ...]. For exact intersections (ie. [Query + A + B + nothing else]), set the --multiple-overlap-target-combi-size flag to the number of --more-bed plus one. You will know if the combinations are computed as inexact by the '...' in their name in the result file. Intersections not including the query file are discarded.
 
 
 
-
-
 **Simple example:**
 
 Comparing the query (-p) against two other BED files, analyzing multiple overlaps.
@@ -238,19 +221,30 @@ Comparing the query (-p) against two other BED files, analyzing multiple overlap
 As the computation of multiple overlaps can be RAM-intensive, if you have a very large amount of candidate genomic feature sets (hundreds) we recommend selecting less candidates among them first by running a pairwise analysis.
 
 
-**MODL algorithm API:** MODL can also be used independantly as a combination mining algorithm. 
 
-This can work on any type of data, biological or not, that respects the conventional formatting for lists of transactions: the data needs to be a matrix with one line per transaction and one column per element.
+**MODL itemset mining algorithm:** By default, OLOGRAM-MODL will compute the enrichment of all n-wise combinations that are encountered in the real data it was passed. This however can add up to 2**N combinations and make the result hard to read. Furthermore, in biological data noise is a real problem and can obscure the relevant combinations. As such, we also give the option to use a custom itemset mining algorithm on the true overlaps to identify interesting combinations. 
+
+
+
+Details
+-----------------
 
-For example, if you have three possible elements A, B and C, a line of [1,0,1] means a transaction containing A and C.
 
-For a factor allowance of k and n final queried words, the matrix will be rebuilt with k*n words in step 1.
-factor allowance is K in K*n words in step 1 where n is final queries nb of words.
+In broad strokes, the custom itemset algorithm MODL (Multiple Overlap Dictionary Learning) will perform many matrix factorizations on the matrix of true overlaps to identify relevant correlation groups of genomic regions. Then a greedy algorithm based on how much these words improve the reconstruction will select the utmost best words. MODL is only used to filter the output of OLOGRAM : once it returns a list of interesting combination, OLOGRAM will compute their enrichment as usual, but for them only. Each combination is of the form [Query + A + B + C] where A, B and C are BED files given as --more-bed. You can also manually specify the combinations to be studied with the format defined in OLOGRAM notes (below).
 
-MODL and will discard combinations rarer than 1/10000 occurences to reduce computing times and will also reduce the abundance of all unique lines in the matrix to their square roots to reduce the emphasis on the most frequent elements.
-However, this can magnify the impact of the noise quadratically as well, and can be disabled when using the manual API.
+Unlike classical association rules mining algorithms, this focuses on mining relevant bio complexes/clusters and correlation groups (item sets), and you should not request more than 20-30 combinations. As a matrix factorization based algorithm, it is designed to be resistant
+to noise which is a known problem in biological data. Its goal is to extract meaningful frequent combinations from noisy data. As a result however, it is biased in favor of the most abundant combinations in the data, and may return correlation groups if you ask for too few words (ie. if AB, BC and AC are complexes, ABC might be returned).
+
+
+This itemset mining algorithm is a work-in-progress. Whether you use MODL will not change the results for each combination, it only changes which combinations are displayed. If you want the enrichment of all combinations, ignore it. To use MODL, use the --multiple-overlap-max-number-of-combinations argument.
+
+
+
+**MODL algorithm API:** MODL can also be used independantly as a combination mining algorithm. 
 
+This can work on any type of data, biological or not, that respects the conventional formatting for lists of transactions: the data needs to be a matrix with one line per transaction and one column per element. For example, if you have three possible elements A, B and C, a line of [1,0,1] means a transaction containing A and C.
 
+For a factor allowance of k and n final queried words, the matrix will be rebuilt with k*n words in step 1. MODL will discard combinations rarer than 1/10000 occurences to reduce computing times. It will also reduce the abundance of all unique lines in the matrix to their square roots to reduce the emphasis on the most frequent elements. However, the latter can magnify the impact of the noise as well and can be disabled when using the manual API. To de-emphasize longer words, which can help in this case, we can also normalize words by their summed square in step 2.
 
 If you are passing a custom error function, it must have the signature error_function(X_true, X_rebuilt, code). X_true is the real data, X_rebuilt is the reconstruction to evaluate, and code is the encoded version which in our case is used to assess sparsity.  All are NumPY matrices.
 
@@ -272,19 +266,40 @@ Here is an example:
     nb_threads = 1,
     step_1_factor_allowance = 2,                        # How many words to ask for in each step 1 rebuilding, as a multiplier of multiple_overlap_max_number_of_combinations
     error_function = None,                              # Custom error function in step 2
-    smother = True)                                     # Should the smothering (quadratic reduction of abundance) be applied ?
+    smother = True,                                     # Should the smothering (quadratic reduction of abundance) be applied ?
+    normalize_words = False)                            # Normalize words by their summed squared in step 2 ?
   interesting_combis = combi_miner.find_interesting_combinations()   
 
 
-For more details about usage and implementation, please read the notes below :
+For more details about usage and implementation, please read the notes below.
 
 **Arguments:**
 
 .. command-output:: gtftk ologram -h
 	:shell:
 
 
-Since the results of MODL only depend on the true intersections and not on the shuffles, you can run MODL with 1 shuffle to pre-select interesting combinations, and then run the full analysis on many shuffles. We then recommend selecting the combinations that interest you in the resulting tsv, using MODL's selection as a starting point, and adding or removing some combinations based on your own needs (eg. adding all the highest fold changes, or all particular combinations containing the Transcription Factor X that you are studying). Then, run ologram_modl_treeify on the resulting filtered tsv.
+
+**Manual intersection computing:** To manually compute an overlap matrix between any number of BED files, the following Python code can be used.
+
+.. code-block:: python
+
+  import pybedtools
+  import numpy as np
+  from pygtftk.stats.intersect.overlap_stats_compute import compute_true_intersection
+
+  # Register the BED files as pybedtools.BedTool objects
+  bedA = pybedtools.BedTool(path_to_your_query)
+  bedsB = [pybedtools.BedTool(bedfilepath) for bedfilepath in list_of_all_paths_to_more_bed]
+      
+  # Use our custom intersection computing algorithm to get the matrix of overlaps
+  true_intersection = compute_true_intersection(bedA, bedsB)
+  flags_matrix = np.array([i[3] for i in true_intersection])
+
+The resulting flags_matrix is a NumPy array that can be edited, and on which MODL can be run.
+
+Since the results of MODL only depend on the true intersections and not on the shuffles, you can run MODL with 1 shuffle or on a manually computed matrix as above to pre-select interesting combinations, and then run the full analysis on many shuffles. We then recommend selecting the combinations that interest you in the resulting tsv file, using MODL's selection as a starting point and adding or removing some combinations based on your own needs (eg. adding all the highest fold changes, or all particular combinations containing the Transcription Factor X that you are studying).
+
 
 
 ologram_merge_stats
@@ -325,11 +340,13 @@ This also works with OLOGRAM-MODL results, since they follow the same basic form
 ologram_modl_treeify
 ~~~~~~~~~~~~~~~~~~~~~~
 
-**Description:** Visualize n-wise enrichment results (OLOGRAM-MODL) as a tree of combinations. Works on the result (tsv file) of an OLOGRAM analysis called with --more-bed-multiple-overlap.
+**Description:** Visualize n-wise enrichment results (OLOGRAM-MODL) as a tree of combinations. Works on the result (tsv file) of an OLOGRAM analysis called with --more-bed-multiple-overlap. On the graph, S designated the total number of basepairs in which this combinations is encountered in the real data. Fold change gives the ratio with the number of basepairs in the shuffles, with the associated Negative Binomial p-value.
+
+This recommended representation is useful to find master regulators, by showing which additions to a combinations increase its enrichment, and allowing to see whether overlaps that contain the element X also contain the element Y (looking at how a child combination accounts for the S of its parent in an inexact counting).
 
-We recommend this representation. The tsv file can be edited before passing it to the command, for example by keeping only the combinations you are interested in.
+The tsv result file can be edited before passing it to the command, for example by keeping only the combinations you are interested in, such as all combinations containing the Transcription Factor you are studying. We recommend running MODL to make a pre-selection.
 
-On the graph, S designated the total number of basepairs in which this combinations is encountered in the real data. Fold change gives the ratio with the number of basepairs in the shuffles, with the associated Negative Binomial p-value.
+We also recommend discarding the rarest combinations found on such a very small number of basepairs that they are unlikely tobe biologically significant. This is mostly relevant when you have many sets (k >= 5) since longer combinations will often be enriched through sheer unlikelihood. 
 
 .. command-output:: gtftk ologram_modl_treeify -i multiple_overlap_trivial_ologram_stats.tsv -o treeified.pdf -l ThisWasTheNameOfTheQuery
 	:shell:

diff --git a/docs/_static/documentation_options.js b/docs/_static/documentation_options.js
@@ -1,6 +1,6 @@
 var DOCUMENTATION_OPTIONS = {
     URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'),
-    VERSION: '1.2.1',
+    VERSION: '1.2.3',
     LANGUAGE: 'en',
     COLLAPSE_INDEX: false,
     BUILDER: 'html',

diff --git a/docs/_static/example_01.png b/docs/_static/example_01.png
diff --git a/docs/_static/example_01b.png b/docs/_static/example_01b.png
diff --git a/docs/_static/example_02.png b/docs/_static/example_02.png
diff --git a/docs/_static/example_05.png b/docs/_static/example_05.png
diff --git a/docs/_static/example_06.png b/docs/_static/example_06.png
diff --git a/docs/_static/example_06b.png b/docs/_static/example_06b.png
diff --git a/docs/_static/example_07.png b/docs/_static/example_07.png
diff --git a/docs/_static/example_08.png b/docs/_static/example_08.png
diff --git a/docs/_static/example_13.png b/docs/_static/example_13.png
diff --git a/docs/_static/example_pa_01.pdf b/docs/_static/example_pa_01.pdf
diff --git a/docs/_static/example_pa_02.pdf b/docs/_static/example_pa_02.pdf
diff --git a/docs/_static/example_pa_03.pdf b/docs/_static/example_pa_03.pdf
diff --git a/docs/_static/example_pa_04.pdf b/docs/_static/example_pa_04.pdf
diff --git a/docs/_static/merge_ologram_stats_01.pdf b/docs/_static/merge_ologram_stats_01.pdf
diff --git a/docs/_static/treeified.pdf b/docs/_static/treeified.pdf
diff --git a/docs/about.html b/docs/about.html
@@ -16,7 +16,7 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
       })();
     </script>
-    <title>Warning about supported GTF file formats &#8212; gtftk 1.2.1 documentation</title>
+    <title>Warning about supported GTF file formats &#8212; gtftk 1.2.3 documentation</title>
     <link rel="stylesheet" href="_static/nature.css" type="text/css" />
     <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
     <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
@@ -45,7 +45,7 @@ <h3>Navigation</h3>
         <li class="right" >
           <a href="index.html" title="Welcome to pygtftk documentation page"
              accesskey="P">previous</a> |</li>
-        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.1 documentation</a> &#187;</li>
+        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.3 documentation</a> &#187;</li>
         <li class="nav-item nav-item-this"><a href="">Warning about supported GTF file formats</a></li> 
       </ul>
     </div>  
@@ -202,13 +202,13 @@ <h3>Navigation</h3>
         <li class="right" >
           <a href="index.html" title="Welcome to pygtftk documentation page"
              >previous</a> |</li>
-        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.1 documentation</a> &#187;</li>
+        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.3 documentation</a> &#187;</li>
         <li class="nav-item nav-item-this"><a href="">Warning about supported GTF file formats</a></li> 
       </ul>
     </div>
     <div class="footer" role="contentinfo">
         &#169; Copyright 2018, F. Lopez and D. Puthier.
-      Last updated on Sep 16, 2020.
+      Last updated on Oct 08, 2020.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
     </div>
   </body>

diff --git a/docs/annotation.html b/docs/annotation.html
@@ -16,7 +16,7 @@
         var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
       })();
     </script>
-    <title>Commands from section ‘annotation’ &#8212; gtftk 1.2.1 documentation</title>
+    <title>Commands from section ‘annotation’ &#8212; gtftk 1.2.3 documentation</title>
     <link rel="stylesheet" href="_static/nature.css" type="text/css" />
     <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
     <script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
@@ -45,7 +45,7 @@ <h3>Navigation</h3>
         <li class="right" >
           <a href="conversion.html" title="Commands from section ‘conversion’"
              accesskey="P">previous</a> |</li>
-        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.1 documentation</a> &#187;</li>
+        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.3 documentation</a> &#187;</li>
         <li class="nav-item nav-item-this"><a href="">Commands from section ‘annotation’</a></li> 
       </ul>
     </div>  
@@ -477,13 +477,13 @@ <h3>Navigation</h3>
         <li class="right" >
           <a href="conversion.html" title="Commands from section ‘conversion’"
              >previous</a> |</li>
-        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.1 documentation</a> &#187;</li>
+        <li class="nav-item nav-item-0"><a href="index.html">gtftk 1.2.3 documentation</a> &#187;</li>
         <li class="nav-item nav-item-this"><a href="">Commands from section ‘annotation’</a></li> 
       </ul>
     </div>
     <div class="footer" role="contentinfo">
         &#169; Copyright 2018, F. Lopez and D. Puthier.
-      Last updated on Sep 16, 2020.
+      Last updated on Oct 08, 2020.
       Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.2.1.
     </div>
   </body>