10 Jun 06:49

Bribak

683d731

v1.3.0 Latest

Latest

Change Log

For Version 1.3.0

Added get_heatmap to the glycoworkGUI
Added an “About” tab to the glycoworkGUI, describing the glycowork version that it is running and pointers to the reference and documentation
Added get_lectin_array to the glycoworkGUI
Added a progress bar to lengthier operations in the glycoworkGUI
Reduced filesize of glycoworkGUI by ~20% and filesize of glycowork by >80%
Removed inplace operations from pandas functions, because of PDEP-8
PyTorch (torch) is now no longer a mandatory requirement for base glycowork. It has been shifted to the setup requirements for the optional glycowork[ml] install. Trying to do machine learning without that install will result in an appropriate ImportError
gdown is now a mandatory requirement for glycowork, to support hosting larger files outside the package itself

glycan_data

Updated glycan_binding by averaging results from duplicate sequences with different formatting
Added processed example glycomics datasets that are available via loader.glycomics_data_loader
Added processed example lectin array datasets that are available via loader.lectin_array_data_loader
Added a bit of fuzziness to the motifs in motif_list to allow for broader capture (e.g., “GalOS” instead of “Gal6S” when appropriate, or “Sia” instead of "Neu5Ac”)
Fixed the definition of Internal_LacNAc_type1 in motif_list

loader

Added glycomics_data_loader as an object for requesting glycomics data. Use dir(glycomics_data_loader) for displaying available glycomics datasets, and then request them via glycomics_data_loader.XXX (same goes for lectin array data, which is requestable via lectin_array_data_loader)
Added human_skin_O_PMC5871710, human_skin_O_PMC5871710_BCC, human_skin_O_PMC5871710_SCC, human_colorectal_O_PMC9254241, human_colorectal_N_PMID26085185, human_colorectal_O_PMID19152289, human_gastric_O_PMC4816881, human_gastric_O_PMID28461410, human_gastric_O_PMC5762837, human_gastric_O_PMC7226152, human_liver_O_PMC9254241, human_liver_O_PMC5383776, human_ovarian_O_PMC4468167, human_prostate_O_PMC8010466, human_prostate_N_PMC8010466, human_retina_GSL_PMC5173345, human_leukemia_O_PMID34646384, human_leukemia_N_PMID34646384, HIV_gagtransfection_N_PMID35112714, HIV_gagtransfection_O_PMID35112714, time_series_N_PMID32149347, human_brain_GSL_PMID38343116, human_brain_N_PMID38343116, human_brain_O_PMID38343116, human_platelets_O_PMID36952551, human_platelets_N_PMID36952551, human_serum_bacteremia_N_PMID33535571, time_series_HMO_PMID22649065, and time_series_O_PMID32149347 as datasets for glycomics_data_loader
Added A549_influenza_PMID33046650 and HEK_XBP1_PMID30305426 as datasets for lectin_array_data_loader
Added lectin_specificity as a resource for documented lectin specificities for lectin array analysis
Switch glycan_binding, df_species, and df_glycan to lazyloading for improved package import etc.
Added strip_suffixes to strip a column of string values of suffixes such as “.1”, “.2” that pandas may assign to duplicate columns
Added download_model to download hosted large files, such as model weights, when needed

stats

Fixed an issue in test_inter_vs_intra_group in which mean values were not correctly broadcast if “paired = False” and “grouped_BH = True”
Added get_equivalence_test to test for significant equivalence of group means via two one-sided t-tests
Added clr_transformation for the center log ratio transformation of a glycomics dataframe with the addition of scale uncertainty via a gamma parameter (see for instance https://arxiv.org/abs/2201.03616 for the theory behind this)
For impute_and_normalize, the default value for “min_samples” has been changed to 0.1, which now means that at least 10% of the samples (rounded down) need to be non-zero for a glycan to be retained. Further, features for which one group only has zero values will now be imputed with 1e-5 to avoid erroneous homogenization of effects by MissForest
Changed the “min_feature_variance” default from 0.01 to 0.02 in variance_based_filtering and now it also outputs the discarded rows as a second output
Added replace_outliers_winsorization to cap outliers via Winsorization
Fixed numpy random seed to 0
Added anosim for ANOSIM (Analysis of similarities) for the beta-diversity calculation in get_biodiversity
Added alpha_biodiversity_stats for performing an ANOVA on alpha diversity metrics, if groups > 2 in get_biodiversity
Fixed a warning if the standard deviation of a paired sample in cohen_d was exactly zero
Added calculate_permanova_stat and permanova_with_permutation for PERMANOVA (Permutational multivariate analysis of variance) for the beta-diversity calculation in get_biodiversity
Added alr_transformation, get_procrustes_scores, and get_additive_logratio_transformation to find ALR reference component to perform the ALR transformation for compositional data analysis
Added correct_multiple_testing to centralize multiple testing correction and also add a warning if >90% of features are significant (in which case, Bonferroni correction will be applied to make results more conservative)
Raised tolerance of MissForest from 1e-6 to 1e-5 (as it’s applied to the sum of differences, it’s still very conservative)
Added omega_squared to calculate Omega squared, as an effect size for ANOVA-type analyses

motif

analysis

Change get_differential_expression to only call TST_grouped_benjamini_hochberg if “grouped_BH = True”, otherwise default to scipy two-stage Benjamini-Hochberg
get_differential_expression now also outputs equivalence tests for all cases in which the uncorrected p-value is above 0.05
get_differential_expression, get_glycanova, get_time_series, and get_jtk now will internally CLR- or ALR-transform input glycomics data to appropriately handle compositional data. These functions also newly accept a “gamma” keyword argument to tune the scale uncertainty for lowering the potential for false-positives
get_heatmap will now automatically transpose the input dataframe if it has been provided in the wrong orientation
Added the “transform” keyword argument to get_heatmap, to optionally CLR/ALR-transform the input data by setting ‘transform = “CLR”’ or ‘transform = “ALR”’
The “transform” keyword argument also exists in most other analysis functions and accepts “ALR” and “CLR”, if users wish to override the automatically inferred type of transformation (“Nothing” is accepted for not transforming data at all but this is not recommended in most circumstances)
Changed multiple testing correction to two-stage Benjamini-Hochberg, even if no grouped Benjamini-Hochberg test is being done
Also change the “min_samples” default to 0.1 in get_differential_expression and other functions
Changed all analysis functions to use Winsorization (glycan_data.stats.replace_outliers_winsorization) instead of IQR capping (glycan_data.stats.replace_outliers_with_IQR_bounds) for outlier treatment
Added get_SparCC to perform SparCC (Sparse Correlations for Compositional Data) to find pairwise associations between glycans sequences, or motifs, between two glycomics datasets, with the typical interface of .analysis functions (note that you can also use a glycomics dataset together with an, e.g., metagenomics dataset, even if “motifs=True” is set)
Removed outlier treatment in get_pvals_motifs to avoid removing actual effects of effect-sparse glycan array data
Added beta-diversity measures (via Euclidean distance on CLR/ALR-transformed data) to get_biodiversity. This function now operates on a shopping cart principle, similar to “feature_set” in the annotation functions. The “metrics” shopping cart currently has “alpha” and “beta” as options. Beta-diversity is tested via ANOSIM (e.g., differences in central tendencies) and PERMANOVA (e.g., variations in dispersions between groups)
In get_heatmap a correct color mapping (ascending or contrastive) is now automatically chosen and applied depending on whether negative values are absent or present in the input data, respectively (transform=”CLR” will introduce negative values in the data and trigger contrastive coloring)
Added the “custom_scale” keyword argument to get_differential_expression, get_glycanova, get_biodiversity, and get_time_series. Only use it if you know what you’re doing. Basically, if you know that the total amount of glycans goes up/down in your condition of interest (in the condition, not in the measurement), then provide the ratio of glycan signal as group2/group1 and that will be used for an informed scale model, as described in https://www.biorxiv.org/content/10.1101/2024.04.01.587602v1 . Alternatively, if you have more than two groups, “custom_scale” can be provided as a dictionary of type: group idx : mean(group)/min(mean(groups)). [In all these cases, “gamma” becomes a parameter describing experimental error in measuring this glycan signal]
In get_volcano the default for “x_thresh” has been changed to 0 (post-hoc filtering of results by fold-change invalidates the FDR guarantee) and a new “n” keyword argument exists to provide the sample-size for applying an get_alphaN calculated alpha threshold
Added get_roc to calculate ROC AUC scores for all features and, optionally, plot the ROC curve of the best feature. Also works in multi-group mode (i.e., best feature to distinguish class A from all other classes) and can use “custom_scale”
Added get_lectin_array to analyze lectin array data to find out what kind of glycan motifs are increasing/decreasing between conditions
Added an optional number of keyword arguments to get_volcano that get directly passed onto the seaborn scatterplot function (**kwargs)
Added the “r...

Assets 4

15 Mar 14:03

Bribak

v1.2.0

30e64cf

v1.2.0

Change Log

For Version 1.2.0

Added glycoworkGUI.py to build the .exe based GUI for important glycowork endpoint functions: GlycoDraw, plot_glycans_excel, and get_differential_expression
Removed python-louvain as a required dependency for glycowork

glycan_data

loader

Switched from pkg_resources to importlib for loading tabular data into the package
stats
Fixed an issue in TST_grouped_benjamini_hochberg that caused errors if nothing was significantly different in the entire dataset or in any group
test_inter_vs_intra_grouping is now robust to non-paired data and data with differing sample sizes per condition
Added replace_outliers_with_IQR_bounds to support outlier treatment in motif.analysis
Added sequence_richness, shannon_diversity_index, and simpson_diversity_index to calculate diversity indices of glycomics data

motif

processing

WURCS handling for universal input now encompass more monosaccharides
GlycoCT handling for universal input now is robust to the declaration of substituents not immediately following their monosaccharide in the GlycoCT string
Added equal_repeats to check whether two repeating units of a polysaccharide are the same, just shifted
Modified glycan nomenclature detection in canonicalize_iupac to be less prone of overidentifying Oxford when it’s just numbers etc.
Added “ß” to the typo detection in canonicalize_iupac and “(-)” as a variation of linkage uncertainty detection
Made canonicalize_iupac robust to the variation of using {} instead of () for linkages

graph

Removed the required usage of lib in glycan_to_nxGraph, compare_glycans, subgraph_isomorphism, and all downstream functions (lib only remains for stemification and deep learning model training/inference)
The keyword argument “wildcards_ptm” now also works as intended when providing pre-calculated graphs as input to compare_glycans or subgraph_isomorphism
Fixed a rare issue in which subgraph_isomorphism, when “count = False”, would sometimes erroneously output “False” because of a greedy approach to evaluating potential matches

tokenization

Added get_unique_topologies to retrieve all base topologies for a given composition that have been observed for a given taxonomic subset
Added the “obfuscate_ptm” keyword argument to map_to_basic, to allow for mapping Gal6S to Hex6S rather than the default HexOS, if that is required/advantageous
Support mapping of phosphorylated glycans in map_to_basic

draw

Fixed an issue where cross-ring fragments were not correctly rendered in GlycoDraw
plot_glycans_excel can now also be used with filepaths to .xlsx files (in addition to .csv files)
plot_glycans_excel now also supports compact glycan drawing with the “compact” keyword argument
Improved drawing resolution in plot_glycans_excel
GlycoDraw will now more strongly make use of nomenclature canonicalization in case of IUPAC dialects (still not 100%, if you suspect you use a dialect of IUPAC, pass your sequences through canonicalize_iupac first)
If no filepath is specified, GlycoDraw will now also display drawn glycan structures in a non-Jupyter environment (as the classic matplotlib pop-up). Note that this functionality requires the cairosvg dependency (head to https://bojarlab.github.io/glycowork/examples.html#glycodraw-code-snippets if you’re unsure about that)

analysis

Functions able to use .csv paths as input can now also deal with .xlsx paths as input
The new “annotate_volcano” keyword argument now allows for the direct insertion of SNFG images within plots from get_volcano without having to subsequently run draw.annotate_figure
get_pvals_motifs, get_differential_expression, get_glycanova, get_time_series, and get_jtk now use glycan_data.stats.replace_outliers_with_IQR_bounds to auto-smooth outliers
Moved hotellings_t2 to glycan_data.stats
All functions compatible with motif-level analysis now accept the “custom_motifs” keyword argument to be passed to annotate_dataset or quantify_motifs if “custom” is included in “feature_set”
Changed the “mode” keyword argument in get_heatmap to “motifs” as a Boolean argument, like in all other motif.analysis functions
Added a call to clean_up_heatmap to get_jtk to avoid redundant motifs
Added get_biodiversity to compare two groups of glycomics datasets with regard to the sequence diversity that is present (similar to comparable analyses for microbiome data)

regex

Added filter_dealbreakers to allow for the exclusion of identified matches if they have illegal components beyond the identified match (e.g., the forbidden Fuc in "Fuc-([Gal|GalNAc])?-Gal-([!Fuc]){,1}-GlcNAc"). Before this, the sequence context except the Fuc was extracted and returned.
Fixed an edge case in filter_matches_by_location in which internal locations sometimes had to handle triple-nested lists which led to errors
get_match can now also use glycan graphs, such as derived from glycan_to_nxGraph, as input
Added get_match_batch to process a whole list of glycans at once, with some performance improvements via first pre-compiling the pattern
Fixed an edge case in get_match in which pattern components consisting of a single monosaccharide with a specified linkage (e.g., “Fuca3”) could sometimes erroneously output no matches
Added motif_to_regex to convert glycan motifs (e.g., in IUPAC-condensed) into a regular expression suitable for get_match. Limited to simple queries for now.

annotate

get_terminal_structures now has a “size” keyword argument with which users can control the size of the extracted terminal motifs
get_k_saccharides now has a “terminal” keyword argument with which users can filter to only count motifs at non-reducing ends
annotate_dataset and functions using it now can add the “terminal2” and “terminal3” option in “feature_set” to also annotate & analyze terminal motifs of size 2 (e.g., Neu5Ac(a2-3)Gal(b1-4)) or size 3 (e.g., Neu5Ac(a2-3)Gal(b1-4)GlcNAc)

network

biosynthesis

Added the possibility of providing abundances to construct_network that are then stored as node attributes in the network
Added add_high_man_removal as a post-processing step in construct_network to allow for the addition of reactions removing mannoses from high-Man N-glycans occurring during maturation
Added estimate_weights and get_edge_weight_by_abundance to estimate reaction capacities from abundances + estimate missing abundances
Added get_maximum_flow, get_max_flow_path, and get_reaction_flow to calculate maximum flow paths between network root and endpoints as well as aggregate the flow by reaction type
Added get_differential_biosynthesis as a wrapper function to compare two groups of glycomes/networks with regard to their biosynthesis (differential flow paths or differential reaction flows)
Fixed an issue in construct_network in which sometimes nodes with outgoing but no incoming connections were not detected as unconnected nodes, leading to incomplete networks
Added the rescue_glycans decorator to construct_network, to allow for auto-fixing nomenclature variations
Improved performance of construct_network by reducing wasteful computation

evolution

Switched get_communities from using python-louvain to the Louvain implementation in networkx

Assets 3

31 Jan 15:09

Bribak

v1.1.0

d7502e9

v1.1.0

Change Log

glycan_data

Updated sugarbase database and all models

stats

Newly added module to glycowork
Moved all the statistics functions from motif.processing into this module: cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering
Added fast_two_sum, two_sum, expansion_sum, hlm, update_cf_for_m_n, jtkdist, jtkinit, jtkstat, and jtkx helper functions for JTK test
Added get_BF to calculate Jeffreys' approximate Bayes factor based on sample size and p-value
Added get_alphaN to calculate sample size-appropriate significance cut-offs informed by Bayesian statistics
Added pi0_tst and TST_grouped_benjamini_hochberg to perform a Two-Stage adaptive Benjamini-Hochberg procedure based on groups (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175141/ or https://www.biorxiv.org/content/10.1101/2024.01.13.575531v1)
Added test_inter_vs_intra_group to estimate intra- versus inter-group correlation with a mixed-effects model for groupings of glycans based on domain expertise

motif

regex

Newly added module to glycowork
Added the get_match function and associated functions to implement a regular expression system for glycans. This allows for powerful queries to detect and extract motifs of arbitrary complexity.

processing

Moved cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering into glycan_data.stats to re-focus processing on processing glycan sequences
Extended canonicalize_composition to cases like ‘5_4_2_1’, ‘5421’, and ‘(Hex)2 (HexNAc)2 (Deoxyhexose)1 (NeuAc)2 + (Man)3(GlcNAc)2’
GlycoCT and WURCS handling for universal input now encompass more monosaccharides and more modifications
Expanded oxford_to_iupac to handle more complex sequences, including sulfation, LacdiNAc, hybrid structures, extended Neu5Ac, complex fucosylation, more custom linkage specifications
enforce_class can now deal with free glycans regardless of whether they end in ‘-ol’ or not

annotate

annotate_dataset and downstream functions now accept a new keyword in “feature_set”, called “custom”. If “custom” is added to “feature_set”, a list of custom motifs can and must be added via the “custom_motifs” keyword argument. “custom” can be mixed and matched with all other keywords in “feature_set”
annotate_dataset now also accepts glyco-regular expressions via the “custom” keyword in “feature_set”. These expressions need to be added within the “custom_motifs” keyword argument and have to start with an “r”, such as "rHex-HexNAc-([Hex|Fuc]){1,2}-HexNAc". Normal motifs and glyco-regular expressions can be freely mixed within “custom_motifs”
Added group_glycans_core, group_glycans_sia_fuc, and group_glycans_N_glycan_type to group glycans by core structure (for O-glycans), Sia/Fuc/FucSia/Rest, or complex/hybrid/high-man/rest (for N-glycans)
Fixed a bug in get_k_saccharides, in which redundant columns were not always correctly removed

analysis

Added get_jtk to analyze circadian expression of glycans in temporal glycomics datasets using the Jonckheere–Terpstra–Kendall (JTK) algorithm, with the typical interface for motifs and imputation etc analogous to differential expression.
get_differential_expression, get_glycanova, and get_jtk now use get_alphaN to calculate a sample size-appropriate significance cut-off (see https://journals.sagepub.com/doi/10.1177/14761270231214429) and add a ‘significant’ column to the output to display whether the corrected p-values lie below this threshold
Added the “zscores” keyword argument to get_pvals_motifs to perform z-score transformation if used data are not yet z-score transformed, by setting “zscores” to False
For statistical calculations, get_pval_motifs will now weigh the motif occurrences by z-score magnitude, rather than only using a cut-off for enrichment calculations
Added effect size calculations to get_pval_motifs which are also in the output, as Cohen’s d
Changed get_pval_motifs such that now both enrichments and depletions will be tested (with depletions resulting in negative effect sizes)
Added select_grouping to find out which grouping of glycans has the highest intra- versus inter-group correlation, as estimated by glycan_data.stats.test_inter_vs_intra_group
When “motifs = False” and “grouped_BH = True”, get_differential_expression now tries to use the Two-Stage adaptive Benjamini-Hochberg procedure based on groups for multiple testing correction, if meaningful groups can be found in the glycans [note this makes everything at least one order of magnitude slower, though most datasets should still finish in a few seconds]

draw

In GlycoDraw, the “highlight_motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
Added plot_glycans_excel to allow for the automated insertion of GlycoDraw SNFG pictures into an Excel file containing glycan sequences

graph

categorical_node_match_wildcard now uses string ID for matching, instead of integer ID, which means even two graphs, generated with two different libs, can now be successfully compared via compare_glycans or subgraph_isomorphism
compare_glycans or subgraph_isomorphism (and all functions using these functions) now support negation, by prepending “!”. For instance, “!Fuc(a1-?)Gal(b1-4)GlcNAc” will match subsequences that have a monosaccharide that is NOT Fuc before the Gal. It is highly recommend to generate your own lib via get_lib if you use negation, as monosaccharides such as !Fuc are not within lib and will cause indexing errors.
Added “?1-?” as another ultimate wildcard (promoting it from a strong narrow wildcard)
Fixed some cases where “Monosaccharide” was not treated as an ultimate wildcard in graph operations
Fixed an issue in graph_to_string in which glycans of size 1 (e.g., “GalNAc”) sometimes were missing their first character

network

Updated pre-calculated biosynthetic networks for milk oligosaccharides

biosynthesis

Refactored find_diff to make networks compatible with the automated, dynamic wildcards (i.e., ? behave as they should and don’t necessarily cause over-branching of the network)
In highlight_network, the “motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)

ml

model_training

In training_setup, upgraded the loss functions for all classification problems to PolyLoss with label smoothing (see https://arxiv.org/abs/2204.12511 for details).
In training_setup, number of classes (for multiclass or multilabel classification) can now be specified via the new “num_classes” keyword argument

Assets 2

05 Dec 10:25

Bribak

v1.0.1

40b04d7

v1.0.1

Change Log

motif

processing

Slightly extended WURCS parsing in wurcs_to_iupac
Fixed an issue in choose_correct_isoform in which errors would be caused if the input list contained only duplicate glycans
Fixed an issue in choose_correct_isoform in which errors would be caused if the input list contained only glycans without branching

draw

Adapted cairosvg imports so that, even without cairosvg dependencies, users can plot glycans inline and export as .svg files (only export as .pdf and export of annotate_figure is still restricted to cairosvg)

network

biosynthesis

Fixed handling of empty outputs of choose_correct_isoform in construct_network

evolution

Fixed dictionary handling in get_communities

Assets 2

04 Dec 12:41

Bribak

v1.0.0

044e18d

v1.0.0

Change Log

Added a Zenodo badge, to have a release-specific doi for glycowork

glycan_data

Updated sugarbase database; sugarbase is now pickled, so literal evaluations are necessary
Harmonized glycan column names across generated dataframes; all use ‘glycan’ now, ‘target’ has been deprecated

loader

Updated motif_list to be compatible with new position encoding
Added Internal_LewisX and Internal_LewisA to motif_list (renamed LewisX and LewisA to Terminal_LewisX and Terminal_LewisA, correspondingly)
Made df_species static again to speed up package import
Added find_nth_reverse helper function that finds the starting index of the nth occurrence of a substring from the end of the string
Added remove_unmatched_brackets helper function to strip unmatched opening or closing brackets from glycan strings

motif

Added more masses to mz_to_composition.csv / mass_dict: Acetonitrile, Formate, Cl-, HCO3-, and NH4+

processing

Extended canonicalize_iupac to cases like "NeuGcα3Galβ3(NeuAcα6)GalNAcol" and even more modification formulations, e.g., “6S-GlcNAc”
Added canonicalize_composition to convert compositions formatted either in the style of HexNAc2Hex1Fuc3Neu5Ac1 or N2H1F3A1 into dictionaries used by glycowork
Added GalNAc4S to permitted reducing end monosaccharides for O-linked glycans in enforce_class
MissForest now has a maximum number of iterations and will check for convergence each iteration (immediately finishing upon converging), yielding some speed-ups in most cases
The output of min_process_glycans no longer contains empty strings for glycans ending in a linkage
Updated choose_correct_isoform to be compatible with change in min_process_glycans
Added get_possible_linkages to retrieve linkages matching a wildcarded linkage
Added get_possible_monosaccharides to retrieve monosaccharides matching a monosaccharide type (HexNAc, etc.)
Added decorators, rescue_glycans and rescue_compositions, to canonicalize them in case a decorated function errors out
Added linearcode_to_iupac to support LinearCode as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added iupac_extended_to_condensed to support IUPAC-extended as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added glycoct_to_iupac to support GlycoCT as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added wurcs_to_iupac to support WURCS as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
Added oxford_to_iupac to support Oxford as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage is limited
check_nomenclature (formerly in motif.tokenization) now handles outputting warning messages for trying to use non-string, non-graph nomenclatures or SMILES with glycowork functions
Expanded find_isomorphs to generate more isomorphic sequence variants and thereby increasing the chances that choose_correct_isoform will have access to the canonical sequence
Fixed a rare issue with canonicalize_iupac where sequences coming from structure_to_basic would sometimes be formatted incorrectly if they contained dHex
Fixed an issue in find_isomorphs in which double branches were not always correctly swapped

analysis

get_heatmap now no longer tries to convert data to relative abundances if negative values are detected in the input
All functions using dataframes as inputs in analysis can now also be used by providing full filepaths to the .csv file instead
Optimized some of the code for readability and speed (everything should be at least a bit faster now)

annotate

get_k_saccharides is now allowed to generate new dynamic motifs with tokens outside of lib (via expand_lib)
annotate_glycan and annotate_dataset now also support narrow wildcards
Fixed an issue in count_unique_subgraphs_of_size_k in which branched motifs were not always correctly formatted (i.e., opening/closing brackets)
get_k_saccharides now outputs dataframes with counts as default and can yield the old nested lists of motifs by setting the new keyword just_motifs to True
Fixed an edge case in which get_k_saccharides sometimes overcounted individual monosaccharides if their strings overlapped

graph

subgraph_isomorphism and compare_glycans now support using wildcards and position encoding at the same time. The extra keyword argument is now deprecated and the functions auto-detect whether anything has been specified in wildcards and/or termini_list
subgraph_isomorphism and compare_glycans now support automatically inferred narrow wildcards to allow for (i) matching linkages like a1-? to only specified linkages within that group (e.g., a1-3 but not b1-3 etc.) and (ii) matching monosaccharide types like HexNAc to only specified monosaccharides of that type (e.g., GlcNAc but not Glc, etc.)
The wildcard_list keyword argument in all graph & annotation functions is now deprecated as wildcards are inferred automatically via narrow wildcards and native full wildcards (?1-? and Monosaccharide)
subgraph_isomorphism now behaves as expected for testing motifs ending in linkages on glycans ending in linkages
subgraph_isomorphism can now return the matched subgraphs in the input glycan with the new return_matches keyword argument
glycan_to_nxGraph is now decorated with the rescue_glycans decorator, which auto-canonicalizes IUPAC strings if they are not in the format preferred by glycowork
Fixed mismatch of labels and string_labels in categorical_node_match_wildcard
Fixed an issue in subgraph_isomorphism in which, when using positional encoding, sometimes the mirror image of a motif was incorrectly captured if the termini aligned
termini_list within subgraph_isomorphism now only requires the specification of monosaccharide positions
Added expand_termini_list helper function to facilitate the expansion of monosaccharide-only termini_list into full termini_list behind the scenes
Added support for shorthand notation of position encoding, now either ‘terminal’ or ‘t’ will work
Improved handling of complex branching in graph_to_string; should be fewer unexpected translations now
Fixed an issue in graph_to_string in which induced subgraphs could cause errors due to unexpected or weirdly sorted node indices
Fixed an edge case in which the reducing end could be sometimes calculated as ‘internal’ when termini=’calc’ in glycan_to_nxGraph
Deprecated a duplicate character_to_label and string_to_labels
Deprecated categorical_termini_match; the functionality is now handled within categorical_node_match_wildcard
Deprecated the wildcards keyword argument from compare_glycans as this will now be detected internally, if wildcards are provided via wildcard_list

tokenization

Composition functions (e.g., composition_to_mass) are now decorated with rescue_compositions, which means that they can be used with compositions like “H3N2” (basically anything that canonicalize_composition can handle)
Deprecated character_to_label as it’s now handled within string_to_labels
Moved check_nomenclature into motif.processing
Optimized some of the code for readability and speed (most things should be at least a bit faster now)

draw

Support motif highlighting in GlycoDraw: by providing the highlight_motif keyword argument, motifs can be highlighted (everything else will be set to low opacity). Works with IUPAC-condensed motifs and named motifs from known
Support wildcards in motif highlighting with the highlight_wildcard_list keyword argument, for instance highlighting all Gal(?1-?)GlcNAc subunits (for Gal(b1-?)GlcNAc you don’t need highlight_wildcard_list, as narrow wildcards are handled automatically)
Support positional encoding in motif highlighting with the highlight_termini_list keyword argument, for instance highlighting all terminal, non-reducing end Gal(b1-?)GlcNAc subunits (yes, you can use both wildcards and positional encoding at the same time😊)
Support drawing of repeat structures (indicated by brackets and the number of repeats) via the new repeat keyword argument. Internal repeats can also be specified with the additional repeat_range keyword argument.
Optimized some of the code for readability and speed (most things should be at least a bit faster now)

network

biosynthesis

Optimized some of the code for readability and speed (everything should be up to 2x faster now)

evolution

Optimized some of the code for readability and speed (everything should be at least a bit faster now)

ml

Optimized some of the code for readability and speed (most things should be at least a bit faster now)

Assets 2

25 Oct 05:03

Bribak

v0.8.1-zenodo

3a5a537

v0.8.1-zenodo

Literally no code changes at this point (0.9 is expected to come in December) but Zenodo requires a new release to mint a doi

Assets 2

27 Aug 13:55

Bribak

v0.8.1

3a5a537

v0.8.1

Change Log

For Version 0.8.1

motif

tokenization

Converted chars into a dict to match libr formatting
Updated constrain_prot to work with the change above

ml

models

Changed prep_model to load trained models onto the CPU if no GPU is available

Assets 2

03 Aug 08:59

Bribak

v0.8.0

a2edec5

v0.8.0

Change Log

For Version 0.8.0

Linted the package with flake8
Increased code coverage
Added another optional extras install, [chem], including glyles, requests, and pubchempy

glycan_data

Changed lib to be a dict of type glycoletters:index, as it’s faster to index a dict vs. a long list; also adapted all functions using lib to reflect this change

loader

Added replace_every_second helper function
Updated linkages list
Changed linkages and Hex etc to be sets instead of lists

motif

processing

Added variance_stabilization for variance stabilization normalization, both globally and group-specific
Added in_lib helper function to check whether all glycoletters of glycan are in lib
Deprecated small_motif_find
cohen_d now also returns the variance of the effect size and supports paired samples as well (calculating Cohen’s dz in this case)
Added mahalanobis_distance to calculate Mahalanobis distance as an effect size for multivariate comparisons
Added mahalanobis_variance to estimate variance of Mahalanobis distance via bootstrapping
Added MissForest for random forest based data imputation
Cleaned up canonicalize_iupac and made it slightly faster
Added variance_based_filtering
Added impute_and_normalize and underlying helper functions
Fixed numpy random seed for reproducibility
Sped-up presence_to_matrix

tokenization

Deprecated mz_to_composition
mz_to_composition2 is now the new mz_to_composition
Adapted mz_to_structures, compositions_to_structures, and match_composition_relaxed to work with this change

annotate

Added create_correlation_network to identify clusters of highly correlated glycans/motifs
Added count_unique_subgraphs_of_size_k as a helper function within get_k_saccharides
Refactor get_k_saccharides to be faster and more complete (and be, effectively, a replacement of motif_matrix)
annotate_dataset now uses get_k_saccharides for mono- and disaccharides, instead of motif_matrix
Deprecated motif_matrix
annotate_dataset now also creates relevant ?-containing motifs if ‘terminal’ in feature_set, even if they don’t explicitly occur in the glycan strings
Big speed-up for annotate_dataset if known=True, as we now cache the precalculated motif graphs
Added quantify_motifs as a wrapper around annotate_dataset to adequately distribute relative abundances across extracted motifs
Deprecated estimate_lower_bound as speed-ups make it no longer necessary

analysis

Renamed make_heatmap to get_heatmap
Renamed make_volcano to get_volcano
Deprecated replace_zero_with_random_gaussian (this is now handled by MissForest in .processing within impute_and_normalize)
Added hotellings_t2 for multivariate comparisons
Changed multiple-testing correction method from Holm-Sidak to Benjamini-Hochberg
Added variance_stabilization in get_differential_expression
Added the option to analyze highly correlated sets of glycans/motifs (via create_correlation_network) within get_differential_expression
Implemented usage of hotellings_t2 and the Mahalanobis distance (as effect size) for usage if sets are analyzed within get_differential_expression
get_heatmap and get_differential_expression now scale abundances by the actual counts of motifs per glycan, not just absence/presence
Added get_meta_analysis to estimate combined effect sizes from the results of multiple studies (both fixed-effects and random-effects models can be estimated)
Added variance_based_filtering in get_differential_expression
Effect size variances can now also be retrieved within get_differential_expression via the effect_size_variance keyword argument
get_differential_expression now also can handle paired samples when paired=True
get_differential_expression now also tests the homogeneity of variances using Levene’s test in all settings (also multiple-testing controlled)
Added get_glycanova to use ANOVA-based analyses on glycomics datasets (uses basically all the improvements of get_differential_expression, including analysis on the motif level)
Added get_pca to plot glycomics data (also has the motif interface)
Added get_pval_distribution to plot the distribution of p-values
Added get_ma to plot a Bland-Altman plot
Added get_glycan_change_over_time to detect significant changes in time-course data via OLS fitting
Added get_time_series as a wrapper around get_glycan_change_over_time to do time series analyses, with all the motif & normalization functionality
Added get_coverage to visualize glycan expression across samples (ordered by average intensity) in a coverage plot

draw

Added import warning if draw dependencies are not installed
Removed pycairo from dependencies
Modified annotate_figure to be compatible with .svg files from older Matplotlib versions
Changed “output” to “filepath” in GlycoDraw
If there are “?” in the provided filepath for GlycoDraw, they will now be automatically replaced with “_” to avoid saving errors

graph

Sped-up glycan_to_graph/glycan_to_nxGraph (and all downstream functions, which are a lot)
Also improved the runtime of downstream functions, such as subgraph_isomorphism independent of these advances
subgraph_isomorphism now also accepts precalculated motif graph as inputs (in addition to the already supported precalculated glycan graphs)

ml

Rephrased import warnings to reflect optional install strategy for extra dependencies

model_training

Sped-up train_ml_model

network

biosynthesis

create_neighbors no longer uses the libr keyword

Assets 2

20 May 05:05

Bribak

v0.7.0

14a52ad

v0.7.0

Change Log

For Version 0.7.0

Removed support for Python 3.7; as we use the walrus operator in some of the re-worked functions, Python 3.8+ is now required to use glycowork
Added optional installs for specialized glycowork usage (‘all’, ‘ml’, and ‘draw’; for now), which install additional dependencies for these usages; more details in docs

glycan_data

Updated datasets, models, lib to be bigger & better; removed many sequence duplicates with differently written branch orderings

loader

Added multireplace helper function, to map a dictionary of changes to a string
Made build_custom_df faster

motif

draw

Added draw as a new submodule of .motif
Added GlycoDraw to draw glycans in SNFG style and save them as .svg/.pdf
Added annotate_figure to replace glycan text with glycan images in .svg figures (heatmaps, volcano plots, etc.)
Added text_to_glycan, which replaces glycan strings in figures with glycan images
Added scale_in_range to normalize a list of numbers within a range

tokenization

Sped up glycan_to_composition by 1000x (avoiding explicit stemification and just doing stemification of the building blocks); also speeds up all functions using glycan_to_composition
Sped up composition_to_mass (independent of the above)
glycan_to_composition (and downstream functions) now can handle more post-biosynthetic modifications: Ac, PCho, PEtN
Renamed calculate_theoretical_mass to glycan_to_mass
Sped up mz_to_composition2 by (i) filtering out duplicate compositions and (ii) selecting compositions from a chosen taxonomic kingdom
Reprioritized mz_to_composition2 by first searching for native compositions and only then looking for compositions + adducts and only then searching for doubly-charged compositions
canonicalize_iupac now also handles floating substituents and can handle many more typos / inconsistencies / IUPAC dialects (such as CFG-coded glycans), including improvements made by Kathryn Klarich
Moved canonicalize_iupac into motif.processing
Expanded get_core (and downstream functions) with HexA, HexNAc, dHex
Expanded map_to_basic to (some) post-biosynthetic modifications
mz_to_structures no longer outright fails if no m/z value can be matched
Deprecated structures_to_motifs ; annotate_dataset can do the same

processing

Fixed bug in processing glycans with floating substituents in small_motif_find
Deprecated seed_wildcard
choose_correct_isoform has been updated to keep up with the improved find_isomorphs
Added more informative error message to IUPAC_to_SMILES
get_lib is now slightly faster

graph

Sped up compare_glycans with string inputs, by avoiding graph operations when the two glycans do not have the same composition
Added support for enabling modification wildcards in compare_glycans and subgraph_isomorphism (for instance matching GalOS and Gal6S) by setting wildcards_ptm = True
Speed-up glycan_to_nxGraph_int by optimizing node label/attribute assignments
Refactor graph_to_string to be a lot more robust, streamlined, and faster. Its new integration with canonicalize_iupac may also result in string improvement upon back-translation (e.g., branch order canonicalization)
ensure_graph now has **kwargs that get passed to glycan_to_nxGraph
get_possible_topologies now supports internal additions as well, with the keyword argument ‘exhaustive’
possible_topology_check now supports wildcard matching via **kwargs passed on to compare_glycans
Made changes to make glycowork compatible with NetworkX 3.0
Moved bracket_removal to motif.processing
Fixed a small inconsistency in handling floating substituents in glycan_to_nxGraph_int that could have caused issues with custom libs
override_reducing_end is no longer needed in glycan_to_nxGraph to delineate linkage-ending glycans (e.g., Fuc(a1-2) ); this is auto-inferred within glycan_to_nxGraph now

annotate

Deprecated convert_to_counts_glycoletter and glycoletter_count_matrix ; motif_matrix can do both
Refactored motif_matrix to be substantially faster and more condensed in its output (also speeds up annotate_dataset with the ‘exhaustive’ option in the feature_set argument)
Expanded motif_matrix to implicitly test for subsumption enrichment (e.g., previously we only explicitly looked for “Gal(b1-?)GlcNAc”; now we also count “Gal(b1-4)GlcNAc” as to the former)
annotate_glycan is now dual-compatible with string and networkx graph input
expanded feature_set in annotate_dataset by the option ‘terminal’, which calls get_terminal_structures
This usage of get_terminal_structures in annotate_dataset now also does the same implicit test for subsumption enrichment as described for motif_matrix above
annotate_dataset now creates its own lib, based on the motif list and the provided glycans
Expanded find_isomorphs to also be able to re-shuffle (some) branched branches
Moved find_isomorphs into motif.processing
Linkages-only are no longer considered by motif_matrix / annotate_dataset

analysis

All functions with the feature_set keyword argument now can also use the ‘terminal’ keyword for analyzing non-reducing end motifs exclusively
Added get_differential_expression to compare glycomics data, including data cleaning and imputation
get_pvals_motifs and make_heatmap no longer have the lib keyword argument, as annotate_dataset will generate a suitable lib internally
Fixed relative abundance summation in motif-mode for make_heatmap
Added the clean_up_heatmap helper function to remove redundant (i.e., identical) rows in heatmaps, with a prioritization of named motifs and longer motifs containing redundant shorter motifs
Added make_volcano, to generate a volcano plot from internally calculated differential expression using the get_differential_expression function
Moved cohen_d into motif.processing

ml

model_training

train_ml_model no longer has the lib keyword argument, as annotate_dataset will generate a suitable lib internally

network

biosynthesis

Refactored construct_network pipeline to be faster and more memory-efficient
reducing_end has been deprecated and is being handled internally
Added infer_roots to auto-infer permitted_roots (also does not need to be specified any longer in construct_network)
Implemented distance limit, to prevent combinatorial explosion when outlier glycans are present
Deprecated subgraph_to_string and make_network_from_edges
Deprecated fill_with_virtuals and make_network_directed
Minor speed-up of process_ptm, by pre-calculating stem_lib once instead of for every glycan in network

Assets 2

09 Dec 13:06

Bribak

v0.6.0

1975edc

v0.6.0

Change Log

For Version 0.6.0

Updated nbdev1 to nbdev2
Updated documentation notebooks
Expanded documentation examples for (i) networks and (ii) deep learning models

glycan_data

Updated v7_sugarbase and associated files + models
Improved Cellosaurus ID prefixes
Added glycan composition as a new column to sugarbase
Exchanged ‘z’ with ‘?’ as a linkage uncertainty indicator
Added protein column to glycan_binding, indicating the protein name whose sequence is in the target column

loader

Added “Ins” and “Galf” to Hex list
Added stringify_dict utils function to convert a dictionary into a string

motif

Changed functions to use “?” as a linkage uncertainty indicator rather than “z”

processing

Added enforce_class to check whether glycan is from desired glycan class
Added IUPAC_to_SMILES to convert glycans from IUPAC-condensed into SMILES via GlyLES

graph

glycan_to_nxGraph can now use glycan strings with floating substituents, such as “{Neu5Ac(a2-3)}Gal(b1-4)GlcNAc(b1-6)[Gal(b1-3)]GalNAc”
added get_possible_topologies and possible_topology_check to probe whether glycans (could) match a glycan with floating substituents
added ensure_graph to allow functions to be dual-compatible for string & graph inputs
generate_graph_features, largest_subgraph, get_possible_topologies, and possible_topology_check are now dual-compatible with string & graph inputs

tokenization

Refactor match_composition_relaxed to be slightly faster & a much smaller function, that uses glycan_to_composition for matching
Deprecated match_composition accordingly
mz_to_composition is now up to 100x faster, based on much better defaults / assumptions
added support for free oligosaccharides to mz_to_composition
added mz_to_composition2 as an alternative way of composition matching; better scaling and “more physiological” as it’s constrained by class-specific existing compositions within sugarbase
glycan_to_composition can now also handle post-biosynthetic modifications such as sulfation
added composition_to_mass
Improve linkage uncertainty handling in canonicalize_iupac
canonicalize_iupac now can handle sulfation and phosphorylation
updated stemify_glycan & structure_to_basic to correctly handle glycans of length 1
updated stemify_glycan to terminate the while loop if it would result in infinite loops
updated glycan_to_composition to support floating substituents
get_core now also handles “Ins” correctly
calculate_theoretical_mass now can also handle methylation modifications correctly
improved reducing end calculation for modified glycans in calculate_theoretical_mass
added speed-up option to calculate_theoretical_mass & glycan_to_composition for non-exotic glycans
refactored calculate_theoretical_mass to use composition_to_mass

annotate

add get_terminal_structures to extract monosaccharide+linkage from all non-reducing ends of glycan
improved runtime and completeness for get_k_saccharides
get_terminal_structures & get_k_saccharides are now also both dual-compatible with string & graph inputs
added get_molecular_properties to obtain chemical features of glycans via SMILES
‘chemical’ is a new option in feature_set of annotate_dataset, using get_molecular_properties
small style fix in motif_matrix to avoid warning
link_find (and downstream annotation findings) now also support floating substituents

analysis

add cohen_d to calculate effect size between two comparison groups
‘chemical’ is a new option in feature_set of get_pvals_motifs and make_heatmap, using get_molecular_properties

ml

model_training

added the option to use GSAM instead of SAM for the optimizer by specifying alpha in training_setup

models

streamlined SweetNet architecture (credit to David Alexander) used in SweetNet and LectinOracle  faster training and clearer code

network

biosynthesis

added a dictionary of pre-calculated glycan graphs to construct_network and underlying functions  ~2x speed-up and better scaling
various other performance improvements to network construction functions further increase speed
improved pruning of virtual root nodes in construct_network
modified export_network to allow for custom node attribute extraction
generalized find_diamonds to allow for extraction of diamonds, hexagons, etc with a custom parameter nb_intermediates (default: 2, for diamonds)
generalized choose_path to compute path probabilities for non-diamond shape motifs

evolution

small fix in calculate_distance_matrix

Assets 2

Releases: BojarLab/glycowork

v1.3.0

Change Log

For Version 1.3.0

glycan_data

loader

stats

motif

analysis

v1.2.0

Change Log

glycan_data

loader

motif

processing

graph

tokenization

draw

analysis

regex

annotate

network

biosynthesis

evolution

v1.1.0

Change Log

glycan_data

stats

motif

regex

processing

annotate

analysis

draw

graph

network

biosynthesis

ml

model_training

v1.0.1

Change Log

motif

processing

draw

network

biosynthesis

evolution

v1.0.0

Change Log

glycan_data

loader

motif

processing

analysis

annotate

graph

tokenization

draw

network

biosynthesis

evolution

ml

v0.8.1-zenodo

v0.8.1

Change Log

For Version 0.8.1

motif

tokenization

ml

models

v0.8.0

Change Log

For Version 0.8.0

glycan_data

motif

ml

network

v0.7.0

Change Log

For Version 0.7.0