Skip to content

Releases: gyorilab/adeft

Adeft v0.12.3

10 May 02:54
Compare
Choose a tag to compare

What's Changed

  • Add Cython file needed to build to MANIFEST.in by @steppi in #78

Full Changelog: 0.11.2...0.12.3

0.11.2

05 Nov 14:28
e1fcdd0
Compare
Choose a tag to compare

This release adds support for Python 3.11.

0.11.1

21 May 22:13
Compare
Choose a tag to compare

This release updates the functions for downloading models and other resources from s3 to use boto3 instead of the unmaintained wget package. Downloads should now be more reliable.

0.11.0

07 May 19:28
Compare
Choose a tag to compare

This release fixes a bug that caused the grounding GUI to not work when adeft is pip installed. The adeft folder for the pretrained models is now placed in a platform specific user data folder by default rather than in a hidden folder in the users home directory. Users are still able to override this default by setting the environment variable ADEFT_HOME. Tests have been updated to use pytest instead of nose.

0.10.0

03 Dec 03:02
Compare
Choose a tag to compare

This release makes several changes concerning model statistics.

  1. The global precision, recall, and F1 scores for a classifier now use micro-averaging to aggregate across the scores for different positive class labels rather than taking an average weighted by the frequencies for each positive label. Micro-averaging looks at global counts of true positives, false positives, and false negatives
    across all positive labels. A true positive involves any positive labeled datapoint classified correctly. A false positive involves any positive labeled datapoint that has been classified incorrectly. A false negative involves any datapoint being classified incorrectly to a positive labeled datapoint. Note that false positives and false negatives can overlap. Micro-averaging is easier to reason about and interpret and using it allows for some simplification of implementation in other places. The original decision to use the weighted average was made with little thought at a time when we were making less use of model statistics.

  2. A method has been added to adeft.disambiguate.AdeftDisambiguator that allows the set of positive labels to be updated while recomputing global model statistics. Previously it was required to retrain the model. This is facilitated by storing the entire label vs label confusion matrix for each CV fold upon training a model and serializing this when saving the model.

Bug fixes and a smaller changes were also made

  1. A bug was fixed that was causing the labels in model statistics to fail to update when adeft.disambiguate.AdeftDisambiguator.modify_groundings was used to update groundings in a model.
  2. A bug was fixed that caused the labels attribute of an adeft.disambiguate.AdeftDisambiguator to not contain labels for which no defining pattern exists. (These labels are typically for texts manually curated in Entrez as mentioning a particular gene with the shortform of interest as a synonym but which are not abbreviations.)
  3. A new attribute was added to classifiers called other_metadata. Anything jsonable stored within this attribute will be preserved upon model serialization. We are using this to store any relevant information needed to retrain a model that does not fit into the existing attributes. This allows for simplification of the retraining process.
  4. Some small updates have been made to the introductory Jupyter notebook.

0.9.0

19 Nov 05:01
Compare
Choose a tag to compare

This release makes a number of improvements to the grounding GUI.

  1. Previously, actions such as deleting an entry or toggling a label as positive/negative would cause the scroll position and text entered into the input boxes to be lost. This made using the app tedious since the page would refresh to the top after each action, making it burdensome for example to delete many groundings or toggle many labels in sequence. This has been remedied.
  2. The input boxes at the top are now fixed in a sticky position making it unnecessary to scroll back and forth in order to select rows and then enter groundings. They now follow along as the user scrolls.
  3. Columns of the table are now sortable. The headers for each column are now buttons masquerading as links. Clicking each header will cause the rows to be sorted by that column. This is useful for example to aid in scanning for similar longforms or to group every row together that has the same grounding.
  4. The user may now pass in a csv file of known groundings with rows of the form namespace, identifier, standard name (e.g HGNC,6091,INSR). It is then only necessary to enter the namespace and one of the identifier or standard name into the input boxes for any grounding that has a row in the supplied table.
  5. Entered groundings are now color coded, with one color for groundings where the standard name and identifier match in a row in the supplied groundings csv file, another color for groundings where the standard name and identifier do not match based on the table, and black if there are no rows in the table for the entered standard name and identifier. The colors have been chosen so that the contrast can hopefully be detected by most color blind users; instead of the standard green for match, red for match, approximations have been chosen for these colors based on the Wong color palette.
  6. Any rows provided the grounding ignore will have their longforms dropped from the generated grounding map. These are displayed with a special color to highlight the special semantic role.
  7. Labels without a namespace will not appear in the column of labels which can be toggled as positive/negative.

These changes should make the GUI much more user friendly and less tedious to use.

0.8.0

16 Nov 00:20
Compare
Choose a tag to compare

This release fixes several bugs and makes some small updates.

Fixes have been made for

  1. A bug in AdeftMiner.prune that broke this method but was undiscovered due to lack of testing. The bug has been fixed and a test has been added.
  2. Training adeft models throwing an error for the edge case where there are more than two labels with only one positive label.
  3. The longform scorer throwing an error when there are punctuation characters in the shortform.
  4. The GUI not working when the multiprocessing start method is set to spawn. This caused the GUI to fail on windows, where fork is unavailable. This should resolve issue #49.
  5. The deprecated parameter iid has been removed from internal use of Scikit-learn's GridSearchCV, removing a deprecation warning.

The following other changes have been made

  1. AdeftLabeler now requires unique identifiers along with the texts passed into process_texts. Instead of passing in a list of texts, the process_texts method now takes a list of tuples of the form (text, identifier). The output list now contains tuples of the form (text, label, identifier). This is useful for mapping back from texts in the generated corpus to texts in the input. Texts without defining patterns are filtered out completely and those with defining patterns have the defining patterns replaced with only the shortform, making mapping backwards nontrivial without the identifiers.
  2. Adeft's home folder can now be specified by setting the environment variable ADEFT_HOME in the user's profile. The default is now the hidden folder ".adeft" in the users home directory with subfolders for different adeft versions.
  3. The parameter class_weight from Scikit-learn's implementation of logistic regression is now exposed as a parameter of AdeftClassifier. This allows for provided different weights in the loss function for different class labels.

0.7.0

10 Sep 13:22
Compare
Choose a tag to compare

This release updates the longform expansion discover algorithm in AdeftMiner to combine the Acromine based approach with an alignment based scorer that we have developed. Alignment based scoring algorithms look for common subsequences between the shortform and longform candidates, different approaches scoring matches in a variety of ways. We have combined the two approaches by taking weighted averages of normalized Acromine scores and alignment based scores for each longform candidate, with the weight assigned to the alignment based score increasing for rare expansions.

The AdeftRecognizer has also been updated to allow the raw longform expansion to be recovered as it actually appears in a text. Previously only a normalized expansion was recovered.

0.6.0

30 Jan 17:26
Compare
Choose a tag to compare

In this release

  • Users may specify a seed that will be used in random number generators involved in adeft model creation, allowing for repeatable model training results.
  • Additional statistics are captured at the time of model training. F1, Precision, and Recall are now captured for each class label separately, allowing users to see how performance compares across labels. These statistics have been propagated to the info string of AdeftDisambiguator.
  • Timestamps and additional metadata are collected at model training time, making changes in models more transparent. AdeftDisambiguator now has an additional method version based on some of this metadata.
  • The ".adeft" folder containing models now has the version appended to it. E.g. in this release the folder will be named ".adeft_0.6.0". This will allow different versions of adeft with incompatible models to coexist on the same machine
  • The command python -m adeft.download no longer takes argument --update, the new behavior is for all existing models to be replaced and a fresh copy of all existing models on S3 downloaded when the command is run.
  • The method feature_importances of AdeftClassifier no longer raises an exception if called for a classifier trained before the information necessary to calculate feature importances was included. Now a warning is logged and None is returned

0.5.5 - JOSS Paper

16 Jan 18:22
Compare
Choose a tag to compare

Version of Adeft software corresponding to accepted manuscript at the Journal of Open Source Software (JOSS).

Change log from 0.5.3:

  • add shortforms to model stopwords to prevent use of abbreviations as model features
  • capture information about feature importance for each model