Skip to content

PubChem molecular formula database for IDSL.UFA

Sadjad F Baygi edited this page Jan 22, 2023 · 7 revisions

DOI

IDSL.UFA incorporates a function to extract molecular formulas from the PubChem database. This function can be called through below commands.

UFA_PubChem_formula_extraction(path)

path is a location to store molecular formulas from PubChem in the .txt format.

The PubChem database contains very complex formulas that are not detectable in common mass spectrometry analyses, therefore, to generate efficient negative and positive IPDBs a number of filters were applied to eliminate these complex formulas and mitigate the size of IPDBs including:

  1. Excluded non-carbon containing formulas
  2. 35 <= monoisotopic mass <= 1250
  3. Maximum number of individual elements (not atoms) in a molecular formulas <= 7

The current PubChem IPDBs include 6,426,480 and 10,733,725 molecular formula ions in negative and positive modes, respectively.

Post library molecular formula matching

In addition to positive and negative IPDBs from the PubChem molecular formula database, a database of frequency of molecular formulas in the PubChem molecular formula database was generated using the molecular_formula_library_generator module and provided in this link. When ionization pathways of the analytical platform (e.g. c("[M]+", "[M+H]+", "[M+Na]+")) is known, intact molecular formulas can be obtained and then check its presence and frequency in the PubChem molecular formula database. Molecular formula database search can be activated through PARAM0025-PARAM0027 in the UFA parameter spreadsheet for individual file annotations.