Skip to content

McM and XSDB queries

Karl Ehatäht edited this page Nov 28, 2021 · 5 revisions

Table of contents

Thanks to cern-get-sso-cookie it is now possible to create a temporary secure connection that can be used to perform queries to various services provided by CERN. The authentication requires grid certificate (usercert.pem) and encrypted private key (userkey.pem) which can be obtained by following the instructions detailed here. These are the same files that one needs to set up for submitting jobs to the grid. The secure connection is established with the creation of a short-lived cookie file (cookie.txt):

cern-get-sso-cookie --cert ~/.globus/usercert.pem \
                    --key  ~/.globus/userkey.pem  \
                    -u <URL>                      \
                    -o cookie.txt

where <URL> corresponds to some website with an API that issues the cookie. It will prompt to enter your grid password twice. The connection is valid for as long as the cookie, typically in the order of 8 hours. Below are two examples that demonstrate how to take advantage of such feature.

McM

McM service is tool for creating and managing MC production requests. For a regular analyst, it can be useful to find out more detailed information about a dataset that is otherwise not available in DAS. The most common use case in particular is finding out the fragment of a dataset that documents the exact showering settings and the location to its gridpack.

Normally, one would access this information by first finding out the preparation ID of a dataset, then pinpointing the corresponding request chain, navigating to the first (LHE) step of the chain and checking its setup command. Thanks to McM API (more documentation here) it is possible to automate all these steps. The only catch is that there is no obvious way to obtain preparation ID of a dataset using the API, given its DBS name. Fortunately, there exists DBS client API in Python2 (in addition to standalone dasgoclient program) that can be used instead to query the preparation ID. The only additional requirement is that the grid proxy is open:

source /cvmfs/cms.cern.ch/common/crab-setup.sh # or similar -- only if you haven't set up your env
voms-proxy-init -voms cms -valid 12:00 # or shorter, depending on the needs

Calls to both APIs are conveniently packaged into inspect_mcm.py script that gives the option to obtain generator metadata (-q generator) or fragment (-q fragment) of a dataset (-i <dbs name>), given cookie (-c <cookie>, defaults to cookie.txt in current working directory). The cookie must be obtained from https://cms-pdmv.cern.ch/mcm/. Prerequisite steps for running the script are documented in its help message (-h).

Here is an example that demonstrates how to find generator information:

inspect_mcm.py -i /ttHJetToNonbb_M125_13TeV_amcatnloFXFX_madspin_pythia8_mWCutfix/RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3_ext1-v2/MINIAODSIM -q generator

Output:

Madgraph5_aMC@NLO, MadSpin, Pythia8

And how to find its fragment:

inspect_mcm.py -i /ttHJetToNonbb_M125_13TeV_amcatnloFXFX_madspin_pythia8_mWCutfix/RunIISummer16MiniAODv3-PUMoriond17_94X_mcRun2_asymptotic_v3_ext1-v2/MINIAODSIM -q fragment

Output:

import FWCore.ParameterSet.Config as cms

# link to cards:
# https://github.com/cms-sw/genproductions/tree/b0f47cf04cdbdbe18b56ddbc22013cbd86f4a2cd/bin/MadGraph5_aMCatNLO/cards/production/13TeV/higgs/tth01j_5f_ckm_NLO_FXFX_MH125

externalLHEProducer = cms.EDProducer("ExternalLHEProducer",
    args = cms.vstring('/cvmfs/cms.cern.ch/phys_generator/gridpacks/slc6_amd64_gcc481/13TeV/madgraph/V5_2.2.2/tth01j_5f_ckm_NLO_FXFX_MH125/v1/tth01j_5f_ckm_NLO_FXFX_MH125_tarball.tar.xz'),
    nEvents = cms.untracked.uint32(5000),
    numberOfParameters = cms.uint32(1),
    outputFile = cms.string('cmsgrid_final.lhe'),
    scriptName = cms.FileInPath('GeneratorInterface/LHEInterface/data/run_generic_tarball_cvmfs.sh')
)

Provenance

If the intention is to just get the gridpack of a sample and the fragment itself is irrelevant, and if the sample files are easily accessible, then querying McM is not necessary. All you need is to pick the file in any data tier, as long as it's in EDM format, and use the following function to retrieve the gridpack:

get_gridpack() {
  edmProvDump -f ExternalLHEProducer "$1" 2>/dev/null | grep gridpack | sed "s/'/ /g" | awk '{print $(NF - 1)}';
}

For example:

get_gridpack root://cms-xrd-global.cern.ch//store/mc/RunIISummer20UL16NanoAODv9/WJetsToLNu_TuneCP5_13TeV-madgraphMLM-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v1/280000/80640BFB-6356-E548-95D7-33B727585FB1.root
#> /cvmfs/cms.cern.ch/phys_generator/gridpacks/2017/13TeV/madgraph/V5_2.6.1/WJetsToLNu/WJetsToLNu_13TeV-madgraphMLM-pythia8_slc6_amd64_gcc630_CMSSW_9_3_16_tarball.tar.xz

If you want to dump Pythia showering settings that is typically included in the generator fragments, then you can just use edmProvDump -f Pythia8HadronizerFilter to get this information. None of this works if the provenance information is dropped from the input file (for instance during the post-processing). Moreover, EDM utilities are available only in CMSSW.

XSDB

XSDB is a database for hosting cross sections. If a user wants to find the cross section of a dataset from the database, all they need to provide is the first part of the dataset name (between the first two slashes). Additional conditions, such as c.o.m energy, may also apply in the search query. Its API does not require open grid proxy. Only the cookie, obtained from https://cms-gen-dev.cern.ch/xsdb/ is needed.

The script inspect_xsdb.py functions as command line equivalent to the XSDB web page. It requires the dataset name (either the full DBS name or only the first part of it) as input (-i <name>). Additional requirements may apply in order to refine the search results, such as c.o.m energy (-e <in TeV>) or order of accuracy (-o <order>). By default, the output is formatted as human-readable table with the following columns: process name, cross section (in pb), total uncertainty, accuracy, c.o.m energy, comments and references. The table-formatted output can be disabled in favor of parsable output with -t 0 option where each unique result is on a separate line and attributes are separated by semicolons. The choice of attributes and their order can be changed via -k option.

First example:

inspect_xsdb.py -i TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8 -k process_name cross_section total_uncertainty accuracy

Output:

+-----------------------------------------------+---------------+-------------------+----------+
|                  process_name                 | cross_section | total_uncertainty | accuracy |
+-----------------------------------------------+---------------+-------------------+----------+
|                TTToSemiLeptonic               |     365.34    |     +4.8%-6.1%    |   NNLO   |
| TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8 |     687.1     |       0.5174      |   NLO    |
| TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8 |     687.1     |       0.5174      |   NLO    |
| TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8 |     687.1     |       0.5174      |   NLO    |
+-----------------------------------------------+---------------+-------------------+----------+

Second example:

inspect_xsdb.py -i /GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8/RunIIAutumn18MiniAOD-102X_upgrade2018_realistic_v15-v4/MINIAODSIM -t 0

Output:

GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8;3.185;0.0117;unknown;13;Automatically computed;
GluGluToContinToZZTo2mu2tau_13TeV_MCFM701_pythia8;3.289;0.001591;unknown;13;Automatically computed;

NB! XSDB is certainly not the definite source for obtaining cross sections. It makes sense to use it unless there is no authoritative source available. If the comment says "automatically computed" as in the last example, then it means that the cross section is estimated using GenXSecAnalyzer typically from a few thousand MiniAOD events. There are scripts such as dump_xs.py which obtain the cross sections from the full available statistics and are therefore slightly more accurate, but it can be cumbersome to run if the samples are huge.

Clone this wiki locally