Skip to content

Pipeline code guidelines

Michael Galloy edited this page May 23, 2023 · 1 revision

Basics

  1. Routines should fit on a screen.
  2. Each routine should have a full doc header.
  3. Outputs should be through return values or output keywords.

Architecture

There are four users of the code in the pipeline:

  1. the full pipeline when run to process a day's data
  2. ad hoc analysis users
  3. the unit tests
  4. regression/integration tests

A few basic principles to make using the pipeline code easier for each of the four users:

  1. The calculations performed by the pipeline will be done by a set of simple (all input through arguments) routines. These routines will take only simple arrays and scalar parameters. They will return error codes and status, not log messages directly.
  2. The higher-level routines, which will be passed the run object and use file objects to cache information about a given file, are only used by the pipeline itself, not ad hoc analysis users or the unit tests. These routines will send messages to the output log.
  3. The regression/integration tests have the same requirements as the pipeline itself.

Data

  1. Do not change anything in the raw directory. Changing the raw directory means that raw data must be copied to a new location in order to run the pipeline, such as with the KCor pipeline. Locks and listings of previously processed files can be kept in the process directory. This allows the pipeline to run without creating a copy of the raw data, even including two instances of the pipeline running on the same raw directory at the same time.
  2. Keep all FITS keyword names, value specifications, and comments in a template file. There will be separate files for primary header vs extension headers for each type of file written in the pipeline.

Testing

  1. A better testing plan is needed.
  2. The unit tests should be able to quickly determine the functionality of the low-level routines has not been broken by changes.
  3. The regression/integration tests should be able to easily and automatically determine if the pipeline has changed the results of some short benchmark datasets and require that the new output be certified as correct before it passes.

Config file

  1. A config file is used for specifying all the options required to run.
  2. The default for each option, except paths, will be what is used in production.
  3. The production config file will be fairly short, only specifying paths.
  4. Other runs' config file will just specify those same paths and any options that are different from the production.
  5. There will be a section on each processing step with constants, options, locations, and diagnostics for that step.
  6. The format of the config files (type and default value of each option) will be defined by a specification file. There will be a verification step before the run starts that verifies the config file against this specification file.

Epochs file

  1. A file containing all the instrument changes effecting data processing by date is very useful. This includes constants about the instrument which haven’t (yet) changed.
  2. Sections of the file are titled by date/times and contain only the values that change in that epoch.
  3. The format of the epoch file (type and default value of each option) will also be defined by a specification file and verified before the run starts.

Logs

  1. Detailed logs are very useful for diagnosing problems.
  2. Logs should put stack traces and other information from crashes into the logs.
  3. Each log message should be stamped with a date/time, level (critical, error, warning, informational, and debug), and routine it came from.
  4. When filtered at the informational level, the logs should give a fairly concise summary of what the pipeline did.
  5. When filtered at the warning level, the logs should list the problems the pipeline encountered in the data.
  6. Critical and error log messages require the pipeline to be fixed to handle some unexpected input. They should stop the pipeline run immediately.
  7. Warning messages should be minimal, so as not to overwhelm. If a particular warning occurs many times on those occasions that it occurs at all, it should be counted and only the summary set to warning (demote the individual warning messages to debug level).

Abstractions

An object representing a run of the pipeline is useful. It is responsible for:

  1. knowing the parameters of the run, i.e., the information available from the config file and the epochs file
  2. doing setup for the run, i.e., setups the logs, creating directories, etc.

An object representing a file will be useful. It is responsible for:

  1. being able to retrieve any information about the file, i.e., data or metadata
  2. caching metadata that is frequently needed

An object representing a date/time will be useful. It is responsible for:

  1. parsing a given string into a date/time object
  2. returning a string with a given format for a date/time object
  3. doing arithmetic, such as adding/subtracting a time interval, comparing two date/times, etc.
  4. convert to another date/time representation such as Julian date, epoch time, etc.

Anti-patterns

  1. Common blocks really are bad. Adding a variable to a common block that is used in a lot of routines requires checking each routine to see if a variable of that name is already used. A run object is better.

Clone this wiki locally