Skip to content
Dimitri Papadopoulos Orfanos edited this page Nov 16, 2021 · 28 revisions

Stratify subjects are associated to 4 recruitment centres, LONDON, SOUTHAMPTON, BERLIN and AACHEN, and further divided into patients and controls.

PSC1 pseudonym code

Only acquisition centres can convert between identifying recruitment data and PSC1 codes.

PSC1 code structure

We associate a specific prefix P to PSC1 codes of each of the resulting classes:

Patient Control
LONDON 010001 (600 subjects) 010000 (50 subjects)
SOUTHAMPTON 090001 (600 subjects) 090000 (400 subjects)
1200 450

After discarding 100 LONDON patient codes and adding 250 BERLIN patient codes on 21/12/2017:

Patient Control
LONDON 010001 (500 subjects) 010000 (50 subjects)
SOUTHAMPTON 090001 (600 subjects) 090000 (400 subjects)
BERLIN 040001 (250 subjects)
1350 450

After adding 100 patients/subjects from AACHEN on 17/09/2018:

Patient Control
LONDON 010001 (500 subjects) 010000 (50 subjects)
SOUTHAMPTON 090001 (600 subjects) 090000 (400 subjects)
BERLIN 040001 (250 subjects)
AACHEN 091001 (40 subjects) 091000 (60 subjects)
1390 510

After discarding 1 LONDON patient code to deal with a patient who was mistakenly assigned a control code:

Patient Control
LONDON 010001 (499 subjects) 010000 (50 subjects)
SOUTHAMPTON 090001 (600 subjects) 090000 (400 subjects)
BERLIN 040001 (250 subjects)
AACHEN 091001 (40 subjects) 091000 (60 subjects)
1389 510

After discarding 5 LONDON patient codes to deal with 5 patients who were mistakenly assigned control codes:

Patient Control
LONDON 010001 (494 subjects) 010000 (50 subjects)
SOUTHAMPTON 090001 (600 subjects) 090000 (400 subjects)
BERLIN 040001 (250 subjects)
AACHEN 091001 (40 subjects) 091000 (60 subjects)
1384 510

After adding 200 new LONDON patient codes in 2021:

Patient Control
LONDON 010001 (494 subjects)
010002 (62 subjects)
010003 (53 subjects)
010004 (20 subjects)
010005 (65 subjects)
010000 (50 subjects)
SOUTHAMPTON 090001 (600 subjects) 090000 (400 subjects)
BERLIN 040001 (250 subjects)
AACHEN 091001 (40 subjects) 091000 (60 subjects)
1584 510

Please note that LONDON control subjects should be recruited from the Imagen cohort, so we do not need to generate new specific pseudoyms for them. Just re-use existing Imagen pseudonym codes. A limited set of 50 new such codes have nevertheless been generated, just in case some LONDON control subjects are recruited outside the Imagen cohort.

Pseudonyms are generated for all subjects of the above defined classes. These 12-digit codes are a concatenation of:

  • a prefix P made of 6 digits, as documented in the table above,
  • a main code C made of 5 digits, unique across all subjects (whether Imagen or Stratify),
  • a check digit D made of a single digit, and obtained by applying the Damm algorithm to the concatenation of P and C, to detect invalid codes.

We make sure the Damerau–Levenshtein distance between the concatenation of C and D for any two subjects is at least 3, in order to mitigate the risk of manual input errors.

PSC1 generation

We ran Python script stratify_generate_psc1.py as follows:

stratify_generate_psc1.py | sort > stratify_codes_2017-07-20.txt

stratify_generate_psc1_berlin.py | sort > stratify_codes_berlin_2017-12-13.txt

stratify_generate_psc1_aachen.py | sort > stratify_psc2_aachen_2018-09-17.txt

stratify_generate_psc1_london_2021.py | sort | head -200 > stratify_codes_2021-11-16.txt

PSC2 pseudonym code

Only NeuroSpin, acting as a trusted third party, can convert between PSC1 and PSC2 codes.

PSC2 code structure

We associate a specific prefix to the PSC2 codes of patients and controls:

prefix
Patient 0001
Control 0000

This is consistent with Imagen:

  • Imagen subjects already use 0000 as a PSC2 prefix.
  • Some LONDON Imagen subjects will be used as Stratify controls.

The 12-digit PSC2 pseudonym codes are a concatenation of:

  • a prefix P made of 4 digits, as documented in the table above,
  • a main code C made of 7 digits, unique across all subjects (whether Imagen, c-VEDA or Stratify),
  • a check digit D made of a single digit, and obtained by applying the Damm algorithm to the concatenation of P and C, to detect invalid codes.

We make sure the Damerau–Levenshtein distance between the concatenation of C and D for any two subjects is at least 3, in order to mitigate the risk of manual input errors.

PSC2 generation

We ran Python script stratify_generate_psc2.py as follows:

stratify_generate_psc2.py | sort > stratify_psc2_2017-07-28.txt

PSC1–PSC2 conversion table

Create separate PSC1 and PSC2 files for each patient/control class, and shuffle PSC2 files so that the conversion table cannot be inferred:

grep -e '^0[19]0000' stratify_codes_2017-07-20.txt > controls_psc1.txt
grep -e '^0[19]0001' stratify_codes_2017-07-20.txt > patients_psc1.txt

grep -e '^0000' stratify_psc2_2017-07-28.txt | shuf > controls_psc2.txt
grep -e '^0001' stratify_psc2_2017-07-28.txt | shuf > patients_psc2.txt

Create the Stratify conversion table using dummy 000000 DAWBA codes for now, copy this new Stratify table at the end of the existing Imagen conversion table, and finally delete temporary files:

paste -d '=' controls_psc1.txt controls_psc2.txt | sed 's/=/=000000=/' > controls.txt
paste -d '=' patients_psc1.txt patients_psc2.txt | sed 's/=/=000000=/' > patients.txt

rm controls_psc1.txt patients_psc1.txt controls_psc2.txt patients_psc2.txt

cat controls.txt patients.txt | sort >> psc2psc.csv

rm controls.txt patients.txt