In this practical, you will perform multiple steps needed on data ingestion and working environment setup. You will download the test dataset, verify its integrity and add minimal documentation. Finally, you will have data stored in optimal way for subsequent (re-)use in your future steps.
Participants are required to have several tools installed before the practical. These will allow them to work with .7z
archives and verify MD5 checksums. Follow the link according to your platform:
Registering the data is necessary for compliance with GDPR. The record should also contain information on actual physical location of the data which should be known before actual ingestion.
Please, read the email from your collaborator with download link and password link.
- Download the data
- Get encryption password
Your collaborator/data provider has generated checksums before uploading the data on your shared storage. These are commonly saved in plain text file and placed close to the actual data.
In our test scenario, md5sum
tool was used for checksum generation.
- Open the downloaded checksum file (
EPIC-DREM_chip-seq.7z.md5
) with your favorite text editor and inspect the content
Data might have been corrupted already on the server or during the transfer. This step ensures that the data are exactly the same as at the time of the last checksum computation.
-
Install
PeaZip
tool (see instructions) -
Right-click on a file -> PeaZip ->
CRC, has and file tools
-
Hit
OK
-
Go to
Clipboard
and verify the MD5 checksum string is matching the one in provided checksum file.
-
Install and run
Checksum
tool (see instructions) -
Choose MD5 algorithm
-
Drag and drop the file archive into
Checksum
window -
Compare the result with the provided checksum by pasting it into
Paste reported checksum here
field.
Place the file with checksums next to the file archive and run following command:
md5sum -c EPIC-DREM_chip-seq.7z.md5
- Right-click on the archive file
- Select
PeaZip
-> Extract here - Enter the encryption password
Use 7z
command line tool to extract the archive.
7z x EPIC-DREM_chip-seq.7z
# Enter the ecryption password
Write minimal information about the folder and data you have just downloaded.
The README file should be in plain format (TXT, Markdown) and contain following information:
- dataset name/title
- project name
- date of creation/download
- data origin
- version of the data
- data owner/responsible
- data structure
- how was the data downloaded/received
- ...
To ensure that nobody will be tempering with the single original copy of the data, it is a best practice to make it read-only.
- Right-click on the folder
- Select
Properties
- In
Attributes
section, check theRead-only
checkbox - Click on
Apply
button and confirm
-
Right-click on the folder/file and select "Get Info"
-
Expand the section
Sharing & Permissions
-
If the small lock icon at the bottom is locked, click on it to unlock it.
-
Set Priviledge to
Read only
for all users/user groups. -
Use bottom dropdown icon to apply changes to all enclosed items
Navigate the parent directory and use chmod
- GNU coreutils tool for changing the mode of the files and directories to be read-only
cd ..
chmod -R a-w test-data
Your task will be to update the dataset and sent it to the trainer. To follow best practices, you should:
- Prepare data for transfer
- include README file for recipient in the folder
- create an encrypted archive
- generate checksums
- Send the data, checksums and encryption password securely. Remember to use secure channels designed for data transfer.
-
Right-click on the folder and select
PeaZip
-> Add to archive -
Select archive name and format. Click on
Enter password / keyfile
and enter your encryption phrase. -
For checksum computation you can use the same steps as in the section Verify checksums
-
Save the checksum in a file with the same name +
.md5
suffix
-
Open KEKA tool
-
Choose the type of your archive and enter encryption password
-
Drag and drop the folder into the window
-
Hit OK
-
For checksum generation open the
Checksum
tool and drag and drop the file archive
Create an archive:
7z a <name-of-the-archive>.7z <folder-to-be-archived> -p
# enter the encryption password
Genereate checksum file for the archive:
md5sum <name-of-the-archive>.7z > checksums.md5