more details to the documentation of data preprocessing #94

Mivg · 2024-11-17T01:49:31Z

No description provided.

GeorgiosSmyrnis · 2024-11-17T21:26:14Z

README.md

@@ -98,11 +98,11 @@ To get started with DCLM, follow these steps:
    We recommend the use of Python 3.10 with DCLM.

 ## Selecting Raw Sources
-If you are creating a new source:
+If you are creating a new source (for example, Wikipedia, GitHub, etc.):


Maybe creating -> registering?

GeorgiosSmyrnis · 2024-11-17T21:27:58Z

README.md

- Key names should be consistent with those in [here](baselines/core/constants.py).
- Create a reference JSON in [exp_data/datasets/raw_sources](exp_data/datasets/raw_sources).
+- Ensure your data is stored in JSONL format (ideally compressed with zstandard), where each line correspond to single page.
+- Key names in these JSONL should be consistent with those in [here](baselines/core/constants.py).


Maybe we can explain these in bit more detail here?

GeorgiosSmyrnis · 2024-12-01T23:40:02Z

README.md

@@ -119,6 +119,7 @@ To process raw data, follow these steps:

 2. **Set up a Ray cluster**:
    The data processing script relies on Ray for distributed processing of data. This cluster can be either launched on a single node (for small scale data processing) or using AWS EC2 instances.
+    There is also work to [deploy Ray on slurm setups](https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html), though this effort is still a work-in-progres.


I feel like this isn't really WIP anymore, in the sense that it is possible to do so if you follow the guide.

has one of us been able to get our scripts to work on slurm?

I've been launching Ray jobs on TACC, so that part is stable - but I suppose you're right, best not to claim something that might be false

GeorgiosSmyrnis · 2024-12-09T06:11:19Z

README.md

- Key names should be consistent with those in [here](baselines/core/constants.py).
- Create a reference JSON in [exp_data/datasets/raw_sources](exp_data/datasets/raw_sources).
+- Ensure your data is stored in JSONL format, ideally compressed with zstandard (though uncompressed or gzip-compressed files will also work), where each line corresponds to a single page/document.
+- Key names in these JSONL should be consistent with those in [here](baselines/core/constants.py). Most importantly, there should be a ``"text"`` key for each line that contains the acftual content of the page.


typo: acftual - > actual

GeorgiosSmyrnis · 2024-12-09T06:14:14Z

README.md

- Key names should be consistent with those in [here](baselines/core/constants.py).
- Create a reference JSON in [exp_data/datasets/raw_sources](exp_data/datasets/raw_sources).
+- Ensure your data is stored in JSONL format, ideally compressed with zstandard (though uncompressed or gzip-compressed files will also work), where each line corresponds to a single page/document.
+- Key names in these JSONL should be consistent with those in [here](baselines/core/constants.py). Most importantly, there should be a ``"text"`` key for each line that contains the acftual content of the page.


I would rephrase as follows:

Each row in these JSONL files corresponds to a document. Each row should contain keys consistent with those at ... and at minimum should contain at least a "text" key containing the actual content.

more details to the documentation of data preprocessing

57e0151

Mivg requested review from jeffreywpli and GeorgiosSmyrnis November 17, 2024 01:49

Mivg self-assigned this Nov 17, 2024

GeorgiosSmyrnis reviewed Nov 17, 2024

View reviewed changes

GeorgiosSmyrnis reviewed Dec 1, 2024

View reviewed changes

jeffreywpli force-pushed the fix/additional_documentation branch from 405842b to 57e0151 Compare December 7, 2024 01:26

Jeffrey and others added 4 commits December 8, 2024 16:09

fix requirements merge

c0ff668

small fixes to address comments

343e2b8

Merge branch 'main' into fix/additional_documentation

bffc3a2

Update requirements.txt

56104bc

GeorgiosSmyrnis reviewed Dec 9, 2024

View reviewed changes

Update README.md

0dd5372

GeorgiosSmyrnis merged commit 8383011 into main Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more details to the documentation of data preprocessing #94

more details to the documentation of data preprocessing #94

Mivg commented Nov 17, 2024

GeorgiosSmyrnis Nov 17, 2024

GeorgiosSmyrnis Nov 17, 2024

GeorgiosSmyrnis Dec 1, 2024

jeffreywpli Dec 7, 2024

GeorgiosSmyrnis Dec 7, 2024

GeorgiosSmyrnis Dec 9, 2024

GeorgiosSmyrnis Dec 9, 2024

more details to the documentation of data preprocessing #94

more details to the documentation of data preprocessing #94

Conversation

Mivg commented Nov 17, 2024

GeorgiosSmyrnis Nov 17, 2024

Choose a reason for hiding this comment

GeorgiosSmyrnis Nov 17, 2024

Choose a reason for hiding this comment

GeorgiosSmyrnis Dec 1, 2024

Choose a reason for hiding this comment

jeffreywpli Dec 7, 2024

Choose a reason for hiding this comment

GeorgiosSmyrnis Dec 7, 2024

Choose a reason for hiding this comment

GeorgiosSmyrnis Dec 9, 2024

Choose a reason for hiding this comment

GeorgiosSmyrnis Dec 9, 2024

Choose a reason for hiding this comment