Add workflow for reading CSV from s3, cleaning, saving to Parquet #738

douglasdavis · 2023-03-28T16:23:19Z

This workflow reads the GDELT project CSV (with sep="\t") dataset which is on S3. The plan is to do some cleaning and save the transformed data to Parquet.

…embarrassing-parallel

jrbourbeau

Thanks @douglasdavis! #724 was just merged, so you should be able to merge main to incorporate those changes. Also, since workflows will tend to be larger and require more resources, we've configured them to only run on PRs when the workflows label has been added (otherwise all tests in tests/workflows will be skipped). I've gone ahead and added the workflows label to this PR

…-to-parquet2

douglasdavis · 2023-03-28T16:52:45Z

Awesome, thanks!

jrbourbeau

While iterating on this workflow, I'd recommend adding the following (temporary) change to only run test_from_csv_to_parquet.py in CI. That should prevent running the full test suite unnecessarily and make it easier to iterate quickly

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 973f7f8..b15196e 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -31,44 +31,44 @@ jobs:
       matrix:
         os: [ubuntu-latest]
         python-version: ["3.9"]
-        pytest_args: [tests]
+        pytest_args: [tests/workflows/test_from_csv_to_parquet.py]
         runtime-version: [upstream, latest, "0.2.1"]
-        include:
-          # Run stability tests on Python 3.8
-          - pytest_args: tests/stability
-            python-version: "3.8"
-            runtime-version: upstream
-            os: ubuntu-latest
-          - pytest_args: tests/stability
-            python-version: "3.8"
-            runtime-version: latest
-            os: ubuntu-latest
-          - pytest_args: tests/stability
-            python-version: "3.8"
-            runtime-version: "0.2.1"
-            os: ubuntu-latest
-          # Run stability tests on Python 3.10
-          - pytest_args: tests/stability
-            python-version: "3.10"
-            runtime-version: upstream
-            os: ubuntu-latest
-          - pytest_args: tests/stability
-            python-version: "3.10"
-            runtime-version: latest
-            os: ubuntu-latest
-          - pytest_args: tests/stability
-            python-version: "3.10"
-            runtime-version: "0.2.1"
-            os: ubuntu-latest
-          # Run stability tests on Python Windows and MacOS (latest py39 only)
-          - pytest_args: tests/stability
-            python-version: "3.9"
-            runtime-version: latest
-            os: windows-latest
-          - pytest_args: tests/stability
-            python-version: "3.9"
-            runtime-version: latest
-            os: macos-latest
+        # include:
+        #   # Run stability tests on Python 3.8
+        #   - pytest_args: tests/stability
+        #     python-version: "3.8"
+        #     runtime-version: upstream
+        #     os: ubuntu-latest
+        #   - pytest_args: tests/stability
+        #     python-version: "3.8"
+        #     runtime-version: latest
+        #     os: ubuntu-latest
+        #   - pytest_args: tests/stability
+        #     python-version: "3.8"
+        #     runtime-version: "0.2.1"
+        #     os: ubuntu-latest
+        #   # Run stability tests on Python 3.10
+        #   - pytest_args: tests/stability
+        #     python-version: "3.10"
+        #     runtime-version: upstream
+        #     os: ubuntu-latest
+        #   - pytest_args: tests/stability
+        #     python-version: "3.10"
+        #     runtime-version: latest
+        #     os: ubuntu-latest
+        #   - pytest_args: tests/stability
+        #     python-version: "3.10"
+        #     runtime-version: "0.2.1"
+        #     os: ubuntu-latest
+        #   # Run stability tests on Python Windows and MacOS (latest py39 only)
+        #   - pytest_args: tests/stability
+        #     python-version: "3.9"
+        #     runtime-version: latest
+        #     os: windows-latest
+        #   - pytest_args: tests/stability
+        #     python-version: "3.9"
+        #     runtime-version: latest
+        #     os: macos-latest
 
     steps:
       - name: Checkout

jrbourbeau

Thanks @douglasdavis. I left a few comments / questions

tests/workflows/test_from_csv_to_parquet.py

jrbourbeau · 2023-04-03T20:07:26Z

tests/workflows/test_from_csv_to_parquet.py

+
+
+def test_from_csv_to_parquet(from_csv_to_parquet_client, s3_factory, s3_url):
+    s3 = s3_factory(anon=True)


Just checking -- is anon=True needed to access the dataset?

Not sure, I'll give it a test

Looks like it is necessary

jrbourbeau · 2023-04-03T20:08:13Z

tests/workflows/test_from_csv_to_parquet.py

+    )
+
+    df = df.partitions[-10:]
+    df = df.map_partitions(drop_dupe_per_partition)


Why apply pandas' drop_duplicates instead of just using dasks? Did you run into an issue with the dask version?

Nothing wrong with the Dask version- it's just that it reduces the graph completely (does an apply-concat-apply); at this point I'd like to remove duplicates on a per-file basis.

tests/workflows/test_from_csv_to_parquet.py

jrbourbeau · 2023-04-03T20:11:25Z

Ah, also it looks like pre-commit is failing and there's a merge conflict. Let me know if I can be helpful with either of those (in particular the merge conflict)

douglasdavis · 2023-04-03T20:40:10Z

Hey @jrbourbeau thanks for the input! I've got a few notes I plan to distill down into a little summary status I'll add here a little later today, will answer a couple of your questions (and I have a couple as well!)

douglasdavis · 2023-04-04T04:30:06Z

Answered some things inline.
another item for my summary here:
When spinning this on a larger piece of the dataset I've found that some files are not perfect CSV's (some parsing errors). Coiled worker logs were very helpful here! and not just solvable with pd.read_csv's on_bad_lines argument. I'm trying to work through by adding some use of delayed to check to do a lazy check for bad files; or just keep an explicit list of bad files associated with this dataset and make sure to not use those.
Last thing: I think the conflict stems from the temporary change in the GH workflow file, so that should be fixed by reverting that temporary change I hope

douglasdavis · 2023-04-11T02:52:50Z

Alright so the current status of the test is passing, but it took going in a bit of a roundabout path. Like I mentioned above, some files had parsing errors (discovered after trying to churn on the whole dataset for the first time). Also some of the larger files in the dataset (the early ones that cram a whole year's or month's worth of data into a single file) would cause the worker to run out of memory. I solved the former by essentially creating another mini workflow (code below) which just determines which files cause the parsing exception with a big list of delayeds-- Coiled was magic for this (once the bad files were determined I hardcoded them into the test). Solved the latter by just omitting the first ~5% of the files. As mentioned-things pass, but I feel the need for this to be a bit more polished and do something more interesting before saving to parquet (now that it at least works!), or perhaps cut my losses and find a better public CSV dataset. Will dive back in when I'm back from leave if this is still open

Code used to determine bad files

import s3fs
import distributed
import coiled
from dask.delayed import delayed

import pandas as pd


cluster_kwargs = dict(
    package_sync=True,
    wait_for_workers=False,
    scheduler_vm_types=["m6i.large"],
    backend_options=dict(region="us-east-1"),
    worker_vm_types=["m6i.large"],
)


fs = s3fs.S3FileSystem(anon=True)

raw_list = fs.ls("s3://gdelt-open-data/events/")[120:]

@delayed
def try_to_read_csv(filename):
    COLUMNSV1 = {
        "GlobalEventID": "Int64",
        "Day": "Int64",
        "MonthYear": "Int64",
        "Year": "Int64",
        "FractionDate": "Float64",
        "Actor1Code": "string[pyarrow]",
        "Actor1Name": "string[pyarrow]",
        "Actor1CountryCode": "string[pyarrow]",
        "Actor1KnownGroupCode": "string[pyarrow]",
        "Actor1EthnicCode": "string[pyarrow]",
        "Actor1Religion1Code": "string[pyarrow]",
        "Actor1Religion2Code": "string[pyarrow]",
        "Actor1Type1Code": "string[pyarrow]",
        "Actor1Type2Code": "string[pyarrow]",
        "Actor1Type3Code": "string[pyarrow]",
        "Actor2Code": "string[pyarrow]",
        "Actor2Name": "string[pyarrow]",
        "Actor2CountryCode": "string[pyarrow]",
        "Actor2KnownGroupCode": "string[pyarrow]",
        "Actor2EthnicCode": "string[pyarrow]",
        "Actor2Religion1Code": "string[pyarrow]",
        "Actor2Religion2Code": "string[pyarrow]",
        "Actor2Type1Code": "string[pyarrow]",
        "Actor2Type2Code": "string[pyarrow]",
        "Actor2Type3Code": "string[pyarrow]",
        "IsRootEvent": "Int64",
        "EventCode": "string[pyarrow]",
        "EventBaseCode": "string[pyarrow]",
        "EventRootCode": "string[pyarrow]",
        "QuadClass": "Int64",
        "GoldsteinScale": "Float64",
        "NumMentions": "Int64",
        "NumSources": "Int64",
        "NumArticles": "Int64",
        "AvgTone": "Float64",
        "Actor1Geo_Type": "Int64",
        "Actor1Geo_Fullname": "string[pyarrow]",
        "Actor1Geo_CountryCode": "string[pyarrow]",
        "Actor1Geo_ADM1Code": "string[pyarrow]",
        "Actor1Geo_Lat": "Float64",
        "Actor1Geo_Long": "Float64",
        "Actor1Geo_FeatureID": "string[pyarrow]",
        "Actor2Geo_Type": "Int64",
        "Actor2Geo_Fullname": "string[pyarrow]",
        "Actor2Geo_CountryCode": "string[pyarrow]",
        "Actor2Geo_ADM1Code": "string[pyarrow]",
        "Actor2Geo_Lat": "Float64",
        "Actor2Geo_Long": "Float64",
        "Actor2Geo_FeatureID": "string[pyarrow]",
        "ActionGeo_Type": "Int64",
        "ActionGeo_Fullname": "string[pyarrow]",
        "ActionGeo_CountryCode": "string[pyarrow]",
        "ActionGeo_ADM1Code": "string[pyarrow]",
        "ActionGeo_Lat": "Float64",
        "ActionGeo_Long": "Float64",
        "ActionGeo_FeatureID": "string[pyarrow]",
        "DATEADDED": "Int64",
        "SOURCEURL": "string[pyarrow]",
    }
    try:
        df = pd.read_csv(
            f"s3://{filename}",
            sep="\t",
            names=COLUMNSV1.keys(),
            dtype=COLUMNSV1,
            storage_options={"anon": True},
        )
        good = (True, filename)
        del df
    except Exception as err:
        good = (False, filename, err)
    return good


tries = [try_to_read_csv(x) for x in raw_list]

# ipython -i <file.py>
# >>> cluster = coiled.Cluster(**cluster_kwargs)
# >>> client = cluster.get_client()
# >>> tries_computed = client.compute(tries)
# >>> results = [tc.result() for tc in tries_computed]
# >>> bad = [x for x in results if not x[0]]
# >>> bad_files = [x[1] for x in bad]

jrbourbeau and others added 12 commits March 20, 2023 21:28

Add matplotlib arxiv workflow

3a76080

Remove stray print

5aedaca

Merge branch 'main' of https://github.com/coiled/coiled-runtime into …

e185e5d

…embarrassing-parallel

Update fixture name

95585d4

Only use requester_pays for test_embarassingly_parallel

10a529b

Rerun CI

4504020

Update instance type

96e66f5

Run workflows on demand and during nightly cron job

0779b48

Use specific range of years

7fb6792

Merge branch 'main' of https://github.com/coiled/coiled-runtime into …

437eb0b

…embarrassing-parallel

Light asserts

e4851df

add workflow

2657885

jrbourbeau added the workflows Related to representative Dask user workflows label Mar 28, 2023

jrbourbeau reviewed Mar 28, 2023

View reviewed changes

douglasdavis added 2 commits March 28, 2023 11:51

show something with use of pytest -s

6eb04d6

Merge remote-tracking branch 'origin/main' into add-workflow-from-csv…

6f4c04e

…-to-parquet2

douglasdavis added 2 commits March 28, 2023 11:55

rm unnecessary noqa comments

222c695

var name

fc40687

jrbourbeau reviewed Mar 28, 2023

View reviewed changes

douglasdavis added 2 commits March 28, 2023 13:00

adjust tests.yml based on James' suggestion

670e3cc

write some parquet to s3

16b0277

jrbourbeau reviewed Apr 3, 2023

View reviewed changes

douglasdavis added 4 commits April 6, 2023 12:14

this version actually passes

b557bc5

check if read works

ccedaf8

works with some excluded files

37120c3

rm unnecessary line

b3cfbaa

milesgranger mentioned this pull request May 10, 2023

Workflow: CSV to parquet #841

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow for reading CSV from s3, cleaning, saving to Parquet #738

Add workflow for reading CSV from s3, cleaning, saving to Parquet #738

douglasdavis commented Mar 28, 2023 •

edited

Loading

jrbourbeau left a comment

douglasdavis commented Mar 28, 2023

jrbourbeau left a comment

jrbourbeau left a comment

jrbourbeau Apr 3, 2023

douglasdavis Apr 4, 2023

douglasdavis Apr 11, 2023

jrbourbeau Apr 3, 2023

douglasdavis Apr 4, 2023 •

edited

Loading

jrbourbeau commented Apr 3, 2023

douglasdavis commented Apr 3, 2023

douglasdavis commented Apr 4, 2023

douglasdavis commented Apr 11, 2023



		def test_from_csv_to_parquet(from_csv_to_parquet_client, s3_factory, s3_url):
		s3 = s3_factory(anon=True)

Add workflow for reading CSV from s3, cleaning, saving to Parquet #738

Are you sure you want to change the base?

Add workflow for reading CSV from s3, cleaning, saving to Parquet #738

Conversation

douglasdavis commented Mar 28, 2023 • edited Loading

jrbourbeau left a comment

Choose a reason for hiding this comment

douglasdavis commented Mar 28, 2023

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Apr 3, 2023

Choose a reason for hiding this comment

douglasdavis Apr 4, 2023

Choose a reason for hiding this comment

douglasdavis Apr 11, 2023

Choose a reason for hiding this comment

jrbourbeau Apr 3, 2023

Choose a reason for hiding this comment

douglasdavis Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

jrbourbeau commented Apr 3, 2023

douglasdavis commented Apr 3, 2023

douglasdavis commented Apr 4, 2023

douglasdavis commented Apr 11, 2023

douglasdavis commented Mar 28, 2023 •

edited

Loading

douglasdavis Apr 4, 2023 •

edited

Loading