Using the S3 connection reading a CSV from minio, column headers are not correctly parsed #2809

mskyttner · 2023-07-21T12:52:08Z

Unlike when using the https connection variant, when making a S3 connection against a CSV served by an S3 backend powered by minio, the first row (column headers) is not correctly parsed. Instead the headers appear as row 1 of the data.

Create a S3 connection to a CSV file on a minio server. This cannot be made using the web UI or the rill CLI, so edit the sources/my_source.yaml file to add the "endpoint" setting which is required for non-AWS S3 connections:

type: "s3"
uri: "s3://{MY_BUCKET}/{MY_CSV_FILE}"
endpoint: "{FQDN_S3_MINIO}"

Also ensure environment variables are set, at least these:

export AWS_ACCESS_KEY_ID={your_minio_account}
export AWS_SECRET_ACCESS_KEY={your_minio_pass}
export AWS_S3_ENDPOINT={fqdn.without.s3.default.region.of.public.minio.server}
export AWS_DEFAULT_REGION={subdomain.for.s3.default.region}

Inspect row 1 of the data provided, the column headings appear there

Expected behavior

For column headers to appear "as usual", like when using http(s) to access the file.

Screenshots

Desktop (please complete the following information):

OS: Linux
Browser [firefox]
Version [latest nightly]

Additional context

Using the minio client ("mc", equivalent of aws cli) and viewing the first row of the file returns the header columns:

$ mc head -n 1 kthb/kthcorpus/projects_case.csv
efecte_id,Project ID,Name,Status,type,Funding Organisation,program,subprogram,subprogram_category,subprogram_category_description,co_funding_org,school,cost_center_school,Primary Researcher,username,Project Number,Agresso Number,beg,end,kth_grant_amount,sdg,related_project,Role,dep,cost_center_dep,Other participating Schools,cost_center_other,external_coordinator,counterpart,duration,dep_slug,dep_code,dep_divaorg,dep_short,dep_desc,school_slug,school_code,school_divaorg,school_short,school_desc

Possibly related issue

#1967

k-anshul · 2023-07-21T13:13:44Z

@mskyttner This looks like the header detection limitations in csv parsing. If all columns values in csv are text, duckdb is not able to identify if the first row is a header or not. You can override this behaviour by setting following in your my_source.yaml file :

type: "s3"
uri: "s3://{MY_BUCKET}/{MY_CSV_FILE}"
endpoint: "{FQDN_S3_MINIO}"
duckdb:
  header: true

This should ensure that the first row is detected as header.

mskyttner · 2023-07-23T16:05:27Z

Thanks for this, I will try it out on Monday!

mskyttner · 2023-07-24T08:13:31Z

Thanks for the tip. Specifying the setting explicitly for the S3 connection takes care of the issue, in the sense that the header row does not appear as row 1 in the table.

It looks like my issue is with the duckdb read_csv_auto type detection, which treats all my columns as varchar, which differs from readr which picks up a couple of date columns and numerical columns for the same CSV.

I think I will switch to parquet instead to workaround the issue with type detection for the CSVs I use, while waiting for future improvements in that area.

Please feel free to close the issue.

mskyttner added the Type:Bug Something isn't working label Jul 21, 2023

k-anshul closed this as completed Jul 24, 2023

mskyttner mentioned this issue Jul 31, 2023

Reading CSV and parquet file loaded over https initially fails but succeeds after several attempts duckdb/duckdb#6837

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the S3 connection reading a CSV from minio, column headers are not correctly parsed #2809

Using the S3 connection reading a CSV from minio, column headers are not correctly parsed #2809

mskyttner commented Jul 21, 2023

k-anshul commented Jul 21, 2023

mskyttner commented Jul 23, 2023

mskyttner commented Jul 24, 2023

Using the S3 connection reading a CSV from minio, column headers are not correctly parsed #2809

Using the S3 connection reading a CSV from minio, column headers are not correctly parsed #2809

Comments

mskyttner commented Jul 21, 2023

k-anshul commented Jul 21, 2023

mskyttner commented Jul 23, 2023

mskyttner commented Jul 24, 2023