Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the S3 connection reading a CSV from minio, column headers are not correctly parsed #2809

Closed
mskyttner opened this issue Jul 21, 2023 · 3 comments
Labels
Type:Bug Something isn't working

Comments

@mskyttner
Copy link

Unlike when using the https connection variant, when making a S3 connection against a CSV served by an S3 backend powered by minio, the first row (column headers) is not correctly parsed. Instead the headers appear as row 1 of the data.

  1. Create a S3 connection to a CSV file on a minio server. This cannot be made using the web UI or the rill CLI, so edit the sources/my_source.yaml file to add the "endpoint" setting which is required for non-AWS S3 connections:
type: "s3"
uri: "s3://{MY_BUCKET}/{MY_CSV_FILE}"
endpoint: "{FQDN_S3_MINIO}"

Also ensure environment variables are set, at least these:

export AWS_ACCESS_KEY_ID={your_minio_account}
export AWS_SECRET_ACCESS_KEY={your_minio_pass}
export AWS_S3_ENDPOINT={fqdn.without.s3.default.region.of.public.minio.server}
export AWS_DEFAULT_REGION={subdomain.for.s3.default.region}
  1. Inspect row 1 of the data provided, the column headings appear there

Expected behavior

For column headers to appear "as usual", like when using http(s) to access the file.

Screenshots

image

Desktop (please complete the following information):

  • OS: Linux
  • Browser [firefox]
  • Version [latest nightly]

Additional context

Using the minio client ("mc", equivalent of aws cli) and viewing the first row of the file returns the header columns:

$ mc head -n 1 kthb/kthcorpus/projects_case.csv
efecte_id,Project ID,Name,Status,type,Funding Organisation,program,subprogram,subprogram_category,subprogram_category_description,co_funding_org,school,cost_center_school,Primary Researcher,username,Project Number,Agresso Number,beg,end,kth_grant_amount,sdg,related_project,Role,dep,cost_center_dep,Other participating Schools,cost_center_other,external_coordinator,counterpart,duration,dep_slug,dep_code,dep_divaorg,dep_short,dep_desc,school_slug,school_code,school_divaorg,school_short,school_desc

Possibly related issue

#1967

@mskyttner mskyttner added the Type:Bug Something isn't working label Jul 21, 2023
@k-anshul
Copy link
Member

@mskyttner This looks like the header detection limitations in csv parsing. If all columns values in csv are text, duckdb is not able to identify if the first row is a header or not. You can override this behaviour by setting following in your my_source.yaml file :

type: "s3"
uri: "s3://{MY_BUCKET}/{MY_CSV_FILE}"
endpoint: "{FQDN_S3_MINIO}"
duckdb:
  header: true

This should ensure that the first row is detected as header.

@mskyttner
Copy link
Author

Thanks for this, I will try it out on Monday!

@mskyttner
Copy link
Author

Thanks for the tip. Specifying the setting explicitly for the S3 connection takes care of the issue, in the sense that the header row does not appear as row 1 in the table.

It looks like my issue is with the duckdb read_csv_auto type detection, which treats all my columns as varchar, which differs from readr which picks up a couple of date columns and numerical columns for the same CSV.

I think I will switch to parquet instead to workaround the issue with type detection for the CSVs I use, while waiting for future improvements in that area.

Please feel free to close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type:Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants