Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added BOM capability for output files (#1267) #1274

Merged
merged 7 commits into from
Feb 15, 2025

Conversation

alvaro-osvaldo-tm
Copy link
Contributor

@alvaro-osvaldo-tm alvaro-osvaldo-tm commented Feb 9, 2025

closes #1267

  • Added the '--add-bom' parameter for almost utilities

- Added the '--add-bom' parameter for almost utilities

Signed-off-by: Álvaro Osvaldo <[email protected]>
@alvaro-osvaldo-tm alvaro-osvaldo-tm changed the title feat: Added BOM capability for output files (1267) feat: Added BOM capability for output files (#1267) Feb 9, 2025
@alvaro-osvaldo-tm
Copy link
Contributor Author

alvaro-osvaldo-tm commented Feb 9, 2025

Implementation

Implemented the feature to optionality add UTF-8 Byte Order Mark (BOM) into output content in all utilities,
except csvpy and sql2csv

Solution

  • The UTF-8 BOM only will be added if the parameter '--add-bom' is specified, otherwise is ignored.
  • The parameter configuration and execution was implemented in the file csvkit/features/AddBOM.py ,
    I used a 'feature' pattern to avoid 'spaghetti code', no problem if the code need to be put into CSVKitUtility class.
    • The advantage of this approach is the code is more clear.
    • But the CSVToolKit is not prepared for it as seen in 'argument' method. Also, a few more CPU cycles will be perceived, if the user process a HUGE amount of files.

Tests

  • A attached a end-to-end test script

  • No unit test was made because the tests use 'StringIO' as 'input file', but the BOM need to be added as bytes using 'TextIOWrapper'.

    • If you want, I can implement a conversion in 'CSVToolkit' and 'LazyFile' to enable the tests.
  • All PyTests and end-to-end tests passed in the following versions:

    • Python 3.8.20
    • Python 3.9.21
    • Python 3.10.16
    • Python 3.11.11
    • Python 3.12.8
      • Except csvgrep and csvcut due CSVToolkitbug.

Checklist

  • Unit Testing
  • End-to-end Testing

Considerations

References

Signed-off-by: Álvaro Osvaldo <[email protected]>
csvkit/cli.py Outdated Show resolved Hide resolved
csvkit/cli.py Outdated Show resolved Hide resolved
- Code inlined to 'cli.py' script.
- Configured 'csvpy' and 'sql2csv' to ignore 'add-bom'

Signed-off-by: Álvaro Osvaldo <[email protected]>
@alvaro-osvaldo-tm
Copy link
Contributor Author

alvaro-osvaldo-tm commented Feb 14, 2025

Done, all code refactored to current code organisation and passed in 'end-to-end' and 'pytest' tests , except for the Python 3.12 as specified in the opened issues.

Let me know if need something more.

@coveralls
Copy link

coveralls commented Feb 15, 2025

Coverage Status

coverage: 90.435% (+0.03%) from 90.408%
when pulling 03bde26 on alvaro-osvaldo-tm:proposals/1267
into bd6095d on wireservice:master.

@jpmckinney jpmckinney merged commit 19810a3 into wireservice:master Feb 15, 2025
10 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can in2csv add a byte order mark (BOM) so that when opening csv in Excel it correctly formats unicode text?
3 participants