Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets quick fixes + SQLAlchemy addition + CFF updates #339

Merged
merged 4 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,49 @@ authors:
name-particle: van
given-names: Sander
orcid: "https://orcid.org/0000-0001-6159-041X"
- affiliation: "Journal of Open Source Software"
family-names: Niemeyer
given-names: Kyle
- affiliation: "Netherlands eScience Center"
family-names: Wehner
given-names: Jens
- affiliation: "Netherlands eScience Center"
family-names: Burg
name-particle: van der
given-names: Sven
- affiliation: "Netherlands eScience Center"
family-names: Siqueira
given-names: Abel
- affiliation: "Netherlands eScience Center"
family-names: Vreede
given-names: Barbara
- affiliation: "Netherlands eScience Center"
family-names: Schnober
given-names: Carsten
- affiliation: "Netherlands eScience Center"
family-names: Chandramouli
given-names: Pranav
- affiliation: "Utrecht University"
family-names: Oberman
given-names: Hanne
- affiliation: "Netherlands eScience Center"
family-names: Lüken
given-names: Malte
- affiliation: "Netherlands eScience Center"
family-names: Isazi
given-names: Alessio
- affiliation: "Datadog, Inc."
family-names: Lev
given-names: Ofek
- affiliation: "Netherlands eScience Center"
family-names: Cahen
given-names: Ewan
- affiliation: "Netherlands eScience Center"
family-names: Ali
given-names: Suvayu
- affiliation: "Netherlands eScience Center"
family-names: Hafner
given-names: Flavio
- affiliation: "Netherlands eScience Center"
family-names: Cushing
given-names: Reggie
15 changes: 13 additions & 2 deletions best_practices/datasets.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Working with tabular data

*Page maintainers: Suvayu Ali* [@suvayu](https://github.com/suvayu) *, Flavio Hafner* [@f-hafner](https://github.com/f-hafner) *and Reggie Cushing* [@recap](https://github.com/recap)

There are several solutions available to you as an RSE, with their own pros and cons. You should evaluate which one works best for your project, and project partners, and pick one. Sometimes it might be, that you need to combine two different types of technologies. Here are some examples from our experience.

You will encounter datasets in various file formats like:
Expand Down Expand Up @@ -40,14 +42,23 @@ SQLite is a transactional database, so if you have a dataset that is changing wi
- For both DuckDB and SQLite, unique indexes allow to ensure data integrity
- For SQLite, indexes are crucial to improve the performance of queries. However, having more indexes makes writing new records to the database slower. So it's again a trade-off between query and write speed.

# Useful libraries

## Database APIs

- [SQLAlchemy](https://www.sqlalchemy.org/)
- In Python, interfacing to SQL databases like SQLite, MySQL or PostgreSQL is often done using [SQLAlchemy](https://www.sqlalchemy.org/), which is an Object Relational Mapper (ORM) that allows you to map tables to Python classes. Note that you still need to use a lot of manual SQL outside of Python to manage the database. However, SQLAlchemy allows you to use the data in a Pythonic way once you have the database layout figured out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking maybe piccolo deserves a mention here, considering it's much simpler, and probably better fit for smaller projects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, by all means, add it. I am not at all an expert on databases, this is just a thing I happen to have used once.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps we can do that in another PR and I'll just make an issue about it first to remember (maybe others have additional ideas that they can discuss there).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done #346. Thanks for the review!

## Data processing libraries on a single machine
- Pandas
- The standard tool for working with dataframes, and widely used in analytics or machine learning workflows. Note however how Pandas uses memory, because certain APIs create copies, while others do not. So if you are chaining multiple operations, it is preferable to use APIs that avoid copies.
- Vaex
- Vaex is an alternative that focuses on out-of-core processing (larger than memory), and has some lazy evaluation capabilities.
- Vaex is an alternative that focuses on out-of-core processing (larger than memory), and has some lazy evaluation capabilities.
- Polars
- An alternative to Pandas (started in 2020), which is primarily written in Rust. Compared to pandas, it is multi-threaded and does lazy evaluation with query optimisation, so much more performant. However since it is newer, documentation is not as complete. It also allows you to write your own custom extensions in Rust.
DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in [Rust](http://rustlang.org/), using the [Apache Arrow](https://arrow.apache.org/) in-memory format. DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. More info [Apache Datafusion](https://datafusion.apache.org/)
- [Apache Datafusion](https://datafusion.apache.org/)
- A very fast, extensible query engine for building high-quality data-centric systems in [Rust](http://rustlang.org/), using the [Apache Arrow](https://arrow.apache.org/) in-memory format. DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

## Distributed/multi-node data processing libraries
- Dask
- `dask.dataframe` and `dask.array` provides the same API as pandas and numpy respectively, making it easy to switch.
Expand Down
Loading