Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading specific rows from a large sas7bdat file #42

Open
BERENZ opened this issue Sep 10, 2024 · 5 comments
Open

Reading specific rows from a large sas7bdat file #42

BERENZ opened this issue Sep 10, 2024 · 5 comments

Comments

@BERENZ
Copy link

BERENZ commented Sep 10, 2024

Is there a way to add functionality to read specific rows from a large sas7bdat file? The issue I'm facing is that I have large SAS files (around 10GB) along with text files (an exact, flat copy of the SAS file). Based on the text file, I can specify the subset of rows that I'm interested in (around 10% of the file).

Another option is to specify a filter while reading, for example, reading rows based on a column. However, I understand that this may be more challenging to implement.

@junyuan-chen
Copy link
Owner

Hi! Have you tried the keyword arguments row_limit and row_offset? They should allow reading just a portion of the file.

@BERENZ
Copy link
Author

BERENZ commented Sep 11, 2024

Hi, yes, but it would only work if the rows I want to select are in order. In my case, they're spread out over the dataset.

@junyuan-chen
Copy link
Owner

@BERENZ All right. Now I see your point. Filtering the rows of the data file with a general condition is not something that is built into the parser. However, a work around could be that you try to cut the file into partitions of consecutive rows that are small enough to be fit into the memory and then filter each partition one by one. The entire file is therefore still read into the memory at some point.

@BERENZ
Copy link
Author

BERENZ commented Sep 12, 2024

Sure, this is what I actually do nowadays (split data into chunks). I understand that to make this possible is to make changes to the underlying ReadStat C library?

@junyuan-chen
Copy link
Owner

Yes. For reading the files, the iteration across rows is handled within the C library and there is no such an interface to skip rows depending on the values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants