-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data) #15
Comments
Hi @jlumbroso, thanks for the detailed bug report! You're describing an issue that I've been thinking about for a while, but never had the time to seriously investigate, so I'm glad that you raised this. Yes, CleverCSV can be quite slow for large files, and chunking the file during detection would be a good solution to this. I agree this is something the library should handle if it can. I should mention though that the I would definitely be open to adding a fix for this into the package, but I'm a bit undecided about the best approach. In any case I'd want to do some testing on the accuracy of this method compared to reading the full file. An alternative that I've had in mind is to keep the All that said, I did some debugging on the specific file you mention and it appears it suffers from the same issue as in #13, but regarding the So I'll first update the package with that regex, and improve the documentation w.r.t. using a smaller sample on big files. Then, I'd like to add some sort of benchmark to the repo that allows us to evaluate the performance of the chunking method (it'll likely be a few weeks before I get around to this, unfortunately). If you'd like to work on adding your algorithm to the package, it would be great if you can prepare a pull request. We can discuss the implementation details there more easily (again, I think wrapping the detect method is probably the easiest at this point). How does that sound? |
The url regex also leads to catastrophic backtracking for the file in #15, which causes a massive slowdown.
* Update URL regex to avoid catastrophic backtracking and increase performance. See [issue #13](#13) and [issue #15](#15). Thanks to @kaskawu for the fix and @jlumbroso for re-raising the issue. * Add ``num_chars`` keyword argument to ``read_as_dicts`` and ``csv2df`` wrappers. * Improve documentation w.r.t. handling large files. Thanks to @jlumbroso for raising this issue.
Hi @jlumbroso, This took a bit longer than expected, but I've now added a comparison study to the repo (see here). This experiment evaluates the accuracy and runtime of dialect detection as a function of the number of lines used for the detection. I've included a version of the "infinite binary search" you suggested, dubbed I changed the parameters of your algorithm a bit when testing, as the results depend on how many lines are used initially and how many steps are taken. In the comparison I'm curious to hear your thoughts on this! |
Dear @GjjvdBurg, First, I apologize for leaving your first message without an answer. I think I wrote a long response and thought I had sent it, but since it's not submitted, I assume it just got lost in my tabs until the browser was closed and then forgotten... So I wanted to express my appreciation to you for following up on this. Next, congratulations on continuing to being data-driven in every aspect of this module: Those comparison benchmarks are really very comprehensive and interesting!! 🥇 I think this interesting enough it could be a Medium article (let me know if you want to write one together). Here are some of my thoughts, happy to discuss more! For autodetection, accuracy >> efficiency. I think that for a robust auto-detection library like Why is Importantly, the reason I suggested Minor nitpick: In the |
Hi @jlumbroso, Thanks for your response, needless to say I'm not that fast with responding myself either, for which I apologize. That doesn't mean however that I haven't thought about this problem in the meantime. I certainly think this would make an interesting (Medium) article, as there's definitely a non-trivial problem here. I agree with your point on accuracy being more important than efficiency. Ideally (as we discussed above), CleverCSV should be able to determine when it is confident in the detected dialect being the right one. In general, the problem with reading a file partially is that it is possible that evidence for a certain dialect can be hidden many rows down in the file. For example, the imdb example only reveals that it uses the Regarding the accuracy figure you show, there are a number of things going on here. First, I made a mistake in the implementation of What's noticable is that starting from Couple minor comments:
|
For completeness, when I run but after that it maxes out in performance: And only after 3,500 lines or so there is a runtime advantage: Now, of course, all this is dependent on the specific parameters I've chosen for |
Hello,
This is a very neat project! I was thinking "I should collect a bunch of CSV files from the web and do statistics to see what the dialects are, and their predominance, to be able to better detect them" and then I found your paper and Python package! Congrats on this very nice contribution.
I am trying to see how
clevercsv
performs on FEC data. For instance, let's consider this file:https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
When I try to open the file with
clevercsv
, it takes an inordinate of time, and seems to be hanging. So I tried to use the sniffer as suggested in your example Binder.It prints out this:
and then a while later (a few minutes later) starts printing:
I would say this takes about 30 minutes, and finally it concludes:
I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.
I would be tempted to think this is because I feed the data as a variable
content
, following your example, rather than provide the filename directly. However when I tried to read theread_csv
method directly with the filename, it also was really, very, very slow. So I think in all situations currently,clevercsv
trips on this file, and more generally this type of file.When I take the initiative to truncate the data arbitrarily,
clevercsv
works beautifully. But shouldn't the truncating be something the library, as opposed to the user does?provides in a few seconds:
@GjjvdBurg If this is not a known problem, may I suggest using some variation of "infinite binary search"?
I have implemented this algorithm here:
https://gist.github.com/jlumbroso/c123a30a2380b58989c7b12fe4b4f49e
When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:
And on the off-chance you would like me to add this algorithm to the codebase, where would it go?
The text was updated successfully, but these errors were encountered: