-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending string Timeseries damages stored Timeseries #38
Comments
Hello capellino, Myself, I am not using the append() function because:
To give you an example, in my case, I have for instance:
Well, after all this blabla, the question is: on which equality is based pystore append() function? If you have a look to 'drop_duplicates' documentation (well, I only checked out that of pandas, not dask, but I understand dask simulates pandas), you will see it identifies duplicates purely on values in columns, not index. So you can have different index, but same value, it won't bother, it will purely suppress the 'duplicateed rows'. Back to your ewample, maybe this can be the why. To finish. in my own case, I only use the write() function, and operate the appending in few lines:
Hope this can give you some clues. |
Thank you very much for your suggestions. I solved not using at all the append() function as you said! |
just a naive question: why are you using remove_duplicates ? or maybe better question, should we use remove_duplicates when the index is part of the data (e.g. a datetimeindex for a timeseries) ? |
Hi, there is a problem with the append function: The line: combined = dd.concat([current.data, new]).drop_duplicates(keep="last")** in the file collection.py should be subtituted by: idx_name = current.data.index.name For further explanation, please refer to: |
In general, pandas indexes are not unique and you can have repeated values. Therefore you need to remove duplicated indexes if unique ids are needed. |
In the SO post you link, they suggest the simpler and more efficient alternative to remove duplicate indices; |
You are right. The method I mentioned is easier to understand for me, but less efficient and compact. |
Yet I see that there is already the logic to avoid duplicates values in
DatetimeIndex a bit above in the code. So the drop_duplicates should
probably only be applied when the index is not a DatetimeIndex
…On Thu, Dec 3, 2020, 14:43 cstocker73 ***@***.***> wrote:
In the SO post you link, they suggest the simpler and more efficient
alternative to remove duplicate indices;
df = df[~df.index.duplicated(keep='first')]
You are right. The method I mentioned is easier to understand for me, but
less efficient and compact.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ6S5WBNOHOOX2GXQXZRE3SS6IWRANCNFSM4LMZP53A>
.
|
Hi everyone. I'm using the useful pystore library to store also very long timeseries with string values.
I noticed a very strange behaviour when I try to append to an existing string Timeseries item, as in the following example code. In particular when appending to an item in the database, the resulting dataframe in the databases loses many of its values.
Instead it works perfectly if I replace strings with floats:
Is pystore library able to store also string values?
Thank you
The text was updated successfully, but these errors were encountered: