Appending string Timeseries damages stored Timeseries #38

fracorallo · 2020-03-17T00:00:00Z

Hi everyone. I'm using the useful pystore library to store also very long timeseries with string values.
I noticed a very strange behaviour when I try to append to an existing string Timeseries item, as in the following example code. In particular when appending to an item in the database, the resulting dataframe in the databases loses many of its values.

#import
import pystore
import pandas as pd
import numpy as np

pystore.set_path('./pystore_test')
store = pystore.store('store')
collection = store.collection('collection')

#define first dataframe (string values)
index_1 = pd.date_range('2013-1-1 00:00', '2014-1-1  00:00', freq='1m')
data_1= pd.DataFrame(np.random.choice(list('abc'), [len(index_1),1]),index=index_1, columns=['test']) 

#write dataframe to database
collection.write('test', data_1)

#read from database
item = collection.item('test')
data_DB = item.to_pandas() # contains 12 dates

#define second dataframe (string values)
index_2 = pd.date_range('2014-1-2 00:00', '2015-1-1  00:00', freq='1m')
data_2= pd.DataFrame(np.random.choice(list('abc'), [len(index_2),1]),index=index_2, columns=['test']) 

#append to database
item = collection.item('test')
collection.append('test', data_2, npartitions=item.data.npartitions)

#read appending result from database
item = collection.item('test')
data_DB = item.to_pandas()  #<-- Contains only 3 of the 12 initial and 12 appended values

Instead it works perfectly if I replace strings with floats:

data_1['test'] = np.random.rand(len(index_1),1) 
data_2['test'] = np.random.rand(len(index_2),1)

Is pystore library able to store also string values?
Thank you

yohplala · 2020-03-18T21:55:52Z

Hello capellino,
I don't have a definitive answer to your trouble, but maybe I have some clue.

Myself, I am not using the append() function because:

well, most notably, as reported in my other tickets, I didn't succeeded to have it working,
and also appending a dataframe to another can obey different logics depending the topic to be studied.
More specifically, when you append a dataframe to another, it seems logical not to record values that are both in dataframe 1 and dataframe 2. But this means you are able to define an "equality" operator between this data. And which information needs to be checked to confirm there is equality is up to you only.

To give you an example, in my case, I have for instance:

for index: timestamp of an occurence
1st column: value of the occurence
2nd column: time of recording in pystore
In my case, equality is based only on index and 1st column

Well, after all this blabla, the question is: on which equality is based pystore append() function?
You can have a look in the source of collection.py:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

If you have a look to 'drop_duplicates' documentation (well, I only checked out that of pandas, not dask, but I understand dask simulates pandas), you will see it identifies duplicates purely on values in columns, not index.

So you can have different index, but same value, it won't bother, it will purely suppress the 'duplicateed rows'.

Back to your ewample, maybe this can be the why.
You generate a string dataframe based on 3 letters, abc as far as I understand. I am not that surprised it keeps only 3 rows.
With float, I guess no value is ever the same.

To finish. in my own case, I only use the write() function, and operate the appending in few lines:
To give you an idea of what is my 'appending' strategy, here it is (for financial data)
` # Only removes duplicates based on key columns for given data type
# Key columns for OHLCV data are 'Timestamp', 'Open', 'High', 'Low', 'Close' & 'Volume'

# Do not keep last row of DataFrame currently in base (empty row)
# Do not keep last row of new DataFrame either (empty row)
combined = pd.concat([current[:-1].reset_index(), df[:-1].reset_index()])
combined.sort_values('Timestamp', inplace = True)
combined.drop_duplicates(subset = AC.DataTypes[cde.data_type], inplace = True)
combined.set_index(keys='Timestamp', inplace = True)
# Re-add as last row the latest timestamp
# We don't know if added data is newest data or earliest data, so let's check
if current.index[-1] > df.index[-1]:
      last_row = current.iloc[-1]
else:
      last_row = df.iloc[-1]
combined = combined.append(last_row, sort=False)
        
my_collection.write(item_ID, combined, overwrite=True) `

Hope this can give you some clues.
Bests

fracorallo · 2020-05-18T13:46:21Z

Thank you very much for your suggestions. I solved not using at all the append() function as you said!

sdementen · 2020-11-26T04:54:05Z

just a naive question: why are you using remove_duplicates ? or maybe better question, should we use remove_duplicates when the index is part of the data (e.g. a datetimeindex for a timeseries) ?

st4ck3 · 2020-12-02T18:43:40Z

Hi,

there is a problem with the append function:

The line:

combined = dd.concat([current.data, new]).drop_duplicates(keep="last")**

in the file collection.py should be subtituted by:

idx_name = current.data.index.name
combined = dd.concat([current.data, new]).reset_index().drop_duplicates(subset=idx_name, keep="first").set_index(idx_name)**

For further explanation, please refer to:

https://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries

st4ck3 · 2020-12-02T18:52:51Z

just a naive question: why are you using remove_duplicates ? or maybe better question, should we use remove_duplicates when the index is part of the data (e.g. a datetimeindex for a timeseries) ?

In general, pandas indexes are not unique and you can have repeated values. Therefore you need to remove duplicated indexes if unique ids are needed.

sdementen · 2020-12-03T04:59:34Z

In the SO post you link, they suggest the simpler and more efficient alternative to remove duplicate indices;
df = df[~df.index.duplicated(keep='first')]

st4ck3 · 2020-12-03T13:42:49Z

In the SO post you link, they suggest the simpler and more efficient alternative to remove duplicate indices;
df = df[~df.index.duplicated(keep='first')]

You are right. The method I mentioned is easier to understand for me, but less efficient and compact.

sdementen · 2020-12-03T15:17:09Z

Yet I see that there is already the logic to avoid duplicates values in DatetimeIndex a bit above in the code. So the drop_duplicates should probably only be applied when the index is not a DatetimeIndex

…

On Thu, Dec 3, 2020, 14:43 cstocker73 ***@***.***> wrote: In the SO post you link, they suggest the simpler and more efficient alternative to remove duplicate indices; df = df[~df.index.duplicated(keep='first')] You are right. The method I mentioned is easier to understand for me, but less efficient and compact. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#38 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ6S5WBNOHOOX2GXQXZRE3SS6IWRANCNFSM4LMZP53A> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending string Timeseries damages stored Timeseries #38

Appending string Timeseries damages stored Timeseries #38

fracorallo commented Mar 17, 2020 •

edited

Loading

yohplala commented Mar 18, 2020 •

edited

Loading

fracorallo commented May 18, 2020

sdementen commented Nov 26, 2020

st4ck3 commented Dec 2, 2020

st4ck3 commented Dec 2, 2020

sdementen commented Dec 3, 2020

st4ck3 commented Dec 3, 2020

sdementen commented Dec 3, 2020 via email

Appending string Timeseries damages stored Timeseries #38

Appending string Timeseries damages stored Timeseries #38

Comments

fracorallo commented Mar 17, 2020 • edited Loading

yohplala commented Mar 18, 2020 • edited Loading

fracorallo commented May 18, 2020

sdementen commented Nov 26, 2020

st4ck3 commented Dec 2, 2020

st4ck3 commented Dec 2, 2020

sdementen commented Dec 3, 2020

st4ck3 commented Dec 3, 2020

sdementen commented Dec 3, 2020 via email

fracorallo commented Mar 17, 2020 •

edited

Loading

yohplala commented Mar 18, 2020 •

edited

Loading