Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash during resampling #1646

Open
DrNickClarke opened this issue Jun 25, 2024 · 0 comments
Open

Crash during resampling #1646

DrNickClarke opened this issue Jun 25, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@DrNickClarke
Copy link
Collaborator

DrNickClarke commented Jun 25, 2024

Describe the bug

2.1 million rows of fx intraday market data. Resample to 10s works. Resample to 1s crashes.

Orginal reporter: Tony Roberts from pyxll.

Possible OOM?

Steps/Code to Reproduce

Download the Dec 2023 csv from here https://www.histdata.com/download-free-forex-historical-data/?/ascii/tick-data-quotes/eurusd/2023

(this is free sample data)

pandas data prep code:

file = "data/DAT_ASCII_EURUSD_T_202312.csv"
df2_raw = pd.read_csv(file, header=None)
df2 = df2_raw.drop(columns=3).rename(columns={0:'timestamp', 1:'bid', 2:'ask'})
df2['timestamp'] = pd.to_datetime(df2['timestamp'], format="%Y%m%d %H%M%S%f")
df2['mid'] = 0.5*(df2['bid'] + df2['ask'])
df2 = df2.set_index('timestamp')

lib.write("EURUSD", df2)

should produce a df with datetime index and bid, ask, mid columns with 2,102,540 rows and no missing data.

resample code:

def resampled_tick_data2(lib, symbol, start, end, freq, max_rows=1000):
    qb = adb.QueryBuilder()
    qb = qb.resample(freq, closed='right').agg({
        'high': ('mid', 'max'),
        'low': ('mid', 'min'),
        'open': ('mid', 'first'),
        'close': ('mid', 'last')
    })
    data = lib.read(symbol,
                    date_range=[start, end],
                    query_builder=qb)
    df = data.data.dropna()
    if max_rows is not None and len(df) > max_rows:
        raise RuntimeError("Number of rows is greater than max rows")
    return df

df = resampled_tick_data2(lib,
                          "EURUSD",
                          dt.datetime(2023,1,1),
                          dt.datetime(2023,12,31),
                          "1s",
                          max_rows=None)

Expected Results

Either produce correct results or give a clear error message.

OS, Python Version and ArcticDB Version

Py 3.10. arcticdb 4.5.0rc. WSL on Win11.

Backend storage used

LMDB

Additional Context

The same example runs ok with mimalloc, with freq='1s' and all freqs down to '10ms' which gives the same number of output rows as original data items.

@DrNickClarke DrNickClarke added the bug Something isn't working label Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants