Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-1705797]: Use cached metadata to make repr faster on simple DataFrames #2760

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sfc-gh-rdurrani
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1705797

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

This PR is to help alleviate the long repr time on DataFrames. For "simple" DataFrames, i.e. DataFrames that are simple selects/projects off of a table, its faster to rely on the metadata count to give us row count, rather than using a window function. This PR adds support for doing so on those sorts of DataFrames.

@sfc-gh-rdurrani
Copy link
Contributor Author

sfc-gh-rdurrani commented Dec 13, 2024

With this PR, and the following benchmarking code, on a 73137675 rows × 8 columns DataFrame:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
df

(timed using VS Code execution of jupyter notebooks), we see a median 6x improvement.

New codepath times: [1.3s, 1.1s, 1.3s]
Old codepath times: [7.6s, 7.9s, 7.8s]

@sfc-gh-rdurrani
Copy link
Contributor Author

sfc-gh-rdurrani commented Dec 13, 2024

For more complex operations, we're a tiny bit slower than before:

benchmark code:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)

times:

old: [4.256317083010799, 4.085870499999146, 4.003666083997814]
new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]

This is a 0.857749436690795x slowdown.

@@ -2019,3 +2028,23 @@ def sample(self, n: Optional[int], frac: Optional[float]) -> "OrderedDataFrame":
ordering_columns=self.ordering_columns,
)
)

def is_projection_of_table(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nested projections? This seems to only handle a single level of projection, right? That should still be fine. We can address nested projections in a followup step.

Copy link
Contributor Author

@sfc-gh-rdurrani sfc-gh-rdurrani Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before putting in this PR, I tested it locally and found that this will handle nested projections - e.g. I tried the following:

df = pd.read_snowflake...
df = df[df.columns[:5:-1]]
df  = df.select_dtypes()

and

df = pd.DataFrame(...)
df = df[df.columns[:5:-1]]
df  = df.select_dtypes()

and after each of those lines of code + after the entire block of code, the format of the api_calls method remained the same - i.e. this check will work for nested projections, and the metadata caching of count is passed on for nested projections of that type.

@sfc-gh-helmeleegy
Copy link
Contributor

For more complex operations, we're a tiny bit slower than before:

benchmark code:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)

times:

old: [4.256317083010799, 4.085870499999146, 4.003666083997814] new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]

This is a 0.857749436690795x slowdown.

This is interesting. It seems to be mostly the cost of performing the check - because otherwise the code used for complex dataframes is exactly the same as before.

@sfc-gh-rdurrani can you try to measure the impact of data size on this regression; i.e., by using input data of different sizes - both smaller and larger? I imagine that as the data size gets bigger the overhead of performing the check becomes less and less significant in the e2e time.

@sfc-gh-rdurrani
Copy link
Contributor Author

For more complex operations, we're a tiny bit slower than before:
benchmark code:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)

times:
old: [4.256317083010799, 4.085870499999146, 4.003666083997814] new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]
This is a 0.857749436690795x slowdown.

This is interesting. It seems to be mostly the cost of performing the check - because otherwise the code used for complex dataframes is exactly the same as before.

@sfc-gh-rdurrani can you try to measure the impact of data size on this regression; i.e., by using input data of different sizes - both smaller and larger? I imagine that as the data size gets bigger the overhead of performing the check becomes less and less significant in the e2e time.

These numbers are not for this code in this PR. These numbers are for if I were to remove the if-else, and just always use the new codepath, and were meant to justify having the if-else, rather than just having all DataFrames go through the if-else. I haven't measured the latency of benchmarks like this through the new codepath (with the if-else), but in practice, the latency should be about the same (i.e. there shouldn't be a measurable impact), since getting api_calls from the plan is inexpensive.

@sfc-gh-helmeleegy
Copy link
Contributor

With this PR, and the following benchmarking code, on a 73137675 rows × 8 columns DataFrame:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
df

(timed using VS Code execution of jupyter notebooks), we see a median 6x improvement.

New codepath times: [1.3s, 1.1s, 1.3s] Old codepath times: [7.6s, 7.9s, 7.8s]

These gains almost match exactly those reported in our previous slack thread here

@sfc-gh-helmeleegy
Copy link
Contributor

For more complex operations, we're a tiny bit slower than before:
benchmark code:

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()
repr(df)
end = perf_counter()
print(end - start)

times:
old: [4.256317083010799, 4.085870499999146, 4.003666083997814] new: [4.786840916000074, 4.763477917003911, 4.6787519170029555]
This is a 0.857749436690795x slowdown.

This is interesting. It seems to be mostly the cost of performing the check - because otherwise the code used for complex dataframes is exactly the same as before.
@sfc-gh-rdurrani can you try to measure the impact of data size on this regression; i.e., by using input data of different sizes - both smaller and larger? I imagine that as the data size gets bigger the overhead of performing the check becomes less and less significant in the e2e time.

These numbers are not for this code in this PR. These numbers are for if I were to remove the if-else, and just always use the new codepath, and were meant to justify having the if-else, rather than just having all DataFrames go through the if-else. I haven't measured the latency of benchmarks like this through the new codepath (with the if-else), but in practice, the latency should be about the same (i.e. there shouldn't be a measurable impact), since getting api_calls from the plan is inexpensive.

This is much better. Thanks for the clarification. In fact, it was hard to believe that the check would take this much time.

@sfc-gh-rdurrani
Copy link
Contributor Author

df = pd.read_snowflake("FINANCIAL__ECONOMIC_ESSENTIALS.CYBERSYN.stock_price_timeseries")
from time import perf_counter
start = perf_counter()
df = df[(df['TICKER'] == 'GOOG') | (df['TICKER'] == 'MSFT') | (df['TICKER'] == 'SNOW')]
df = df.pivot_table(index=["TICKER", "DATE"], columns="VARIABLE_NAME", values="VALUE")
df = df["All-Day Low"]["GOOG"].resample("91D").min()

I stand corrected - @sfc-gh-helmeleegy I tried this benchmark with the new code (including the if-else) and got these times:
[4.397639709000941, 4.303594415992848, 4.002272666999488]
these numbers are in between the old code-path, and the proposed (but rejected) new code-path of only ever doing the count eagerly. It's a little surprising that this codepath is a bit slower than the old codepath, but as you said, that's probably the overhead of getting the plan, although that should be fairly cheap.

@sfc-gh-rdurrani
Copy link
Contributor Author

Posting a tldr, since I realize the multiple messages may be a little confusing to other reviewers.

To clarify - there are 3 scenarios being tested for the complex benchmark.

Scenario A: Old codepath (before this PR)
Scenario B: New codepath, where instead of adding a window count, we always eagerly call count
Scenario C: Current PR, where we check if the current DataFrame is a projection of a table, and eagerly get count if it is, or use the window count from the old code if it isn't.

In Scenarion A and Scenario C, this benchmark takes the old codepath, where we add a window count to the query as our cached row count column. The only difference between Scenario A and Scenario C is that Scenario C requires an additional if-else to check that the current dataframe is not a projection of a table. In Scenario B, we eagerly materialize the count.

Scenario A numbers: [4.256317083010799, 4.085870499999146, 4.003666083997814]
Scenario B numbers: [4.786840916000074, 4.763477917003911, 4.6787519170029555]
Scenario C numbers: [4.397639709000941, 4.303594415992848, 4.002272666999488]

As expected, Scenario C numbers are lower than Scenario B numbers - what's a little surprising is that Scenario B numbers are higher than Scenario A numbers, since the if-else should be cheap (getting the snowpark dataframe plan shouldn't be expensive), but this is a small number of runs, and the last number from Scenario C is inline with what you would expect from Scenario A, so this could just be random variation + bad runs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants