Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plotting large dataset in Dash app #267

Merged
merged 6 commits into from
Aug 13, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions examples/dash/plotly-large-dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Plotting large datasets in Dash

Interactive Dash applications that plot large datasets using one of:
- [**WebGL**](https://plotly.com/python/webgl-vs-svg/) (in `webgl` folder): a powerful technology that uses GPU to accelerate computation, helping you render figures more effectively. This method is generally ideal for figures with up to 100,000-200,000 markers (terminology for data points in charts), depending on the power of your GPU. For figures larger than that, it's often optimal to aggregate the data points first

- [**`plotly-resampler`**](https://github.com/predict-idlab/plotly-resampler) (in `resampler` folder): an external library that dynamically aggregates time-series data respective to the current graph view. This approach helps you downsample your dataset at the cost of losing some details.

- Combined approach (in `combined` folder).

We will be using a commercial flight dataset that documents information such as flight delays in the first half (1/1-6/30) of 2006. You can find it [here](https://github.com/vega/falcon/blob/master/data/flights-3m.csv). For the purpose of this project, we will focus on plotting departure delays.

Once you download the dataset, run `python csv-clean.py flights-3m.csv` to obtain the cleaned csv file `flights-3m-cleaned.csv`. Move the cleaned file to the `data` folder in any of the project folders (`webgl`, `resample` or `combined`) you want to test.

## Description

On its home page, the apps will display a scatter plot figure denoting departure delay time (minute) of around 3 million flights, captured below. You can select the date range you want to visualize in `resampler` and `combined`.

- `webgl`

![](static/app_webgl.png)

- `resampler`

![](static/app_resampler.png)

- `combined`

![](static/app_combined.png)

You can also click on the graph and drag your cursor around to zoom into any part of the graph you want.

![](static/zoom_in.gif)

To revert the figure to its original state, click on the `Reset axes` button at the upper right corner of the figure.

![](static/zoom_out.gif)


## Local testing

`cd` into the folder of the approach you want to test, then run `gunicorn app:server run --bind 0.0.0.0:80`. You should be able to access the app at `0.0.0.0:80`.

## Upload to Ploomber Cloud

Ensure that you are in the correct project folder.

### Command line

Go to your app folder and set your API key: `ploomber-cloud key YOURKEY`. Next, initialize your app: `ploomber-cloud init` and deploy it: `ploomber-cloud deploy`. For more details, please refer to our [documentation](https://docs.cloud.ploomber.io/en/latest/user-guide/cli.html).

### UI

Zip `app.py` together with `requirements.txt` and `data` folder, then upload to Ploomber Cloud. For more details, please refer to our [Dash deployment guide](https://docs.cloud.ploomber.io/en/latest/apps/dash.html).

## Interacting with the App

Once the app starts running, you will see a page similar to the above screenshots. You can click on the graph and drag your cursor around to zoom into any part of the graph you want.

![](static/zoom_in.gif)

To revert the figure back to its original state, click on the `Reset axes` button at the upper right corner of the figure.

![](static/zoom_out.gif)
67 changes: 67 additions & 0 deletions examples/dash/plotly-large-dataset/combined/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
from dash import dcc, html, Input, Output, Dash
import pandas as pd
from datetime import datetime as dt
import plotly.graph_objects as go
from plotly_resampler import FigureResampler

app = Dash(__name__)
server = app.server

N = 100000

df = pd.read_csv("data/flights-3m-cleaned.csv")

app.layout = html.Div(children=[
html.H1("Plotting Large Datasets in Dash"),
html.P("Select range of flight dates to visualize"),
dcc.DatePickerRange(
id="date-picker-select",
start_date=dt(2006, 1, 1),
end_date=dt(2006, 4, 1),
min_date_allowed=dt(2006, 1, 1),
max_date_allowed=dt(2006, 7, 1),
initial_visible_month=dt(2006, 1, 1),
),
dcc.Graph(id='example-graph'),

])

@app.callback(
Output("example-graph", "figure"),
[
Input("date-picker-select", "start_date"),
Input("date-picker-select", "end_date"),
],
)
def update_figure(start, end):
start = start + " 00:00:00"
end = end + " 00:00:00"

df_filtered = df[(pd.to_datetime(df["DEP_DATETIME"]) >= pd.to_datetime(start)) & \
(pd.to_datetime(df["DEP_DATETIME"]) <= pd.to_datetime(end))]

fig = FigureResampler(go.Figure())

fig.add_trace(go.Scattergl(
mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
showlegend=False,
line_width=0.3,
line_color="gray",
marker={
"color": abs(df["DEP_DELAY"]), # Convert marker value to color.
"colorscale": "Portland", # How marker color changes based on data point value.
"size": abs(5 + df["DEP_DELAY"] / 50) # Non-negative size of individual data point marker based on the dataset.
}
),
hf_x=df_filtered["DEP_DATETIME"],
hf_y=df_filtered["DEP_DELAY"],
max_n_samples=N
)

fig.update_layout(
title="Flight departure delay",
xaxis_title="Flight date and time (24h)",
yaxis_title="Departure delay (minutes)"
)

return fig
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dash
plotly-resampler
pandas
gunicorn
27 changes: 27 additions & 0 deletions examples/dash/plotly-large-dataset/csv-clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import pandas as pd
import sys

if __name__ == "__main__":
if (len(sys.argv) != 2 or not sys.argv[1].endswith(".csv")):
raise ValueError("Usage: python csv-clean.py filename.csv")

in_file = sys.argv[1]
df = pd.read_csv(in_file)

# Clean out null values
df = df[df['DEP_TIME'].notnull() & df['DEP_DELAY'].notnull()]

# Ensure hour is between 0 and 23 for conversion
df.loc[df.DEP_TIME == 2400, 'DEP_TIME'] = 0

# Add time to date and convert
df["DEP_DATETIME"] = df["FL_DATE"] * 10000 + df["DEP_TIME"]
df["DEP_DATETIME"] = df["DEP_DATETIME"].apply(lambda x: pd.to_datetime(str(int(x))))

# Select relevant columns.
df = df[["DEP_DATETIME", "DEP_DELAY"]].sort_values(["DEP_DATETIME"])
print("Completed conversion. Resulting DataFrame:\n")
print(df)

out_file = in_file[:-4] + "-cleaned.csv"
df.to_csv(out_file, sep=",")
65 changes: 65 additions & 0 deletions examples/dash/plotly-large-dataset/resampler/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
from dash import dcc, html, Input, Output, Dash
import pandas as pd
from datetime import datetime as dt
import plotly.graph_objects as go
from plotly_resampler import FigureResampler

app = Dash(__name__)
server = app.server

N = 2000

df = pd.read_csv("data/flights-3m-cleaned.csv")

app.layout = html.Div(children=[
html.H1("Plotting Large Datasets in Dash"),
html.P("Select range of flight dates to visualize"),
dcc.DatePickerRange(
id="date-picker-select",
start_date=dt(2006, 1, 1),
end_date=dt(2006, 4, 1),
min_date_allowed=dt(2006, 1, 1),
max_date_allowed=dt(2006, 7, 1),
initial_visible_month=dt(2006, 1, 1),
),
dcc.Graph(id='example-graph'),

])

@app.callback(
Output("example-graph", "figure"),
[
Input("date-picker-select", "start_date"),
Input("date-picker-select", "end_date"),
],
)
def update_figure(start, end):
start = start + " 00:00:00"
end = end + " 00:00:00"

df_filtered = df[(pd.to_datetime(df["DEP_DATETIME"]) >= pd.to_datetime(start)) & \
(pd.to_datetime(df["DEP_DATETIME"]) <= pd.to_datetime(end))]

fig = FigureResampler(go.Figure())

fig.add_trace(go.Scatter(
mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
showlegend=False,
line_width=0.3,
line_color="gray",
marker_size=abs(5 + df["DEP_DELAY"] / 50), # Non-negative size of individual data point marker based on the dataset.
marker_colorscale="Portland", # How marker color changes based on data point value.
marker_color=abs(df["DEP_DELAY"]), # Convert marker value to color.
),
hf_x=df_filtered["DEP_DATETIME"],
hf_y=df_filtered["DEP_DELAY"],
max_n_samples=N
)

fig.update_layout(
title="Flight departure delay",
xaxis_title="Flight date and time (24h)",
yaxis_title="Departure delay (minutes)"
)

return fig
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dash
plotly-resampler
pandas
gunicorn
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 40 additions & 0 deletions examples/dash/plotly-large-dataset/webgl/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from dash import dcc, html, Input, Output, Dash
from flask import request
import pandas as pd
import plotly.graph_objects as go

app = Dash(__name__)
server = app.server

N = 100000 # Limit number of rows to plot.

fig = go.Figure() # Initiate the figure.

df = pd.read_csv("data/flights-3m-cleaned.csv")

fig.add_trace(go.Scattergl(
x=df["DEP_DATETIME"][:N],
y=df["DEP_DELAY"][:N],
mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
showlegend=False,
line_width=0.3,
line_color="gray",
marker={
"color": abs(df["DEP_DELAY"][:N]), # Convert marker value to color.
"colorscale": "Portland", # How marker color changes based on data point value.
"size": abs(5 + df["DEP_DELAY"][:N] / 50) # Non-negative size of individual data point marker based on the dataset.
}
)
)

fig.update_layout(
title="Flight departure delay",
xaxis_title="Flight date and time (24h)",
yaxis_title="Departure delay (minutes)"
)

app.layout = html.Div(children=[
html.H1("Plotting Large Datasets in Dash"),
dcc.Graph(id='example-graph', figure=fig),

])
3 changes: 3 additions & 0 deletions examples/dash/plotly-large-dataset/webgl/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dash
pandas
gunicorn