Skip to content

Commit

Permalink
Plotting large dataset in Dash app (#267)
Browse files Browse the repository at this point in the history
* Plotting large dataset in Dash app

* Added dataset instructions

* Wording in README

* Date range in apps and more description in README

* Added description of app

* Added description in app

---------

Co-authored-by: Ben C <[email protected]>
  • Loading branch information
bchen39 and bqd39 authored Aug 13, 2024
1 parent b8aa4c5 commit 6684386
Show file tree
Hide file tree
Showing 13 changed files with 296 additions and 0 deletions.
63 changes: 63 additions & 0 deletions examples/dash/plotly-large-dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Plotting large datasets in Dash

Interactive Dash applications that plot large datasets using one of:
- [**WebGL**](https://plotly.com/python/webgl-vs-svg/) (in `webgl` folder): a powerful technology that uses GPU to accelerate computation, helping you render figures more effectively. This method is generally ideal for figures with up to 100,000-200,000 markers (terminology for data points in charts), depending on the power of your GPU. For figures larger than that, it's often optimal to aggregate the data points first

- [**`plotly-resampler`**](https://github.com/predict-idlab/plotly-resampler) (in `resampler` folder): an external library that dynamically aggregates time-series data respective to the current graph view. This approach helps you downsample your dataset at the cost of losing some details.

- Combined approach (in `combined` folder).

We will be using a commercial flight dataset that documents information such as flight delays in the first half (1/1-6/30) of 2006. You can find it [here](https://github.com/vega/falcon/blob/master/data/flights-3m.csv). For the purpose of this project, we will focus on plotting departure delays.

Once you download the dataset, run `python csv-clean.py flights-3m.csv` to obtain the cleaned csv file `flights-3m-cleaned.csv`. Move the cleaned file to the `data` folder in any of the project folders (`webgl`, `resample` or `combined`) you want to test.

## Description

On its home page, the apps will display a scatter plot figure denoting departure delay time (minute) of around 3 million flights, captured below. You can select the date range you want to visualize in `resampler` and `combined`.

- `webgl`

![](static/app_webgl.png)

- `resampler`

![](static/app_resampler.png)

- `combined`

![](static/app_combined.png)

You can also click on the graph and drag your cursor around to zoom into any part of the graph you want.

![](static/zoom_in.gif)

To revert the figure to its original state, click on the `Reset axes` button at the upper right corner of the figure.

![](static/zoom_out.gif)


## Local testing

`cd` into the folder of the approach you want to test, then run `gunicorn app:server run --bind 0.0.0.0:80`. You should be able to access the app at `0.0.0.0:80`.

## Upload to Ploomber Cloud

Ensure that you are in the correct project folder.

### Command line

Go to your app folder and set your API key: `ploomber-cloud key YOURKEY`. Next, initialize your app: `ploomber-cloud init` and deploy it: `ploomber-cloud deploy`. For more details, please refer to our [documentation](https://docs.cloud.ploomber.io/en/latest/user-guide/cli.html).

### UI

Zip `app.py` together with `requirements.txt` and `data` folder, then upload to Ploomber Cloud. For more details, please refer to our [Dash deployment guide](https://docs.cloud.ploomber.io/en/latest/apps/dash.html).

## Interacting with the App

Once the app starts running, you will see a page similar to the above screenshots. You can click on the graph and drag your cursor around to zoom into any part of the graph you want.

![](static/zoom_in.gif)

To revert the figure back to its original state, click on the `Reset axes` button at the upper right corner of the figure.

![](static/zoom_out.gif)
75 changes: 75 additions & 0 deletions examples/dash/plotly-large-dataset/combined/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
from dash import dcc, html, Input, Output, Dash
import pandas as pd
from datetime import datetime as dt
import plotly.graph_objects as go
from plotly_resampler import FigureResampler

app = Dash(__name__)
server = app.server

N = 100000

df = pd.read_csv("data/flights-3m-cleaned.csv")

app.layout = html.Div(children=[
html.H1("Plotting Large Datasets in Dash"),
html.H2("""Downsampled figure: Departure delay time of around 3
million flights in the first half (1/1-6/30) of 2006"""),
html.P("Select range of flight dates to visualize"),
dcc.DatePickerRange(
id="date-picker-select",
start_date=dt(2006, 1, 1),
end_date=dt(2006, 4, 1),
min_date_allowed=dt(2006, 1, 1),
max_date_allowed=dt(2006, 7, 1),
initial_visible_month=dt(2006, 1, 1),
),
html.Div("""Click on the graph and drag
your cursor around to zoom into any part of the graph you want."""
, style={"margin-top": "10px"}),
html.Div("""To revert the figure to its original state, click on the
'Reset axes' button at the upper right corner of the figure."""
, style={"margin-top": "10px"}),
dcc.Graph(id='example-graph'),

])

@app.callback(
Output("example-graph", "figure"),
[
Input("date-picker-select", "start_date"),
Input("date-picker-select", "end_date"),
],
)
def update_figure(start, end):
start = start + " 00:00:00"
end = end + " 00:00:00"

df_filtered = df[(pd.to_datetime(df["DEP_DATETIME"]) >= pd.to_datetime(start)) & \
(pd.to_datetime(df["DEP_DATETIME"]) <= pd.to_datetime(end))]

fig = FigureResampler(go.Figure())

fig.add_trace(go.Scattergl(
mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
showlegend=False,
line_width=0.3,
line_color="gray",
marker={
"color": abs(df["DEP_DELAY"]), # Convert marker value to color.
"colorscale": "Portland", # How marker color changes based on data point value.
"size": abs(5 + df["DEP_DELAY"] / 50) # Non-negative size of individual data point marker based on the dataset.
}
),
hf_x=df_filtered["DEP_DATETIME"],
hf_y=df_filtered["DEP_DELAY"],
max_n_samples=N
)

fig.update_layout(
title="Flight departure delay",
xaxis_title="Flight date and time (24h)",
yaxis_title="Departure delay (minutes)"
)

return fig
4 changes: 4 additions & 0 deletions examples/dash/plotly-large-dataset/combined/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dash
plotly-resampler
pandas
gunicorn
27 changes: 27 additions & 0 deletions examples/dash/plotly-large-dataset/csv-clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import pandas as pd
import sys

if __name__ == "__main__":
if (len(sys.argv) != 2 or not sys.argv[1].endswith(".csv")):
raise ValueError("Usage: python csv-clean.py filename.csv")

in_file = sys.argv[1]
df = pd.read_csv(in_file)

# Clean out null values
df = df[df['DEP_TIME'].notnull() & df['DEP_DELAY'].notnull()]

# Ensure hour is between 0 and 23 for conversion
df.loc[df.DEP_TIME == 2400, 'DEP_TIME'] = 0

# Add time to date and convert
df["DEP_DATETIME"] = df["FL_DATE"] * 10000 + df["DEP_TIME"]
df["DEP_DATETIME"] = df["DEP_DATETIME"].apply(lambda x: pd.to_datetime(str(int(x))))

# Select relevant columns.
df = df[["DEP_DATETIME", "DEP_DELAY"]].sort_values(["DEP_DATETIME"])
print("Completed conversion. Resulting DataFrame:\n")
print(df)

out_file = in_file[:-4] + "-cleaned.csv"
df.to_csv(out_file, sep=",")
73 changes: 73 additions & 0 deletions examples/dash/plotly-large-dataset/resampler/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
from dash import dcc, html, Input, Output, Dash
import pandas as pd
from datetime import datetime as dt
import plotly.graph_objects as go
from plotly_resampler import FigureResampler

app = Dash(__name__)
server = app.server

N = 2000

df = pd.read_csv("data/flights-3m-cleaned.csv")

app.layout = html.Div(children=[
html.H1("Plotting Large Datasets in Dash"),
html.H2("""Downsampled figure: Departure delay time of around 3
million flights in the first half (1/1-6/30) of 2006"""),
html.P("Select range of flight dates to visualize"),
dcc.DatePickerRange(
id="date-picker-select",
start_date=dt(2006, 1, 1),
end_date=dt(2006, 4, 1),
min_date_allowed=dt(2006, 1, 1),
max_date_allowed=dt(2006, 7, 1),
initial_visible_month=dt(2006, 1, 1),
),
html.Div("""Click on the graph and drag
your cursor around to zoom into any part of the graph you want."""
, style={"margin-top": "10px"}),
html.Div("""To revert the figure to its original state, click on the
'Reset axes' button at the upper right corner of the figure."""
, style={"margin-top": "10px"}),
dcc.Graph(id='example-graph'),

])

@app.callback(
Output("example-graph", "figure"),
[
Input("date-picker-select", "start_date"),
Input("date-picker-select", "end_date"),
],
)
def update_figure(start, end):
start = start + " 00:00:00"
end = end + " 00:00:00"

df_filtered = df[(pd.to_datetime(df["DEP_DATETIME"]) >= pd.to_datetime(start)) & \
(pd.to_datetime(df["DEP_DATETIME"]) <= pd.to_datetime(end))]

fig = FigureResampler(go.Figure())

fig.add_trace(go.Scatter(
mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
showlegend=False,
line_width=0.3,
line_color="gray",
marker_size=abs(5 + df["DEP_DELAY"] / 50), # Non-negative size of individual data point marker based on the dataset.
marker_colorscale="Portland", # How marker color changes based on data point value.
marker_color=abs(df["DEP_DELAY"]), # Convert marker value to color.
),
hf_x=df_filtered["DEP_DATETIME"],
hf_y=df_filtered["DEP_DELAY"],
max_n_samples=N
)

fig.update_layout(
title="Flight departure delay",
xaxis_title="Flight date and time (24h)",
yaxis_title="Departure delay (minutes)"
)

return fig
4 changes: 4 additions & 0 deletions examples/dash/plotly-large-dataset/resampler/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
dash
plotly-resampler
pandas
gunicorn
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
47 changes: 47 additions & 0 deletions examples/dash/plotly-large-dataset/webgl/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from dash import dcc, html, Input, Output, Dash
from flask import request
import pandas as pd
import plotly.graph_objects as go

app = Dash(__name__)
server = app.server

N = 100000 # Limit number of rows to plot.

fig = go.Figure() # Initiate the figure.

df = pd.read_csv("data/flights-3m-cleaned.csv")

fig.add_trace(go.Scattergl(
x=df["DEP_DATETIME"][:N],
y=df["DEP_DELAY"][:N],
mode="markers", # Replace with "line-markers" if you want to display lines between time series data.
showlegend=False,
line_width=0.3,
line_color="gray",
marker={
"color": abs(df["DEP_DELAY"][:N]), # Convert marker value to color.
"colorscale": "Portland", # How marker color changes based on data point value.
"size": abs(5 + df["DEP_DELAY"][:N] / 50) # Non-negative size of individual data point marker based on the dataset.
}
)
)

fig.update_layout(
title="Flight departure delay",
xaxis_title="Flight date and time (24h)",
yaxis_title="Departure delay (minutes)"
)

app.layout = html.Div(children=[
html.H1("Plotting Large Datasets in Dash"),
html.H2("""Downsampled figure: Departure delay time of around 3
million flights in the first half (1/1-6/30) of 2006"""),
html.Div("""Click on the graph and drag
your cursor around to zoom into any part of the graph you want."""
, style={"margin-top": "10px"}),
html.Div("""To revert the figure to its original state, click on the
'Reset axes' button at the upper right corner of the figure."""
, style={"margin-top": "10px"}),
dcc.Graph(id='example-graph', figure=fig),
])
3 changes: 3 additions & 0 deletions examples/dash/plotly-large-dataset/webgl/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dash
pandas
gunicorn

0 comments on commit 6684386

Please sign in to comment.