Skip to content

Latest commit

 

History

History
538 lines (409 loc) · 15.9 KB

README.md

File metadata and controls

538 lines (409 loc) · 15.9 KB

Snowflake Cortex Forecasting

Setup Working Environment

Step 1: Choose IDE of Choice

This demo is expecting that you have VSCode installed. If you do not have it installed, you can download it here. With the python extensions downloaded so that you are able to create a minicode environment.

Step 2: Install Miniconda

bash install_miniconda.sh

Step 3: Create Conda Environment

conda create --name cortex_forecast --override-channels -c https://repo.anaconda.com/pkgs/snowflake python=3.11 --y; conda activate cortex_forecast; pip install --upgrade pip setuptools wheel; pip install .

Step 4: Run Streamlit App Locally (Skip if you are wanting to deploy to SIS)

streamlit run streamlit_app.py

Note: in side of streamlit_app.py you can add your snowflake connection parameters on ln 105 of the python file if you want a default or the streamlit app will ask for your credientials when you run the app.

if st.session_state.snowpark_connection is None:
    try:
        # Use default SIS connection parameters
        connection_config = {
            'user': os.getenv('SNOWFLAKE_USER', ''),
            'password': os.getenv('SNOWFLAKE_PASSWORD', ''),
            'account': os.getenv('SNOWFLAKE_ACCOUNT', ''),
            'database': 'CORTEX',
            'warehouse': 'CORTEX_WH',
            'schema': 'DEV',
            'role': 'CORTEX_USER_ROLE'
        }

Step 5: Deploy Streamlit App to Snowflake

Install Snowflake CLI How to Link

Once you have the CLI installed you can run the following command to deploy the streamlit app to Snowflake.

# Set initial snowflake CLI connection
snow connection add

# Set default connection 
snow connection set default <NEW_CONNECTION_NAME>

# Test snowflake connection
snow connection test

# Deploy streamlit app to Snowflake
snow streamlit deploy --replace

This will provide you with a URL that you can use to access the streamlit app that is running in Snowflake.

Streamlit Application

The Streamlit application provides a user-friendly interface to the SnowflakeMLForecast functionality.

Key Features

  • Choose between your own data or a pre-loaded example dataset
  • Step-by-step forecast configuration
  • Interactive visualization of results

How to Run the Application

To run the Streamlit application, use the following command in your terminal:

streamlit run streamlit_app.py

This will start the Streamlit server and open the application in your default web browser.

How to Use

  1. Data Selection

    • Choose between your Snowflake data or the example dataset

    • If using your data, select the database, schema, and table/view

    • Specify timestamp, target, and optional series/exogenous columns

  2. Forecast Configuration

    • Set model name, tags, and comments

    • Configure forecast parameters (training days, forecast horizon)

    • Set advanced options like evaluation settings

  3. Model Execution

    • Review settings and run the forecast

    • View results including charts and statistics

Tips for Best Results

  • Ensure consistent frequency in your time series data

  • Include sufficient historical data to capture seasonal patterns

  • Experiment with different configuration settings

  • For your own data, carefully select relevant exogenous variables

Remember, these code snippets are integrated into the Streamlit application, allowing for an interactive experience. Users can modify parameters, load their own data, and see results in real-time through the web interface.

Snowflake ML Forecast

Note: This is a POC and is not production ready. It’s designed to get you 80-90% of the way there for simple forecasting use cases, serving as a starting point for your forecasting journey.

Overview

The project consists of three main components:

  1. SnowparkConnection Class: A Python class that manages the connection to Snowflake, handling authentication and session management.
  2. SnowflakeMLForecast Class: A Python class that interfaces with Snowflake to create, manage, and analyze forecast models.
  3. Streamlit Application: A web-based interface that makes it easy to configure and run forecasts without writing code.

SnowflakeMLForecast Class

Features

  • Dynamic forecast model creation using Snowflake’s CREATE SNOWFLAKE.ML.FORECAST functionality
  • Visualization capabilities for forecast results
  • Tag management in Snowflake
  • Configurable via YAML files
  • Robust error handling

Installation

  1. Clone the repository

  2. Install dependencies:

    pip install snowflake-connector-python pandas altair streamlit
    

Basic Usage

from snowflake_ml_forecast import SnowflakeMLForecast

# Setup connection config
connection_config = {
    'user': 'your_user',
    'password': 'your_password',
    'account': 'your_account',
    'database': 'your_database',
    'warehouse': 'your_warehouse',
    'schema': 'your_schema',
    'role': 'your_role'
}

# Create forecast model
forecast_model = SnowflakeMLForecast(
    config_file='path/to/your/config.yaml',
    connection_config=connection_config
)

# Run forecast and visualize
forecast_data = forecast_model.create_and_run_forecast()
forecast_model.generate_forecast_and_visualization()

# Clean up
forecast_model.cleanup()

Forecast Configuration Guide

This guide provides a detailed explanation of all configuration options available for the Snowflake ML Forecast tool. The configuration is typically defined in a YAML file.

Configuration Structure

The configuration is divided into four main sections:

  1. model: Defines the model’s metadata
  2. input_data: Specifies the input data source and structure
  3. forecast_config: Sets up the forecasting parameters
  4. output: Defines where the forecast results will be stored

Below is a detailed breakdown of each section:

1. Model Configuration

model:
  name: my_taxi_forecast_model
  tags:
    environment: production
    team: data_science
  comment: "Forecast model for predicting sales trends."
  • name (required): A unique identifier for your model.
  • tags (optional): Key-value pairs for categorizing and organizing your models.
    • environment: Typically set to ‘development’, ‘production’, or ‘testing’.
    • team: The team responsible for this model.
  • comment (optional): A description of the model’s purpose or any relevant notes.

2. Input Data Configuration

input_data:
  table: ny_taxi_rides_h3_train
  table_type: table
  timestamp_column: pickup_time
  target_column: pickups
  series_column: h3
  exogenous_columns:
  - PUBLIC_HOLIDAY
  - SPORT_EVENT
  • table (required): The name of the table or view containing your input data.
  • table_type (optional): Specifies whether the input is a ‘table’ or a ‘view’. Default is ‘table’.
  • timestamp_column (required): The name of the column containing timestamp data.
  • target_column (required): The name of the column containing the values you want to forecast.
  • series_column (optional): For multi-series forecasting, the column that identifies different series.
  • exogenous_columns (optional): A list of additional columns to be used as features in the forecast model.

3. Forecast Configuration

forecast_config:
  training_days: 30
  table: ny_taxi_rides_h3_predict
  config_object:
    on_error: skip
    evaluate: true
    evaluation_config:
      n_splits: 2
      gap: 0
      prediction_interval: 0.95
  • training_days (optional): The number of days of historical data to use for training. If not specified, all available data will be used.
  • table (optional): If specified, the model will create predictions for the data in this table instead of forecasting future dates.
  • config_object: Advanced configuration options for the forecasting process.
    • on_error (optional): Determines behavior when an error occurs. Options are ‘skip’ or ‘fail’. Default is ‘skip’.
    • evaluate (optional): Whether to perform model evaluation. Default is true.
    • evaluation_config: Settings for the evaluation process.
      • n_splits (optional): Number of splits for cross-validation. Default is 2.
      • test_size (optional): Size of the test set in days. If not specified, it will be automatically determined.
      • gap (optional): Number of days between the training and test sets. Default is 0.
      • prediction_interval (optional): Confidence level for prediction intervals. Default is 0.95 (95% confidence).

4. Output Configuration

output:
  table: taxi_forecast_results
  • table (required): The name of the table where forecast results will be stored.

Usage Notes

  1. Ensure all required fields are filled out correctly.
  2. For multi-series forecasting, make sure to specify the series_column.
  3. Choose exogenous_columns carefully - they should be variables that you expect to have a significant impact on your target variable.
  4. When using training_days, ensure you have enough historical data to cover the specified period.
  5. If you specify a table in forecast_config, make sure it contains future dates or scenarios you want to predict.
  6. Adjust evaluation_config parameters based on your data characteristics and forecast horizon.

The class uses a YAML file for configuration. Here’s a simple example sample:

model:
  name: my_forecast_model
  tags:
    environment: production
    team: data_science
  comment: "Forecast model for predicting trends."

input_data:
  table: storage_usage_train
  timestamp_column: usage_date
  target_column: storage_gb

forecast_config:
  training_days: 180
  forecast_days: 30
  config_object:
    evaluate: true
    evaluation_config:
      prediction_interval: 0.95

output:
  table: storage_forecast_results

Versatility and Adaptability

While this project showcases a powerful Streamlit application for interactive forecasting, it’s important to note that the core functionality is highly versatile and can be easily integrated into various workflows:

  1. Python Scripts and Notebooks: The SnowflakeMLForecast class can be imported and used in your own Python scripts or Jupyter notebooks, allowing for seamless integration with your existing data processing pipelines.

  2. Automated Reporting: Leverage the forecasting capabilities in automated reporting systems, scheduling regular forecast updates without manual intervention.

  3. Custom Applications: Build your own applications around the SnowflakeMLForecast class, tailoring the user interface and functionality to your specific needs.

  4. Baseline for Advanced Models: Use this project as a starting point for more complex forecasting models, adding your own custom features or integrating with other machine learning libraries.

Remember, while the Streamlit application provides a great out-of-the-box experience, the underlying SnowflakeMLForecast class is designed to be flexible and extensible. We encourage you to explore beyond the Streamlit interface and adapt the code to fit your unique forecasting requirements. Whether you’re using the ready-made Streamlit application or integrating the core functionality into your own systems, this tool provides a solid foundation for leveraging Snowflake’s ML capabilities for time series forecasting

Project Structure

Here’s a simplified overview of the project structure:

cortex_forecast/
├── cortex_forecast/
│   ├── __init__.py
│   ├── connection.py
│   ├── forecast.py
│   └── files/
│       └── yaml/
│           ├── storage_forecast_config.yaml
│           └── taxi_forecast_config.yaml
├── docs/
│   ├── index.ipynb
│   └── forecast_storage_example.ipynb
├── nbs/
│   ├── 00_connection.ipynb
│   └── 01_cortex_forecast.ipynb
├── pages/
│   ├── 00_table_selection_creation_page.py
│   ├── 01_create_forecast_config.py
│   └── 02_modeling.py
├── streamlit_app.py
├── requirements.txt
└── setup.py

Key components:

  • cortex_forecast/: Core Python package with the main functionality.

  • docs/: Documentation notebooks and examples.

  • nbs/: Development notebooks (using nbdev).

  • pages/: Streamlit application pages.

  • streamlit_app.py: Main Streamlit application entry point.

This structure separates the core functionality (cortex_forecast/) from the Streamlit application (pages/ and streamlit_app.py), while keeping documentation (docs/) and development notebooks (nbs/) organized.

Conclusion

Whether you’re using the SnowflakeMLForecast class in your Python scripts or leveraging the Streamlit application for quick forecasts, this tool provides a flexible and powerful way to generate forecasts using Snowflake’s ML capabilities. Start with the example data to get familiar with the process, then apply it to your own datasets for valuable insights.

Happy forecasting!

Full Example

See in Docs/ folder for two example of this in action. One is for storage and the other is for Taxi Pick up in NYC.

from snowflake.snowpark.version import VERSION
from cortex_forecast.forecast import SnowflakeMLForecast
import os

Create Snowflake Connection Using SnowflakeMLForecast

Note: Make sure that you create a yaml file that you would like to so that the SnowflakeMLForecast can read the connection information from it and be able to build your forecast.

forecast_model = SnowflakeMLForecast(
   config='./cortex_forecast/files/yaml/storage_forecast_config.yaml',
    connection_config={
        'user': os.getenv('SNOWFLAKE_USER'),
        'password': os.getenv('SNOWFLAKE_PASSWORD'),
        'account': os.getenv('SNOWFLAKE_ACCOUNT'),
        'database': 'CORTEX',
        'warehouse': 'CORTEX_WH',
        'schema': 'DEV',
        'role': 'CORTEX_USER_ROLE'  # Use the desired role
    },
    is_streamlit=False
)

snowflake_environment = forecast_model.session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION
print('\nConnection Established with the following parameters:')
print('Snowflake version           : {}'.format(snowflake_environment[0][1]))
print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0], snowpark_version[1], snowpark_version[2]))
# Create Training Data
training_days = 365

forecast_model.session.sql(f'''CREATE OR REPLACE TABLE storage_usage_train AS
    SELECT 
        TO_TIMESTAMP_NTZ(usage_date) AS usage_date,
        storage_bytes / POWER(1024, 3) AS storage_gb
    FROM 
    (
        SELECT * 
            FROM snowflake.account_usage.storage_usage
            WHERE usage_date < CURRENT_DATE()
    )
    WHERE TO_TIMESTAMP_NTZ(usage_date) > DATEADD(day, -{training_days}, CURRENT_DATE())
''').collect()
forecast_model.session.sql('SELECT * FROM storage_usage_train ORDER BY usage_date DESC LIMIT 10').show()
import matplotlib.pyplot as plt
df = forecast_model.session.sql('SELECT * FROM storage_usage_train ORDER BY usage_date').to_pandas()
df.head()
df = df.set_index('USAGE_DATE')
df['STORAGE_GB'].plot(figsize=(10, 6), title='Storage GB Over Time')

# Show the plot
plt.xlabel('Date')
plt.ylabel('Storage GB')
plt.grid(True)
plt.show()

Train a Model

This will use what is inside of the yaml file that you created that you passed over to the SnowflakeMLForecast object

# Run Forecast
forecast_data = forecast_model.create_and_run_forecast()
forecast_data.head()

Visualize Forecast

forecast_model.generate_forecast_and_visualization(show_historical=True)