From a81e40c794a70b210c07b0ec9c98a2775215c30f Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Sun, 30 Nov 2025 21:34:06 -0800 Subject: [PATCH 01/11] Create optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md new file mode 100644 index 0000000000..8b13789179 --- /dev/null +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -0,0 +1 @@ + From 82854ff45c7fe7e5db6eccf69e51869373deb348 Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Sun, 30 Nov 2025 21:39:38 -0800 Subject: [PATCH 02/11] Update optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md index 8b13789179..ae48ade9a2 100644 --- a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -1 +1,19 @@ +Table of Contents +1. Why Open Source Llama Models? + +2. How Llama Models Can Be Used on Snowflake + +3. AISQL with Llama: A Use Case + +4. Fine-Tuning Llama Models – Best Practices & Customer Support Recipe + +5. Performance Optimizations on the Snowflake Platform + +6. Cost Optimization, Governance, and Prompt Caching + +7. Explainability of Llama-Based Applications + +8. Interoperability with MCP and Partner Tooling + +9. Working with Snowflake Intelligence and Cortex Knowledge Extensions From adea2878f6a8d4cee1d6019a5ae4e7b7192a9bd1 Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Sun, 30 Nov 2025 21:43:40 -0800 Subject: [PATCH 03/11] Update optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 352 ++++++++++++++++++ 1 file changed, 352 insertions(+) diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md index ae48ade9a2..e33ccc5dd8 100644 --- a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -1,3 +1,5 @@ +Optimizing Open Source Llama on Snowflake Cookbook + Table of Contents 1. Why Open Source Llama Models? @@ -17,3 +19,353 @@ Table of Contents 8. Interoperability with MCP and Partner Tooling 9. Working with Snowflake Intelligence and Cortex Knowledge Extensions + + + +1. Why Open Source Llama Models? + + + + +Open source Llama models (from Meta) are a strategic choice for enterprises because they provide: + +Transparency and control – Full visibility into model architecture, training data, and safety guardrails, which is critical for regulated industries and internal governance.​ + +Avoiding vendor lock-in – Ability to fine-tune, host, and run Llama variants (e.g., llama3.1-70b, llama3.1-405b) on-prem, in cloud, or in Snowflake, reducing dependency on a single closed LLM provider.​ + +Customization and domain alignment – Llama models can be fine-tuned on proprietary data (e.g., support tickets, internal docs) to match company tone, terminology, and compliance requirements.​​ + +Cost efficiency at scale – Snowflake’s optimized Llama variants (e.g., snowflake-llama-3.3-70b, snowflake-llama-3.1-405b) use SwiftKV and other optimizations to reduce inference cost by up to 75% vs. vanilla Llama, while keeping data in Snowflake.​ + +For Snowflake customers, this means you can run high-quality, open LLMs securely on your governed data platform, combining the flexibility of open source with the performance and governance of the Snowflake AI Data Cloud.​ + +2. How Llama Can Be Used on Snowflake + + + + +Snowflake provides several ways to leverage Llama models, all keeping data in place and under governance: + +Cortex AI Functions (AISQL) – Use AI_COMPLETE with Llama models (e.g., llama3.1-70b, llama3.1-405b) directly in SQL to generate text, summarize, classify, or translate unstructured data stored in tables and stages.​ + +Cortex Analyst – Let business users ask natural language questions over structured data; behind the scenes, Cortex Analyst can use Llama models (e.g., via a Mistral + Llama combo) to generate accurate SQL and answer questions.​ + +Cortex Agents – Build agentic workflows (e.g., customer support bots, document processors) that call Llama models via Cortex Agent APIs, with tools for search, SQL, and external actions.​ + +Custom Llama in Snowpark – Bring your own Llama-based model (e.g., fine-tuned Llama 3) as a containerized UDF or stored procedure in Snowpark, then call it from SQL or Python.​ + +Key benefits on Snowflake: + +Data never leaves the Snowflake environment; prompts and responses stay within your governance boundary.​ + +Native integration with RBAC, masking policies, and data sharing, so Llama apps inherit existing security and compliance controls.​ + +Built-in functions like AI_COUNT_TOKENS and AI_COMPLETE make it easy to prototype and productionize Llama use cases in SQL.​ + + + + + +3. AISQL with Llama – A Use Case +Use Case: Aggregating Insights from Customer Support Tickets + +Imagine a support team that wants to quickly understand common themes across thousands of tickets without writing complex SQL or Python. + +Solution with AISQL + Llama: + +Store support tickets in a Snowflake table (e.g., support.tickets with columns ticket_id, subject, description, created_at). + +Use AI_AGG with a Llama model (e.g., llama3.1-70b) to summarize themes across all tickets in a time window.​ + +"sql +SELECT + AI_AGG( + 'llama3.1-70b', + 'Summarize the main issues and feature requests mentioned in these support tickets. Group by category and list the top 5 issues with examples.', + description + ) AS insights +FROM support.tickets +WHERE created_at >= DATEADD('day', -7, CURRENT_DATE());" + +Benefits: + +No need to export data or use external LLM APIs; everything runs inside Snowflake.​ + +The Llama model can handle long context (128K tokens for llama3.1-70b), so it can process many tickets in a single call.​ + +Results can be materialized into a table and exposed via Streamlit or BI tools for self-service analytics.​ + +This pattern works for many AISQL use cases: summarizing feedback, classifying tickets, extracting entities, or translating multilingual support content.​ + + + +4. Fine-Tuning with Best Practices for Llama Recipes – Focus on Customer Support + + + + +Fine-tuning Llama models (e.g., Llama 3.1 8B/70B) on customer support data can dramatically improve response quality, tone, and accuracy. + +Best Practices for Fine-Tuning Llama on Snowflake +Start with a curated dataset + +Collect historical support tickets, agent responses, and approved answers. + +Structure as a prompt-completion pair: + +json +{"prompt": "Customer: How do I reset my password?", "completion": "To reset your password, go to Settings > Account > Reset Password. If you don’t see that option, contact support@company.com."} +Use Snowflake tables to store and version this dataset; apply masking and access controls as needed.​​ + +Choose the right Llama variant + +For most support use cases, llama3.1-8b or llama3.1-70b is ideal: good balance of reasoning, latency, and cost.​ + +Use snowflake-llama-3.3-70b if you want the cost/performance benefits of SwiftKV.​ + +Use parameter-efficient fine-tuning (PEFT) + +Apply LoRA (Low-Rank Adaptation) to fine-tune only a small fraction of parameters, reducing GPU cost and time.​​ + +In Snowflake, this can be done via Snowpark ML or by integrating with external fine-tuning platforms (e.g., Hugging Face, Predibase) that pull data from Snowflake.​​ + +Add guardrails and safety + +Use Cortex Guard (Llama Guard 3) to filter unsafe or harmful responses in production.​ + +Add a post-processing step (e.g., a stored procedure) to validate responses against a list of approved answers or policies.​​ + +Evaluate and iterate + +Define metrics: accuracy, response length, hallucination rate, and customer satisfaction (CSAT).​​ + +Use A/B testing in Snowflake to compare the fine-tuned Llama model against the base model or a human baseline.​ + +Example Customer Support Recipe +python +# Pseudocode: Fine-tuning Llama 3.1 8B for support +from snowflake.snowpark import Session +from transformers import AutoModelForCausalLM, TrainingArguments, Trainer +from peft import LoraConfig, get_peft_model + +# 1. Load support data from Snowflake +session = Session.builder.configs(...).create() +support_data = session.table("support.finetune_dataset").to_pandas() + +# 2. Prepare prompt-completion pairs +# Format: "Customer: {question}\nAgent: {answer}" + +# 3. Load base Llama model and apply LoRA +model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B") +lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.1) +model = get_peft_model(model, lora_config) + +# 4. Train with Hugging Face Trainer +training_args = TrainingArguments( + output_dir="./support-llama", + per_device_train_batch_size=4, + num_train_epochs=3, + save_steps=100, + logging_steps=10, +) +trainer = Trainer(model=model, args=training_args, train_dataset=support_data) +trainer.train() + +# 5. Push fine-tuned model to Snowflake (as a containerized UDF or Cortex Agent) +# Then call it via AI_COMPLETE or Cortex Agent API +This recipe can be adapted for any domain (e.g., HR, finance, sales) by changing the training data and prompt templates.​​ + +5. Performance Optimizations on Snowflake Platform + + + +To get the best performance from Llama models on Snowflake, follow these platform-specific optimizations: + +Use the right warehouse size + +For Cortex AI functions (e.g., AI_COMPLETE, AI_AGG), use a warehouse no larger than MEDIUM; larger warehouses do not improve LLM throughput and only increase cost.​ + +For custom Llama UDFs in Snowpark, size the warehouse based on batch size and concurrency needs, and use auto-suspend to avoid idle costs.​​ + +Batch and pipeline calls + +Process many rows in a single query (e.g., SELECT AI_COMPLETE(...) FROM large_table) to maximize throughput.​ + +For interactive apps, use the Cortex REST APIs (Complete, Embed, Agents) instead of SQL functions to reduce latency.​ + +Leverage SwiftKV-optimized models + +Prefer snowflake-llama-3.3-70b and snowflake-llama-3.1-405b over vanilla Llama models; they are optimized for higher throughput and lower latency on Snowflake.​ + +SwiftKV reduces computational overhead during prompt processing, allowing prefill tokens to skip up to half the model’s layers with minimal accuracy loss.​ + +Optimize context and output length + +Use AI_COUNT_TOKENS to estimate token usage and avoid hitting model limits (e.g., 128K context for llama3.1-70b).​ + +For summarization or classification, keep prompts concise and set a reasonable max_tokens to reduce latency and cost.​ + +Use Cortex Search for RAG + +For long-context tasks (e.g., support chatbot), combine Llama with Cortex Search to retrieve relevant documents and inject them into the prompt, rather than passing entire documents to the LLM.​ + +These optimizations ensure that Llama-based workloads are fast, scalable, and cost-effective on Snowflake.​ + +6. Cost Optimization, Governance, and Prompt Caching +Running Llama models at scale requires careful cost and governance controls. + +Cost Optimization +Choose the right model + +Use smaller models (e.g., llama3.1-8b, mistral-7b) for simple tasks (summarization, classification) and reserve larger models (e.g., llama3.1-70b, llama3.1-405b) for complex reasoning.​ + +Compare cost/quality trade-offs using benchmarks (e.g., MMLU, HumanEval, GSM8K) and real-world use cases.​ + +Track and monitor usage + +Use SNOWFLAKE.ACCOUNT_USAGE.CORTEX_FUNCTIONS_USAGE_HISTORY and CORTEX_FUNCTIONS_QUERY_USAGE_HISTORY to see per-function, per-model credit and token consumption.​ + +Set up alerts and budgets for AI services to avoid unexpected spend.​ + +Apply FinOps practices + +Tag queries and workloads by team, project, or use case to allocate costs accurately.​ + +Regularly review and retire unused models, stages, and Cortex Search services to reduce idle costs.​ + +Governance and Access Control +Control model access + +Use the account-level CORTEX_MODELS_ALLOWLIST to restrict which models can be used (e.g., allow only llama3.1-70b and snowflake-llama-3.3-70b).​ + +Use RBAC on model objects in SNOWFLAKE.MODELS to grant fine-grained access to specific roles (e.g., GRANT APPLICATION ROLE CORTEX-MODEL-ROLE-LLAMA3.1-70B TO ROLE support_analyst).​ + +Secure data and prompts + +Ensure that all media files (documents, audio) are stored in encrypted stages with directory tables enabled.​ + +Use masking policies and row access policies to protect PII in prompts and responses.​ + +Prompt Caching (Conceptual) +While Snowflake does not yet have a built-in LLM prompt cache, you can simulate caching patterns: + +Cache frequent prompts in a table + +Store common prompts and their responses in a table (e.g., support.prompt_cache) with a TTL. + +Before calling AI_COMPLETE, check if the prompt exists and is fresh; if so, return the cached response.​ + +Use deterministic prompts + +For idempotent tasks (e.g., summarizing a fixed set of tickets), use a deterministic prompt template and key (e.g., ticket_ids + prompt_template_hash) to enable easy caching.​ + +This approach reduces redundant LLM calls and lowers cost, especially for high-frequency, low-variability prompts.​ + +7. Explainability + + +Explainability is critical for trust, debugging, and compliance in Llama-based applications. + +For custom ML models (not Llama itself) + +Snowflake Model Registry supports Shapley-based explainability for traditional ML models (e.g., regression, classification).​ + +Use model.explain(input_data) to get feature attributions and understand why a model made a particular prediction.​ + +For Llama-based AISQL and agents + +While Llama models themselves are not directly explainable in Snowflake, you can: + +Log prompts, model names, and responses in an audit table for traceability.​ + +Use structured outputs (e.g., JSON with confidence scores) and post-process them to highlight key factors.​ + +For Cortex Analyst, the generated SQL and semantic model provide a clear “reasoning trace” that can be reviewed and audited.​ + +Best practices for explainability + +Always log the full context (prompt, model, parameters, timestamp) for every LLM call.​ + +For customer-facing apps, provide a “show reasoning” option that displays the retrieved documents (from Cortex Search) and the final prompt sent to the LLM.​ + +Use human-in-the-loop review for high-stakes decisions (e.g., support escalations, financial advice) to validate and explain LLM outputs.​ + +This layered approach ensures that Llama applications are transparent, auditable, and aligned with enterprise AI governance standards.​ + +8. Interoperability with MCP and Partner Tooling + + +Model Context Protocol (MCP) enables Llama-based agents to securely interact with Snowflake and partner tools. + +Snowflake-managed MCP server + +Expose Snowflake capabilities (Cortex Analyst, Cortex Search, SQL execution) as MCP tools, so external agents (e.g., Claude Desktop, Cursor, fast-agent) can call them via a standardized interface.​ + +Configure an MCP server with tools like: + +CORTEX_ANALYST_MESSAGE (for natural language over structured data) + +CORTEX_SEARCH_SERVICE_QUERY (for RAG over unstructured content) + +SYSTEM_EXECUTE_SQL (for direct SQL queries).​ + +Integrating Llama with MCP + +Use an MCP client (e.g., Anthropic Claude, Cursor) that supports Llama models; the client can route prompts to Snowflake tools via the MCP server.​ + +For example, a support agent in Cursor can: + +Ask a question in natural language. + +The MCP client invokes the Snowflake MCP server’s Cortex Search tool to find relevant KB articles. + +The Llama model in the client generates a response using the retrieved context.​ + +Partner tooling + +Combine Snowflake + Llama with: + +BI tools (Tableau, Power BI) via Cortex Analyst for natural language analytics.​ + +MLOps platforms (Weights & Biases, MLflow) to track fine-tuning experiments and model versions.​ + +Agentic frameworks (LangChain, LlamaIndex) that use MCP to call Snowflake tools.​​ + +This interoperability lets you build modular, secure, agentic applications where Llama models orchestrate actions across Snowflake and external systems.​ + +9. Working with Snowflake Intelligence and Cortex Knowledge Extensions +Snowflake Intelligence and Cortex Knowledge Extensions (CKEs) extend Llama-based apps with governed, up-to-date knowledge. + +Snowflake Intelligence + +A natural language interface over your Snowflake data and shared content; it can use Llama models (via Cortex Analyst and Cortex Search) to answer questions across structured and unstructured data.​ + +Use cases: + +“What are the top support issues this month?” (structured data via Cortex Analyst).​ + +“Summarize the latest changes in our API docs” (unstructured data via CKE).​​ + +Cortex Knowledge Extensions (CKEs) + +CKEs are Cortex Search Services shared on the Snowflake Marketplace or via private listings; they let you integrate licensed or proprietary content (e.g., KB articles, market research, API docs) into Llama-based apps.​​ + +How it works: + +A provider creates a Cortex Search Service on a table of documents and shares it as a CKE. + +A consumer uses Snowflake Intelligence, Cortex Agent, or a custom app to query the CKE; the LLM reasons over the retrieved content to generate answers with citations.​​ + +Best practices for CKEs and Intelligence + +For customer support: + +Create a CKE from your internal KB, release notes, and support playbooks. + +Use Snowflake Intelligence to let agents ask questions like “How do I handle a billing dispute?” and get answers with links to the relevant KB article.​​ + +Ensure CKEs include a SOURCE_URL column so LLMs can provide clear attribution and hyperlinks back to the source.​ + +Monitor CKE usage and costs in Provider Studio; set content protection thresholds to prevent abuse.​ + +By combining Llama models with Snowflake Intelligence and CKEs, you can build powerful, governed AI assistants that answer questions across both internal data and external knowledge sources From b69024f9bb035beb89ec3e14a7341ddec21e563d Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:42:24 -0800 Subject: [PATCH 04/11] Update optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 254 +++++++++++++----- 1 file changed, 189 insertions(+), 65 deletions(-) diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md index e33ccc5dd8..2763a77baa 100644 --- a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -1,12 +1,21 @@ -Optimizing Open Source Llama on Snowflake Cookbook +author: Anika Shahi +id: optimizing-llama-snowflake-cookbook +categories: snowflake-site:taxonomy/solution-center/certification/quickstart, snowflake-site:taxonomy/product/ai, snowflake-site:taxonomy/product/analytics, snowflake-site:taxonomy/snowflake-feature/cortex-analyst +language: en +summary: Optimizing LLama on Snowflake - A Guide to Performance, Cost, Agents Best Practices +environments: web +status: Published +feedback link: https://github.com/Snowflake-Labs/sfguides/issues +# Optimizing Open Source Llama on Snowflake Cookbook + +## Overview -Table of Contents 1. Why Open Source Llama Models? 2. How Llama Models Can Be Used on Snowflake -3. AISQL with Llama: A Use Case +3. AISQL with Llama 4. Fine-Tuning Llama Models – Best Practices & Customer Support Recipe @@ -20,12 +29,15 @@ Table of Contents 9. Working with Snowflake Intelligence and Cortex Knowledge Extensions +### What You Will Learn +- How to optimize llama models usage on Snowflake +- Cost considerations and tools for using llama models +- Performance optimizations for Llama models on the Snowflake platform +- Agentic Best Practices with llama models +- Extensibility with Llama open source models -1. Why Open Source Llama Models? - - - +## Why Choose Open Source Llama? Open source Llama models (from Meta) are a strategic choice for enterprises because they provide: @@ -39,7 +51,8 @@ Cost efficiency at scale – Snowflake’s optimized Llama variants (e.g., snowf For Snowflake customers, this means you can run high-quality, open LLMs securely on your governed data platform, combining the flexibility of open source with the performance and governance of the Snowflake AI Data Cloud.​ -2. How Llama Can Be Used on Snowflake + + ## How Llama Can Be Used on Snowflake @@ -65,8 +78,8 @@ Built-in functions like AI_COUNT_TOKENS and AI_COMPLETE make it easy to prototyp +## AISQL with Llama: Use Case -3. AISQL with Llama – A Use Case Use Case: Aggregating Insights from Customer Support Tickets Imagine a support team that wants to quickly understand common themes across thousands of tickets without writing complex SQL or Python. @@ -98,8 +111,7 @@ Results can be materialized into a table and exposed via Streamlit or BI tools f This pattern works for many AISQL use cases: summarizing feedback, classifying tickets, extracting entities, or translating multilingual support content.​ - -4. Fine-Tuning with Best Practices for Llama Recipes – Focus on Customer Support +## Fine-Tuning with Best Practices for Llama Recipes – Focus on Customer Support @@ -141,26 +153,26 @@ Define metrics: accuracy, response length, hallucination rate, and customer sati Use A/B testing in Snowflake to compare the fine-tuned Llama model against the base model or a human baseline.​ -Example Customer Support Recipe +## Example Customer Support Recipe python -# Pseudocode: Fine-tuning Llama 3.1 8B for support + Pseudocode: Fine-tuning Llama 3.1 8B for support from snowflake.snowpark import Session from transformers import AutoModelForCausalLM, TrainingArguments, Trainer from peft import LoraConfig, get_peft_model -# 1. Load support data from Snowflake +1. Load support data from Snowflake session = Session.builder.configs(...).create() support_data = session.table("support.finetune_dataset").to_pandas() -# 2. Prepare prompt-completion pairs -# Format: "Customer: {question}\nAgent: {answer}" +2. Prepare prompt-completion pairs + Format: "Customer: {question}\nAgent: {answer}" -# 3. Load base Llama model and apply LoRA +3. Load base Llama model and apply LoRA model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3.1-8B") lora_config = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.1) model = get_peft_model(model, lora_config) -# 4. Train with Hugging Face Trainer +4. Train with Hugging Face Trainer training_args = TrainingArguments( output_dir="./support-llama", per_device_train_batch_size=4, @@ -171,31 +183,63 @@ training_args = TrainingArguments( trainer = Trainer(model=model, args=training_args, train_dataset=support_data) trainer.train() -# 5. Push fine-tuned model to Snowflake (as a containerized UDF or Cortex Agent) -# Then call it via AI_COMPLETE or Cortex Agent API +5. Push fine-tuned model to Snowflake (as a containerized UDF or Cortex Agent) + Then call it via AI_COMPLETE or Cortex Agent API This recipe can be adapted for any domain (e.g., HR, finance, sales) by changing the training data and prompt templates.​​ -5. Performance Optimizations on Snowflake Platform +# LLama Performance Optimizations Built on Snowflake Platform + +Snowflake and Llama integration offers a comprehensive set of inference optimization methods for performance aimed at enabling real-time, high-throughput, and efficient large language model (LLM) inference, especially for massive models like Llama 3.1 70B and Llama 3.1 405B. T + +These are the key inference optimization techniques: +1. FP8 ZeroQuant and Mixed Precision + - Uses hardware-agnostic FP8 (8-bit floating point) quantization to reduce memory usage and accelerate computation without significant loss in model accuracy. + - Mixed precision enables more efficient GPU utilization, balancing speed and precision. +2. Tensor and Pipeline Parallelism + - Optimized implementation of tensor parallelism allows large model weights to be split across multiple GPUs efficiently. + - Pipeline parallelism improves throughput by dividing the model into stages for concurrent processing. +3. Single-Node and Multi-Node Scalability + - Supports efficient inference on a single GPU node with high throughput and low latency. + - Scales seamlessly to multi-node setups to handle larger models and longer context windows (up to 128K tokens). +4. Low Latency and High Throughput + - Achieves up to 3x lower end-to-end latency and 1.4x higher token throughput compared to baseline implementations. + - Time to first token (TTFT) can be below 750 ms, and inter-token latency (ITL) can be under 70 ms, crucial for real-time interactive applications. +5. Optimized KV Cache Management + - Manages key-value (KV) caches efficiently, crucial for supporting large context windows which grow memory demands linearly with context length. +6. SwiftKV Integration + - Snowflake’s open-source SwiftKV optimizes the key-value cache generation stage, which is the most compute-intensive part in LLM inference, boosting throughput by up to 50%. + - This technology helps lower power and compute costs for inference workloads significantly. +7. ZeRO and LoRA for Fine-Tuning + - ZeRO-3 inspired sharding and CPU offloading combined with Low-Rank Adaptation (LoRA) allow fine-tuning and inference with reduced GPU memory load. + + +# Performance Benefits and Use Cases +Supports real-time use cases requiring low latency responses and high throughput for cost-efficient deployment. +Enables handling long sequences efficiently, important for tasks requiring extensive context understanding. +Flexible to run on both legacy hardware (NVIDIA A100) and modern hardware (NVIDIA H100). +Integrations are accessible via Snowflake Cortex AI platform, enabling enterprises to deploy Llama models efficiently. +In summary, the Snowflake-Llama integration leverages advanced quantization, parallelism, cache optimization (SwiftKV), and fine-tuning memory management techniques to deliver highly efficient, scalable, and low-latency inference for large Llama models—optimizing both performance and cost for enterprise AI workloads. + To get the best performance from Llama models on Snowflake, follow these platform-specific optimizations: -Use the right warehouse size +- Use the right warehouse size -For Cortex AI functions (e.g., AI_COMPLETE, AI_AGG), use a warehouse no larger than MEDIUM; larger warehouses do not improve LLM throughput and only increase cost.​ +- For Cortex AI functions (e.g., AI_COMPLETE, AI_AGG), use a warehouse no larger than MEDIUM; larger warehouses do not improve LLM throughput and only increase cost.​ -For custom Llama UDFs in Snowpark, size the warehouse based on batch size and concurrency needs, and use auto-suspend to avoid idle costs.​​ +- For custom Llama UDFs in Snowpark, size the warehouse based on batch size and concurrency needs, and use auto-suspend to avoid idle costs.​​ -Batch and pipeline calls +- Batch and pipeline calls -Process many rows in a single query (e.g., SELECT AI_COMPLETE(...) FROM large_table) to maximize throughput.​ +- Process many rows in a single query (e.g., SELECT AI_COMPLETE(...) FROM large_table) to maximize throughput.​ -For interactive apps, use the Cortex REST APIs (Complete, Embed, Agents) instead of SQL functions to reduce latency.​ +- For interactive apps, use the Cortex REST APIs (Complete, Embed, Agents) instead of SQL functions to reduce latency.​ -Leverage SwiftKV-optimized models +- Leverage SwiftKV-optimized models -Prefer snowflake-llama-3.3-70b and snowflake-llama-3.1-405b over vanilla Llama models; they are optimized for higher throughput and lower latency on Snowflake.​ +- Prefer snowflake-llama-3.3-70b and snowflake-llama-3.1-405b over vanilla Llama models; they are optimized for higher throughput and lower latency on Snowflake.​ SwiftKV reduces computational overhead during prompt processing, allowing prefill tokens to skip up to half the model’s layers with minimal accuracy loss.​ @@ -214,54 +258,135 @@ These optimizations ensure that Llama-based workloads are fast, scalable, and co 6. Cost Optimization, Governance, and Prompt Caching Running Llama models at scale requires careful cost and governance controls. -Cost Optimization -Choose the right model - -Use smaller models (e.g., llama3.1-8b, mistral-7b) for simple tasks (summarization, classification) and reserve larger models (e.g., llama3.1-70b, llama3.1-405b) for complex reasoning.​ -Compare cost/quality trade-offs using benchmarks (e.g., MMLU, HumanEval, GSM8K) and real-world use cases.​ +# Practical Performance Best Practices for Llama Models on Snowflake -Track and monitor usage +- Choose the right model +- Use smaller models (e.g., llama3.1-8b, mistral-7b) for simple tasks (summarization, classification) and reserve larger models (e.g., llama3.1-70b, llama3.1-405b) for complex reasoning.​ +- Compare cost/quality trade-offs using benchmarks (e.g., MMLU, HumanEval, GSM8K) and real-world use cases.​ +- Track and monitor usage Use SNOWFLAKE.ACCOUNT_USAGE.CORTEX_FUNCTIONS_USAGE_HISTORY and CORTEX_FUNCTIONS_QUERY_USAGE_HISTORY to see per-function, per-model credit and token consumption.​ -Set up alerts and budgets for AI services to avoid unexpected spend.​ - -Apply FinOps practices +- Set up alerts and budgets for AI services to avoid unexpected spend.​ -Tag queries and workloads by team, project, or use case to allocate costs accurately.​ +- Apply FinOps practices -Regularly review and retire unused models, stages, and Cortex Search services to reduce idle costs.​ +- Tag queries and workloads by team, project, or use case to allocate costs accurately.​ -Governance and Access Control -Control model access +- Regularly review and retire unused models, stages, and Cortex Search services to reduce idle costs.​ +# Governance and Access Control +- Control model access Use the account-level CORTEX_MODELS_ALLOWLIST to restrict which models can be used (e.g., allow only llama3.1-70b and snowflake-llama-3.3-70b).​ -Use RBAC on model objects in SNOWFLAKE.MODELS to grant fine-grained access to specific roles (e.g., GRANT APPLICATION ROLE CORTEX-MODEL-ROLE-LLAMA3.1-70B TO ROLE support_analyst).​ +- Use RBAC on model objects in SNOWFLAKE.MODELS to grant fine-grained access to specific roles (e.g., GRANT APPLICATION ROLE CORTEX-MODEL-ROLE-LLAMA3.1-70B TO ROLE support_analyst).​ -Secure data and prompts +# Secure data and prompts Ensure that all media files (documents, audio) are stored in encrypted stages with directory tables enabled.​ Use masking policies and row access policies to protect PII in prompts and responses.​ -Prompt Caching (Conceptual) +# Prompt Caching (Conceptual) While Snowflake does not yet have a built-in LLM prompt cache, you can simulate caching patterns: -Cache frequent prompts in a table +- Cache frequent prompts in a table + +- Store common prompts and their responses in a table (e.g., support.prompt_cache) with a TTL. + +- Before calling AI_COMPLETE, check if the prompt exists and is fresh; if so, return the cached response.​ + +- Use deterministic prompts + +- For idempotent tasks (e.g., summarizing a fixed set of tickets), use a deterministic prompt template and key (e.g., ticket_ids + prompt_template_hash) to enable easy caching.​ + +- This approach reduces redundant LLM calls and lowers cost, especially for high-frequency, low-variability prompts.​ + +#Performance Benefits of Llama Models + +The Llama models provide industry-leading performance on Snowflake’s optimized stack: +- Llama 3.1 405B achieves 3x lower latency and 1.4x higher throughput than baseline, delivering rapid token generation with a time to first token under 750 ms. +- Llama 4 Maverick, with 17 billion active parameters, combines state-of-the-art textual and image understanding, support for 12 languages, and high-speed creative content generation. +- Llama 4 Scout features an unprecedented context window of 10 million tokens, enabling complex tasks like multi-document summarization and extensive personalized reasoning. +- These performance gains empower real-time applications ranging from chatbots to AI-powered data agents, all within Snowflake’s environment. + +# Cost Considerations and Tools + +Using open-source Llama models on Snowflake enables business-savvy users and advanced data science teams to maintain data gravity, simplify MLOps, and operate with robust, engineering-grade cost controls. The platform combines optimized Llama variants with native governance features, allowing you to reason about spend in the same way you already manage warehouses and other Snowflake workloads. + +Cost benefits of Llama on Snowflake +Running Llama directly on Snowflake means inference is billed as Snowflake credits, eliminating separate LLM infrastructure, data egress, and cross‑cloud integration overhead. Open source Llama families typically have lower effective price per token than many proprietary models, and Snowflake provides tuned configurations (for example, Llama 3.x and 4 variants) that are optimized for throughput and cost efficiency.​ + +From a FinOps perspective, Llama on Snowflake lets you: + +1. Attribute AI spend to the same cost centers as your analytical workloads.​ + +2. Trade off model size vs. throughput and latency with clear, credit‑based pricing.​ + +3. Consolidate security and governance (no separate vector DBs, ETL pipelines, or hosting). -Store common prompts and their responses in a table (e.g., support.prompt_cache) with a TTL. -Before calling AI_COMPLETE, check if the prompt exists and is fresh; if so, return the cached response.​ +Provisioned throughput for predictable spend -Use deterministic prompts +Provisioned throughput in Cortex lets you reserve Llama inference capacity in discrete units over a fixed term, instead of paying only on a best‑effort, on‑demand basis. You size provisioned throughput units to your expected tokens per second or requests per minute and receive predictable performance and a stable monthly cost envelope. -For idempotent tasks (e.g., summarizing a fixed set of tickets), use a deterministic prompt template and key (e.g., ticket_ids + prompt_template_hash) to enable easy caching.​ +Features Available: +Use historical token logs and load testing to estimate peak TPS/RPS and select an appropriate provisioned tier.​ -This approach reduces redundant LLM calls and lowers cost, especially for high-frequency, low-variability prompts.​ +Put experimentation and low-volume workloads on demand while routing critical production traffic (for example, customer-facing Llama endpoints) to provisioned capacity.​ -7. Explainability +Combine provisioned throughput with autoscaling warehouses handling retrieval and pre/post‑processing, so both LLM and SQL compute scale in a controlled way. + +Resource monitors track credit consumption at account or warehouse level and can trigger notifications or automatically suspend warehouses when defined thresholds are reached. For Llama, this means the warehouses that power ETL, feature engineering, vectorization, and RAG retrieval are all protected against runaway jobs.​ + +Budgets let you define monthly credit limits for groups of resources and are refreshed multiple times a day, supporting proactive rather than reactive cost management. Combined with tag‑based budgets, you can create dedicated “AI” or “Llama” budgets that automatically aggregate all tagged resources into a single cost view, including serverless features.​ + +A practical pattern is: + +Assign each Llama‑backed product or team its own budget.​ + +Use resource monitors to guard individual warehouses, and budgets to cap the aggregate envelope.​ + +Alert at multiple thresholds (for example, 50%, 80%, 100%) so owners can throttle traffic, switch to a smaller Llama variant, or adjust prompts before overspend. + + +Cost allocation tags and Cortex LLM functions +Object tags and query tags are the backbone of cost attribution in Snowflake. By defining a small, consistent tag taxonomy (such as environment, product, cost_center, and owner), you can classify warehouses, compute pools, databases, and even users or roles involved in Llama workloads.​ + +As Cortex LLM functions add support for cost allocation tags, each Llama call can carry the same tagging semantics. In practice, that enables:​ + +Per‑feature or per‑tenant cost breakdowns using query tags or application‑level tags on LLM function calls.​ + +Seamless aggregation of Llama spend into tag‑based budgets and dashboards with no manual resource enumeration.​ + +Cleaner separation of dev, staging, and prod costs when those are expressed as tags and roles.​ + +For example, an app that routes user queries to a Cortex LLM function using Llama can include a query tag like “product=assistant, tenant_id=1234” so that every token consumed is attributable to that specific customer. + +Prompt caching and key‑value caching exploit the fact that large parts of Llama prompts are stable across calls (system messages, shared documents, instructions). When this context is cached, subsequent calls pay only for the new tokens, which can dramatically reduce costs for chat, support copilots, and RAG workloads where context changes slowly.​ + +Advanced design tips: + +Factor prompts into stable and volatile segments, keeping instructions and long‑lived documents unchanged so they benefit from caching.​ + +Use smaller Llama variants or specialized distilled models for high-volume, low-complexity tasks, and reserve larger variants for complex reasoning or low-volume, high-value paths.​ + +Monitor token distribution (prompt vs. completion) to identify where restructuring prompts or truncating context yields the biggest marginal savings. + +# Practical Cost Workflow for Data Scientists + +Prototype with on‑demand Cortex LLM functions using Llama, logging query tags for experiment tracking and cost visibility.​ + +Analyze token usage and latency to determine the optimal model size, caching strategy, and whether provisioned throughput is warranted.​ + +Once patterns stabilize, allocate provisioned throughput for the core Llama endpoints, size warehouses for retrieval and pre- and post-processing, and attach resource monitors.​ + +Apply object and query tags representing the environment, product, and cost center, and wire these into tag-based budgets dedicated to Llama workloads.​ + +Continuously optimize prompts and routing logic (for example, small Llama vs. large) using observability data to drive down cost per successful outcome, not just cost per token.​ + +# Explainability Explainability is critical for trust, debugging, and compliance in Llama-based applications. @@ -272,27 +397,26 @@ Snowflake Model Registry supports Shapley-based explainability for traditional M Use model.explain(input_data) to get feature attributions and understand why a model made a particular prediction.​ -For Llama-based AISQL and agents - +Llama-based AISQL and agents While Llama models themselves are not directly explainable in Snowflake, you can: -Log prompts, model names, and responses in an audit table for traceability.​ +- Log prompts, model names, and responses in an audit table for traceability.​ -Use structured outputs (e.g., JSON with confidence scores) and post-process them to highlight key factors.​ +- Use structured outputs (e.g., JSON with confidence scores) and post-process them to highlight key factors.​ -For Cortex Analyst, the generated SQL and semantic model provide a clear “reasoning trace” that can be reviewed and audited.​ +- For Cortex Analyst, the generated SQL and semantic model provide a clear “reasoning trace” that can be reviewed and audited.​ -Best practices for explainability +# Best Practices for Explainability -Always log the full context (prompt, model, parameters, timestamp) for every LLM call.​ +1. Always log the full context (prompt, model, parameters, timestamp) for every LLM call.​ -For customer-facing apps, provide a “show reasoning” option that displays the retrieved documents (from Cortex Search) and the final prompt sent to the LLM.​ +2. For customer-facing apps, provide a “show reasoning” option that displays the retrieved documents (from Cortex Search) and the final prompt sent to the LLM.​ -Use human-in-the-loop review for high-stakes decisions (e.g., support escalations, financial advice) to validate and explain LLM outputs.​ +3. Use human-in-the-loop review for high-stakes decisions (e.g., support escalations, financial advice) to validate and explain LLM outputs.​ -This layered approach ensures that Llama applications are transparent, auditable, and aligned with enterprise AI governance standards.​ +4. This layered approach ensures that Llama applications are transparent, auditable, and aligned with enterprise AI governance standards.​ -8. Interoperability with MCP and Partner Tooling +# Interoperability with MCP and Partner Tooling Model Context Protocol (MCP) enables Llama-based agents to securely interact with Snowflake and partner tools. @@ -333,7 +457,7 @@ Agentic frameworks (LangChain, LlamaIndex) that use MCP to call Snowflake tools. This interoperability lets you build modular, secure, agentic applications where Llama models orchestrate actions across Snowflake and external systems.​ -9. Working with Snowflake Intelligence and Cortex Knowledge Extensions +# Working with Snowflake Intelligence and Cortex Knowledge Extensions Snowflake Intelligence and Cortex Knowledge Extensions (CKEs) extend Llama-based apps with governed, up-to-date knowledge. Snowflake Intelligence @@ -356,7 +480,7 @@ A provider creates a Cortex Search Service on a table of documents and shares it A consumer uses Snowflake Intelligence, Cortex Agent, or a custom app to query the CKE; the LLM reasons over the retrieved content to generate answers with citations.​​ -Best practices for CKEs and Intelligence +# Best practices for CKEs and Intelligence For customer support: From d315c78c5efe867cc5e77bbe38530bdd82194bce Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:44:11 -0800 Subject: [PATCH 05/11] Create cortex_budgets --- .../cortex_budgets | 73 +++++++++++++++++++ 1 file changed, 73 insertions(+) create mode 100644 site/sfguides/src/optimizing-llama-snowflake-cookbook/cortex_budgets diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/cortex_budgets b/site/sfguides/src/optimizing-llama-snowflake-cookbook/cortex_budgets new file mode 100644 index 0000000000..7e120fa45c --- /dev/null +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/cortex_budgets @@ -0,0 +1,73 @@ +-- This script creates a stored procedure `mini_budget_on_cortex` to monitor +-- Snowflake Cortex functions credit usage. This is a temporary solution until the native +-- Snowflake Budget feature supports monitoring for Cortex services. +-- +-- +-- The procedure checks usage for the current account and sends a notification +-- if the usage exceeds a specified limit within a given billing period (day, week, or month). +-- It also creates a task `run_mini_budget_on_cortex` to run the check every 30 minutes. + + +CREATE OR REPLACE PROCEDURE mini_budget_on_cortex( + billing_period VARCHAR, + alert_limit FLOAT, + ni_name VARCHAR +) +RETURNS VARCHAR +LANGUAGE SQL +AS +$$ +DECLARE + start_of_period DATE; + INVALID_PERIOD EXCEPTION (-20001, 'Invalid billing_period specified. Must be DAY, WEEK, or MONTH.'); + current_usage_credits FLOAT; + notification_body VARCHAR; +BEGIN + CASE UPPER(billing_period) + WHEN 'DAY' THEN + start_of_period := CURRENT_DATE(); + WHEN 'WEEK' THEN + start_of_period := DATE_TRUNC('week', CURRENT_DATE()); + WHEN 'MONTH' THEN + start_of_period := DATE_TRUNC('month', CURRENT_DATE()); + ELSE + RAISE INVALID_PERIOD; + END CASE; + + SELECT COALESCE(SUM(TOKEN_CREDITS), 0) INTO current_usage_credits + FROM SNOWFLAKE.ACCOUNT_USAGE.CORTEX_FUNCTIONS_USAGE_HISTORY + WHERE START_TIME >= :start_of_period; + + notification_body := 'Account: ' || CURRENT_ACCOUNT() || ' (' || CURRENT_REGION() || ')' + || ' - Cortex functions credit usage in this ' || billing_period + || ' = ' || current_usage_credits; + IF (current_usage_credits >= alert_limit) THEN + IF (ni_name IS NOT NULL AND TRIM(ni_name) <> '') THEN + CALL SYSTEM$SEND_SNOWFLAKE_NOTIFICATION( + '{"text/plain":"' || :notification_body || '"}', + '{"' || :ni_name || '": {}}' + ); + END IF; + END IF; + + RETURN notification_body; +END; +$$; + + +-- Make sure the notification is sent to the correct Slack channel with a low limit. +call mini_budget_on_cortex('MONTH', 10, 'MY_SLACK_NI'); + + + +-- Create a task to run the procedure every 30 minutes. +create or replace task run_mini_budget_on_cortex +schedule = '30 minute' +as +call mini_budget_on_cortex('MONTH', 1000, 'MY_SLACK_NI'); + + + +-- Make sure the task is running. +alter task run_mini_budget_on_cortex resume; + From d8bb7464e3bdba9bcb54bfcc2af340bcad4caa5a Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:45:03 -0800 Subject: [PATCH 06/11] Create few_shot_prompt_recipe --- .../few_shot_prompt_recipe | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 site/sfguides/src/optimizing-llama-snowflake-cookbook/few_shot_prompt_recipe diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/few_shot_prompt_recipe b/site/sfguides/src/optimizing-llama-snowflake-cookbook/few_shot_prompt_recipe new file mode 100644 index 0000000000..114d0342d2 --- /dev/null +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/few_shot_prompt_recipe @@ -0,0 +1,17 @@ +#few shot prompting +# "Customer cannot reset password using the forgot password link." +#"Category: Account Access Issue\nUrgency: High\nResponse: Please verify your email and try resetting your password again." +#"Internet connection keeps dropping every few minutes, affecting video calls." +#"Category: Network Issue\nUrgency: Medium\nResponse: Please restart your router and check for service outages in your area." +#"Application crashes with error code 500 when uploading large files." +#"Category: Application Bug\nUrgency: High\nResponse: Could you please provide the file size and error logs? We will check." + + +-- Create the fine-tuning job pointing at your training data table +SELECT snowflake.cortex.finetune( + 'CREATE', + 'MY_DB.MY_SCHEMA.CUSTOMER_SUPPORT_FINETUNED', + 'llama3.1-8b', + 'SELECT prompt, completion FROM MY_DB.MY_SCHEMA.CUSTOMER_SUPPORT_TRAINING', + NULL -- optional validation data query here +); From f54d5bb933367c2703753ee6ed02ca62afbc56cd Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:45:38 -0800 Subject: [PATCH 07/11] Create llama fine-tuning sample --- .../llama fine-tuning sample | 30 +++++++++++++++++++ 1 file changed, 30 insertions(+) create mode 100644 site/sfguides/src/optimizing-llama-snowflake-cookbook/llama fine-tuning sample diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/llama fine-tuning sample b/site/sfguides/src/optimizing-llama-snowflake-cookbook/llama fine-tuning sample new file mode 100644 index 0000000000..90244a02c3 --- /dev/null +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/llama fine-tuning sample @@ -0,0 +1,30 @@ +import snowflake.connector + +conn = snowflake.connector.connect( + user='YOUR_USER', + password='YOUR_PASSWORD', + account='YOUR_ACCOUNT', + warehouse='YOUR_WAREHOUSE', + database='YOUR_DB', + schema='YOUR_SCHEMA' +) + +cs = conn.cursor() + +cs.execute(""" + SELECT cortex_finetune.create_finetune_job( + model_name => 'llama3.1-8b', + training_table => 'your_schema.your_training_data', + prompt_col => 'prompt', + completion_col => 'completion', + output_model_name => 'custom_llama_model_v1', + max_epochs => 3, + batch_size => 8, + learning_rate => 0.0001, + peft_method => 'lora' + ); +""") + +print(cs.fetchone()) +cs.close() +conn.close() From 841819c6ae50b221f26273c3aa55f4a347d074b7 Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:46:10 -0800 Subject: [PATCH 08/11] Create chain_of_thought_prompt_recipe --- .../chain_of_thought_prompt_recipe | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 site/sfguides/src/optimizing-llama-snowflake-cookbook/chain_of_thought_prompt_recipe diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/chain_of_thought_prompt_recipe b/site/sfguides/src/optimizing-llama-snowflake-cookbook/chain_of_thought_prompt_recipe new file mode 100644 index 0000000000..f1afaa6be1 --- /dev/null +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/chain_of_thought_prompt_recipe @@ -0,0 +1,12 @@ + SELECT snowflake.cortex.complete( + 'llama-3-7b', + PARSE_JSON('[ + {"role": "system", "content": "You are a support triage assistant. For each incoming support ticket, classify the issue category, urgency (Low, Medium, High), and recommend the best team to handle it."}, + {"role": "user", "content": "Customer reports they cannot access their account after multiple password resets."}, + {"role": "assistant", "content": "Category: Account Access\nUrgency: High\nAssign To: Tier 1 Support"}, + {"role": "user", "content": "User reports slow internet speeds when using our VPN service."}, + {"role": "assistant", "content": "Category: Network Performance\nUrgency: Medium\nAssign To: Network Support Team"}, + {"role": "user", "content": "New ticket: Application crashes when uploading files larger than 10MB."}, + {"role": "assistant", "content": ""} + ]') +) AS triage_response; From 51302ed82bb0bc14a80abfa7be74182594c84bfd Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:50:43 -0800 Subject: [PATCH 09/11] Update optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 47 ++++++++++++------- 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md index 2763a77baa..7fc9f49440 100644 --- a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -116,42 +116,57 @@ This pattern works for many AISQL use cases: summarizing feedback, classifying t -Fine-tuning Llama models (e.g., Llama 3.1 8B/70B) on customer support data can dramatically improve response quality, tone, and accuracy. +# Fine-tuning Llama models (e.g., Llama 3.1 8B/70B) on customer support data can dramatically improve response quality, tone, and accuracy. Best Practices for Fine-Tuning Llama on Snowflake -Start with a curated dataset +1. Start with a curated dataset -Collect historical support tickets, agent responses, and approved answers. +2. Collect historical support tickets, agent responses, and approved answers. -Structure as a prompt-completion pair: +3. Structure as a prompt-completion pair: json {"prompt": "Customer: How do I reset my password?", "completion": "To reset your password, go to Settings > Account > Reset Password. If you don’t see that option, contact support@company.com."} Use Snowflake tables to store and version this dataset; apply masking and access controls as needed.​​ -Choose the right Llama variant +4. Choose the right Llama variant -For most support use cases, llama3.1-8b or llama3.1-70b is ideal: good balance of reasoning, latency, and cost.​ +5. For most support use cases, llama3.1-8b or llama3.1-70b is ideal: good balance of reasoning, latency, and cost.​ -Use snowflake-llama-3.3-70b if you want the cost/performance benefits of SwiftKV.​ +6. Use snowflake-llama-3.3-70b if you want the cost/performance benefits of SwiftKV.​ -Use parameter-efficient fine-tuning (PEFT) +7. Use parameter-efficient fine-tuning (PEFT) -Apply LoRA (Low-Rank Adaptation) to fine-tune only a small fraction of parameters, reducing GPU cost and time.​​ +8. Apply LoRA (Low-Rank Adaptation) to fine-tune only a small fraction of parameters, reducing GPU cost and time.​​ In Snowflake, this can be done via Snowpark ML or by integrating with external fine-tuning platforms (e.g., Hugging Face, Predibase) that pull data from Snowflake.​​ -In Snowflake, this can be done via Snowpark ML or by integrating with external fine-tuning platforms (e.g., Hugging Face, Predibase) that pull data from Snowflake.​​ +9. Add guardrails and safety -Add guardrails and safety +10. Use Cortex Guard (Llama Guard 3) to filter unsafe or harmful responses in production.​ -Use Cortex Guard (Llama Guard 3) to filter unsafe or harmful responses in production.​ +11. Add a post-processing step (e.g., a stored procedure) to validate responses against a list of approved answers or policies.​​ -Add a post-processing step (e.g., a stored procedure) to validate responses against a list of approved answers or policies.​​ +12. Evaluate and iterate -Evaluate and iterate +13. Define metrics: accuracy, response length, hallucination rate, and customer satisfaction (CSAT).​​ -Define metrics: accuracy, response length, hallucination rate, and customer satisfaction (CSAT).​​ +14. Use A/B testing in Snowflake to compare the fine-tuned Llama model against the base model or a human baseline.​ + + + +Few-shot prompting involves providing the model with a small number of example input-output pairs (prompts and completions) to teach it the desired behavior during fine-tuning. In Snowflake Cortex fine-tuning, you prepare a training dataset of examples reflecting the relevant task. During fine-tuning, the model learns to map new ticket texts to categories, understanding these categories from the few-shot samples. To prepare the fine-tuning data on Snowflake, you structure this as prompt-completion pairs in a table and use the Snowflake Cortex fine-tuning function.​​ + +Chain-of-thought prompting involves including intermediate reasoning steps in the training examples, leading the model to generate explanations or stepwise reasoning before the final answer. This helps improve performance on complex tasks requiring logical inference. +Example: SQL Query Generation with Intermediate Reasoning +By fine-tuning the model on these examples with reasoning steps, it learns to generate not just answers but the reasoning process, which can improve accuracy and explainability. + +Reference both prompt recipes in the repo. + +Example: Customer Support Ticket Categorization +Collect a table of support tickets with columns: prompt (ticket text), and completion (category label or response). +Use 100-500 example pairs representing correct categorizations. + +Reference the Repo for sample code recipe. -Use A/B testing in Snowflake to compare the fine-tuned Llama model against the base model or a human baseline.​ ## Example Customer Support Recipe python From 9f8d37e37bf1d52cc45b50a87f42dceb0679a862 Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:52:31 -0800 Subject: [PATCH 10/11] Update optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md index 7fc9f49440..f6120ff3da 100644 --- a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -226,6 +226,13 @@ These are the key inference optimization techniques: - This technology helps lower power and compute costs for inference workloads significantly. 7. ZeRO and LoRA for Fine-Tuning - ZeRO-3 inspired sharding and CPU offloading combined with Low-Rank Adaptation (LoRA) allow fine-tuning and inference with reduced GPU memory load. + + +With Llama + SwiftKV, we achieved: +- Up to 75% reduction in LLM inference costs on Cortex AI by optimizing KV cache generation. +- Improved throughput and reduced latency by reusing hidden states and minimizing redundant computations. +- Maintained accuracy with minimal loss through model rewiring and fine-tuning. +- Enabled significant cost savings and performance gains in production environments, including up to 2x higher throughput on high-end GPUs without sacrificing output quality. # Performance Benefits and Use Cases From 270254126d59c0b951662cfa95a1ca3e04dac381 Mon Sep 17 00:00:00 2001 From: Anika Shahi <68033298+sfc-gh-ashahi@users.noreply.github.com> Date: Tue, 2 Dec 2025 18:55:57 -0800 Subject: [PATCH 11/11] Update optimizing-llama-snowflake-cookbook.md --- .../optimizing-llama-snowflake-cookbook.md | 63 +++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md index f6120ff3da..24f4816f45 100644 --- a/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md +++ b/site/sfguides/src/optimizing-llama-snowflake-cookbook/optimizing-llama-snowflake-cookbook.md @@ -408,6 +408,69 @@ Apply object and query tags representing the environment, product, and cost cent Continuously optimize prompts and routing logic (for example, small Llama vs. large) using observability data to drive down cost per successful outcome, not just cost per token.​ +## Model Selection Criteria for LLama Models on Snowflake Platform + +Model selection for Llama on Snowflake should be driven by a clear mapping between task complexity, latency/throughput constraints, cost envelope, and the platform primitives Snowflake gives you for governance and optimization. This guide focuses on how an experienced data scientist can pick the right Llama variant and deployment pattern on Snowflake, with example use cases and the technical rationale behind each decision.​ + +Core selection dimensions +When choosing between Llama sizes (for example, 8B, 70B, 405B and distilled variants) on Snowflake, prioritize: + +Task complexity and reasoning depth: complex multi-step reasoning, long-context synthesis, or nuanced code generation benefit more from larger Llama models (70B–405B), while simpler classification, extraction, and templated generation can be served well by smaller models.​ + +Latency and throughput: high QPS or interactive UIs often require smaller or optimized Llama variants, potentially under provisioned throughput to guarantee low p95 latency, whereas batch analytics can tolerate slower large models.​ + +Context window and modality: use long-context Llama (for example, large-context 3.1/4 series) and vision-capable variants where the use case involves multi-document reasoning, large logs, or images.​ + +Cost and hardware footprint: larger Llama models drive higher credit consumption; Snowflake’s optimized and distilled Llama variants are designed to deliver near-frontier quality with significantly lower inference cost.​ + +A practical rule of thumb is to start with the smallest model that can meet your acceptance criteria on held-out tasks, and only step up to a larger Llama when you see clear accuracy or reliability gaps that matter to the business.​ + +Example use case: RAG knowledge assistant +For an internal RAG assistant over enterprise documentation and tickets, the model must handle long contexts, disambiguate subtle domain terms, and follow safety/guardrails. In Snowflake, you would typically:​​ + +Start with a mid-to-large Llama (for example, 70B or an optimized large-context variant) exposed via Cortex LLM functions for serverless inference, because this tier balances strong reasoning with manageable cost.​ + +Use provisioned throughput for production traffic so you can guarantee latency and avoid noisy neighbor effects, sized from observed peak tokens per second and request concurrency.​​ + +Apply prompt caching for stable system prompts (instructions, policies) and shared reference documents, so repeated queries reuse cached context and reduce input-token cost.​ + +Technical justification: long-context Llama variants on Snowflake exploit platform-level optimizations (like efficient KV caching and inference stack tuning) to support 100k+ token windows while keeping response times and cost acceptable compared with naïvely hosting a frontier model yourself. This makes them an excellent fit for RAG where context quality and adherence to guardrails matter more than raw tokens per second.​​ + +Example use case: structured extraction and classification +For invoice parsing, PII detection, or ticket triage, you often have: clear schemas, limited input length, and strong tolerance for distilled or smaller models as long as accuracy thresholds are met. In Snowflake, the recommended pattern is:​ + +Use small to mid-size Llama models (for example, 8B or pruned/pruned-distilled variants) and fine-tune them on a few thousand labeled examples stored in Snowflake tables, leveraging synthetic data and distillation where needed.​ + +Deploy the tuned model through Cortex LLM functions or Snowpark Container Services, depending on whether you need custom weights or are using a managed fine-tuned endpoint.​ + +Run extraction pipelines on warehouses governed by resource monitors, with budgets and cost allocation tags scoped to the owning team or product.​ + +Technical justification: for schema-bound tasks, fine-tuned smaller Llama models can achieve performance comparable to larger general-purpose models at a fraction of the inference cost and latency. Snowflake’s support for synthetic data and distillation allows you to leverage a very large Llama (for example, 405B) as a teacher once, then serve nearly all traffic from a smaller student model.​ + +Example use case: code generation and analytics copilot +For SQL or Python copilots embedded in BI or notebooks, human latency expectations are tight, and quality must be high enough to minimize post-editing time. Recommended approach:​​ + +Use a mid-to-large Llama variant tuned or specialized for code and tool use, often exposed via Cortex AI functions so it can call Snowflake-native tools (for example, running queries, inspecting schemas).​ + +Combine provisioned throughput with prompt caching for session-level context (current query history, schema descriptions) to balance responsiveness and cost.​​ + +Route non-critical or bulk code generation (for example, unit test expansion, boilerplate comments) to a smaller Llama model if your experimentation shows minimal quality differences.​ + +Technical justification: tool-using and code-generation tasks benefit from advanced reasoning and planning, where larger Llama models show measurable uplifts in correctness and fewer hallucinations. However, careful separation of “critical path” vs. “nice-to-have” generations lets you use a mixture-of-models strategy to avoid paying large-model prices for every token.​ + +Cost, governance, and selection criteria on Snowflake +Model selection on Snowflake is tightly coupled with its cost-governance stack: + +Provisioned throughput: choose larger Llama models for workloads where you can justify reserved capacity, then allocate provisioned units to hit your latency SLOs at known peak load; for spiky or exploratory usage, keep to smaller models on on-demand usage.​ + +Resource monitors and budgets: for each Llama tier in production, attach resource monitors to the supporting warehouses and define AI-specific budgets; if a large Llama tier repeatedly drives budget alerts, that is a signal to downshift to a smaller or distilled variant.​ + +Cost allocation tags: tag warehouses, compute pools, and LLM functions by model_name, environment, product, and cost_center so you can compute cost per use case and per model family; models that fail cost-per-outcome targets (for example, cost per resolved ticket) should be candidates for replacement.​ + +Prompt caching: treat cache hit rate and average prompt tokens per request as first-class selection metrics; for workloads with high cache locality, you can tolerate a larger Llama because the marginal cost per interaction is lower.​ + +A rigorous selection process should iterate between offline evaluation (accuracy, calibration, safety benchmarks) and online metrics (task success rate, latency, cost per outcome) across multiple Llama sizes. On Snowflake, the presence of built-in cost governance and optimized Llama variants makes it feasible to systematically test and then lock in the smallest model that meets your product and cost SLOs for each use case. + # Explainability