diff --git a/docs/Networking.md b/docs/Networking.md index af7429e..28de6d1 100644 --- a/docs/Networking.md +++ b/docs/Networking.md @@ -1,15 +1,15 @@ -## Networking +# Networking | ID | Specification | |-------|--------------| -| N-R1 | **DDoS**: The AI Landing Zone should provide guidance and implementation of Azure DDoS protection. In case of existing platform landing zone, the central DDoS service should be used instead.

Best Practice:

[Azure DDoS Protection](https://learn.microsoft.com/en-us/azure/ddos-protection/ddos-protection-overview) should be enabled to safeguard AI services from potential disruptions and downtime caused by distributed denial of service attacks. Enable Azure DDoS protection at the virtual network level to defend against traffic floods targeting internet-facing applications.| -| N-R2 | **Jump box & Bastion**: The AI Landing Zone should provide guidance and implementation of a jumpbox that can be accessed through bastion. In case of existing platform landing zone the central jump box and bastion service should be used instead.

Best Practice:

AI development access should use a jumpbox within the virtual network of the workload or through a connectivity hub virtual network. Use Azure Bastion to securely connect to virtual machines interacting with AI services. Azure Bastion provides secure RDP/SSH connectivity without exposing VMs to the public internet. Enable Azure Bastion to ensure encrypted session data and protect access through TLS-based RDP/SSH connections. | -| N-R3 | **Private Endpoints**: The AI Landing Zone should provide guidance and implementation of Private endpoint for the AI services it deploys.

Best Practice:

No PaaS services or AI model endpoints should be accessible from the public internet. Private endpoints to provide private connectivity to Azure services within a virtual network. Private endpoints provide secure, private access to PaaS portals like Azure AI Foundry and Azure Machine Learning studio. For Azure AI Foundry, Configure the [managed virtual network](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/configure-managed-network) and use [private endpoints](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/configure-private-link). For Azure Open AI, Restrict access to select [virtual networks](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#scenarios) or use [private endpoints](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#use-private-endpoints). For Azure Machine Learning, Create a [secure workspace](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-create-secure-workspace-vnet) with a virtual network. [Plan for network isolation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-network-isolation-planning). Follow the [security best practices](https://learn.microsoft.com/en-us/azure/machine-learning/concept-enterprise-security) for Azure Machine Learning. | -| N-R4 | **Network Security Groups**: The AI Landing Zone should provide guidance and implementation of NSGs on all virtual networks implemented as part of the architecture.

Best Practice:

Utilize [network security groups](https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview) (NSGs) to define and apply access policies that govern inbound and outbound traffic to and from AI workloads. These controls can be used to implement the principle of least privilege, ensuring that only essential communication is permitted. | -| N-R5 | **App Gateway or Azure Front Door with WAF**: The AI Landing Zone should provide guidance and implementation of an application gateway or front door with WAF for the chat application based on regional or global deployment.

Best Practice:

[Azure WAF](https://learn.microsoft.com/en-us/azure/web-application-firewall/overview) helps protect your AI workloads from common web vulnerabilities, including SQL injections and cross-site scripting attacks. Configure Azure WAF on [Application Gateway](https://learn.microsoft.com/en-us/azure/web-application-firewall/ag/ag-overview) for workloads that require enhanced security against malicious web traffic. For Azure AI Services, Restrict access to select [virtual networks](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#scenarios) or use [private endpoints](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#use-private-endpoints)| -| N-R6 | **APIM as AI Gateway:** The AI landing zone must provide guidance & implementation of APIM with AI Foundry.

Best Practice:

The AI Landing Zone should use [Azure API Management](https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities#backend-load-balancer-and-circuit-breaker) for load balancing API requests to AI endpoints. Consider using Azure API Management (APIM) as a generative AI gateway within your virtual networks. A generative AI gateway sits between your front-end and the AI endpoints. Application Gateway, WAF policies, and APIM within the virtual network is an established [architecture](https://github.com/Azure/apim-landing-zone-accelerator/blob/main/scenarios/workload-genai/README.md#scenario-3-azure-api-management---generative-ai-resources-as-backend) in generative AI solutions. For more information, see [AI Hub architecture](https://github.com/Azure-Samples/ai-hub-gateway-solution-accelerator#ai-hub-gateway-landing-zone-accelerator) and [Deploy Azure API Management instance to multiple Azure regions](https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-deploy-multi-region). A [generative AI gateway](https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities) allows you to track token usage, throttle token usage, apply circuit breakers, and route to different AI endpoints to control costs.

_Consider a generative AI gateway for monitoring._ A reverse proxy like Azure API Management allows you to implement logging and monitoring that aren't native to the platform. API Management allows you to collect source IPs, input text, and output text. For more information, see [Implement logging and monitoring for Azure OpenAI Service language models](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/openai/architecture/log-monitor-azure-openai)._._ [Azure API Management](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-guide) (APIM) can help ensure consistent security across AI workloads. Use its built-in policies for traffic control and security enforcement. Integrate APIM with Microsoft Entra ID to centralize authentication and authorization and ensure only authorized users or applications interact with your AI models. Ensure you configure least privilege access on the [reverse proxy’s managed identity](https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-use-managed-service-identity). For more information, see [AI authentication with APIM](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-custom-authentication#general-recommendations)| -| N-R7 | The AI Landing Zone must implement HTTPS through Azure Application Gateway or Azure Front Door.

Best Practice:

Secure connections using TLS protocols help protect data integrity and confidentiality for AI workloads connecting from the internet. Implement HTTPS through Azure Application Gateway or Azure Front Door. Both services provide encrypted, secure tunnels for internet-originating connections.| -| N-R8 | **Firewall**: The AI Landing Zone must provide guidance & implementation of a UDR to Azure or 3P Firewall.

Best Practice:

[Azure Firewall](https://learn.microsoft.com/en-us/azure/firewall/overview) enforces security policies for outgoing traffic before it reaches the internet. Use it to control and monitor outgoing traffic and enable SNAT to conceal internal IP addresses by translating private IPs to the firewall's public IP. It ensures secure and identifiable outbound traffic for better monitoring and security. | -| N-R9 | **Private DNS Zones**: The AI Landing Zone should provide guidance and implementation of [integrated private endpoints with Private DNS Zones](https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-dns-integration) for proper DNS resolution and successful private endpoint functionality. In case of platform landing zone, the central Private DNS Zones will be leveraged instead.

Best Practice:

Private DNS zones centralize and secure DNS management for accessing PaaS services within your AI network. Set up Azure policies that enforce private DNS zones and require private endpoints to ensure secure, internal DNS resolutions. If you don't have central Private DNS Zones, the DNS forwarding doesn't work until you add conditional forwarding manually. For example, see [using custom DNS](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-custom-dns) with Azure AI Foundry hubs and Azure Machine Learning workspace. Custom DNS servers manage PaaS connectivity within the network, bypassing public DNS. Configure private DNS zones in Azure to resolve PaaS service names securely and route all traffic through private networking channels. | -| N-R10 | **Restrict Outbound by default**: The AI Landing Zone should provide guidance and implementation restricting outbound access by default.

Best Practice:

Limiting outbound traffic from your AI model endpoints helps protect sensitive data and maintain the integrity of your AI models. For minimizing data exfiltration risks, restricting outbound traffic to approved services or fully qualified domain names (FQDNs) and maintain a list of trusted sources. You should only allow unrestricted internet outbound traffic if you need access to public machine learning resources but regularly monitor and update your systems. For more information, see [Azure AI services](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-data-loss-prevention), [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/configure-managed-network), and [Azure Machine Learning.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-network-isolation-planning#allow-only-approved-outbound-mode) | -| N-R11 | **Virtual Network**: The AI Landing Zone should provide prescriptive guidance virtual networks, subnets and corresponding IP Address planning for them. | \ No newline at end of file +| N-R1 | **DDoS Protection**: The AI Landing Zone should provide guidance and implementation of Azure DDoS protection. In case of existing platform landing zone, the central DDoS service should be used instead.

**Best Practice:**

[Azure DDoS Protection](https://learn.microsoft.com/en-us/azure/ddos-protection/ddos-protection-overview) should be enabled to safeguard AI services from potential disruptions and downtime caused by distributed denial of service attacks. Enable Azure DDoS protection at the virtual network level to defend against traffic floods targeting internet-facing applications.| +| N-R2 | **Jump box & Bastion**: The AI Landing Zone should provide guidance and implementation of a jumpbox that can be accessed through bastion. In case of existing platform landing zone the central jump box and bastion service should be used instead.

**Best Practice:**

AI development access should use a jumpbox within the virtual network of the workload or through a connectivity hub virtual network. Use Azure Bastion to securely connect to virtual machines interacting with AI services. Azure Bastion provides secure RDP/SSH connectivity without exposing VMs to the public internet. Enable Azure Bastion to ensure encrypted session data and protect access through TLS-based RDP/SSH connections. | +| N-R3 | **Private Endpoints**: The AI Landing Zone should provide guidance and implementation of Private endpoint for the AI services it deploys.

**Best Practice:**

No PaaS services or AI model endpoints should be accessible from the public internet. Private endpoints to provide private connectivity to Azure services within a virtual network. Private endpoints provide secure, private access to PaaS portals like Azure AI Foundry and Azure Machine Learning studio. For Azure AI Foundry, Configure the [managed virtual network](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/configure-managed-network) and use [private endpoints](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/configure-private-link). For Azure Open AI, Restrict access to select [virtual networks](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#scenarios) or use [private endpoints](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#use-private-endpoints). For Azure Machine Learning, Create a [secure workspace](https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-create-secure-workspace-vnet) with a virtual network. [Plan for network isolation](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-network-isolation-planning). Follow the [security best practices](https://learn.microsoft.com/en-us/azure/machine-learning/concept-enterprise-security) for Azure Machine Learning. | +| N-R4 | **Network Security Groups**: The AI Landing Zone should provide guidance and implementation of NSGs on all virtual networks implemented as part of the architecture.

**Best Practice:**

Utilize [network security groups](https://learn.microsoft.com/en-us/azure/virtual-network/network-security-groups-overview) (NSGs) to define and apply access policies that govern inbound and outbound traffic to and from AI workloads. These controls can be used to implement the principle of least privilege, ensuring that only essential communication is permitted. | +| N-R5 | **App Gateway or Azure Front Door with WAF**: The AI Landing Zone should provide guidance and implementation of an application gateway or front door with WAF for the chat application based on regional or global deployment.

**Best Practice:**

[Azure WAF](https://learn.microsoft.com/en-us/azure/web-application-firewall/overview) helps protect your AI workloads from common web vulnerabilities, including SQL injections and cross-site scripting attacks. Configure Azure WAF on [Application Gateway](https://learn.microsoft.com/en-us/azure/web-application-firewall/ag/ag-overview) for workloads that require enhanced security against malicious web traffic. For Azure AI Services, Restrict access to select [virtual networks](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#scenarios) or use [private endpoints](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-virtual-networks#use-private-endpoints)| +| N-R6 | **APIM as AI Gateway:** The AI landing zone must provide guidance & implementation of APIM with AI Foundry.

**Best Practice:**

The AI Landing Zone should use [Azure API Management](https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities#backend-load-balancer-and-circuit-breaker) for load balancing API requests to AI endpoints. Consider using Azure API Management (APIM) as a generative AI gateway within your virtual networks. A generative AI gateway sits between your front-end and the AI endpoints. Application Gateway, WAF policies, and APIM within the virtual network is an established [architecture](https://github.com/Azure/apim-landing-zone-accelerator/blob/main/scenarios/workload-genai/README.md#scenario-3-azure-api-management---generative-ai-resources-as-backend) in generative AI solutions. For more information, see [AI Hub architecture](https://github.com/Azure-Samples/ai-hub-gateway-solution-accelerator#ai-hub-gateway-landing-zone-accelerator) and [Deploy Azure API Management instance to multiple Azure regions](https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-deploy-multi-region). A [generative AI gateway](https://learn.microsoft.com/en-us/azure/api-management/genai-gateway-capabilities) allows you to track token usage, throttle token usage, apply circuit breakers, and route to different AI endpoints to control costs.

_Consider a generative AI gateway for monitoring._ A reverse proxy like Azure API Management allows you to implement logging and monitoring that aren't native to the platform. API Management allows you to collect source IPs, input text, and output text. For more information, see [Implement logging and monitoring for Azure OpenAI Service language models](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/openai/architecture/log-monitor-azure-openai)._._ [Azure API Management](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-guide) (APIM) can help ensure consistent security across AI workloads. Use its built-in policies for traffic control and security enforcement. Integrate APIM with Microsoft Entra ID to centralize authentication and authorization and ensure only authorized users or applications interact with your AI models. Ensure you configure least privilege access on the [reverse proxy's managed identity](https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-use-managed-service-identity). For more information, see [AI authentication with APIM](https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/azure-openai-gateway-custom-authentication#general-recommendations).

**Multi-Region Setup:**
• Use Azure API Management (APIM) premium tier to build multi-region gateway setup
• In the event of a disaster, route the traffic to a healthy region
• Use Azure Traffic Manager in conjunction with APIM to either route requests to a regional gateway or Geographic routing method in multi-region deployment scenario

**Single Region Setup:**
• For single region scenario, deploy APIM in Availability zone configuration with automatic zone selection | +| N-R7 | **HTTPS Implementation:** The AI Landing Zone must implement HTTPS through Azure Application Gateway or Azure Front Door.

**Best Practice:**

Secure connections using TLS protocols help protect data integrity and confidentiality for AI workloads connecting from the internet. Implement HTTPS through Azure Application Gateway or Azure Front Door. Both services provide encrypted, secure tunnels for internet-originating connections.| +| N-R8 | **Firewall**: The AI Landing Zone must provide guidance & implementation of a UDR to Azure or 3P Firewall.

**Best Practice:**

[Azure Firewall](https://learn.microsoft.com/en-us/azure/firewall/overview) enforces security policies for outgoing traffic before it reaches the internet. Use it to control and monitor outgoing traffic and enable SNAT to conceal internal IP addresses by translating private IPs to the firewall's public IP. It ensures secure and identifiable outbound traffic for better monitoring and security. | +| N-R9 | **Private DNS Zones**: The AI Landing Zone should provide guidance and implementation of [integrated private endpoints with Private DNS Zones](https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-dns-integration) for proper DNS resolution and successful private endpoint functionality. In case of platform landing zone, the central Private DNS Zones will be leveraged instead.

**Best Practice:**

Private DNS zones centralize and secure DNS management for accessing PaaS services within your AI network. Set up Azure policies that enforce private DNS zones and require private endpoints to ensure secure, internal DNS resolutions. If you don't have central Private DNS Zones, the DNS forwarding doesn't work until you add conditional forwarding manually. For example, see [using custom DNS](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-custom-dns) with Azure AI Foundry hubs and Azure Machine Learning workspace. Custom DNS servers manage PaaS connectivity within the network, bypassing public DNS. Configure private DNS zones in Azure to resolve PaaS service names securely and route all traffic through private networking channels. | +| N-R10 | **Restrict Outbound by default**: The AI Landing Zone should provide guidance and implementation restricting outbound access by default.

**Best Practice:**

Limiting outbound traffic from your AI model endpoints helps protect sensitive data and maintain the integrity of your AI models. For minimizing data exfiltration risks, restricting outbound traffic to approved services or fully qualified domain names (FQDNs) and maintain a list of trusted sources. You should only allow unrestricted internet outbound traffic if you need access to public machine learning resources but regularly monitor and update your systems. For more information, see [Azure AI services](https://learn.microsoft.com/en-us/azure/ai-services/cognitive-services-data-loss-prevention), [Azure AI Foundry](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/configure-managed-network), and [Azure Machine Learning.](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-network-isolation-planning#allow-only-approved-outbound-mode) | +| N-R11 | **Virtual Network**: The AI Landing Zone should provide prescriptive guidance virtual networks, subnets and corresponding IP Address planning for them. | \ No newline at end of file diff --git a/docs/Operational-Excellence.md b/docs/Operational-Excellence.md index e0c6654..6b80b59 100644 --- a/docs/Operational-Excellence.md +++ b/docs/Operational-Excellence.md @@ -1,11 +1,32 @@ -## Operational Excellence - -| ID | Specification | -|------|--------------| -| | | -| | | -| | | -| | | -| | | -| | | -| | | +# Operational Excellence + +This document outlines the operational excellence specifications for AI Landing Zones. + +## Specifications + +| ID | Specification | +|-----|---------------| +| O-1 | **Infrastructure-as-Code (IaC) & Deployment Automation**
Use Bicep, Terraform templates to automate AI deployments.
Reference: [Azure Verified Modules](https://azure.github.io/Azure-Verified-Modules/)
Learn: [Bicep documentation](https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/) | [Terraform on Azure](https://learn.microsoft.com/en-us/azure/developer/terraform/) | +| O-2 | **Monitoring & Observability**
Integrate Azure Monitor natively with services like Azure OpenAI and APIM to track:
• Request/response payloads
• Latency
• Throughput
• Error rates [GenAI gate...using APIM]

Use custom events via Event Hubs for near real-time monitoring and alerting.
Learn: [Azure Monitor](https://learn.microsoft.com/en-us/azure/azure-monitor/) | [Event Hubs](https://learn.microsoft.com/en-us/azure/event-hubs/) | +| O-3 | *To be defined* | +| O-4 | *To be defined* | +| O-5 | *To be defined* | + +## Additional Resources + +### Infrastructure & Deployment +- [Azure Verified Modules](https://azure.github.io/Azure-Verified-Modules/) +- [Bicep documentation](https://learn.microsoft.com/en-us/azure/azure-resource-manager/bicep/) +- [Terraform on Azure](https://learn.microsoft.com/en-us/azure/developer/terraform/) +- [Infrastructure as Code best practices](https://learn.microsoft.com/en-us/azure/architecture/framework/devops/iac) + +### Monitoring & Observability +- [Azure Monitor overview](https://learn.microsoft.com/en-us/azure/azure-monitor/) +- [Azure Event Hubs documentation](https://learn.microsoft.com/en-us/azure/event-hubs/) +- [API Management monitoring](https://learn.microsoft.com/en-us/azure/api-management/api-management-howto-use-azure-monitor) +- [Azure OpenAI monitoring](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/monitoring) + +### General Operational Excellence +- [Azure Well-Architected Framework - Operational Excellence](https://learn.microsoft.com/en-us/azure/architecture/framework/devops/overview) +- [Azure landing zones design principles](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/landing-zone/design-principles) +- [Cloud Adoption Framework - Operational Excellence](https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/manage/) diff --git a/docs/Security.md b/docs/Security.md index 2606961..80ba9ad 100644 --- a/docs/Security.md +++ b/docs/Security.md @@ -1,4 +1,24 @@ -## Security +## Secur| S-R5 | **Monitor outputs and apply prompt shielding:** The AI Landing Zone should implement/guide on using AI Content Safety.

Best Practice:

Regularly inspect the data returned by AI models to detect and mitigate risks associated with malicious or unpredictable user prompts. Implement [Prompt Shields](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/concepts/jailbreak-detection) to scan text for the risk of a user input attack on generative Al models. | +| S-R6 | The AI Landing Zone must provide implementation and guidance on zero trust. | + +## Define and Maintain Data Boundaries + +- Use [Microsoft Purview](https://learn.microsoft.com/en-us/purview/create-sensitivity-labels) to classify data sensitivity and define access policies. +- Implement [Azure RBAC](https://learn.microsoft.com/en-us/azure/role-based-access-control/overview) to restrict data access by workload and user group. +- Use [Azure Private Link](https://learn.microsoft.com/en-us/azure/private-link/private-link-overview) for network-level data isolation between AI applications. + +## Implement Comprehensive Data Loss Prevention + +- Use [Microsoft Purview DLP](https://learn.microsoft.com/en-us/purview/dlp-learn-about-dlp) to scan and block sensitive data in AI workflows. +- Configure [content filtering](https://learn.microsoft.com/en-us/azure/ai-services/content-safety/overview) to prevent leakage of sensitive information. +- Implement custom filters to detect and redact organization-specific sensitive data. +- For [Microsoft Copilot Studio](https://learn.microsoft.com/en-us/microsoft-copilot-studio/dlp-example-6), configure DLP policies for agents. + +## Protect AI Artifacts from Compromise + +- Store models and datasets in [Azure Blob Storage](https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-overview) with [private endpoints](https://learn.microsoft.com/en-us/azure/storage/common/storage-private-endpoints). +- Apply [encryption at rest](https://learn.microsoft.com/en-us/azure/storage/common/storage-service-encryption) and [in transit](https://learn.microsoft.com/en-us/azure/storage/common/storage-require-secure-transfer). +- Enforce strict access policies and monitor for unauthorized access attempts.y | ID | Specification | |-------|--------------|