This module provisions a Data Lakehouse in Azure, by employing the following Azure resources:
- Datalake as an Azure Storage Account
- ETL via Azure Data Factory
- Data Warehousing via Azure SQL Database
- API Access via Azure API Management
The module provisions Azure Roles and Role Assignments alongside Active Directory Groups to allow for the following access patterns:
Azure Data Factory has a Github Integration that allows us to store our Data Factory configuration in a Github repository. Though we are able to configure it, we are not able to automatically authenticate against it.
This is left to the user to do manually, by following the steps outlined in the Azure Data Factory Github Integration documentation.
Sometimes, when applying, you might get an error like this:
Error: checking for presence of existing Key "[KEY NAME]" (Key Vault "[KEY VAULT FQDN]"):
keyvault.BaseClient#GetKey: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error.
Status=403 Code="Forbidden" Message="The user, group or application '[...]' does not have keys get permission on key vault '[NAME OF KEYVAULT];location=northeurope'.
[...]
InnerError={"code":"AccessDenied"}
Key Vault Access Controls often take a while to propagate. This can cause issues when trying to access a newly created Key Vault.
Wait a few minutes and try again. If that does not work, review the access policies for the Key Vault and make sure that the access policies are correctly configured.
Sometimes, when applying, you might get an error like this:
Error: checking for presence of existing Key "[KEY NAME]" (Key Vault "[KEY VAULT FQDN]):
keyvault.BaseClient#GetKey: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error.
Status=403 Code="Forbidden" Message="Client address is not authorized and caller is not a trusted service.
[...]
InnerError={"code":"ForbiddenByFirewall"}
The Network Access / Firewall rules for the Key Vault might be blocking your acess.
Add your IP to the Key Vault Firewall rules.
Name | Description | Type | Default | Required |
---|---|---|---|---|
budget_contact_emails | Emails to send budget notifications to | list(string) |
n/a | yes |
instance | Instance name | string |
n/a | yes |
org_code | Organization code | string |
n/a | yes |
platform_config | workload_subscription_id - The ID of the subscription which we want to provision into.platform_subscription_id - The ID of the platform subscription.workload_management_group_name (optional) - The name of the management group which we want to provision our workload subscription into. If this is not set, the placement of the workload subscription inside the management group hieracry will not be changed. |
object({ |
n/a | yes |
tier | Tier of the environment | string |
n/a | yes |
warehouse_config | sku_name - The sku-name for the datawarehouse. This controls the sku-name that will be used for the SQL Server Database. sku-names vary across regions and offerings, run az sql db list-editions -l [your region] -o table to see available options.max_size_gb - This controls the max-size-gb setting, i.e. how much storage to allocate for the SQL Server Database.zone_redundant - Whether the datawarehouse should be zone-redundant. This controls the zone-redundant setting that will be used for the SQL Server Database. Be aware that might not be available for all sku's.admin_group_id (optional) - The name of an existing AD Group that should be used as an admin to the datawarehouse. If this is not set, a new AD Group will be created.collation (optional) - The collation to use for the datawarehouse. This controls the collation that will be used for the SQL Server Database.ip_whitelist (optional) - A list of maps containing ip_address / name pairs to whitelist for the datalakehouse |
object({ |
n/a | yes |
adf_git_backend_config | Configuration for the github repository | object({ |
null |
no |
api_management_config | sku_name - SKU name for the API management instancesku_capacity - SKU capacity for the API management instancepublisher_name - Publisher name for the API management instancepublisher_email - Publisher email for the API management instance |
object({ |
null |
no |
budget_for_resource_group | Budget for the resource group | number |
50 |
no |
datalake_whitelisted_cidrs | A list of CIDRs to whitelist for the datalake | list(string) |
[] |
no |
datalakehouse_admins | A list of Azure AD User Principal ID's that are allowed to administer the Data Lakehouse. | list(string) |
[] |
no |
datalakehouse_contributor_can_contribute_to_keyvault | Whether the data engineers should be able to contribute to the key vault. | bool |
false |
no |
datalakehouse_contributor_group_name | The name of an existing AD Group that should be used as a contributor to the datalakehouse. If this is not set, a new AD Group will be created. |
string |
"" |
no |
datalakehouse_contributors | Contributors to the datalakehouse | list(string) |
[] |
no |
existing_audit_keyvault_id | An existing Keyvault to use for audit logs. If not provided, a new one will be created. | string |
null |
no |
existing_resource_group_info | An existing Resource Group to use. If not provided, a new one will be created. | object({ |
null |
no |
features | Features to enable or disable. | object({ |
{ |
no |
keyvault_ip_whitelist | IP addresses to whitelist for the keyvault | list(string) |
[] |
no |
name_overrides | Map of resource names to override. If not set, the name will be generated from the instance name. This variable is an escape hatch for some naming scheme conflicts that can occur and should, ideally, not be used. The schema for this variable is defined inside resource and service modules and is not documented here. |
map(string) |
{} |
no |
tags | Any tags that should be present on created resources. Will get merged with local.default_tags | map(string) |
{} |
no |
Name | Description |
---|---|
datafactory_info | The Data Factory Info |
datalake_info | The Data Lake Info |
datalakehouse_contributor_group_info | The Data Lakehouse Contributor group |
datalakehouse_warehouse_admin_group_info | The Data Lakehouse Warehouse Admin group |
datalakehouse_warehouse_connection_info | The Data Lakehouse Warehouse Connection Info |
Name | Type |
---|---|
azurerm_management_group_subscription_association.workload_subscription_association | resource |
azuread_user.datalakehouse_contributors | data source |
azurerm_client_config.current | data source |
azurerm_management_group.workload_mgtm_group | data source |
azurerm_subscription.workload_subscription | data source |
Name | Source | Version |
---|---|---|
api_management | ../../modules/azure/api-management | n/a |
base_setup | ../../modules/azure/base-setup | n/a |
data_engineer_group_role_assignments | ../../modules/azure/role-assignment | n/a |
data_engineer_role | ../../modules/azure/role-definition | n/a |
data_engineer_user_group | ../../modules/azure/ad-group | n/a |
datafactory | ../../modules/azure/datafactory | n/a |
datalake | ../../modules/azure/datalake | n/a |
datawarehouse | ../../modules/azure/datawarehouse | n/a |
keyvault | ../../modules/azure/keyvault | n/a |
warehouse_admin_group | ../../modules/azure/ad-group | n/a |
Name | Version |
---|---|
terraform | >= 1.1 |
azuread | >=2.47.0 |
azurerm | >=3.0.0 |
Name | Version |
---|---|
azuread | 2.47.0 |
azurerm | 3.92.0 |