Skip to content

IAM and Governance

Dev Singh edited this page Mar 22, 2025 · 2 revisions

Core implements a Identity and Access Management (IAM) architecture that follows the principle of least privilege while balancing security with development and operational velocity. This article explores our approach to IAM design, including role-based access control, controlled privilege escalation, and function-scoped resource access.

User-Facing IAM

The Foundation: Role-Based Access Control

At the heart of our system is a role-based access control (RBAC) model (in src/common/roles.ts) that follows the "action:resource" naming pattern:

export enum AppRoles {
  EVENTS_MANAGER = "manage:events",
  TICKETS_SCANNER = "scan:tickets",
  IAM_ADMIN = "admin:iam",
  STRIPE_LINK_CREATOR = "create:stripeLink",
  // etc.
}

Rather than assigning permissions directly to users, we maintain two DynamoDB tables:

  • userroles: Maps individual users to specific roles
  • grouproles: Maps Microsoft Entra ID (Azure AD) groups to roles

This allows us to manage permissions both at the individual level (for exceptions) and at the group level (for organizational roles), providing flexibility while keeping permissions manageable.

User Role Assignment Process

The process of assigning roles to users follows multiple paths to accommodate different organizational structures:

1. Group-Based Role Assignment

The primary method of role assignment is through Microsoft Entra ID groups. This approach leverages the existing organizational structure:

  1. Group Creation: Administrative users create groups in Microsoft Entra ID that represent organizational units (e.g., "Executive Council", "Committee Chairs")
  2. Group-Role Mapping: These groups are mapped to specific application roles via the grouproles DynamoDB table.
  3. User Membership: Users are added to the appropriate Entra ID groups based on their organizational role.
  4. Dynamic Resolution: When a user makes a request, the system automatically resolves their roles based on JSON Web Token claims returned by EntraID.

This pattern is particularly effective for roles tied to organizational positions. For example, all members of the "Executive Council" group might automatically receive the EVENTS_MANAGER and STRIPE_LINK_CREATOR roles to help them in their job functions. On the other hand, the Infrastructure Committee chairs and Officer Board members are granted full access, including bypassing Object-Level authorization as applicable.

This process is implemented in the getGroupRoles function found in src/api/functions/authorization.ts, which handles the retrieval and caching of role assignments for groups.

2. Direct Role Assignment

For exceptions or special cases, roles can be assigned directly to individual users in the userroles table.

There is currently no UI for this process, and is mostly used for one-off access grants, such as allowing Social Committee members to fulfill purchases.

This approach is useful for temporary access grants or roles that don't align neatly with organizational groups.

The direct assignment is implemented in the getUserRoles function in src/api/functions/authorization.ts, which performs similar logic to group role resolution but for individual users.

3. Administrative Interface

To make role management accessible to non-technical administrators, we provide a dedicated IAM management interface in the application:

  1. Group Management: Administrators can view and manage group membership
  2. Invitation System: New users can be invited to the Entra ID tenant

In the future, we aim to provide:

  1. Role Visualization: The interface shows which roles are assigned to each group
  2. User Management: Individual user roles can be viewed and modified

This interface abstracts away the underlying DynamoDB tables and Entra ID API calls, providing a user-friendly way to manage permissions.

The interface is implemented in the ManageIamPage component found in src/ui/pages/iam/ManageIam.page.tsx, which provides comprehensive role and group management capabilities. This page simply calls the IAM routes found in src/api/routes/iam.ts.

4. Role Inheritance and Special Handling

Some special cases in our role system include:

  • All Roles: A special value ["all"] in the roles array grants all application roles
  • Paid Membership Verification: Certain group memberships (like Executive Council) require verification of paid membership status
  • Role Caching: Role assignments are cached for performance, with different expiration times for a balance of speed and security.

Infrastructure IAM

Controlled Privilege Escalation in Lambda Functions

One of the most interesting aspects of our architecture is how we handle privilege escalation for specific operations. Following AWS best practices, our main Lambda functions run with the minimal permissions needed for day-to-day operations. However, certain operations like Microsoft Entra ID management require additional privileges. Since we use a "lambdalith" architecture, we must manage this privilege escalation ourselves at the route level.

Rather than granting these elevated permissions permanently, we use AWS's STS role assumption pattern to temporarily escalate privileges only when needed. This pattern is implemented in functions like getAuthorizedClients in src/api/routes/iam.ts, which temporarily assumes a role with additional permissions for Entra ID operations.

We store secret strings in AWS Secrets Manager, with only the necessary roles being able to access the necessary secrets data.

This pattern significantly enhances security:

  1. Temporal Privilege Minimization: Elevated permissions exist only during the specific operation that requires them.
  2. Reduced Attack Surface: Even if a vulnerability exists elsewhere in the application, the blast radius is limited.

In our CloudFormation templates, we structure the IAM roles to allow this controlled escalation (see src/cloudformation/iam.yml.

Function-Scoped Resource Access in SQS Consumers

Another secure design pattern we've implemented is function-scoped resource access for our SQS message processing. Since our system handles a variety of asynchronous tasks (email sending, membership provisioning, payment processing), we need to ensure that each task has access only to the resources it requires. For example, there's no reason for the external merch/ticketing API to also be provisioning memberships.

We achieve this through a combination of:

  1. Message Typing: Each SQS message has a specific function type
  2. Handler Registration: Handlers must be explicitly registered for message types
  3. Permission Boundaries: The SQS Lambda execution role has specific permissions
  4. Task-Specific Queues: Dedicated queues for specific operations with restricted function access

Our SQS message processing system is defined in src/common/types/sqsMessage.ts and src/api/sqs/index.ts. It defines a clear set of available functions (like email sending, membership provisioning, and payment processing) and explicitly maps each message type to a specific handler function. This structure ensures that each message can only trigger its designated handler, preventing unauthorized operations. We also use Zod schemas to strictly validate our payloads.

Beyond our general-purpose queue, we've implemented task-specific queues (like our sales email queue) that are restricted to invoking only specific functions. This restriction is enforced at two levels:

  1. Handler-Level Validation: The SQS handler in src/api/sqs/index.ts validates that messages from specific queues can only trigger appropriate functions.

  2. IAM Policy Restrictions: Each queue is assigned specific IAM policies that limit what external functions can push messages to it.

This multi-layered approach ensures that even if a service gains the ability to publish to a specific queue, it can only trigger the specific functions that queue is authorized to invoke. For example, our sales email queue can only trigger email sending operations with specific sender addresses and recipient domains.

This design ensures that:

  1. Explicit Authorization: Only registered handlers can process specific message types
  2. Limited Scope: Each handler has a specific, well-defined purpose
  3. Clear Boundaries: Messages can't trigger arbitrary code execution

We further enhance security through our IAM policy structure. For example, our SQS Lambda role that handles email sending has very specific permissions:

- PolicyName: ses-membership
  PolicyDocument:
    Version: "2012-10-17"
    Statement:
      - Action:
          - ses:SendEmail
          - ses:SendRawEmail
        Effect: Allow
        Resource: "*"
        Condition:
          StringEquals:
            ses:FromAddress:
              Fn::Sub: "membership@${SesEmailDomain}"
          ForAllValues:StringLike:
            ses:Recipients:
              - "*@illinois.edu"

This policy constrains not just which actions the function can perform (sending emails), but also the specific parameters it can use (only sending from specific addresses to specific domains). This prevents potential abuse even if the function's logic were compromised.

Authentication process

Our platform implements authorization at multiple layers:

1. API-level authentication

The initial authenticationlayer validates JWT tokens from Microsoft Entra ID and maps users to roles:

const authPlugin: FastifyPluginAsync = async (fastify, _options) => {
  fastify.decorate(
    "authorize",
    async function (
      request: FastifyRequest,
      _reply: FastifyReply,
      validRoles: AppRoles[]
    ): Promise<Set<AppRoles>> {
      // Token validation logic
      // Role resolution logic
      // Authorization check
    }
  );
};

2. Route-Level Authorization

Each API route explicitly declares which roles can access it:

fastify.get(
  "/events",
  {
    onRequest: async (request, reply) => {
      await fastify.authorize(request, reply, [AppRoles.EVENTS_MANAGER]);
    },
  },
  handler
);

3. UI Component Authorization

The UI applies authorization checks using the AuthGuard component, only showing users resources they have permission to access:

<AuthGuard
  resourceDef={{ service: "core", validRoles: [AppRoles.TICKETS_MANAGER] }}
>
  <TicketManagementComponent />
</AuthGuard>

Benefits and Lessons Learned

Our IAM architecture has provided several key benefits:

  1. Reduced Attack Surface: By limiting permissions to what's strictly necessary, we've minimized potential exploit vectors.

  2. Simplified Reasoning: Clear role names and explicit permission checks make it easier for developers to understand who can do what.

  3. Auditability: Role assumption logs and explicit handler registrations create a clear audit trail of system behaviors.

  4. Scalable Permissions Model: New roles and permissions can be added with minimal changes to the core system.

However, we've also learned some important lessons:

  1. Developer Experience Trade-offs: More granular permissions require more thought during development, creating some additional overhead.

  2. Testing Complexity: Properly testing role-based access requires simulating different user contexts, which adds complexity to test suites.

  3. Documentation Importance: With finer-grained permissions, clear documentation becomes even more crucial for onboarding new developers.

Role Resolution Workflow

When a request comes into the system, a role resolution workflow takes place:

  1. Token Analysis: The system examines the JWT token to extract user identity and group memberships
  2. Group Role Lookup: For each group the user belongs to, the system looks up associated roles
  3. User-Specific Role Lookup: The system checks for any user-specific role assignments
  4. Role Aggregation: All applicable roles are combined into a single set
  5. Authorization Decision: The request is authorized if the required role exists in the user's role set

This flow is implemented in the authorize function found in src/api/plugins/auth.ts. The function performs several crucial steps:

  1. Validates the JWT token from the request
  2. Extracts group memberships from the token
  3. Retrieves roles for each group the user belongs to
  4. Adds any role mappings from Azure configurations
  5. Incorporates user-specific role overrides from the database
  6. Performs authorization by checking if any required role is present in the user's role set
  7. Makes the user's roles available for use in request handlers

Auditing and Governance

Beyond just controlling access, our IAM architecture also facilitates auditing and governance:

  1. Action Logging: Critical actions are logged with the actor's identity
  2. Role Change Tracking: Changes to role assignments are tracked with timestamps
  3. Mandatory Group Requirements: Some actions (like adding users to certain groups) have prerequisite checks

For example, when modifying group membership, the system includes detailed audit logging in the IAM routes (src/api/routes/iam.ts), capturing the actor, target, action type, and timestamp for all permission changes. The message is logged with {type: "audit"} in the JSON field of the messages, enabling us to quickly filter and find actions in AWS CloudWatch.

This audit trail provides accountability and helps track actions over time. These logs are retained for one year.