No one has standing access to production. All changes go through pipelines. Break-glass is for active incidents only — automatically tracked, time-bound, and immutably audited.
# One-time setup aws configure sso # Login aws sso login --profile Dev-tapp export AWS_PROFILE=Dev-tapp # Verify aws sts get-caller-identity
aws eks update-kubeconfig --name dev-eks-cluster --region us-east-1 kubectl get nodes
Use the connect.sh pattern — no static passwords:
# 1. Assume the DB access role TEMP_CREDS=$(aws sts assume-role \ --role-arn "arn:aws:iam::<ACCOUNT_ID>:role/developer-rds-access" \ --role-session-name "dev-psql-$(date +%s)") export AWS_ACCESS_KEY_ID=$(echo "$TEMP_CREDS" | jq -r '.Credentials.AccessKeyId') export AWS_SECRET_ACCESS_KEY=$(echo "$TEMP_CREDS" | jq -r '.Credentials.SecretAccessKey') export AWS_SESSION_TOKEN=$(echo "$TEMP_CREDS" | jq -r '.Credentials.SessionToken') # 2. Generate RDS auth token DB_TOKEN=$(aws rds generate-db-auth-token \ --hostname "<RDS_ENDPOINT>" --port 5432 \ --username "<DB_USER>" --region us-east-1) # 3. Connect PGPASSWORD="$DB_TOKEN" psql --host="<RDS_ENDPOINT>" --port=5432 \ --user="<DB_USER>" --dbname="<DB_NAME>"
Access is managed via IaC (Terraform) in the Management account. Users are added to groups via PR. No click-ops.
| Group | Console access | Members |
|---|---|---|
grp-forge-platform-admins | Full · all accounts | Platform engineers |
grp-forge-developers | Full on Dev + QA | Software engineers |
grp-forge-qa-engineers | Full on QA, ReadOnly on Dev | QA team |
grp-forge-9squid-team | Full on Dev | 9Squid product team |
grp-forge-sre-oncall | PowerUser + break-glass on Prod | On-call engineers |
grp-forge-wealth-prod | ReadOnly on Prod-Wealth | CTO, key stakeholders |
grp-forge-data-analysts | S3 Data Access on Prod (Confidential) | Business / BI team |
VPN access is separate from console access. Not every engineer needs console, but all need VPN to reach internal services.
| Group | Network access | Members |
|---|---|---|
grp-forge-vpn-dev | Dev VPC | All engineers |
grp-forge-vpn-qa | QA VPC | QA + relevant devs |
grp-forge-vpn-prod-wealth | Prod-Wealth VPC | Wealth team + SRE only |
grp-forge-vpn-prod-cash | Prod-Cash VPC | Cash team + SRE only |
grp-forge-vpn-prod-9squid | Prod-9Squid VPC | 9Squid team + SRE only |
No one has standing access to production. All changes go through pipelines. Break-glass is for active incidents only.
| Tier | Permission set | When | Approval | Duration |
|---|---|---|---|---|
| Tier 0 | Forge-Prod-Emergency | P1 — site down | Self-serve + auto-alert | 1h |
| Tier 1 | Scenario-specific | P2/P3 — investigation | Manager approval | 2-4h |
| Tier 2 | Forge-Prod-ReadOnly | Investigation only | Senior engineer | 4h |
| Tier 3 | IAM user in Bitwarden | Identity Center is down | Physical, 2-person | until rotated |
SSM session to check logs, processes, env vars. aws ssm start-session --target <ID>. 2-hour max.
IAM-authenticated SQL connection (connect.sh) or SSM port-forward fallback. 2-hour max.
kubectl logs, kubectl exec, kubectl describe. 2-hour max.
ReadOnly access to CloudWatch, EC2, RDS dashboards. 4-hour max.
Every break-glass session is automatically archived to an immutable S3 bucket:
How it works: EventBridge detects AssumeRole for *BreakGlass* / *Emergency* roles → Lambda creates S3 folder + writes metadata + sends Slack alert → SSM logs stream automatically → Step Function exports CloudTrail after session expires.
TIER 0 (P1 — SITE DOWN): Assume Forge-Prod-Emergency → MFA → work → auto-expires 1h Auto-alerts #platform-incidents. Ticket within 1h. TIER 1 (P2/P3 — INVESTIGATION): 1. Create ticket 2. Post request in #platform-support 3. Wait for manager approval 4. Platform Admin provisions access in Identity Center 5. Log in via SSO → verify with sts:GetCallerIdentity 6. Diagnose and fix — document everything 7. Log out → access auto-revokes EC2: aws ssm start-session --target <instance-id> RDS: connect.sh (IAM auth) or SSM port-forward (fallback) EKS: aws eks update-kubeconfig → kubectl Console: SSO portal → Forge-Prod-ReadOnly DURATIONS: Tier 0 = 1h · EC2/DB/EKS = 2h · Console = 4h APPROVERS: Platform Leads, Eng Managers CHANNELS: #platform-support (requests), #platform-incidents