Kubernetes the Hard Way is a classic learning resource for everyone who wants learn Kubernetes seriously. I also followed that tutorial when I first learned Kubernetes a couple of years ago. It gave me more familarity and a better mental model of the moving pieces in Kubernetes. However, since the explanations in that tutorial were very brief, for many parts I ended up just following along by copying and pasting the documented bash snippets, without really gaining a thorough understanding. In the end, everything worked, but I still didn't feel 100% confident.
And I realised that's very similar to the experience of vibe coding. So I began to wonder - why don't I just vibe code the whole thing from scratch? In the end, I should get something that works as well, but in the process, I would get to make a lot more decisions, and ask a lot more questions - and this would teach me more skills and knowledge.
The start of everything is this prompt to ChatGPT 5:
There is a popular learning resource in the k8s community - kubernetes the hard way, which shows how to bootstrap a cluster in cloud without using a managed offering or out-of-the-box bootstrapper. Now, I want to do something similar, however, instead of following that tutorial, I want to do it by chatting with you.
Our goals are:
- Create a kubernetes cluster on AWS
- We aim to use the minimal offerings from AWS like EC2 and VPCs. We don't use EKS
- We don't use things like kubeadm. Instead, we manually install everything from source by hand
- We need to make sure all control plane components are configured correctly so they can talk with each other. This will include certificates are generated and distributed correctly
- We want to have 3 control plane nodes and 2 worker nodes
- In the end, we need to be able to create a simple service (nginx is fine) running as a Deployment which we can reach to from the internet
Doing everything at once is hard. So, let's first start with a high level overview. Please arrange all the things we need to do into several relatively self-contained "chapters". We'll then dig into each "chapter" to get things working gradually.
From there, ChatGPT created a detailed roadmap, which I saved in SPEC.md. As part of Chapter 0, I also got a detailed project architecture doc which I saved in ARCHITECTURE.md. (This was revised to reflect the latest architecture after everything was completed.)
For each chapter, I used Codex to discuss, plan and execute. After each chapter is done, I asked Codex to generate a summary and append at the end of this document. I used a RULES.md file to specify the behaviour that Codex must adhere to. For design decisions, I asked Codex to create an ADR for each under the ADRs folder. The DECISIONS.md keeps an index of all the ADRs so Codex can digest and access them more efficiently.
I got a Kubernetes cluster set up from bare metal, which has 3 control plane nodes and 2 worker nodes, and the API server can be accessed from the internet:
I can also run workload on this cluster and expose that to the internet:
I also get the metrics server working so we can check node usage:
- Documented the target AWS-based Kubernetes architecture with decisions and rationale in
ARCHITECTURE.md. - Appended an ASCII topology diagram illustrating VPC, AZ, subnet, and node placement.
- Clarified Calico’s VXLAN mode and its alternatives, plus explained kube-proxy’s IPVS mode behavior.
- Captured Chapter 1 prerequisite decisions in
ADRs/000-chapter1-network-prep-decisions.mdand indexed them viaDECISIONS.md. - Scaffolded
chapter1/terraform/with network and security modules, plus documented inputs and notes for future chapters. - Provisioned the AWS network substrate (VPC, subnets, IGW, managed NAT, security groups) using Terraform 1.13.3.
- Recorded Chapter 2 provisioning choices in
ADRs/001-chapter2-node-provisioning-decisions.mdand later right-sized the worker fleet viaADRs/002-chapter2-worker-sizing-adjustment.md, updatingDECISIONS.mdaccordingly. - Built
chapter2/terraform/to launch 3× control-planet3.mediuminstances and 2× workert3.smallinstances with 20 GiB gp3 roots using cloud-init templates and dynamic Ubuntu 22.04 AMI discovery via SSM. - Added committed ops assets: role-specific cloud-init (
chapter2/cloud-init/), a static inventory (chapter2/inventory.yaml), and a bastion-run validation script (chapter2/scripts/validate_nodes.sh).
- Logged PKI strategy in
ADRs/003-chapter3-pki-decisions.mdand expanded.gitignoreto keep generated secrets out of version control. - Produced full PKI inventory under
chapter3/pki/usingcfssl, including apiserver, component, admin, and per-node kubelet certificates with documented SAN coverage. - Generated secrets encryption material in
chapter3/encryption/, drafted distribution/rotation guidance (chapter3/pki/manifest.yaml,chapter3/REVOCATION.md), and captured execution notes inchapter3/README.md.
- Brought up the three-node etcd cluster (v3.5.12) with TLS on all control-plane nodes using the Chapter 3 CA and staged binaries under
chapter4/bin/. - Extended the manifest-driven distribution tooling and added
chapter4/scripts/bootstrap_etcd_node.shto standardise service bring-up. - Documented validation flows (
etcdctl endpoint status/health,member list) and captured implementation notes inchapter4/README.md.
- Finalised control-plane assets under
chapter5/, including env files, systemd units, distribution manifest, and an idempotent bootstrap script aligned with ADR 005. - Removed the obsolete
--cloud-provider=noneflag, converted controller-manager/scheduler units toType=exec, and fixed kube-apiserver advertising/permissions to eliminate handler timeouts. - From the bastion ran
distribute_control_plane.sh --nodes <cp>followed bybootstrap_control_plane.shon each control-plane node (cp-a/b/c), confirmed services viasystemctl status, and verified/healthz?verbose.
- Introduced the internal AWS NLB (
kthw-api-nlb) with TCP 6443 listener and target group coveringcp-a,cp-b, andcp-c, plus a private Route53 zonekthw.labexposingapi.kthw.lab. - Stood up the dedicated
chapter6/terraform/stack and README detailing usage, validation, and outputs for the shared API endpoint. - Expanded Chapter 1 security groups so the control plane accepts kube-apiserver traffic from the NLB subnets, enabling load-balanced access from the bastion and future clients.
- Logged worker runtime decisions in
ADRs/007-chapter7-worker-stack-decisions.mdand updatedDECISIONS.mdto track Chapter 7 scope. - Staged Kubernetes v1.31.1 worker binaries plus containerd v1.7.14/runc v1.1.12/crictl v1.31.1 under
chapter7/bin/, templated configs inchapter7/config/, and defined distribution targets viachapter7/manifest.yaml. - Authored
chapter7/scripts/enable_workers.shand README guidance so workers install the stack, register with the control plane, and remainNotReadypending the Chapter 8 CNI rollout.
- Captured Calico networking choices in
ADRs/008-chapter8-networking-decisions.mdand linked them throughDECISIONS.md, then rendered a VXLAN-tuned manifest atchapter8/calico.yamlalongside the archived upstream source. - Hardened networking prerequisites by expanding security group rules in
chapter1/terraform/modules/security/main.tffor BGP/VXLAN traffic and documenting the dependency inchapter8/README.md. - Added operational artifacts:
chapter8/README.md,chapter8/VALIDATION.md, test fixtures inchapter8/tests/connectivity.yaml, and a kubelet proxy RBAC binding underchapter8/manifests/kube-apiserver-to-kubelet-crb.yaml. - Authored
chapter5/scripts/update_hosts_entries.shto align/etc/hostsacross bastion and nodes, enablingkubectl execvalidation, and trimmed service-DNS tests until Chapter 9 delivers CoreDNS.
- Logged CoreDNS/Metrics Server decisions in
ADRs/009-chapter9-core-addons-decisions.md, populated manifests underchapter9/manifests/, and scripted the request-header ConfigMap refresh viachapter9/scripts/ensure_requestheader_configmap.sh. - Generated the front-proxy CA/client certs, distributed them with the Chapter 5 manifest, and updated
kube-apiserver.envto include--enable-aggregator-routing=trueso aggregated APIs accept proxied requests. - Rolled out CoreDNS and a hostNetworked metrics-server deployment, refreshed
extension-apiserver-authentication, and validated the aggregation layer (k get apiservice v1beta1.metrics.k8s.io,k top nodes).
- Captured ALB-focused exposure decisions in
ADRs/012-chapter10-app-exposure-decisions.mdand staged Terraform underchapter10/terraform/to provision the public ALB, security groups, target group, and instance attachments. - Rendered nginx Deployment/Service manifests in
chapter10/manifests/that surface node identity via Downward API variables, plus documented rollout/cleanup inchapter10/README.md. - Applied the stack and workload from the bastion, then validated internet reachability by curling the ALB DNS name and confirming target health for both workers.
- Logged hardening choices in
ADRs/013-chapter11-security-decisions.mdandADRs/014-metrics-server-worker-placement.md, covering RBAC, NetworkPolicies, and metrics-server worker placement. - Added
chapter11/manifests/RBAC + policy assets, refreshed metrics-server secrets/script, and confirmed hostNetwork metrics-server runs cleanly on workers with the aggregated API reachable. - Documented validation steps in
chapter11/README.mdandchapter11/VALIDATION.md, including kubelet read-only checks and guidance for ingress namespace labelling.
- Captured a documentation-only scope for resiliency work in
ADRs/015-chapter12-backup-upgrade-dr-deferral.md, deferring automation until the resiliency milestone is staffed. - Outlined etcd snapshot automation, control-plane upgrade, and node replacement workstreams with proposed scripts and docs in
chapter12/README.md. - Listed validation drills, ownership prerequisites, and follow-up decisions to unblock future execution of the Chapter 12 plan.
- Added
ADRs/016-chapter13-public-api-exposure-scope.mdto lock public-access constraints (AWS NLB hostname, admin-only CIDR allowlist, dedicated security group) and recorded the entry inDECISIONS.md. - Built
chapter13/terraform/to provision the internet-facing NLB, security group, target group attachments, and ENI SG associations usingadmin_cidr_blockssupplied at plan/apply time viacurl. - Regenerated the kube-apiserver certificate SANs with the NLB hostname, updated the PKI manifest for correct ownership, refreshed the certs on all control planes, and created
chapter13/kubeconfigs/admin-public.kubeconfigfor approved operators. - Authored
chapter13/README.mdand the rotation runbook documenting Terraform usage, allowlist updates, cert distribution, and validation (kubectl --raw=/livezfrom the internet, allowlist enforcement, NLB resilience).
- Captured teardown guidance in
chapter14/docs/cleanup-checklist.md, including ordered Terraform destroy commands, manual AWS verification steps, and ready-to-run CLI snippets for each resource family. - Refreshed
ARCHITECTURE.mdwith Mermaid diagrams, a security group matrix, and full PKI/kubeconfig inventories so the document reflects the final cluster state. - Logged the cleanup decisions in
ADRs/017-chapter14-cleanup-decisions.md, updatedDECISIONS.md, and scaffoldedchapter14/README.mdfor future execution notes.


