Skip to content

Feat/vpn interface#2

Open
NamSupawan wants to merge 131 commits into
mainfrom
feat/vpn_interface
Open

Feat/vpn interface#2
NamSupawan wants to merge 131 commits into
mainfrom
feat/vpn_interface

Conversation

@NamSupawan
Copy link
Copy Markdown

No description provided.

Comment thread app/controllers/vpn_devices_controller.rb Fixed
config_file = params[:config_file]
output, status = Open3.capture2e("sudo wg-quick up #{config_file}")

output, status = Open3.capture2e("sudo wg-quick up #{config_file}")

Check failure

Code scanning / CodeQL

Uncontrolled command line Critical

This command depends on a
user-provided value
.

Copilot Autofix

AI 21 days ago

Use a non-shell invocation and strict input validation/allowlisting before execution.

Best fix here (without changing intended functionality):

  1. Validate config_file against a conservative allowlist (e.g., only letters, numbers, _, -, ., and /) and reject unsafe values.
  2. Invoke Open3.capture2e with argument separation ("sudo", "wg-quick", "up", config_file) instead of a single interpolated string, so the shell is not used.
  3. Keep existing redirect behavior, adding an early return for invalid input.

Edit region: app/controllers/vpn_devices_controller.rb, inside def create (lines shown 107–115).

No new imports or dependencies are required (Open3 is already used in this file/snippet context).

Suggested changeset 1
app/controllers/vpn_devices_controller.rb

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/app/controllers/vpn_devices_controller.rb b/app/controllers/vpn_devices_controller.rb
--- a/app/controllers/vpn_devices_controller.rb
+++ b/app/controllers/vpn_devices_controller.rb
@@ -105,8 +105,14 @@
 
   # POST /vpn_devices or /vpn_devices.json
   def create
-    config_file = params[:config_file]
-    output, status = Open3.capture2e("sudo wg-quick up #{config_file}")   
+    config_file = params[:config_file].to_s
+
+    unless config_file.match?(/\A[\w.\-\/]+\z/)
+      redirect_to new_vpn_device_path, alert: 'Invalid configuration file name.'
+      return
+    end
+
+    output, status = Open3.capture2e('sudo', 'wg-quick', 'up', config_file)
     if status.success?
       redirect_to vpn_devices_path, notice: 'WireGuard interface created successfully.'
     else
EOF
@@ -105,8 +105,14 @@

# POST /vpn_devices or /vpn_devices.json
def create
config_file = params[:config_file]
output, status = Open3.capture2e("sudo wg-quick up #{config_file}")
config_file = params[:config_file].to_s

unless config_file.match?(/\A[\w.\-\/]+\z/)
redirect_to new_vpn_device_path, alert: 'Invalid configuration file name.'
return
end

output, status = Open3.capture2e('sudo', 'wg-quick', 'up', config_file)
if status.success?
redirect_to vpn_devices_path, notice: 'WireGuard interface created successfully.'
else
Copilot is powered by AI and may make mistakes. Always verify output.
NamSupawan and others added 17 commits February 6, 2025 23:44
Captures the 6 files that have been modified directly on the running
production EC2 (product-wireguard-vpn / i-05caf7960a63760c6) but never
made it into any branch on GitHub. Some of these date back to early 2025.

Files included:
  .env.sample                          (trailing-newline only — trivial)
  config/credentials.yml.enc           (encrypted Rails credentials)
  config/initializers/omniauth.rb      (removed email_verified: true
                                        from Google OAuth options)
  package.json                         (postcss-cli ^5 -> ^11,
                                        sass ^1.77.6 -> ^1.79.1)
  package-lock.json                    (matches package.json)
  yarn.lock                            (matches package.json)

NOT included (intentionally):
  .gitignore  — reverted; PR #11 ships a strict superset
  backup.rules, firewall.txt, rules.v5, config/credentials.yml.enc.backup
              — operational artifacts, gitignored by PR #11

This branch is for preserving the diff so it can be reviewed and merged
into feat/vpn_interface as a normal PR, instead of getting silently
clobbered the next time someone runs git pull on the server.

No code review has happened on these changes — please review before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prod-drift snapshot 2026-05: omniauth + package updates from running production
* Hot-reload WireGuard on peer changes, race-safe IP allocation

Rebased onto feat/vpn_interface (the actual production branch) instead
of main, which is 127 commits behind production. Preserves Nam's
existing changes: the .140-.254 allocation range, the #add_with_user
admin flow, and the #create wg-quick-up flow.

Fixes three operational bugs surfaced during the product-wireguard-vpn
SSM investigation (5 users currently in DB but missing from wg0.conf,
conf last written 57 days ago, 4 manual `systemctl restart wg-quick@wg0`
cycles in 80 min today).

1. Restart needed per new user
   - `after_action :update_wireguard_config` was only running on
     `:update`/`:destroy`. Both `:new` (current_user signup) and
     `:add_with_user` (admin-creates-for-user) PERSIST a VpnDevice but
     were missing from the allowlist, so wg0.conf never got regenerated
     for new signups.
   - Replace `systemctl restart wg-quick@wg0` (which deletes the link
     and drops every active session) with
     `wg syncconf wg0 <(wg-quick strip wg0)` — atomic peer
     reconciliation, no session drops. Implemented in
     `WireguardConfigGenerator.reload_wireguard` and called automatically
     from `write_server_configuration`.

2. Deleted IPs occasionally "skip"
   - DB-level unique index on `ip_allocations.ip_address` so concurrent
     signups can't both win the same IP via the model-level uniqueness
     race.
   - `IpAllocation.allocate_ip` runs in a per-attempt transaction and
     retries on `RecordNotUnique`.
   - FK `ip_allocations.vpn_device_id` re-added with `ON DELETE CASCADE`
     so manual/raw deletes always free the IP.
   - Migration de-dupes any existing duplicate IPs before adding the
     unique index (no duplicates exist today, so this is a no-op).
   - Allocation range `(140..254)` preserved from feat/vpn_interface and
     extracted to `ALLOCATION_RANGE` constant with a comment explaining
     why .2-.139 are reserved.

3. Users sometimes can't connect after restart
   - Root cause = bug 1. With the `:new`/`:add_with_user` after_action
     fix, every signup now writes the conf and syncs the kernel
     atomically.

Other changes
   - IP allocation moved from the controllers into `VpnDevice`'s
     `after_create` callback. Pool exhaustion or uniqueness-race
     exhaustion now rolls back the device instead of leaving an orphan
     VpnDevice without an `ip_allocation`.
   - Atomic conf write (`.tmp` + rename) so the kernel never reads a
     half-written file via the
     `/etc/wireguard/wg0.conf -> config/wireguard/wg0.conf` symlink.
   - `lib/tasks/wireguard.rake`:
       * `wireguard:audit`      — DB vs conf vs kernel drift report
       * `wireguard:regenerate` — rewrite conf + wg syncconf
       * `wireguard:fix`        — audit -> regenerate -> audit

Deploy (server is already on feat/vpn_interface)
   git pull
   bundle exec rake db:migrate
   bundle exec rake wireguard:fix    # onboards the 5 currently-stuck users
   sudo systemctl reload puma

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Gitignore operational artifacts produced on production boxes

The production EC2 (product-wireguard-vpn) accumulates a handful of files
during normal operations that are not source code and should never be
committed:

  *.rules            iptables-save dumps
  rules.v*           iptables-save dumps with versioned suffix
  firewall.txt       iptables-save dumps (named variant)
  *.enc.backup       backup of config/credentials.yml.enc
  *.backup           generic backup suffix
  .claude-crew*      Claude Code agent session state
  .fos/              local tooling state

Today these show up as 4 untracked files on the server (backup.rules,
config/credentials.yml.enc.backup, firewall.txt, rules.v5) which would
otherwise need ad-hoc cleanup before every `git pull`.

No actual files are added or removed — this only affects future `git
status` / `git add` behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Reopen ALLOCATION_RANGE to (2..254) now that delete propagation works

Background (per Nam, the original author of the (140..254) workaround):

  > i reserve ip .2 ─ .139 because i cannot delete key on server
  > wireguard but db delete key done

The narrowed allocation range was a workaround for a deeper bug:
destroying a VpnDevice in the UI deleted the DB row but never told the
WireGuard kernel to drop the peer (the conf file was rewritten but
`systemctl restart wg-quick@wg0` was the only way to push it into the
kernel, and restarts dropped every active session — so it was avoided).
That left "phantom" peers in the kernel with stale public keys bound to
.2-.139 IPs. To prevent IP-key collisions when an old IP got
re-allocated to a new user, Nam moved the allocator floor to .140.

This is no longer needed:

* The other commit in this PR ("Hot-reload WireGuard...") wires
  `wg syncconf wg0 <(wg-quick strip wg0)` into the destroy after_action.
  Deletes now atomically reconcile the kernel with the conf without
  dropping any other peer's session.
* With reliable delete propagation, the full .2-.254 range is safe to
  reopen. Effective pool size grows from 115 → 253 addresses, and the
  ~80 gap addresses below .139 (left behind by old deletes that never
  freed their kernel peer) become reusable on the next signup.

Also: improve the wireguard:audit rake task to label each ghost peer
with the "# User: ..." comment line that precedes it in the conf. With
the range reopened, operators need to verify ghosts represent stale
state (safe to clean up) vs. real users (re-add via UI before running
`rake wireguard:fix`).

Before / after:
  before:  ==  Peers in conf but missing from DB (ghosts, ...): 2  ==
           10.42.5.18
           10.42.5.173

  after:   ==  Peers in conf but missing from DB ...: 2  ==
              ⚠  Review each one before running `rake wireguard:fix`.
                 If a peer here represents a real user, re-add their
                 device via the UI first.
           10.42.5.18    # User: Pheeraphat (New) Prisan, Device: NewFlowAccountDevice
           10.42.5.173   # User: <name>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Prevent duplicate VpnDevice creation from Turbo prefetch / double-submit

Root cause:
  `VpnDevicesController#new` is a GET that persists a record (controller
  pattern preserved on `feat/vpn_interface` and not refactored here).
  Combined with Rails 7's Turbo Drive — which prefetches link targets on
  hover for perceived speed — every mouse hover over the "Let's add a
  device" button silently fires another GET, each of which creates
  another VpnDevice. Browser pre-rendering and double-clicks have the
  same effect.

Evidence (production scan today):
  - Pheeraphat (user#18) created 4 devices within ~1 second after
    re-onboarding, only 1 with the description he typed:
       04:25:05  id=626  "PheeraphatPrisanMacbook"   <- actual click
       04:25:05  id=627  nil                          <- prefetch
       04:25:06  id=628  nil                          <- prefetch
       04:25:06  id=629  nil                          <- prefetch

  - 16 other users have device pairs with sub-second creation deltas,
    many ending in placeholder/nil descriptions. 7 of these are
    confirmed never-handshook prefetch artifacts (kernel rx/tx = 0).

Fix:
  1. View — `app/views/admin/index.html.erb`
     Add `data-turbo-prefetch="false"` to the "Let's add a device" link
     in the empty-state. Stops Turbo from issuing background prefetch
     GETs on hover. Click behavior is unchanged. (The `_vpn_devices`
     partial's admin-only "Add a VPN For YourDevice" link already had
     `data: { turbo: false }`, so it wasn't a vector.)

  2. Controller — `app/controllers/vpn_devices_controller.rb`
     Add a 30-second idempotency guard in `#new` and `#add_with_user`.
     If the target user created a device in the last 30s, short-circuit
     and redirect with a notice instead of persisting another row.
     Catches all the remaining vectors (double-click, refresh, browser
     prefetch from other links, accidental form re-submit). Extracted
     into a private `recent_device_for(user)` helper.

Out of scope:
  Proper REST refactor (`GET /vpn_devices/new` → form only, no DB write;
  `POST /vpn_devices` → actual create). The current controller still has
  `#create` repurposed as a `wg-quick up` shell-out, so untangling it is
  a bigger change. The prefetch fix + idempotency guard resolves the
  user-visible bug without that refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Fix description-update form requiring page refresh to show download button

User-reported bug: when a user with a nil-description device (a prefetch
artifact) types a description into the "Please add description" form and
clicks "Update", the page appears to do nothing — the form stays
visible with the same nil-description device. They have to manually
refresh the browser to see the device move into the card section (which
has the Download Config button).

Root cause:
  `form_with model: vpn_device` in Rails 7 defaults to `local: false`,
  meaning the form submission goes through Turbo Drive as a fetch-style
  request. The controller's `update` action responds with
  `redirect_to root_path`, which Turbo SHOULD follow via a Drive
  navigation that fully re-renders the page from the new URL. In
  practice the page does not update — the user sees the same stale form
  until they refresh.

  The matching `add_with_user` form a few lines below in the same partial
  already has `local: true`, presumably for the same reason — somebody
  hit this before but only fixed one of the two forms.

Fix:
  Add `local: true` to the per-device update form. Submission is now a
  standard HTML POST/PATCH, the response 302 is followed by the browser
  (not Turbo), and the new page renders correctly with the just-named
  device shown in the card section, Download Config button visible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#13)

After PR #11 deployed, the very first new signup (Lookplub at .5) hit the
exact bug I called out at the end of the deploy: `wg syncconf` is failing
silently every time Rails regenerates the conf. The user gets created in
DB and conf, but is never added to the kernel. Operator has to manually
`wg syncconf` after every signup — back to the original problem PR #11
was meant to solve.

Two root causes in `WireguardConfigGenerator`, both my own from PR #11:

1. `File.write` + atomic rename in `write_server_configuration` leaves the
   new conf file at mode 0644 (default umask). `wg-quick strip` then warns
   "Warning: '/etc/wireguard/wg0.conf' is world accessible" — which is
   correct, the server's [Interface] private key was sitting in a
   world-readable file (real security issue, not just cosmetic).

2. `reload_wireguard` uses `Open3.capture2e` to run `wg-quick strip`,
   which captures stderr alongside stdout. The "Warning: ..." text from
   (1) ends up in the buffer we then pipe into `wg syncconf`. syncconf
   parses each line as a config directive, hits "Warning:" and aborts:

     [wg-reload] wg syncconf wg0 failed: Line unrecognized:
       `Warning:`/home/ubuntu/gate-wireguard/config/wireguard/wg0.conf`is
       world accessible'
     Configuration parsing error

Fix:

* Force mode 0600 on wg0.conf (and the private.key file, while we're
  there) right after writing them. This eliminates the warning AND
  closes the world-readable-private-key security gap.
* Switch `capture2e` to `capture2` for `wg-quick strip` so any future
  stderr warnings can't poison the stripped output we feed to syncconf.

Verified manually on production today: with 0600 perms, `wg-quick strip`
emits no stderr, syncconf accepts the input, kernel is updated correctly
for new peers. Lookplub (the user who hit this on the deploy) was
unblocked by the same chmod + syncconf pattern this PR codifies.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants