Monitoring Operations

Use the monitoring pages to watch host capacity, deployment resource use, and game health while servers are running.

Swarm Host Actions

Swarm host owners and admins can manage monitoring from the swarm host detail page:

  • enable the monitoring stack
  • disable the monitoring stack
  • restart the monitoring stack
  • repair the monitoring stack
  • run monitoring diagnostics
  • change the local metric retention window

These actions are queued for the swarm host agent. If monitoring is unhealthy, the agent reports that state, but game deployment actions continue to be managed independently.

Retention And Storage

The default retention window is 30d. Source installs store monitoring data under /var/lib/swarmhost-monitor unless the agent environment overrides it.

Shorter retention reduces disk use. Longer retention gives more history but needs more local storage on the swarm host. Retention changes affect local metric history only; they do not change deployment backups.

Diagnostics

Use Run monitoring diagnostics from the swarm host page when charts stop updating or the stack reports unhealthy collectors.

For source-checkout installs, the same diagnostic can be run locally:

.venv/bin/python agents/swarmhost-agent/agent/agent.py diagnostics monitoring

Diagnostics include collector health, configured retention, data directory size, target count, disk guard state, and the last known successful scrape time when the local metrics store can answer that query.

Game Health Metrics

Game health charts are shown only when a deployment has an explicit game metrics target, port, or sidecar configuration. If a game does not expose game metrics, the deployment monitor still shows container resource charts when host-local monitoring is available.

Troubleshooting

If the monitoring page has no chart data:

  • confirm the swarm host is online
  • confirm monitoring is enabled on that swarm host
  • run monitoring diagnostics
  • check that the browser can establish a monitor session
  • wait for the first scrape interval after enabling or repairing the stack

If only game health charts are missing, check whether that game or deployment actually provides a game metrics target.

Product Metrics Stack

The central metrics stack runs on VPS_1 and scrapes the control-plane /metrics endpoint. It is separate from swarmhost-local monitoring.

Deploy or repair it with:

scripts/release/deploy_product_metrics.sh

The stack includes:

  • VictoriaMetrics for product time-series storage
  • vmagent scraping https://swarmhosts.com/metrics with PRODUCT_METRICS_TOKEN from default/swarmhosts-web-secret
  • a separate 30-day VictoriaMetrics store for VPS host metrics
  • a VPS vmagent scraping node-exporter for VPS_1 and VPS_2
  • node-exporter for VPS_1 as a Kubernetes DaemonSet
  • vmalert rules for scrape health, login failures, deployment failures, current deployment errors, offline swarmhosts, and VPS CPU, memory, and disk pressure
  • Alertmanager for alert state
  • Grafana at https://metrics.swarmhosts.com, including the Swarm Hosts VPS Host Overview dashboard

Product metrics retention is currently configured as 100y, which is treated as effectively indefinite retention until the service has enough history to choose a real threshold.

VPS host metrics retention is configured separately as 30d.

VPS_2 is outside the k3s cluster, so it must run node-exporter on TCP 9100 and allow that port only from VPS_1 (15.204.10.9) or an equivalent private path. On an Ubuntu/Debian VPS with UFW active, install or repair it with:

scripts/ops/install_vps_node_exporter.sh

Grafana credentials are stored in the Kubernetes secret default/swarmhosts-product-grafana-secret. Retrieve them through Kubernetes access when needed; do not copy them into docs, issues, or chat.