Skip to main content

Day-2 operations

Once Care is running on the cluster (see Deploying Care onto the cluster), day-2 operations cover keeping it observable and recoverable: monitoring, log aggregation, and backups for every stateful component.

This deployment is a work in progress. Some capabilities are running today; others are design goals that are partially built or not yet automated. This page is honest about which is which — treat anything marked "evolving" or "planned" as not-yet-tested, and verify against the cluster before relying on it.

Monitoring and logs

Observability is provided by a Prometheus + Grafana + Loki stack, deployed through the monitoring module:

python deployer.py apply-tofu monitoring

The pieces fit together like this:

ComponentRole
PrometheusScrapes and stores metrics from cluster components.
GrafanaDashboards and visualization over metrics and logs.
LokiLog aggregation and storage, queried from Grafana.
AlloyCollects logs from pods/nodes and ships them into Loki.

After the stack is stable, finish the setup in Grafana:

  1. Log in to Grafana using the Kubernetes-autogenerated admin password (read it from the Grafana admin secret in the monitoring namespace — do not hardcode or commit it).
  2. Add Loki as a data source so logs become queryable alongside metrics.
  3. Import the dashboards you need.
note

Initial Grafana setup (data source + dashboards) is a manual step today. The admin password is generated by Kubernetes — retrieve it from the cluster secret rather than setting a known value.

warning

Alloy is known to emit a noisy fsnotifier-related warning in this environment. It is a known cosmetic issue and does not affect log shipping.

Backups

Each stateful component is backed up on a daily/configurable schedule to S3-compatible external storage. The mechanism differs by component.

PostgreSQL (Care, Odoo, Metabase databases)

PostgreSQL is backed up using the CloudNativePG operator together with the barman-cloud plugin, which performs continuous WAL archiving and base backups to S3-compatible storage. Install the plugin manifest from the CloudNativePG release:

kubectl apply -f \
https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.10.0/manifest.yaml

The plugin needs credentials and a target bucket. Provide an S3 credentials secret and point the archive at your bucket — keep all values in cluster secrets, never in committed YAML:

kubectl create secret generic s3-secret \
--from-literal=access_key=<s3-access-key> \
--from-literal=secret_key=<s3-secret-key>
info

The same Postgres + barman-cloud backup pattern applies to every Postgres instance in the deployment: Care's database, Odoo's database, and Metabase's database. Configure each with its own bucket/prefix.

Volumes (Longhorn snapshots and backups)

Persistent volumes are managed by Longhorn. Longhorn takes volume snapshots and ships volume backups to an S3 target. The backup target is configured with an S3 credentials secret in the longhorn-system namespace — generalize the endpoint and keys to your own provider:

kubectl create secret generic <backup-secret-name> -n longhorn-system \
--from-literal=AWS_ACCESS_KEY_ID=<s3-access-key> \
--from-literal=AWS_SECRET_ACCESS_KEY=<s3-secret-key> \
--from-literal=AWS_ENDPOINTS=<s3-endpoint> \
--from-literal=AWS_REGION=<s3-region>

What gets backed up

Daily/configurable backups are intended for each stateful component:

ComponentWhat is protectedMechanism
PostgreSQL (Care)Application databaseCloudNativePG + barman-cloud → S3
PostgreSQL (Odoo)Odoo databaseCloudNativePG + barman-cloud → S3
PostgreSQL (Metabase)Metabase databaseCloudNativePG + barman-cloud → S3
OpenSearch / ElasticsearchSearch index dataVolume backup to external media
Object store (RustFS)Uploads, audit logs, public dataIncremental sync to external store
Persistent volumes (Longhorn)All attached volumesLonghorn snapshots/backups → S3
note

Redis/Valkey runs in non-persistent mode by design — it holds no durable state and is not backed up.

Restore

A core design goal of this deployment is that a fresh cluster can be restored from backups — rebuild the Kubernetes cluster, then replay Postgres WAL archives, Longhorn volume backups, and the object-store sync to bring Care back up.

warning

Restore is a design goal, not a finished, tested procedure. End-to-end restore is not yet fully automated or documented in this deployment, and several supporting pieces (such as extending and repopulating Longhorn volumes) are still being worked out. Do not treat restore as production-proven until it has been validated against your own backups. Practice and verify a full restore on a throwaway cluster before depending on it.

Maturity: available today vs evolving

Use this split to set expectations. The "evolving / planned" column is drawn from the deployment's roadmap.

Available todayEvolving / planned
Prometheus + Grafana + Loki monitoringMove Postgres to a cluster operator–managed Helm chart
Alloy log shipping into LokiGenerate secrets/passwords via Terraform and auto-configure
Postgres backups via CloudNativePG + barman-cloud → S3Version-lock all Helm charts
Longhorn volume snapshots/backups → S3TLS for database and other inter-pod communication
Per-component daily/configurable backupsDocumented, automated restore-from-backup procedure
Caching layer for the nginx-based frontend
tip

Manual Grafana setup, manual secret retrieval, and unautomated restore are the rough edges to plan around. As the roadmap items land (operator-managed Postgres, Terraform-generated secrets, version-locked charts), these workflows will tighten up.