Day-2 operations

Once Care is running on the cluster (see Deploying Care onto the cluster), day-2 operations cover keeping it observable and recoverable: monitoring, log aggregation, and backups for every stateful component.

This deployment is a work in progress. Some capabilities are running today; others are design goals that are partially built or not yet automated. This page is honest about which is which — treat anything marked "evolving" or "planned" as not-yet-tested, and verify against the cluster before relying on it.

Monitoring and logs

Observability is provided by a Prometheus + Grafana + Loki stack, deployed through the monitoring module:

python deployer.py apply-tofu monitoring

The pieces fit together like this:

Component	Role
Prometheus	Scrapes and stores metrics from cluster components.
Grafana	Dashboards and visualization over metrics and logs.
Loki	Log aggregation and storage, queried from Grafana.
Alloy	Collects logs from pods/nodes and ships them into Loki.

After the stack is stable, finish the setup in Grafana:

Log in to Grafana using the Kubernetes-autogenerated admin password (read it from the Grafana admin secret in the monitoring namespace — do not hardcode or commit it).
Add Loki as a data source so logs become queryable alongside metrics.
Import the dashboards you need.

note

Initial Grafana setup (data source + dashboards) is a manual step today. The admin password is generated by Kubernetes — retrieve it from the cluster secret rather than setting a known value.

warning

Alloy is known to emit a noisy fsnotifier-related warning in this environment. It is a known cosmetic issue and does not affect log shipping.

Backups

Each stateful component is backed up on a daily/configurable schedule to S3-compatible external storage. The mechanism differs by component.

PostgreSQL (Care, Odoo, Metabase databases)

PostgreSQL is backed up using the CloudNativePG operator together with the barman-cloud plugin, which performs continuous WAL archiving and base backups to S3-compatible storage. Install the plugin manifest from the CloudNativePG release:

kubectl apply -f \
  https://github.com/cloudnative-pg/plugin-barman-cloud/releases/download/v0.10.0/manifest.yaml

The plugin needs credentials and a target bucket. Provide an S3 credentials secret and point the archive at your bucket — keep all values in cluster secrets, never in committed YAML:

kubectl create secret generic s3-secret \
  --from-literal=access_key=<s3-access-key> \
  --from-literal=secret_key=<s3-secret-key>

info

The same Postgres + barman-cloud backup pattern applies to every Postgres instance in the deployment: Care's database, Odoo's database, and Metabase's database. Configure each with its own bucket/prefix.

Volumes (Longhorn snapshots and backups)

Persistent volumes are managed by Longhorn. Longhorn takes volume snapshots and ships volume backups to an S3 target. The backup target is configured with an S3 credentials secret in the longhorn-system namespace — generalize the endpoint and keys to your own provider:

kubectl create secret generic <backup-secret-name> -n longhorn-system \
  --from-literal=AWS_ACCESS_KEY_ID=<s3-access-key> \
  --from-literal=AWS_SECRET_ACCESS_KEY=<s3-secret-key> \
  --from-literal=AWS_ENDPOINTS=<s3-endpoint> \
  --from-literal=AWS_REGION=<s3-region>

What gets backed up

Daily/configurable backups are intended for each stateful component:

Component	What is protected	Mechanism
PostgreSQL (Care)	Application database	CloudNativePG + barman-cloud → S3
PostgreSQL (Odoo)	Odoo database	CloudNativePG + barman-cloud → S3
PostgreSQL (Metabase)	Metabase database	CloudNativePG + barman-cloud → S3
OpenSearch / Elasticsearch	Search index data	Volume backup to external media
Object store (RustFS)	Uploads, audit logs, public data	Incremental sync to external store
Persistent volumes (Longhorn)	All attached volumes	Longhorn snapshots/backups → S3

note

Redis/Valkey runs in non-persistent mode by design — it holds no durable state and is not backed up.

Restore

A core design goal of this deployment is that a fresh cluster can be restored from backups — rebuild the Kubernetes cluster, then replay Postgres WAL archives, Longhorn volume backups, and the object-store sync to bring Care back up.

warning

Restore is a design goal, not a finished, tested procedure. End-to-end restore is not yet fully automated or documented in this deployment, and several supporting pieces (such as extending and repopulating Longhorn volumes) are still being worked out. Do not treat restore as production-proven until it has been validated against your own backups. Practice and verify a full restore on a throwaway cluster before depending on it.

Maturity: available today vs evolving

Use this split to set expectations. The "evolving / planned" column is drawn from the deployment's roadmap.

Available today	Evolving / planned
Prometheus + Grafana + Loki monitoring	Move Postgres to a cluster operator–managed Helm chart
Alloy log shipping into Loki	Generate secrets/passwords via Terraform and auto-configure
Postgres backups via CloudNativePG + barman-cloud → S3	Version-lock all Helm charts
Longhorn volume snapshots/backups → S3	TLS for database and other inter-pod communication
Per-component daily/configurable backups	Documented, automated restore-from-backup procedure
	Caching layer for the nginx-based frontend

tip

Manual Grafana setup, manual secret retrieval, and unautomated restore are the rough edges to plan around. As the roadmap items land (operator-managed Postgres, Terraform-generated secrets, version-locked charts), these workflows will tighten up.

Monitoring and logs​

Backups​

PostgreSQL (Care, Odoo, Metabase databases)​

Volumes (Longhorn snapshots and backups)​

What gets backed up​

Restore​

Maturity: available today vs evolving​