Live cluster lynkedup-tech (NYC2) · 2026-06-28 maintenance window · Istio + SPIFFE/SPIRE
STATE: mTLS PERMISSIVE 9 app services meshed · 0 outages PAUSED before workers / ingress-mesh / STRICT
We rolled the platform's HTTP application services in the sre namespace into an Istio service mesh, on a dedicated SPIFFE/SPIRE identity foundation, in PERMISSIVE mode (sidecars accept both mTLS and plaintext, so nothing can break yet). Datastores and messaging were deliberately excluded. This report verifies, step by step, that the current state behaves as expected — every command below was run live and its real output captured.
Flow being tested: a public request enters via ingress-nginx → reaches a meshed app pod (its Envoy sidecar) → the app talks to its dependencies (managed Postgres via a pass-through ServiceEntry, Vault via a PERMISSIVE exception, and the excluded in-namespace datastores in plaintext). Meanwhile istiod configures every sidecar and SPIRE underpins workload identity (spiffe://insignia.tech/…).
Scope note: this verifies the mesh on the already-deployed platform services. The new Insignia identity components we built this week (Session Broker, MDM Auth-Subject Bridge, OPA Policy Gateway, CMP adapter/edge) are code, not yet deployed — they are out of scope here.
istiod is the mesh brain: it pushes config to every sidecar and (today) issues their certificates.
NAME READY IMAGE istiod 1 docker.io/istio/pilot:1.23.2
Verdict: Running, v1.23.2. Healthy.
Every workload identity is scoped to a SPIFFE trust domain. It must equal SPIRE's and the OPA contract's expected caller IDs.
trustDomain: insignia.tech
Verdict: insignia.tech — matches SPIRE and the OPA decision-input contract (caller.spiffe_id).
SPIRE is the dedicated workload-identity provider (the 'mesh-first SPIFFE/SPIRE' goal). Server issues SVIDs; an agent runs on every node.
spire-agent-55q2n 1/1 Running spire-agent-6nx2n 1/1 Running spire-agent-9ttjf 1/1 Running spire-agent-bn5t4 1/1 Running spire-agent-tlbhg 1/1 Running spire-agent-zmhql 1/1 Running spire-server-0 2/2 Running spire-spiffe-csi-driver-dcb6w 2/2 Running spire-spiffe-csi-driver-fhnn2 2/2 Running spire-spiffe-csi-driver-gjtd6 2/2 Running spire-spiffe-csi-driver-kwg7l 2/2 Running spire-spiffe-csi-driver-nlgmg 2/2 Running spire-spiffe-csi-driver-sdghj 2/2 Running
Verdict: server 2/2, all 6 agents 1/1, CSI on every node. Actively minting SVIDs.
k8s 1.36 supports native sidecar containers — the proxy starts before app init containers, so services with wait-for-db / migration init containers can be meshed.
true
Verdict: ENABLE_NATIVE_SIDECARS=true.
The data-plane proxies. These are the services we deliberately injected (PERMISSIVE).
MESHED: cmp-docs MESHED: hello-world MESHED: mdm-ai MESHED: mdm-kernel MESHED: mdm-sas MESHED: opa MESHED: opal-server MESHED: poi-api MESHED: realmdm-opa MESHED: roof
Verdict: 9 production app services + the hello-world canary, all carrying an istio-proxy.
Services injected AFTER native mode (mdm-kernel, opal-server, poi-api) carry istio-proxy as an init container with restartPolicy=Always — not a regular container. This is why a naive 'list containers' looks like they're not meshed.
mdm-kernel -> initContainers: istio-init,istio-proxy, opal-server -> initContainers: istio-init,istio-proxy, poi-api -> initContainers: istio-init,istio-proxy,
Verdict: CONFUSING BUT CORRECT: native sidecars appear under initContainers, not containers. Both forms are fully meshed.
Vault, Neo4j, Redpanda (Kafka), Redis, the 3 Postgres StatefulSets and Temporal are deliberately kept OUT of the mesh — meshing long-lived non-HTTP datastores is the classic cause of subtle outages.
(no LEAK lines above = datastores correctly excluded)
Verdict: No datastore carries a sidecar. Exclusion is clean.
Each sidecar holds an X.509-SVID. This is the identity OPA will consume as caller.spiffe_id for ABAC.
spiffe://insignia.tech/ns/sre/sa/default
Verdict: spiffe://insignia.tech/ns/sre/sa/default — correct trust domain (resolves the earlier cluster.local worry). NOTE: workloads currently share the 'default' ServiceAccount, so all share one identity. Per-workload ServiceAccounts are a later refinement for fine-grained policy.
ingress-nginx (NOT yet meshed) → the meshed mdm-kernel pod. Under PERMISSIVE the pod accepts plaintext from the sidecar-less ingress.
HTTP 200
Verdict: HTTP 200 — external access to a meshed service is intact.
Same path for the AI service.
HTTP 200
Verdict: HTTP 200.
Keycloak is in another namespace and not meshed — a control to show non-mesh traffic is unaffected.
HTTP 302
Verdict: HTTP 302 (normal Keycloak redirect).
Another untouched service, confirming the change is scoped to sre.
HTTP 200
Verdict: HTTP 200.
mdm-kernel/mdm-ai talk to the DO managed Postgres. Their traffic now goes through Envoy — the riskiest common dependency.
mdm-kernel readyReplicas=1/1
Verdict: mdm-kernel Ready (readiness gates on deps) + external 200. DB reachable through the mesh.
The managed PG already does TLS; Istio must NOT wrap it again. A ServiceEntry + DestinationRule(tls:DISABLE) makes Envoy pass it through.
db-postgresql-nyc2-43694-do-user-26492239-0.a.db.ondigitalocean.com ports=25060
Verdict: ServiceEntry covers :25060 (what mdm-* uses). FLAG: confirm no meshed service needs the :25061 pool before STRICT.
ExternalSecrets-Operator (sidecar-less, other ns) calls Vault. A server-side PeerAuthentication keeps Vault PERMISSIVE so ESO survives the future STRICT flip.
vault-permissive: mode=PERMISSIVE selector={"app.kubernetes.io/name":"vault"}
Verdict: vault-permissive applied, selector app.kubernetes.io/name=vault.
mdm-sas uses Vault transit for plaintext crypto — a meshed service depending on the excluded Vault.
mdm-sas ready=true
Verdict: meshed + ready — meshed→excluded-Vault path works under PERMISSIVE.
Full disclosure: this worker is crashlooping (29 restarts).
mdm-projection-worker-f897d6bc5-zxjsp 1/1 Running restarts=29 (NOT meshed; crashloop predates mesh)
Verdict: PRE-EXISTING and NOT mesh-related — it is NOT meshed (no sidecar). It was already restarting before today. Will be assessed when we (carefully) mesh the workers in the next phase.
Working perfectly ✅
insignia.tech end-to-end; meshed pods carry real spiffe://insignia.tech/… SVIDs.tls:DISABLE); Vault reachable via the PERMISSIVE exception.Clarified / flagged ⚠️ (correct, but worth knowing)
default ServiceAccount, so they share one SVID. Fine for mTLS today; give security-sensitive services their own ServiceAccount before relying on per-service ABAC.:25060; confirm no meshed service needs the :25061 pool before STRICT.Pre-existing, not us ℹ️
mdm-projection-worker is crashlooping (29 restarts) — it is not meshed and was failing before today; to be assessed when we mesh the workers.Deliberately NOT done yet (next phase) ⏭
k8s-pods branch feat/mesh-istio-spire.