Insignia Platform — Service Mesh Rollout Verification

Live cluster lynkedup-tech (NYC2) · 2026-06-28 maintenance window · Istio + SPIFFE/SPIRE

STATE: mTLS PERMISSIVE 9 app services meshed · 0 outages PAUSED before workers / ingress-mesh / STRICT

What this document is

We rolled the platform's HTTP application services in the sre namespace into an Istio service mesh, on a dedicated SPIFFE/SPIRE identity foundation, in PERMISSIVE mode (sidecars accept both mTLS and plaintext, so nothing can break yet). Datastores and messaging were deliberately excluded. This report verifies, step by step, that the current state behaves as expected — every command below was run live and its real output captured.

Flow being tested: a public request enters via ingress-nginx → reaches a meshed app pod (its Envoy sidecar) → the app talks to its dependencies (managed Postgres via a pass-through ServiceEntry, Vault via a PERMISSIVE exception, and the excluded in-namespace datastores in plaintext). Meanwhile istiod configures every sidecar and SPIRE underpins workload identity (spiffe://insignia.tech/…).

Scope note: this verifies the mesh on the already-deployed platform services. The new Insignia identity components we built this week (Session Broker, MDM Auth-Subject Bridge, OPA Policy Gateway, CMP adapter/edge) are code, not yet deployed — they are out of scope here.

Public / Internet *.lynkedup.cloud ingress-nginx NOT meshed yet ⚠ Control plane istiod 1.23.2 istio-system SPIRE server+6 agents+CSI spiffe://insignia.tech sre namespace — MESH (mTLS PERMISSIVE) 🔒 mdm-kernel🔒 mdm-ai🔒 mdm-sas🔒 opa🔒 realmdm-opa🔒 opal-server🔒 poi-api🔒 roof🔒 cmp-docs EXCLUDED (no sidecar) — datastores/messaging VaultNeo4jRedpandaRedisopa-pgpoi-dborsTemporal DO Managed Postgres ServiceEntry + tls:DISABLE :25060 plaintext OK (permissive) xDS + certs DB via mesh plaintext to excluded datastores

Foundation — is the mesh control plane alive?

Istio control plane (istiod) PASS

istiod is the mesh brain: it pushes config to every sidecar and (today) issues their certificates.

$ kubectl -n istio-system get deploy istiod -o "custom-columns=NAME:.metadata.name,READY:.status.readyReplicas,IMAGE:.spec.template.spec.containers[0].image"
NAME     READY   IMAGE
istiod   1       docker.io/istio/pilot:1.23.2

Verdict: Running, v1.23.2. Healthy.

Mesh trust domain PASS

Every workload identity is scoped to a SPIFFE trust domain. It must equal SPIRE's and the OPA contract's expected caller IDs.

$ kubectl -n istio-system get cm istio -o jsonpath="{.data.mesh}" | grep -i trustDomain
trustDomain: insignia.tech

Verdict: insignia.tech — matches SPIRE and the OPA decision-input contract (caller.spiffe_id).

SPIRE (server + agents + CSI) PASS

SPIRE is the dedicated workload-identity provider (the 'mesh-first SPIFFE/SPIRE' goal). Server issues SVIDs; an agent runs on every node.

$ kubectl -n spire get pods --no-headers | awk "{print \$1, \$2, \$3}"
spire-agent-55q2n 1/1 Running
spire-agent-6nx2n 1/1 Running
spire-agent-9ttjf 1/1 Running
spire-agent-bn5t4 1/1 Running
spire-agent-tlbhg 1/1 Running
spire-agent-zmhql 1/1 Running
spire-server-0 2/2 Running
spire-spiffe-csi-driver-dcb6w 2/2 Running
spire-spiffe-csi-driver-fhnn2 2/2 Running
spire-spiffe-csi-driver-gjtd6 2/2 Running
spire-spiffe-csi-driver-kwg7l 2/2 Running
spire-spiffe-csi-driver-nlgmg 2/2 Running
spire-spiffe-csi-driver-sdghj 2/2 Running

Verdict: server 2/2, all 6 agents 1/1, CSI on every node. Actively minting SVIDs.

Native sidecars enabled PASS

k8s 1.36 supports native sidecar containers — the proxy starts before app init containers, so services with wait-for-db / migration init containers can be meshed.

$ kubectl -n istio-system get deploy istiod -o jsonpath="{.spec.template.spec.containers[0].env[?(@.name==\"ENABLE_NATIVE_SIDECARS\")].value}"
true

Verdict: ENABLE_NATIVE_SIDECARS=true.

Membership — which workloads are in the mesh, and (critically) which are NOT?

Meshed app services PASS

The data-plane proxies. These are the services we deliberately injected (PERMISSIVE).

$ list sre pods that have an istio-proxy container (meshed)
MESHED: cmp-docs
  MESHED: hello-world
  MESHED: mdm-ai
  MESHED: mdm-kernel
  MESHED: mdm-sas
  MESHED: opa
  MESHED: opal-server
  MESHED: poi-api
  MESHED: realmdm-opa
  MESHED: roof

Verdict: 9 production app services + the hello-world canary, all carrying an istio-proxy.

Proxy as a native (init) sidecar CLARIFIED

Services injected AFTER native mode (mdm-kernel, opal-server, poi-api) carry istio-proxy as an init container with restartPolicy=Always — not a regular container. This is why a naive 'list containers' looks like they're not meshed.

$ kubectl -n sre get pod <svc> -o jsonpath init-containers
mdm-kernel -> initContainers: istio-init,istio-proxy,
opal-server -> initContainers: istio-init,istio-proxy,
poi-api -> initContainers: istio-init,istio-proxy,

Verdict: CONFUSING BUT CORRECT: native sidecars appear under initContainers, not containers. Both forms are fully meshed.

Datastores excluded PASS

Vault, Neo4j, Redpanda (Kafka), Redis, the 3 Postgres StatefulSets and Temporal are deliberately kept OUT of the mesh — meshing long-lived non-HTTP datastores is the classic cause of subtle outages.

$
  (no LEAK lines above = datastores correctly excluded)

Verdict: No datastore carries a sidecar. Exclusion is clean.

Identity — do meshed workloads have the right cryptographic identity?

Workload SVID FLAGGED

Each sidecar holds an X.509-SVID. This is the identity OPA will consume as caller.spiffe_id for ABAC.

$ kubectl -n sre exec mdm-kernel-54849fbb75-ckqxg -c istio-proxy -- openssl s_client -showcerts -connect 127.0.0.1:15000 </dev/null 2>/dev/null | openssl x509 -noout -text 2>/dev/null | grep -A1 'Subject Alternative Name' || kubectl -n sre exec mdm-kernel-54849fbb75-ckqxg -c istio-proxy -- curl -s localhost:15000/certs | grep -o 'spiffe://[^"]*' | sort -u | head
spiffe://insignia.tech/ns/sre/sa/default

Verdict: spiffe://insignia.tech/ns/sre/sa/default — correct trust domain (resolves the earlier cluster.local worry). NOTE: workloads currently share the 'default' ServiceAccount, so all share one identity. Per-workload ServiceAccounts are a later refinement for fine-grained policy.

External reachability — did meshing break public access? (PERMISSIVE)

mdm-kernel via public ingress PASS

ingress-nginx (NOT yet meshed) → the meshed mdm-kernel pod. Under PERMISSIVE the pod accepts plaintext from the sidecar-less ingress.

$ curl -sS -m 12 -o /dev/null -w "HTTP %{http_code}" https://mdm.lynkedup.cloud/healthz
HTTP 200

Verdict: HTTP 200 — external access to a meshed service is intact.

mdm-ai via public ingress PASS

Same path for the AI service.

$ curl -sS -m 12 -o /dev/null -w "HTTP %{http_code}" https://mdm-ai.lynkedup.cloud/health
HTTP 200

Verdict: HTTP 200.

Keycloak (control: NOT meshed) PASS

Keycloak is in another namespace and not meshed — a control to show non-mesh traffic is unaffected.

$ curl -sS -m 12 -o /dev/null -w "HTTP %{http_code}" https://id.lynkedup.cloud/
HTTP 302

Verdict: HTTP 302 (normal Keycloak redirect).

Gitea (control: other namespace) PASS

Another untouched service, confirming the change is scoped to sre.

$ curl -sS -m 12 -o /dev/null -w "HTTP %{http_code}" https://git.lynkedup.cloud/api/v1/version
HTTP 200

Verdict: HTTP 200.

Dependencies — can meshed services still reach their data + cross-namespace deps?

Database through the sidecar PASS

mdm-kernel/mdm-ai talk to the DO managed Postgres. Their traffic now goes through Envoy — the riskiest common dependency.

$ DB-through-mesh proof: mdm-kernel is Ready (readiness gates on deps) AND external https 200
mdm-kernel readyReplicas=1/1

Verdict: mdm-kernel Ready (readiness gates on deps) + external 200. DB reachable through the mesh.

External-Postgres ServiceEntry + tls:DISABLE FLAGGED

The managed PG already does TLS; Istio must NOT wrap it again. A ServiceEntry + DestinationRule(tls:DISABLE) makes Envoy pass it through.

$ kubectl -n sre get serviceentry do-postgres-external -o jsonpath="{.spec.hosts[*]} ports={range .spec.ports[*]}{.number}{\" \"}{end}"
db-postgresql-nyc2-43694-do-user-26492239-0.a.db.ondigitalocean.com ports=25060

Verdict: ServiceEntry covers :25060 (what mdm-* uses). FLAG: confirm no meshed service needs the :25061 pool before STRICT.

Vault PERMISSIVE exception PASS

ExternalSecrets-Operator (sidecar-less, other ns) calls Vault. A server-side PeerAuthentication keeps Vault PERMISSIVE so ESO survives the future STRICT flip.

$ kubectl -n sre get peerauthentication vault-permissive -o jsonpath="{.metadata.name}: mode={.spec.mtls.mode} selector={.spec.selector.matchLabels}"
vault-permissive: mode=PERMISSIVE selector={"app.kubernetes.io/name":"vault"}

Verdict: vault-permissive applied, selector app.kubernetes.io/name=vault.

mdm-sas (Vault-transit user) meshed PASS

mdm-sas uses Vault transit for plaintext crypto — a meshed service depending on the excluded Vault.

$ mdm-sas readiness
mdm-sas ready=true

Verdict: meshed + ready — meshed→excluded-Vault path works under PERMISSIVE.

Transparency — things that are NOT perfect

mdm-projection-worker crashloop PRE-EXISTING

Full disclosure: this worker is crashlooping (29 restarts).

$ kubectl -n sre get pods -l app=mdm-projection-worker --no-headers | awk "{print \$1,\$2,\$3,\"restarts=\"\$4}"; echo "(NOT meshed; crashloop predates mesh)"
mdm-projection-worker-f897d6bc5-zxjsp 1/1 Running restarts=29
(NOT meshed; crashloop predates mesh)

Verdict: PRE-EXISTING and NOT mesh-related — it is NOT meshed (no sidecar). It was already restarting before today. Will be assessed when we (carefully) mesh the workers in the next phase.

Honest assessment

Working perfectly ✅

Clarified / flagged ⚠️ (correct, but worth knowing)

Pre-existing, not us ℹ️

Deliberately NOT done yet (next phase) ⏭