We’ve had two major customer-impacting outages over the last few weeks, so I thought it would be helpful to explain what happened and the actions we’re taking to improve our infrastructure to be more resilient.
The first incident happened on February 7, 2026. We’ll call this the “crosslink incident.”
The second incident happened on February 24, 2026. We’ll call the “cert expiry incident.”
Before I explain these two issues, first a bit of background.
When we first started Pinax, we were supporting a few blockchains and associated services. Our infrastructure used home-grown configuration based on Linux containers (LXC) to manage services. Over the last few years we’ve been adding more blockchains and more services… We recognized that our original deployment pattern wasn’t going to scale. So for the last 2.5+ years we’ve been on a journey to move most things to Kubernetes (k8s). (I say “most” because we’re not sure we’ll ever get Solana to run in k8s? Has anyone been successful at that?)
Migrating the underlying platform behind the scenes for live is of course a “fun” challenge. Part of the platform migration also involves migrating our entire monitoring and incident management stack. Last year we chose incident.io as our partner and continue to migrate services over to it.
Ok, so with the background of the way, the two issues.
Crosslink incident
To provide high availability, we have services running in multiple datacentres and there are redundant connections between each. On February 7 the primary link between two of the datacentres failed and the traffic was routed over the backup link. Unfortunately due to a configuration issue, the services in Kubernetes did not work over this backup link. Anything running in LXC worked, but cross-datacenter services in k8s did not. Once the configuration was adjusted, services were restored. The primary link was restored by the telecom service provider a few hours later.
Technical explanation:
The MTU of the backup link was smaller than the primary link, the MTU in the Cillium CNI (Container Networking Interface plugin) in Kubernetes needed to be adjusted accordingly.
Cert expiry incident
We have multiple Kubernetes clusters. The main one has now been running for more than 1 year. Certificates expire after 1 year by default. Our cert was not renewed in time and the services stopped working.
Technical explanation:
K0s (our Kubernetes distro) is supposed to always renew certs during a controller restart. There’s an issue that restarting the controller doesn’t regenerate the certs.
That explains why our client certificates weren’t regenerated even though we had a few container restarts before (k0s version upgrade, node maintenance, etc).
A pull request fixing this issue was merged about 2 weeks ago. That PR is part of k0s that was released last week.
We have implemented the documented manual workaround for the issue and will upgrade to a newer version of k0s in due course.
Customer impact
Ok but why did all this impact customers? After all we have multiple clusters and multiple datacentres, right? Well the short answer is that not every service is deployed in every datacentre (too expensive). Some services are only in one datacentre or other. And we’re in the process of moving things from LXC to k8s.
Part of the authentication service was migrated from LXC to k8s last year. However, the full replication deployment across datacentres has been pending our deployment of Cilium Cluster Mesh rollout. Cluster mesh is coming soon, but until it is fully deployed we’re reviewing which components need to be manually replicated between datacentres to provide better resilience.
Conclusion
We’re on a journey to provide modern and robust blockchain infrastructure. We’ve encountered a few unexpected issues as we move from LXC to k8s. We commit to taking issues that impact customers seriously and adjust our rollout plans accordingly. Please get in touch if you need further information. Also follow our status page for updates… the entire engineering team is working to make more details of what is happening available there.



No Comments