FurrIX — A Furry, Hobbyist Virtual Internet Exchange » Category

Wednesday, July 22, 2026

[Incident Report #037][vIX] Platform Issues and Tooling….

What are Incident Reports?
As a community‑operated and governed virtual internet exchange, FurrIX maintains
a commitment to open and honest communication with its members. Sometimes
during the operation of the exchange, we run into issues that impact the state of
the exchange and member connectivity. When this happens, the FurrIX operations
team publishes an incident report to ensure all members remain informed. As a
hobbyist‑rooted vIX, we aim to keep communication clear, accessible and practical
to the best of our ability.

Summary

On July 22nd, 2026 at approximately 12:45 EST, the FurrIX vIX entered a degraded
operational state. We lost access to PHYONE’s Proxmox control plane and experienced
partial failure of NS1’s DoT/DoH services. The root cause was a failure in newly
developed SSL certificate distribution tooling, which corrupted certificate stores on
both PHYONE and NS1. This resulted in service outages and exposed gaps in our
recovery lifelines.

What Happened?
Our lead engineer was developing new automation to handle SSL certificate
distribution across the vIX, aiming to reduce manual volunteer workload.
During a trial run on the distro network, the tooling behaved unexpectedly
and corrupted the certificate stores on both PHYONE and NS1.

Furthermore, when it was noticed that we no longer had UI access, our volunteers
tried using the recovery options we had thought were fully in place and found out
very quickly that neither lifeline was fully operational. For roughly two hours, the
vIX was running headless with limited administrative control.

We were seeing the following issues:

Control Plane Damage — Proxmox UI on PHYONE offline
NS Partial Failure — NS1’s DoT/DoH endpoints failing
RNDC and other control‑plane operations to become unstable

The code appeared correct during review, but a permissions issue slipped through and
only manifested once deployed. A right fucken oops, that was. Compounding the issue,
when volunteers attempted to use our recovery lifelines, we discovered that neither of
them were fully operational. For roughly two hours, the vIX was running headless with
limited administrative control.

What did we do to fix this?
We learned that while our OOB recovery network was operational, our recovery shell
accounts had not been setup on PHYONE. We promptly reached out to the data center
for a KVM to be put on the machine as soon as possible. Upon getting that setup, our
volunteers deployed our recovery account via the machine shell and then detached from
KVM to continue the recovery as to our DRP for ‘Access Loss, PHYONE’ and we modified
our SSL handling scripts to ensure that NS1 set the correct permissions on the cert store
and that PHYONE now validates certs before installing them and reloading services.

During the recovery, we also learned the same issue was affecting tooling on NS1 itself and
we got that sorted to make sure it also validates its certs and keys before reloading any
services.

While the recovery was ongoing, we had a few blips in networking as the KVM was connected
and disconnected from PHYONE- but we should be fully alive and working again. Oh yea, we also
added Discord alerts to our scripts so that we are able to keep an eye on their execution and
catch any problems that might occur.

Response & Recovery Actions
1. Restoring Access to PHYONE
- Verified OOB recovery network was operational
- Discovered recovery shell accounts were not configured
- Contacted the data center and requested KVM attachment
- Once KVM was online, deployed recovery account via local shell
- Detached KVM to minimize network blips and continued
recovery per DRP “Access Loss, PHYONE”

2. Repairing Certificate Store PHYONE
- Identified corrupted cert stores on PHYONE
- Updated SSL handling scripts to:
— Validate certs before installation
— Validate key/cert pairing
— Reload services only after validation passes

3. Fixing NS1 Tooling and Cert Store
- Found identical validation/permission issues affecting NS1’s tooling
- Updated scripts to ensure NS1 validates certs and keys before reloading services
- Apply correct permissions (bind:bind)

Root Cause
A permissions and validation oversight in new SSL distribution tooling caused certificate
corruption on PHYONE and NS1. Lack of fully configured recovery accounts delayed restoration.

Lessons Learned
Going forward with the development and upkeep of the FurrIX vIX,
our volunteers will be applying the following lessons to future scripts
and any custom tooling:
- Automation touching cert stores must validate before overwrite
- Permissions must be explicitly set every time
- Recovery accounts must be deployed and tested on all hypervisors
- Monitoring hooks (Discord alerts) are essential for early detection

For onlookers wondering why this was not sandboxed more aggressively before
being put into production, it is a hard truth that FurrIX does not have a replicated
environment offsite for testing these kinds of tools. Meaning a lot of the time, we
are heavily crawling our own tooling before we deploy and sometimes not so obvious
issues can crop up that our volunteers haven’t thought about before hand.

Add a comment (0 views)

Wednesday, July 15, 2026

[Incident Report #036][DC] Network Cabling Issues….

What Happened?
On July 14th, 2026 at around 2045EST, the FurrIX vIX went offline for about thirty-five
minutes. The volunteers at FurrIX reached out to the data center for some insight after
doing our own troubleshooting. It was discovered that our secondary server was still
alive but our primary had no response.

We were seeing the following issues:

vIX reachability — The virtual exchange was temporarily in a degraded state.
NS1 Failure — Members were relying on NS2’s zone cache temporarily.
MA BGP Failures — Our BGP sessions with members temporarily failed.
Web/email Failures - These services stopped responding.
Games-3P Failure - Game services went offline.

What did we do to fix this?
We contacted the data center to determine the scope of the event and were informed
that the techs at the upstream data center had found a loose network cable and that
was the cause of our server going offline. They have reseated the cable and the exchange
is now back online and services are reachable once again.

As of this post, all FurrIX vIX services have recovered.

Add a comment (30 views)

Wednesday, July 1, 2026

[Transparency Report #012][Documentation] vIX Monitoring

What are Transparency Reports?
As a community‑operated and governed virtual internet exchange, FurrIX maintains
a commitment to open and honest communication with its members. From time to
time, operational work may occur that affects the exchange or its supporting infrastructure.
When this happens, the FurrIX operations team publishes a transparency report to
ensure all members remain informed. As a hobbyist‑rooted vIX, we aim to keep
communication clear, accessible and practical to the best of our ability.

What Happened
During the MFN to FurrIX migration, a number of larger infrastructure projects took priority
and were using our volunteer’s free time to get the exchange ready for full operation.
As a result, the monitoring stack (LibreNMS + graph export scripts) fell out of sync with the
new network layout. A stale firewall rule on PHY Two’s edge router blocked the monitoring
server’s requests with changes to new PI space, causing all transit graphs to stop updating.
Because this was a volunteer‑run transition with limited available time, the issue persisted
longer than usual, roughly four months, while other critical work was completed.

Changes to the exchange:
The outdated firewall rule was corrected, restoring connectivity between the web server
and LibreNMS. Once access was restored, all graph‑generation scripts came back online
and were patched with new tooling bits for extended monitoring internally and public
facing. All vIX flow‑rate graphs are now current and visible again.

Are exchange operations affected?
Both volunteers and members now have full visibility into how the vIX carries data and
how usage trends evolve over time. Aside from improved monitoring, normal operations
continue as expected.

Add a comment (52 views)

Monday, June 1, 2026

[Incident Report #035][DC] Power Failures

What Happened?
On June 1st at approximately 0145 EST, the FurrIX virtual exchange became unreachable. Shortly
after, our BGP announcements began withdrawing, causing our prefixes to disappear from upstream
looking glasses. Any members with devices tunneled into the exchange — phones, homelabs or
PCs — temporarily lost internet access and routing through the vIX.

We were seeing the following issues:

vIX reachability — The virtual exchange was fully offline.
NS1/NS2 Failure — Members could not reach either authoritative name server.
Prefixes Left BGP — Our /48 and /44 announcements temporarily stopped.
PHY One and Two power loss — Both hosts experienced unclean reboots.
Backup failures — No backups were generated for June 1st.

What did we do to fix this?
We contacted the datacenter to determine the scope of the event and were informed
that WII were experiencing major power issues at the data center. Reviewing outage maps
for the region and weather reports, we became aware that severe thunderstorms passed
through the Kansas City area during the same time frame, which may have affected the
the region but we do not have concrete information on this right now.

As of this post, all FurrIX vIX services have recovered, our prefixes are visible in upstream
looking glasses again and member reachability has returned to normal.

Add a comment (107 views)

Monday, May 25, 2026

[Transparency Report #007][OPERATIONS] Full Environment Rebuild Scheduled WIP

What is happening?
The FurrIX vIX is currently going through its rebuild of our exchange and it is taking a little
longer than we expected. Due to a miscommunication, reinstalling the physical server’s OS
took a bit of time.

What has been reworked so far:
- Phy One: The ProxMox host has been rebuilt
- Core Router: We condensed our IPv6 edge and core router into one VM
- Nardoragon Router: Our services router is back online with new config
- Catos vIX Access Router: Has been pulled from backup and reconfigured
- NS1/Games-3P: These member facing services are back online
- Web Server: Our websites are back online

Parts of the exchange still being worked on:
- Mail-NG: the mail server has to be brought back online
- Ikus vIX Access Router: Secondary member facing router still being reconfig’d
- NMS: We currently have no monitoring, needs to be reconfigured

Are exchange operations affected?

Yes — temporarily.
During the rebuild window, routing and service availability will be null as systems are rebuilt
and renumbered. Once the work is complete, normal operations will resume with improved
stability, ease of expansion, better rooted upkeep and clarity.

Add a comment (99 views)

FurrIX — A Furry, Hobbyist Virtual Internet Exchange

Wednesday, July 22, 2026

[Incident Report #037][vIX] Platform Issues and Tooling….

Wednesday, July 15, 2026

[Incident Report #036][DC] Network Cabling Issues….

Wednesday, July 1, 2026

[Transparency Report #012][Documentation] vIX Monitoring

Monday, June 1, 2026

[Incident Report #035][DC] Power Failures

Monday, May 25, 2026

[Transparency Report #007][OPERATIONS] Full Environment Rebuild Scheduled WIP

Admin

Menu

Categories

Archives

Last 10 entries

Subscribe

FurrIX — A Furry, Hobbyist Virtual Internet Exchange

Wednesday, July 22, 2026

[Incident Report #037][vIX] Platform Issues and Tooling….

Wednesday, July 15, 2026

[Incident Report #036][DC] Network Cabling Issues….

Wednesday, July 1, 2026

[Transparency Report #012][Documentation] vIX Monitoring

Monday, June 1, 2026

[Incident Report #035][DC] Power Failures

Monday, May 25, 2026

[Transparency Report #007][OPERATIONS] Full Environment Rebuild Scheduled WIP

Admin

Menu

Categories

Archives

Last 10 entries

Search

Subscribe