Server issues – the end is in sight.

After 5 months of frustration, headaches, nightmares and stress, the BladeForums servers are looking like they are stable and tasks are nearly complete. We just need to get DMARC up and running and that will be the final checkbox – stability and speed are already showing great improvements.

I wish this had been a smooth process, but unfortunately it was much more drawn out than it needed to be largely because of issues that were out of my control.

The server switchover was precipitated by discovering issues with the nightly backup schema – MySQL table locking along with I/O transfer rates were resulting in nightly downtimes during peak traffic thanks to hosting provider misconfigurations. We’d specified backups begin at 0400 across the board, and they only configured that for the web server and had the db server set for midnight.

On top of that, vBulletin 4.x has been End Of Life’d by Internet Brands and further development will not occur as they focus on version 5. Unfortunately, v5 is not scalable for sites of our size, so we had to implement a platform migration strategy to ensure site longevity instead of remaining on legacy software. Step one of this was moving to more up-to-date virtual machines that were capable of running the latest packages and modern software.

So, back in May we had 3 new VMs spun up and anticipated a clean and orderly migration to the new nodes. We configured the new virtual machines with a dedicated www server and two database servers – a master and slave, where the slave would have the master’s db replicated to it. Backups would then be performed on the slave server which would completely eliminate any table locking issues resulting in downtime. We did test imports and even did trial runs of a possible successor forums platform to see whether migration would be feasible as an all in one effort.

Instead we experienced a multiple month long nightmare. Weeks were spent trying to troubleshoot what was going on, but it boiled down to the servers just…. ‘going away’ randomly every few seconds. We proceeded to import the forums data and what was noticeable during soft testing turned into complete site failures during live traffic, so multiple migration attempts (each spanning 8+ hours of downtime) failed. We had days of outages as a result, requiring reverting to the old server architecture.

Eventually it was determined that the VMs were, quite simply, built off bad templates. Only instead of completely blowing away the 3 bad VMs and generating 3 fresh ones, our hosting provider chose to only create a new web vm and wanted us to try and make it work. Needless to say, all 3 required replacement, costing me hours of technical support and thousands of dollars in time wasted, customer frustration, and more.

Once we had working VMs, we had a relatively short downtime with an easy migration that went off without any major issues. What should have taken 4 hours had dragged on for more than 4 months.

For the last few weeks we have been battling a persistent issue with Sphinx search – we could manually run the indexer and have new posts appear, but the automated reindexing and delta indexing was not occuring via cron.  Yesterday afternoon we began to get that resolved and it was completely fixed by 1AM this morning.  With that, our last obstacle for full operation is implementing DKIM / DMARC to get into compliance with email handling standards.

After that is complete, we will be ready to begin Phase II of transitioning away from vBulletin to another forum software platform, one that will offer even greater advantages and vastly benefit the membership in many ways.

For those members reading this, I appreciate your patience and support.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s