Postby Zhro » Wed Oct 24, 2018 12:48 am

Hello, everyone. I want to provide an update on the state of the server health-wise. I know that this announcement is very, very long overdue and I apologize. I’m going to be talking a little about the health of our hardware and it’s complicated. So grab yourself some popcorn and sit down.

As most of you already know, the server has become very unstable for the past few months. This has been due to an unknown fault that has been very difficult to track down. Why? Because the fault is intermittent. As far as I can tell, the crashes appear to happen at random. I can't trigger it. That means that I can’t test for it. I can only wait for it to happen. It could be two weeks or twice in one a day. There is no discernable pattern.

Despite my personal silence, I have actively been trying fix the problem and my repeated failures is an embarrassment and something that I’ve taken as a personal failure at my ability to maintain our server. For those of you who don’t know, Minedlands is not hosted in the cloud. One of the ways that we are able to pivot and provide the range of services we do is that we’re hosted on real metal. I have personally poured an enormous amount of time, money, and donations into it to build something that I can be proud to offer as a free service for your enjoyment. And this isn’t just about me. It’s about the countless hours that our staff and players have invested into making Minedlands what it is today. You should be proud of yourself. I know I am.

The server we run on today is an expensive beast. Allow me to explain a bit about the hardware; I will try to keep this as simple as possible. The server runs on not one but TWO 8-core processors and 256GB (gigabytes) of memory (that's 16GB x 16 sticks). Your data is also served on a mirror of high performance Samsung solid-state drives, spared no expense, and arranged as a pair such that if a failure does occur for one drive then the other will pick up where it left off. Power is provided through a backup battery to keep everything alive even during unavoidable interruptions. We also run on expensive network hardware and a premium internet package to support the many simultaneous connections. There are lots more bips and bobs that blink to make up the rest but all of this inevitably sums up a machine that I can't even lift. For those of you who know, it is so large and heavy that moving it means pivoting it around on the ground. It's big. It's heavy. And it's expensive.

We've seen a large number of steady improvements over the years with several complete system upgrades to get to where we are today. This has required a significant investment which has been both a blessing and a curse. Because every individual component is expensive, and downtime is unacceptable, I have slowly built up a small surplus of spare parts. I have, on-hand, spare motherboards, processors, memory, hard drives, and controllers. But what I don't have is a complete MIRROR of the server. What this means is that when a failure presents itself, the server comes down, I work to identify the source problem, and swap out the part. Clean. Efficient. Simple. This means that while we may go down for a few hours, it won’t ever be counted in days.

So why have we been so unstable? To put it simply, I have been unable to locate the cause of the hardware failure. And because we only crash intermittently, I cannot just swap something out and then test it on-demand. We have to wait days, possibly even weeks to see if we're stable. This is terrible but I don’t have any other option. Even if I wanted to build an exact copy of the server to provide a complete fail-over (this has been considered), not only would it be enormous expense, but it would only serve to further entrench us into the aging technology that we currently use.

I want us to eventually move to newer and significantly faster hardware for some time now but the expense has always been prohibitive. It means buying all new hardware. This is something that I've been working towards. But any savings that made have always had to be put towards spare parts. At this time I SHOULD have enough to keep us stable for the foreseeable future. But this just hasn't been the case.

I want to be honest with everyone and I will be as open as possible about what's actually going on. A few months ago the server started to experience intermittent restarts. At first I thought that it was a power issue as it usually happened while I was asleep. I put us on a backup battery and found that this was not the case. I thought it was the motherboard and swapped that out. Same problem. Possibly the power supply? Swapped that too. We run on very expensive error-correcting memory so it shouldn't be the memory. A processor fault should have fail-over and diagnostics. What's left? Hard drives shouldn't be capable of bringing down the server.

I am at a complete LOSS as to the source and it has been a very sore problem for both myself, the server, our reputation, and you. I apologize for the terrible experience you've had to endure. The loss of your time and days of work on a crash. I wish I had an answer but I am still, months later, trying to find the cause of the problem.

I will be taking the server down shortly to attempt a very desperate move. I will be removing one of our processors and half of the server's memory. If this doesn't solve the problem then I will swap this around with the other half. I'm seriously running out of ideas as this will finally be the equivalent of replacing everything in the entire machine.

Please continue to bear with me as I try so very, very hard to make things right. Minedlands is not going away. It is a labor of love. It's just gotten a little sick. Send your prayers that we’ll have an answer soon and that all of this will soon be a distant memory.
