Servers Crashing and Burning
The Thanksgiving break did not go as smoothly for the servers as I hoped it would. A problem appeared that results in death of the manager server and subsequent death of the entire grid. This happened several times recently and seems to be slowly increasing in frequency.
This one had me stumped because there was never anything left to trace. One or more servers were usually uncommunicative, requiring remote hands assistance from the computer center to reboot them. The rest were all dead.
Sunday morning, yesterday, I checked the server monitor only to see red text and flames from a few servers that weren't completely dead. The rest were gone.
I got to work cleaning up the zombies. Most of them needed a slight wave to send them on to the promised land. One of them wasn't dead! It wasn't deadlocked either. It was using 37GB of memory! What could it be doing?
I got to work sifting through the call stacks of its thirty plus threads. This revealed a world suffering the effects of an asteroid collision. The world was destroying all of its buildings. One building was stuck in an infinite loop trying to destroy itself. A peculiarity of its tiny design caused it to fail.
Once understood, the bug was easily fixed. It was also fixed in similar code copied from spacecraft, though the condition could never logically occur in a valid spacecraft design.
Four days of time were added to all active accounts, to make up for the server down time.
The Thanksgiving break did not go as smoothly for the servers as I hoped it would. A problem appeared that results in death of the manager server and subsequent death of the entire grid. This happened several times recently and seems to be slowly increasing in frequency.
This one had me stumped because there was never anything left to trace. One or more servers were usually uncommunicative, requiring remote hands assistance from the computer center to reboot them. The rest were all dead.
Sunday morning, yesterday, I checked the server monitor only to see red text and flames from a few servers that weren't completely dead. The rest were gone.
I got to work cleaning up the zombies. Most of them needed a slight wave to send them on to the promised land. One of them wasn't dead! It wasn't deadlocked either. It was using 37GB of memory! What could it be doing?
I got to work sifting through the call stacks of its thirty plus threads. This revealed a world suffering the effects of an asteroid collision. The world was destroying all of its buildings. One building was stuck in an infinite loop trying to destroy itself. A peculiarity of its tiny design caused it to fail.
Once understood, the bug was easily fixed. It was also fixed in similar code copied from spacecraft, though the condition could never logically occur in a valid spacecraft design.
Four days of time were added to all active accounts, to make up for the server down time.