Login

MEXAHOTABOP · 08-23-2020, 02:30 PM

I surprised why no one asked about this before, and hope Haxus will answer on that question

Why you use gdb to freeze crashed server process, instead of gathering core dump and restarting it?
If you have binlog enabled in your sql db, then i suppose restart will not lead to any long term unsolvable problems

QuakeIV · (This post was last modified: 08-23-2020, 06:33 PM by QuakeIV.)

There is a certain philosophy that auto-restarting a server due to a fault is a bad idea, due to probability that the fault will immediately re-arise and crash it again.

I personally think its a relatively reasonable one, many faults are repeatable so trying to reboot wont do much good, and it will do some small degree of harm to the servers to reboot them repeatedly for hours.

MEXAHOTABOP · 08-23-2020, 07:24 PM

almost any modern database have some kind of bin log/transaction log, what make possible to restore working state from snapshot or delayed slave replication, so this must be relevantly safe to use since he just need to rollback only some from transactions related to server where crash occurred to prevent instant re-arise

also information about crash will be fully saved since core dump contain full dump of allocated memory, process state and registers

QuakeIV · (This post was last modified: 08-23-2020, 11:44 PM by QuakeIV.)

Its true that you could probably pretty much deal with most of the downsides. Limited rollback (maybe 10 minutes or something) and maybe a delay on restarting, and you would both reduce the odds of a repeat failure and also reduce the fatigue on the servers from repeated restarts.

Notably doing a selective rollback on just the database entries that the failed server was working on (rather than bringing everything down to do a global rollback and then coming back up) it might be challenging to implement a highly selective rollback, depending on how the backend works.

**Deantwo** · (This post was last modified: 08-24-2020, 10:28 AM by Deantwo.)

Duplicate of https://www.hazeron.com/mybb/showthread.php?tid=2204.

Even a selective 10 minute rollback wouldn't save the servers if the issue causing the crash is: the completion of a >10 minute manufacturing process, asteroid impacting a world, player clicking a commonly used button, or similar.

This has been discussed a lot, and there really isn't any good solution that couldn't horribly backfire.

MEXAHOTABOP · (This post was last modified: 08-24-2020, 10:38 AM by MEXAHOTABOP.)

Here question to haxus about used implementation and why he use it
There a request to do something about it in general

**Deantwo** · (This post was last modified: 08-24-2020, 10:38 AM by Deantwo.)

(08-24-2020, 10:36 AM)MEXAHOTABOP Wrote: Here question to haxus about used implementation
There a request to do something about it in general

Optimal solution is to just fix the bugs causing the crash. Which is easiest to do with the servers getting locked up in GDB.

MEXAHOTABOP · (This post was last modified: 08-24-2020, 11:27 AM by MEXAHOTABOP.)

as i writed before, core dump contain all allocated memory and registers
you will get exactly same process in same state only thing what will be changed between freezed process and restored one is system pid

AnrDaemon · 08-24-2020, 03:21 PM

(08-23-2020, 07:24 PM)MEXAHOTABOP Wrote: almost any modern database have some kind of bin log/transaction log,

Which is all irrelevant in case error is not in database logic, but in business logic (i.e. game code).
Just to recall the issue with resources quality being out of place in newly spawned systems after server restart. Reason? The data was stored in database with different precision than expected. It was perfectly fine by database, but an actual bug in game logic.

Login
Username:
Password:	Lost Password?
	Remember me