DEVBLOG – Technical 2023-10-09 16:00
Back to listAt a time when everyone is focused on capturing modsters and kromas in-game, this devblog will focus on something a little different. We'll be talking today about certain technical topics that are completed or currently ongoing in DOFUS, and we'll also take the opportunity to look back at a couple of incidents that occurred during the March 2023 update. Our goal in this post is to share our progress with you, and to give you a more detailed look behind the scenes at some of our major technical efforts (and the associated challenges). We'll also be presenting the changes and benefits that have resulted from these efforts, or that will result from them in the future. So without further ado… here we go!
One login server to connect them all
The first topic we want to talk about is the new login server. What is it, exactly? And what's this server for? In a nutshell, when you start the game from the launcher, you don't land directly on the game servers, but are first directed to the login server (as you may have guessed from the name). It plays multiple roles:
- It displays the list of all servers and their status (up, restarting, down, etc.). Unlike the game servers, the login server needs to be aware of all existing servers (whereas the Tylezia server, for example, has no idea that the Hell Mina server even exists).
- It checks which server(s) the player has characters on (to display the number of characters in the server list).
- It collects information about the player's account (e.g. whether they are subscribed, banned, etc.).
- It verifies server restrictions (e.g. single-account or tournament servers).
- And finally, it sends all of this information to the game server to log the character in on their server (including for reconnections when a connection is lost during combat).
Historically, we had three login servers (with the very utilitarian names of co-2, co-3 and co-4).
When the player first entered the game, they were redirected to one of these three servers at random (without worrying about whether it was full or not). We also had a server called co-1, which played a slightly different role. Its main function was to manage "services" in the broadest sense of the word (bans, pending character changes such as recoveries, class, color or name changes, friend/enemy lists, pending gifts and purchases, KE transactions made on the site, etc.).
Why change the system?
We had many reasons for wanting to start over from scratch with the login server system. But what really triggered the change was Temporis 7.
With the old server, changing a server's status (from "Closed" to "Open") required us to shut down and reboot the login server in order for it to notice the change. As a result, all players who were waiting on the server were disconnected when the login server shut down, then all logged back in at once. For a weekly maintenance, this restart occurred at a moment when overall traffic was low, so it didn't cause any real problems.
But at the start of a Temporis event, we try to start things up in the afternoon when the maximum number of players are available.
We tell our players in advance when the servers are going to open, and many of them are in the starting blocks and ready to go a few minutes before the actual start time. The result is that when the starting gun actually fires and the login servers restart, those servers have to manage a huge number of players all at once – which they often struggled to do, with just three servers and an imbalanced distribution of players across the three of them.
The massive rush of logins then caused latency on the login servers, further reducing their ability to process players waiting to get in and creating a snowball effect of ever-worsening lag. The queue would then get longer, creating even more lag, resulting in a longer queue…
For Temporis 7, we reached an all-time high for queue length on the first night, and we had no solution for the problem. Everything was sorted out the next morning with a restart during a low-traffic period.
Another factor was that the login server has been around since "day one". The core code dated back to the earliest days of DOFUS, and the many changes made here and there over the years had made the system increasingly complicated to understand and maintain. Meanwhile, many technical improvements added to the game servers since 2005 had not been passed on to the login servers.
Finally, we also had other problems less visible to players, relating to admin rights for DOFUS team accounts and an excessive number of calls to various APIs aimed at collecting account data (subscriptions, friends, etc.), each of them contributing in their own ways to an overloaded system. This aging system was clearly in need of an overhaul.
Practically speaking…
The login server overhaul was a major project that kept two developers busy full-time for four months. A number of additions were also made along the way to fix or add various features. We also needed to understand how the system worked as a whole, while also managing an IT infrastructure that made sense at the time but isn't entirely ideal for our current needs. In short, we ended up having to update a good 90% of the login server. At this point, only the communication component for the game server and client are still original code (though not for long…).
As is often the case with major technical projects, the number of changes and specific cases meant that certain issues weren't discovered until they were deployed in-game. This was the case for the March 28 update. Two examples:
- Single-account restrictions: these restrictions have likely been the biggest and most complicated problems to resolve, in no small part because it's not easy for our in-house testers to reproduce the specific conditions that players have encountered. We encountered two major problems in deploying the new login server:
- When a player lost their in-game Internet connection, their connection to the game was disrupted, and the login server thought they were still in-game. When the player then logged back in, the resulting conflict caused them to be locked out as a result of single-account restrictions.
- To ensure that multiple accounts aren't connecting from the same computer (forbidden in single-account mode), our usual procedure would be to identify the computer's info in detail. In certain rare cases, we were unable to establish an ID for a given computer. In these situations, we would assign a generic identifier – an identifier shared by all players. As a result, whenever anyone logged in with this generic identifier, no other player with a computer ID problem would be able to connect.
- If a player lost their Internet connection in mid-fight, the login server didn't know that the player was still in combat. As a result, the login server wouldn't reconnect them to the fight when they logged back in, and the character was kicked out of the fight and died immediately.
Opportunities
Needless to say, our goal with these changes was not to introduce new bugs with the update. And while this change has had its share of changes for our team, it also opens the door to specific improvements for players:
- Server openings are now better and smoother than ever. Whether for updates, maintenance or Temporis launches, the system is now able to self-manage and automatically adapt to changing player volumes. We also no longer need to log all players out before restarting. We are now more flexible when it comes to adding and removing servers from the server login list.
- Debugging and adding features to the login server is now easier. To take one example, we had a problem with logging back into the game when a player was disconnected from the Kolossium. We were unable to fix this problem with the old login server, but it was fairly easy to fix it with the new one.
- We can now do maintenance on the login server much more easily (if we need to fix something urgently, for example). This enabled us to correct a number of problems in the few days after the update without having to do a full maintenance.
Save files and data format
Two years ago, we started work on a second major technical issue: real-time saves. As usual, we'll start with a quick review of how things looked two years ago.
How did saves work in 2021?
Every action taken by a player had to be saved (as you'd expect, since otherwise the player would lose aspects of their progress). This information was saved to databases specific to each game server. In parallel, certain information was also saved to text files (and not to databases).
These different saves were (and, in certain cases, still are) made at various different times:
- When the player logged out, we saved all of their data (e.g. their map position, inventory and quest progress).
- When a server shut down, certain elements like houses or public paddocks were saved.
- Other data elements were saved in specific situations. For example, we saved a player's progress in Infinite Dreams only when they left it.
- And finally, in other situations, these saves were being made in real time, as soon as the player completed the action.
So why these disparities between the different systems? Keep in mind that DOFUS will be 19 years old this year, and that the possibilities and limitations associated with servers and networks have changed quite a bit in that time. Depending on when various features were implemented, certain optimization choices we made may have since become obsolete.
You're likely also familiar with the backup saves that are made at regular intervals (midnight, 6 a.m., 11 a.m. and 4 p.m. Paris time). These backups create a copy of each player's state at a given time T so that we can restore them in case of a rollback. These backups are made in addition to the copy of the database that is generated at the same time, in order to capture all the information that may not already be present in the database.
The 2021 save system had a number of disadvantages:
- If the server didn't shut down properly or if players were not logged out properly, certain data was lost, forcing us to do a rollback. Why? Because we had no way of knowing who had actually been logged out and who hadn't. Let's take an example: suppose that player A had given an item to player B and the server shut down incorrectly while the players were logged in (with player B being logged out correctly, but not player A). In our save file, we would then wind up with player A still having the item (because its disappearance from their inventory hadn't been saved), but with player B having it too. We would then not only have a duplicated item, but it would also be in two different inventories (and in the opposite case where A had been logged out correctly but not B, the item would simply have disappeared).
- The system required us to make backups at regular intervals and during maintenance (which made each maintenance longer). Our goal is to eliminate these backups so that maintenance can be done more quickly.
- When we needed to do a rollback, we were forced to use the data from a backup (and therefore potentially go several hours back in time) instead of using data from just before the rollback. For example, if we identified a problem at 3:00, we'd have to go back to the data from the 11:00 backup instead of using data from 2:55.
So what about the data format?
Over time, the data for different elements was stored in different formats (based on our needs, the existing save systems, and the storage capacity available at the time). Some of those formats are still easy to use now. However, in the past we had chosen a compressed and optimized data format that used minimal storage space, especially for large amounts of data. This format now causes lots of problems for us, which is why we also needed to change it at the same time as our changes to how data is saved.
This data format has a number of defects, especially in terms of readability and the ease of finding or adding new information. The compressed format is a simplified representation made up of a series of characters separated by special characters (_, |, etc.). The data chain for a character's inventory, for example, can get extremely long, as each item in the inventory is separated from the next item by a special character, and each item is also broken down into multiple sub-records which are in turn separated by a different special character, to indicate the item's quantity, inventory position, effects, etc.). This format prevents us from adding a number of different features that we'd like to implement in-game.
What we've changed
We decided to adopt a different format for some of this data. While less optimized in terms of storage space, it is more appropriate for our needs, easier to read, and allows us to work with this data much more easily. For example, it allows us to easily target and save just one value, instead of having to save all the data whenever anything changes. Transitioning to this new data format was also a good opportunity to clean up the saved data and get rid of certain information that is no longer useful.
For save files, the change was broken down into multiple stages, and is still in progress as of today.
In 2022, we started by eliminating all the information that had been stored in text files and saving it with the rest of the data in dedicated databases.
In addition, we've been working on our updates for a year now to incorporate real-time saves of player data and directly start using the most flexible format possible. We've done this with guilds, alterations, and more recently with alliances. For all the other data (e.g. wells of infinite dreams, houses, marketplaces, paddocks, stables, inventories, and characteristics, to name just a few), we've been gradually going through all of them to update them. As a result, houses have been saved in real time since September 20, 2022, marketplaces since February 14, 2023, wells of infinite dreams since March 1, 2023, and so on.
However, not everything has been converted yet to the data format we eventually want to use, nor is everything saved in real time yet. We still have to get to some of the most sensitive areas (inventories, banks, characteristics, stables). Some of these updates are already done, and are currently going through waves of intensive testing to be sure we don't miss anything. We're also waiting for the "least disruptive" time to deploy them in-game – we try to choose the least busy periods to minimize the impact of any problems that may arise.
And problems can indeed arise with these changes. This devblog is a good opportunity to look back at three problems that have come up in-game as a result of earlier changes:
- In the update of March 28, 2023, we changed the data format for paddocks and mount certificates. The format for stables remained unchanged. Less than an hour after reopening the servers, we identified a compatibility problem between these two formats. It turned out that when a mount was moved from the stable to the paddock (or vice versa), mounts' pregnancy information was not correctly transferred from one format to the other. The old data format (used by the stables) was unable to handle this loss of information, which resulted in an error when saving it. The stable was then saved as having no mounts inside.
- In connection with our changes to paddock saves in the March 28 update, we noticed that mount fatigue data was not being saved completely. We wanted to correct this problem in the May 16 maintenance. The bug fix went through. However, a few hours after the maintenance was complete, we realized that the save request was being sent every time any of the mounts in a paddock changed their position, which was generating far more requests than the server could handle. As a result, we had to do an exceptional extra maintenance the next day (May 17) to remove the save data update until we could find another solution. Since then, we've optimized the code to send fewer save requests, and the finished update on this point was added to the game without problems as part of the August 8 maintenance.
- For several months now, achievements have been saved in real time in the desired format. However, the conversion from the old data format to the new one only took place when the player logged in. In the May 30 maintenance, we wanted to apply the data format update to all the characters who had never logged in yet since the change. Unfortunately, we realized during the maintenance that storing all achievements for all players took up far too much space, which would make backups (during maintenance or at regular intervals) significantly longer. We therefore had to stop the migration and do a rollback to the pre-maintenance backup (which goes to show why it's a good idea to do one before starting maintenance). We are currently working on optimizing the data to be saved and the format to use so that this data takes up less space.
Opportunities
Even though these changes caused certain problems when first rolled out, just like the new login server, their purpose is to pave the way for a number of improvements and new features in the future:
- As mentioned earlier, rollbacks should be less frequent, and when they are necessary, they'll use a backup that's not as far back in time.
- Maintenance time will be somewhat reduced. This is an essential first step on the way to drastically reducing the time that each maintenance takes.
- The change in data format also gives us a number of opportunities:
- Ability to create custom item categories
- Ability to lock certain items in your inventory
- Easier injections when a Temporis server closes down
- Ability to more easily find items in players' inventories in case of any problems
- Once everything is saved in real time, we will finally be able to unlock the ability to perform server transfers at any time, without being dependent on weekly maintenances. This feature is the "final boss" of the save format project.
- …and that's just a partial list of the possibilities that these changes open up for us.
Conclusion
Wow, talk about a wall of text… Thank you for taking the time to read this devblog. We're well aware that technical changes are not the most interesting topic for many of you. In general, all that's visible to players is that certain maintenance updates sometimes introduce new bugs. I hope I've succeeded in explaining the main challenges associated with these changes, along with our reasons for wanting to make them. Although risky at times, these changes are essential to ensuring a long future for the game, and to our ability to add new features that have been impossible in the past.
For those of you who do have an interest in technical topics, I hope I've been able to clearly present all the essential information and keep you interested throughout this long devblog post. And for everyone else, stay tuned – other devblogs on more gameplay-oriented issues are on their way soon…