This is mostly a question for the PWE staff, but I'm sure the community will weigh in as well.
Now that the 5+ hour long outage on 5/2/2013 has finally been resolved, could we get some comments about what happened, and some reassurances about what's being done to prevent it in the future?
Was it a piece of networking gear that failed? A server? A connectivity issue that was more or less outside of your control? Something that caused multiple cascading failures?
Also, as a general question, what is your uptime goal for STO? It's clearly not the "five nines" (99.999% uptime) standard for major online services, as that only allows about 5 minutes of downtime per year, and the weekly maintenance alone is way more than that.
I understand that you run a business, and while you could theoretically have 5 independent data centers all over the globe, any one of which is capable of running the entire system, that would be prohibitively expensive. But neither are you just running the server on a single box plugged into a cable modem. There's a balance point somewhere in between, and it's different for every company and situation.
My question is: how much downtime (planned and unplanned) do you consider "acceptable?" I really hope that the 5 hour unexpected downtime is as unacceptable to you as it is to the players, as reflected by the 200 posts/hour on the two different threads about the outage.
Who wants to bet that recent shenanigans in STO are related to the infrastructure for Neverwinter getting some hardcore stress testing with the opening of that game's beta?
not likely as each game is hosted through different servers, but what the client was the issue as its the launcher, if launcher goes down so does all the games as its just a clients acess to the game.
-Spells
|| Open Door Policy ||
| Dues Ex Mechina |
Fleet Leader
I for one am HAPPY they stayed for overtime to fix the issue at hand.
Stayed for overtime? you do realize that when the crash happened any 9-5 employee was in the middle of dinner already. Servers are in California and the crash happened around 6:30 PM easter time.
I give kudos to the employees who fixed the server catastrophy but the company itself none. there is no excuse for server crashes this day and age when you release a new IP and it was the release of the Neverwinter open beta that pushed the system past it's limits. It is well known that server overload happens on release day. they could rent servers or a whole farm to prevent such things and still make a huge profit.
Join Date: Nobody cares.
"I'm drunk, whats your excuse for being an idiot?" - Unknown drunk man. :eek:
It looks like its yet another of countless data centre issues, because the forums reached through startrekonline are unreachable but perfectworld is. The data centre they are using has some very serious connectivity issues that are driving players off, they need to pack up and move out.
B
and so no doubt other games as well. So I'd guess the problem was systemic, probably in the network and possibly outside of Cryptics control.
They still need to make this their top priority to address and see that this does not happen again.
It's pretty clear that they have all sorts of issues, just from small examples like the massive lag with items in/out of bank/account bank.
It would be exceptionally foolish to invest millions developing a new game with a big title like NW and not also invest to make sure the launch is as smooth as possible.
not likely as each game is hosted through different servers, but what the client was the issue as its the launcher, if launcher goes down so does all the games as its just a clients acess to the game.
-Spells
1) They are the same data centers
2) They have shared resources
3) They share a common networking set(Login authentication)
Neverwinter very much CAN overload STO. They're hosted together, they share a chat server (custom channels go to all Cryptic games and can even be reached outside the games), patch server, and account server. That's more integrated than, say, two different domains in the same data center, which itself is a situation where one site can cause an outage on the other.
99.999% is rare for server hosting, generally attained by multicast redundancy which is not feasible for real time applications and not cost feasible for most others. Three nines is considered beyond outstanding for real time applications, including MMOs and other game servers. Counterintuitively, the few MMOs capable of that kind of uptime are not the biggest names, but the smallest - largely unmaintained by their dwindling development teams but not yet abandoned by their players, going months or years without patches and neglecting regular maintenance until the server crashes or dies.
I mean, an extended surprise downtime is frustrating to all involved, especially those affected during peak hours, but let's not use silly and unrealistic standards to judge it.
I was one of the players that unexpectedly got severed from the game due to the outage on 5/2. As I also have a position in the IT field I am familiar with the issues in a client/server model.
I want to say that I have been VERY impressed at both your service availability and the response time of your team in light of the issues you are presented with on a daily basis. I am also amazed that they do not happen more often given the size of the population that is pounding your server farm every minute of every day.
If I were given the chance to grade your performance I would say A+ guys!
Thank you for your continuing efforts to keep the game growing and moving in new and interesting ways.
Randy Wood
AKA Jacob Wood@woodtrandy in game
[SIGPIC][/SIGPIC]
Admiral Randy Wood (UFP) - 1st Temporal Operations Group
Admiral Womek (RR) - Temporal Operations Group
Admiral Renan Wartok (RR) - Temporal Operations Group
Admiral Raynal Waton (UFP) - Temporal Operations Group
General Kragh (KDF) - House of Ta'al
C'mon. I think the phrase "S**t happens" applies here. Unexpected things happen in life. It just does. I'm just glad that they were able to fix the problem in a timely manner, and I'm sure they would want to prevent any future such occurrences from happening. After all, it is bad for business if being able to play became very unreliable. But seriously, why should they have to give a detailed explanation as to exactly what steps they are doing to prevent this in the future. Pretty much a waste of their time. They will just implement it and move on. And so should we.
_____________________
Come join the 44th Fleet.
startrek.44thfleet.com[SIGPIC][/SIGPIC]
I was one of the players that unexpectedly got severed from the game due to the outage on 5/2. As I also have a position in the IT field I am familiar with the issues in a client/server model.
I want to say that I have been VERY impressed at both your service availability and the response time of your team in light of the issues you are presented with on a daily basis. I am also amazed that they do not happen more often given the size of the population that is pounding your server farm every minute of every day.
If I were given the chance to grade your performance I would say A+ guys!
Thank you for your continuing efforts to keep the game growing and moving in new and interesting ways.
Randy Wood
AKA Jacob Wood@woodtrandy in game
+1 (Although this particular outage didn't affect me as BOTH my desktop and Laptop died - within 2 days of each other too - so I haven't been able to do anything at home - and yes, I'm posting this from work during a break; but as a IT professional myself, and someone who's played a lot of MMOs; Cryptic in general has had a very good record of keeping their games up, running and available for the majority of their players.) 100% uptime is a myth; and unexpected hardware failures (that sometimes also take out the backups) do happen. There is no such thing as 'fail-proof' hardware.
Me? I hope to be back up and running in STO by the end of next week at the latest (assuming none of the parts I want for my new system are backordered.):eek::D;)
Formerly known as Armsman from June 2008 to June 20, 2012
PWE ARC Drone says: "Your STO forum community as you have known it is ended...Display names are irrelevant...Any further sense of community is irrelevant...Resistance is futile...You will be assimilated..."
As long as it doesn't reveal any top seekrit/proprietary info, can Cryptic post what happened to cause the May 2 huge outage? Information will go a far way to calm game deleting rage, especially if it was something like a transformer blowing that was completely out of Cryptic's control. In fact, posting a quick breakdown of what caused any outage or major game error, if known, would be appreciated by the community. We would probably reduce shouting for more servers/more bandwidth/keep NW, CO, STO servers separated if we knew those issues weren't factors.
"Susse-thrai" had been the name bestowed upon her, half in anger, half in affection, by her old crew on Bloodwing; the keen-nosed, cranky, wily old she-beast, never less dangerous than when you thought her defenseless, and always growing new teeth far back in her throat to replace the old ones broken in biting out the last foe's heart. Romulans: left one homeworld, lost another, third time's the charm?
I would also like some detail as to the problem and its solution and or steps to prevent it in the future.
Why? They told us it was a hardware failure already. Do you want the exact part and model number of what went poof? There isn't really much you can do to prevent hardware failure, event at home. Sometimes stuff just breaks.
Stuff happens. Five hours is a long time, and I get that if it's the only time you can play it it's rather frustrating, but you literally CANNOT expect the servers to be running full-tilt all the time.
On the initial downtime, the message was about Neverwinter overloading the servers. Remember that PW account can be use on ALL games.
So I'm guessing with Neverwinter just release, the "login" server (where it check that you are active PW account) is getting hit hard and probably failed (which no one can log on) Once the login server authenticate you, then it pass you to the proper game server (in our case STO)
Everyone must go to the login server before hitting the game server. I am guessing this is what happen.
I've seen triple-A MMOs have unexpected downtimes of 12 hours or more. Granted, that was 3 plus years ago, but stuff does indeed happen. One time an MMO I played fervently went down for 36 hours - the community went into pandemonium, as they couldn't raid Pandemonium for a whopping day and a half (10 points if you know which MMO off the top of your head), but the company did prorate all paid subscriptions for the time lost.
Point is, it happened, the server is back up, and we can play again with a reasonably small inconvenience. Let's not waste time complaining on the forums, we all have alts to level and reps to grind!
My guess is "hope" keeps people not playing but posting on the forums. For others, its a path of sad realization and closure. Grieving takes time. The worst "haters" here love the game, or did at some point.
They are probably using virtual servers (Hyper-V, KVM, or etc). My experience if you have a hardware failure on the server banks that house those, everything comes tumbling down. The upside of this is since the servers are virtual they'll come back pretty quickly once the main hyper-V server comes back online.
Havent played many mmos have you no mmo is 99% uptime they all go down alot and for lots of reasons as dumb as a missing image.
Yeah, BUT, this recent crash, was just one incident, among an ever increasing list of incidents: like constant server disconnects, lag with bank items, extreme lag in general, along with launcher connectivity issues.
Is it possible that Never Winter might make things worse?, some concern is understandable.
The point being missed here is Resource Allocation.
New Game NWN = Full resources.
Old game STO = whatever is left over.
Blame those supporting NWN for the outages.
This is patently incorrect. NWN and STO are not only run out of the same data center, but they share fairly significant resources - login and authentication server, chat server, necessary networking for game access... there's a fairly high level of integration in the hardware that runs STO, NWN, and CO.
As a matter of fact, I'd be willing to bet that NWN going live and this issue will see upgrades to a fair amount of the infrastructure that all of Crytpic's games share. Sure, NWN going into open beta is probably what overloaded the login/authentication service, but now it's a known weak link in the chain that they'll want to address (most likely with new hardware) before NWN goes live. In the end, NWN is going to do nothing but benefit STO and CO.
__________________________________________________
Joined January 2010.
This is patently incorrect. NWN and STO are not only run out of the same data center, but they share fairly significant resources - login and authentication server, chat server, necessary networking for game access... there's a fairly high level of integration in the hardware that runs STO, NWN, and CO.
As a matter of fact, I'd be willing to bet that NWN going live and this issue will see upgrades to a fair amount of the infrastructure that all of Crytpic's games share. Sure, NWN going into open beta is probably what overloaded the login/authentication service, but now it's a known weak link in the chain that they'll want to address (most likely with new hardware) before NWN goes live. In the end, NWN is going to do nothing but benefit STO and CO.
He may be referring to financial and manpower resources rather than resources of a computational nature.
Comments
I for one am HAPPY they stayed for overtime to fix the issue at hand.
Thank you Cryptic.
-Spells
| Dues Ex Mechina |
Fleet Leader
-Spells
| Dues Ex Mechina |
Fleet Leader
Stayed for overtime? you do realize that when the crash happened any 9-5 employee was in the middle of dinner already. Servers are in California and the crash happened around 6:30 PM easter time.
I give kudos to the employees who fixed the server catastrophy but the company itself none. there is no excuse for server crashes this day and age when you release a new IP and it was the release of the Neverwinter open beta that pushed the system past it's limits. It is well known that server overload happens on release day. they could rent servers or a whole farm to prevent such things and still make a huge profit.
"I'm drunk, whats your excuse for being an idiot?" - Unknown drunk man. :eek:
-Spells
| Dues Ex Mechina |
Fleet Leader
Completed Starbase, Embassy, Mine, Spire and No Win Scenario
Nothing to do anymore.
http://dtfleet.com/
Visit our Youtube channel
http://nw.perfectworld.com/news/?p=880781
and so no doubt other games as well. So I'd guess the problem was systemic, probably in the network and possibly outside of Cryptics control.
It's not like you can simply stage new content without at least shutting down every now and then.
They still need to make this their top priority to address and see that this does not happen again.
It's pretty clear that they have all sorts of issues, just from small examples like the massive lag with items in/out of bank/account bank.
It would be exceptionally foolish to invest millions developing a new game with a big title like NW and not also invest to make sure the launch is as smooth as possible.
1) They are the same data centers
2) They have shared resources
3) They share a common networking set(Login authentication)
99.999% is rare for server hosting, generally attained by multicast redundancy which is not feasible for real time applications and not cost feasible for most others. Three nines is considered beyond outstanding for real time applications, including MMOs and other game servers. Counterintuitively, the few MMOs capable of that kind of uptime are not the biggest names, but the smallest - largely unmaintained by their dwindling development teams but not yet abandoned by their players, going months or years without patches and neglecting regular maintenance until the server crashes or dies.
I mean, an extended surprise downtime is frustrating to all involved, especially those affected during peak hours, but let's not use silly and unrealistic standards to judge it.
http://nw.perfectworld.com/news/?p=880781
Has the STO team offered and explanation?
http://sto-forum.perfectworld.com/showthread.php?t=651451
Pretty much the same thing, right there in the news network forum.
I was one of the players that unexpectedly got severed from the game due to the outage on 5/2. As I also have a position in the IT field I am familiar with the issues in a client/server model.
I want to say that I have been VERY impressed at both your service availability and the response time of your team in light of the issues you are presented with on a daily basis. I am also amazed that they do not happen more often given the size of the population that is pounding your server farm every minute of every day.
If I were given the chance to grade your performance I would say A+ guys!
Thank you for your continuing efforts to keep the game growing and moving in new and interesting ways.
Randy Wood
AKA Jacob Wood@woodtrandy in game
Admiral Randy Wood (UFP) - 1st Temporal Operations Group
Admiral Womek (RR) - Temporal Operations Group
Admiral Renan Wartok (RR) - Temporal Operations Group
Admiral Raynal Waton (UFP) - Temporal Operations Group
General Kragh (KDF) - House of Ta'al
C'mon. I think the phrase "S**t happens" applies here. Unexpected things happen in life. It just does. I'm just glad that they were able to fix the problem in a timely manner, and I'm sure they would want to prevent any future such occurrences from happening. After all, it is bad for business if being able to play became very unreliable. But seriously, why should they have to give a detailed explanation as to exactly what steps they are doing to prevent this in the future. Pretty much a waste of their time. They will just implement it and move on. And so should we.
Come join the 44th Fleet.
startrek.44thfleet.com[SIGPIC][/SIGPIC]
+1 (Although this particular outage didn't affect me as BOTH my desktop and Laptop died - within 2 days of each other too - so I haven't been able to do anything at home - and yes, I'm posting this from work during a break; but as a IT professional myself, and someone who's played a lot of MMOs; Cryptic in general has had a very good record of keeping their games up, running and available for the majority of their players.) 100% uptime is a myth; and unexpected hardware failures (that sometimes also take out the backups) do happen. There is no such thing as 'fail-proof' hardware.
Me? I hope to be back up and running in STO by the end of next week at the latest (assuming none of the parts I want for my new system are backordered.):eek::D;)
PWE ARC Drone says: "Your STO forum community as you have known it is ended...Display names are irrelevant...Any further sense of community is irrelevant...Resistance is futile...You will be assimilated..."
Romulans: left one homeworld, lost another, third time's the charm?
Why? They told us it was a hardware failure already. Do you want the exact part and model number of what went poof? There isn't really much you can do to prevent hardware failure, event at home. Sometimes stuff just breaks.
So I'm guessing with Neverwinter just release, the "login" server (where it check that you are active PW account) is getting hit hard and probably failed (which no one can log on) Once the login server authenticate you, then it pass you to the proper game server (in our case STO)
Everyone must go to the login server before hitting the game server. I am guessing this is what happen.
Please review my campaign and I'll return the favor.
New Game NWN = Full resources.
Old game STO = whatever is left over.
Blame those supporting NWN for the outages.
Point is, it happened, the server is back up, and we can play again with a reasonably small inconvenience. Let's not waste time complaining on the forums, we all have alts to level and reps to grind!
Cheers!
Yeah, BUT, this recent crash, was just one incident, among an ever increasing list of incidents: like constant server disconnects, lag with bank items, extreme lag in general, along with launcher connectivity issues.
Is it possible that Never Winter might make things worse?, some concern is understandable.
This is patently incorrect. NWN and STO are not only run out of the same data center, but they share fairly significant resources - login and authentication server, chat server, necessary networking for game access... there's a fairly high level of integration in the hardware that runs STO, NWN, and CO.
As a matter of fact, I'd be willing to bet that NWN going live and this issue will see upgrades to a fair amount of the infrastructure that all of Crytpic's games share. Sure, NWN going into open beta is probably what overloaded the login/authentication service, but now it's a known weak link in the chain that they'll want to address (most likely with new hardware) before NWN goes live. In the end, NWN is going to do nothing but benefit STO and CO.
Joined January 2010.
In regard to hating Star Trek 2009:
He may be referring to financial and manpower resources rather than resources of a computational nature.
You are correct Sir!