test content
What is the Arc Client?
Install Arc

outage debrief and general availability policy question

squigishsquigish Member Posts: 0 Arc User
This is mostly a question for the PWE staff, but I'm sure the community will weigh in as well.

Now that the 5+ hour long outage on 5/2/2013 has finally been resolved, could we get some comments about what happened, and some reassurances about what's being done to prevent it in the future?

Was it a piece of networking gear that failed? A server? A connectivity issue that was more or less outside of your control? Something that caused multiple cascading failures?

Also, as a general question, what is your uptime goal for STO? It's clearly not the "five nines" (99.999% uptime) standard for major online services, as that only allows about 5 minutes of downtime per year, and the weekly maintenance alone is way more than that.

I understand that you run a business, and while you could theoretically have 5 independent data centers all over the globe, any one of which is capable of running the entire system, that would be prohibitively expensive. But neither are you just running the server on a single box plugged into a cable modem. There's a balance point somewhere in between, and it's different for every company and situation.

My question is: how much downtime (planned and unplanned) do you consider "acceptable?" I really hope that the 5 hour unexpected downtime is as unacceptable to you as it is to the players, as reflected by the 200 posts/hour on the two different threads about the outage.
Post edited by squigish on
«13

Comments

  • srspellssrspells Member Posts: 0 Arc User
    edited May 2013
    Havent played many mmos have you no mmo is 99% uptime they all go down alot and for lots of reasons as dumb as a missing image.

    I for one am HAPPY they stayed for overtime to fix the issue at hand.

    Thank you Cryptic.
    -Spells
    || Open Door Policy ||
    | Dues Ex Mechina |
    Fleet Leader
  • momawmomaw Member Posts: 0 Arc User
    edited May 2013
    Who wants to bet that recent shenanigans in STO are related to the infrastructure for Neverwinter getting some hardcore stress testing with the opening of that game's beta?
  • srspellssrspells Member Posts: 0 Arc User
    edited May 2013
    not likely as each game is hosted through different servers, but what the client was the issue as its the launcher, if launcher goes down so does all the games as its just a clients acess to the game.
    -Spells
    || Open Door Policy ||
    | Dues Ex Mechina |
    Fleet Leader
  • jetwtfjetwtf Member Posts: 1,207
    edited May 2013
    srspells wrote: »
    I for one am HAPPY they stayed for overtime to fix the issue at hand.

    Stayed for overtime? you do realize that when the crash happened any 9-5 employee was in the middle of dinner already. Servers are in California and the crash happened around 6:30 PM easter time.

    I give kudos to the employees who fixed the server catastrophy but the company itself none. there is no excuse for server crashes this day and age when you release a new IP and it was the release of the Neverwinter open beta that pushed the system past it's limits. It is well known that server overload happens on release day. they could rent servers or a whole farm to prevent such things and still make a huge profit.
    Join Date: Nobody cares.
    "I'm drunk, whats your excuse for being an idiot?" - Unknown drunk man. :eek:
  • srspellssrspells Member Posts: 0 Arc User
    edited May 2013
    neverwinter may of overloaded the client server but not stos, essentially it was the client that made cryptics games out of order.
    -Spells
    || Open Door Policy ||
    | Dues Ex Mechina |
    Fleet Leader
  • nicha0nicha0 Member Posts: 1,456 Arc User
    edited May 2013
    It looks like its yet another of countless data centre issues, because the forums reached through startrekonline are unreachable but perfectworld is. The data centre they are using has some very serious connectivity issues that are driving players off, they need to pack up and move out.
    Delirium Tremens
    Completed Starbase, Embassy, Mine, Spire and No Win Scenario
    Nothing to do anymore.
    http://dtfleet.com/
    Visit our Youtube channel
  • nyxadrillnyxadrill Member Posts: 1,242 Arc User
    edited May 2013
    Being in Europe I must have been in bed when the outage happened :P however it looks like it affected Neverwinter as well:

    http://nw.perfectworld.com/news/?p=880781

    and so no doubt other games as well. So I'd guess the problem was systemic, probably in the network and possibly outside of Cryptics control.
    server_hamster6.png
  • anazondaanazonda Member Posts: 8,399 Arc User
    edited May 2013
    99.999% works great for a webpage, or even could-services... Not for games that need constant updates and additions, bugfixes...?

    It's not like you can simply stage new content without at least shutting down every now and then.
    Don't look silly... Don't call it the "Z-Store/Zen Store"...
    Let me put the rumors to rest: it's definitely still the C-Store (Cryptic Store) It just takes ZEN.
    Like Duty Officers? Support effords to gather ideas
  • ussultimatumussultimatum Member Posts: 0 Arc User
    edited May 2013
    nyxadrill wrote: »
    B
    and so no doubt other games as well. So I'd guess the problem was systemic, probably in the network and possibly outside of Cryptics control.

    They still need to make this their top priority to address and see that this does not happen again.

    It's pretty clear that they have all sorts of issues, just from small examples like the massive lag with items in/out of bank/account bank.


    It would be exceptionally foolish to invest millions developing a new game with a big title like NW and not also invest to make sure the launch is as smooth as possible.
  • soundwisdomsoundwisdom Member Posts: 248 Arc User
    edited May 2013
    srspells wrote: »
    not likely as each game is hosted through different servers, but what the client was the issue as its the launcher, if launcher goes down so does all the games as its just a clients acess to the game.
    -Spells


    1) They are the same data centers
    2) They have shared resources
    3) They share a common networking set(Login authentication)
  • hevachhevach Member Posts: 2,777 Arc User
    edited May 2013
    Neverwinter very much CAN overload STO. They're hosted together, they share a chat server (custom channels go to all Cryptic games and can even be reached outside the games), patch server, and account server. That's more integrated than, say, two different domains in the same data center, which itself is a situation where one site can cause an outage on the other.


    99.999% is rare for server hosting, generally attained by multicast redundancy which is not feasible for real time applications and not cost feasible for most others. Three nines is considered beyond outstanding for real time applications, including MMOs and other game servers. Counterintuitively, the few MMOs capable of that kind of uptime are not the biggest names, but the smallest - largely unmaintained by their dwindling development teams but not yet abandoned by their players, going months or years without patches and neglecting regular maintenance until the server crashes or dies.


    I mean, an extended surprise downtime is frustrating to all involved, especially those affected during peak hours, but let's not use silly and unrealistic standards to judge it.
  • avertyoureyessavertyoureyess Member Posts: 59 Arc User
    edited May 2013
    So this came off the Neverwinter Site...its one of the first things you see on thier board..

    http://nw.perfectworld.com/news/?p=880781

    Has the STO team offered and explanation?
  • capnmanxcapnmanx Member Posts: 1,452 Arc User
    edited May 2013
    So this came off the Neverwinter Site...its one of the first things you see on thier board..

    http://nw.perfectworld.com/news/?p=880781

    Has the STO team offered and explanation?

    http://sto-forum.perfectworld.com/showthread.php?t=651451

    Pretty much the same thing, right there in the news network forum.
  • woodrandywoodrandy Member Posts: 8 Arc User
    edited May 2013
    To Bandon and his peers:

    I was one of the players that unexpectedly got severed from the game due to the outage on 5/2. As I also have a position in the IT field I am familiar with the issues in a client/server model.

    I want to say that I have been VERY impressed at both your service availability and the response time of your team in light of the issues you are presented with on a daily basis. I am also amazed that they do not happen more often given the size of the population that is pounding your server farm every minute of every day.

    If I were given the chance to grade your performance I would say A+ guys!

    Thank you for your continuing efforts to keep the game growing and moving in new and interesting ways.

    Randy Wood
    AKA Jacob Wood@woodtrandy in game
    [SIGPIC][/SIGPIC]
    Admiral Randy Wood (UFP) - 1st Temporal Operations Group
    Admiral Womek (RR) - Temporal Operations Group
    Admiral Renan Wartok (RR) - Temporal Operations Group
    Admiral Raynal Waton (UFP) - Temporal Operations Group
    General Kragh (KDF) - House of Ta'al
  • latinumbarlatinumbar Member Posts: 0 Arc User
    edited May 2013
    Seriously?

    C'mon. I think the phrase "S**t happens" applies here. Unexpected things happen in life. It just does. I'm just glad that they were able to fix the problem in a timely manner, and I'm sure they would want to prevent any future such occurrences from happening. After all, it is bad for business if being able to play became very unreliable. But seriously, why should they have to give a detailed explanation as to exactly what steps they are doing to prevent this in the future. Pretty much a waste of their time. They will just implement it and move on. And so should we.
    _____________________
    Come join the 44th Fleet.
    startrek.44thfleet.com[SIGPIC][/SIGPIC]
  • crypticarmsmancrypticarmsman Member Posts: 4,115 Arc User
    edited May 2013
    woodrandy wrote: »
    To Bandon and his peers:

    I was one of the players that unexpectedly got severed from the game due to the outage on 5/2. As I also have a position in the IT field I am familiar with the issues in a client/server model.

    I want to say that I have been VERY impressed at both your service availability and the response time of your team in light of the issues you are presented with on a daily basis. I am also amazed that they do not happen more often given the size of the population that is pounding your server farm every minute of every day.

    If I were given the chance to grade your performance I would say A+ guys!

    Thank you for your continuing efforts to keep the game growing and moving in new and interesting ways.

    Randy Wood
    AKA Jacob Wood@woodtrandy in game

    +1 (Although this particular outage didn't affect me as BOTH my desktop and Laptop died - within 2 days of each other too - so I haven't been able to do anything at home - and yes, I'm posting this from work during a break; but as a IT professional myself, and someone who's played a lot of MMOs; Cryptic in general has had a very good record of keeping their games up, running and available for the majority of their players.) 100% uptime is a myth; and unexpected hardware failures (that sometimes also take out the backups) do happen. There is no such thing as 'fail-proof' hardware.

    Me? I hope to be back up and running in STO by the end of next week at the latest (assuming none of the parts I want for my new system are backordered.):eek::D;)
    Formerly known as Armsman from June 2008 to June 20, 2012
    TOS_Connie_Sig_final9550Pop.jpg
    PWE ARC Drone says: "Your STO forum community as you have known it is ended...Display names are irrelevant...Any further sense of community is irrelevant...Resistance is futile...You will be assimilated..."
  • kintishokintisho Member Posts: 1,040 Arc User
    edited May 2013
    I would also like some detail as to the problem and its solution and or steps to prevent it in the future.
  • sussethraisussethrai Member Posts: 137 Arc User
    edited May 2013
    As long as it doesn't reveal any top seekrit/proprietary info, can Cryptic post what happened to cause the May 2 huge outage? Information will go a far way to calm game deleting rage, especially if it was something like a transformer blowing that was completely out of Cryptic's control. In fact, posting a quick breakdown of what caused any outage or major game error, if known, would be appreciated by the community. We would probably reduce shouting for more servers/more bandwidth/keep NW, CO, STO servers separated if we knew those issues weren't factors.
    "Susse-thrai" had been the name bestowed upon her, half in anger, half in affection, by her old crew on Bloodwing; the keen-nosed, cranky, wily old she-beast, never less dangerous than when you thought her defenseless, and always growing new teeth far back in her throat to replace the old ones broken in biting out the last foe's heart.
    Romulans: left one homeworld, lost another, third time's the charm?
  • hravikhravik Member Posts: 1,203 Arc User
    edited May 2013
    kintisho wrote: »
    I would also like some detail as to the problem and its solution and or steps to prevent it in the future.

    Why? They told us it was a hardware failure already. Do you want the exact part and model number of what went poof? There isn't really much you can do to prevent hardware failure, event at home. Sometimes stuff just breaks.
  • jstewart55jstewart55 Member Posts: 412 Arc User
    edited May 2013
    Stuff happens. Five hours is a long time, and I get that if it's the only time you can play it it's rather frustrating, but you literally CANNOT expect the servers to be running full-tilt all the time.
  • neoakiraiineoakiraii Member Posts: 7,468 Arc User
    edited May 2013
    S**t happens
    GwaoHAD.png
  • sasheriasasheria Member Posts: 1 Arc User
    edited May 2013
    On the initial downtime, the message was about Neverwinter overloading the servers. Remember that PW account can be use on ALL games.

    So I'm guessing with Neverwinter just release, the "login" server (where it check that you are active PW account) is getting hit hard and probably failed (which no one can log on) Once the login server authenticate you, then it pass you to the proper game server (in our case STO)

    Everyone must go to the login server before hitting the game server. I am guessing this is what happen.
    To grow old is inevitable, to grow up is optional.
    Please review my campaign and I'll return the favor.
  • thlaylierahthlaylierah Member Posts: 2,987 Arc User
    edited May 2013
    The point being missed here is Resource Allocation.

    New Game NWN = Full resources.

    Old game STO = whatever is left over.

    Blame those supporting NWN for the outages.
  • jornadojornado Member Posts: 918 Arc User
    edited May 2013
    I've seen triple-A MMOs have unexpected downtimes of 12 hours or more. Granted, that was 3 plus years ago, but stuff does indeed happen. One time an MMO I played fervently went down for 36 hours - the community went into pandemonium, as they couldn't raid Pandemonium for a whopping day and a half (10 points if you know which MMO off the top of your head), but the company did prorate all paid subscriptions for the time lost.

    Point is, it happened, the server is back up, and we can play again with a reasonably small inconvenience. Let's not waste time complaining on the forums, we all have alts to level and reps to grind!

    Cheers!
    [SIGPIC][/SIGPIC]
    My guess is "hope" keeps people not playing but posting on the forums. For others, its a path of sad realization and closure. Grieving takes time. The worst "haters" here love the game, or did at some point.
  • nightstalker61nightstalker61 Member Posts: 1 Arc User
    edited May 2013
    They are probably using virtual servers (Hyper-V, KVM, or etc). My experience if you have a hardware failure on the server banks that house those, everything comes tumbling down. The upside of this is since the servers are virtual they'll come back pretty quickly once the main hyper-V server comes back online.
    newnssig.jpg
  • abaddon653abaddon653 Member Posts: 1,144 Arc User
    edited May 2013
    Those interns, always spilling coffee on the servers.
  • vengefuldjinnvengefuldjinn Member Posts: 1,521 Arc User
    edited May 2013
    srspells wrote: »
    Havent played many mmos have you no mmo is 99% uptime they all go down alot and for lots of reasons as dumb as a missing image.

    Yeah, BUT, this recent crash, was just one incident, among an ever increasing list of incidents: like constant server disconnects, lag with bank items, extreme lag in general, along with launcher connectivity issues.

    Is it possible that Never Winter might make things worse?, some concern is understandable.
    tumblr_o2aau3b7nh1rkvl19o1_400.gif








  • darkkindness2darkkindness2 Member Posts: 257 Arc User
    edited May 2013
    The point being missed here is Resource Allocation.

    New Game NWN = Full resources.

    Old game STO = whatever is left over.

    Blame those supporting NWN for the outages.

    This is patently incorrect. NWN and STO are not only run out of the same data center, but they share fairly significant resources - login and authentication server, chat server, necessary networking for game access... there's a fairly high level of integration in the hardware that runs STO, NWN, and CO.

    As a matter of fact, I'd be willing to bet that NWN going live and this issue will see upgrades to a fair amount of the infrastructure that all of Crytpic's games share. Sure, NWN going into open beta is probably what overloaded the login/authentication service, but now it's a known weak link in the chain that they'll want to address (most likely with new hardware) before NWN goes live. In the end, NWN is going to do nothing but benefit STO and CO.
    __________________________________________________
    Joined January 2010.

    In regard to hating Star Trek 2009:
    kain9prime wrote: »
    IDIC fail.
  • captainoblivouscaptainoblivous Member Posts: 2,284 Arc User
    edited May 2013
    This is patently incorrect. NWN and STO are not only run out of the same data center, but they share fairly significant resources - login and authentication server, chat server, necessary networking for game access... there's a fairly high level of integration in the hardware that runs STO, NWN, and CO.

    As a matter of fact, I'd be willing to bet that NWN going live and this issue will see upgrades to a fair amount of the infrastructure that all of Crytpic's games share. Sure, NWN going into open beta is probably what overloaded the login/authentication service, but now it's a known weak link in the chain that they'll want to address (most likely with new hardware) before NWN goes live. In the end, NWN is going to do nothing but benefit STO and CO.

    He may be referring to financial and manpower resources rather than resources of a computational nature.
    I need a beer.

  • thlaylierahthlaylierah Member Posts: 2,987 Arc User
    edited May 2013
    He may be referring to financial and manpower resources rather than resources of a computational nature.

    You are correct Sir!
Sign In or Register to comment.