I saved this from over at reddit some time ago, and I’m sharing it here because I thought it could use wider exposure. This post is largely for IT professionals, of whom I know a fair community, forgive me if it doesn’t seem relevant.
On the other hand, if you’re a manager or a director responsible for IT¹, you may want to read this with some care.
A question was asked, “Isn’t there a live sandbox environment² you can freely make mistakes in before you jump in the actual live databases or whatever and make changes? If not, why not?”
A comprehensive answer was posted by redditor /u/catherder9000, which I have only bowdlerized a little, and I hope the author is not mortally offended.
It is all about scale.
(Sort of like how this post could have been summed up in 2 sentences, but enjoy it anyway!)
Lets say you work at a company that is a large small business (40-50 million revenue yearly, 100-200 people). Your IT department is a 1-3 man team, because “you’re an expense” …most business people think only sales people make them money. Don’t worry that you can’t make money if things don’t work, only sales makes you money.
Now lets pretend your last major upgrade to the servers was accomplished with a $75,000 budget. Getting that budget with the equipment you demanded was required was hard fought. Some corners were cut on “not absolutely necessary” things, things like a second slightly smaller and slightly slower server to run as a mirror of the first one, a server where you could do all your testing on. That “saved” the company $30,000, right? You just like to spend money, you never make the company any money.
Then, a year later you have something that absolutely has to be done to the server. You are pretty sure it will work, your outside support people are confident it will work, you have no server to test it on because all your other servers are much too small to handle it or are already tasked with other “critical” services. So you go with your best judgement and go live with a big change during the wee hours to cause the least interruption.
1 AM STUFF GOES BAD.
Now you’re scrambling. By 5AM you’re in a frantic attempt to get back online before major business starts, nothing you or your vendor have tried has worked, they’ve called in a half dozen of their T3’s and developers all to no avail. People are rolling in, things aren’t working. Calls are happening. Pages are going out. 6AM, the owner rolls in. His stuff isn’t working. You’re now thinking about reverting to last night’s backup because the changes you were told would work without a hitch were nothing but a giant frozen boot to the face hitch. People are getting really frantic about not being able to do business, nobody can order anything, nobody can sell anything, nobody can maintain inventory, nobody can do anything but sit around with their thumbs in their ears and surf the web. You’re just an expense, you don’t make the company money.
6:30AM, you make the decision to give up attempts at fixing and instead roll back to the last backup. You start the restore telling everyone “this should be resolved by 9:30AM everyone we have is on it and a full restore should take 2 or 3 hours tops.”
9:35 rolls around, 9:40… 10:15 the backup fails at the last point. What the…? How the…? This is impossible! You make some calls, you explain that you have to attempt rolling back to the offsite backup, yes you understand that will lose the half the day’s business and everything will have to be manually entered when the system is back up. You’re given the “Well for pity’s sake get it back up what do we pay you for!?!” (The go ahead. They have utmost confidence in your abilities.) You start the other restore. It works, but was much slower than the onsite one because fiber is only so fast. 3:00PM you’re back online, things seem to be stable again.
3:30, nobody in IT has slept in 32 hours. You’re called into a meeting with management. People want answers. You explain that you were assured everything would go smoothly by the vendor, you tell them that you were confident on your role in the upgrade as well. What should have been a 2 hour downtime during the night turned into a 17 hour ordeal. It was an unforeseeable incident. You mention that, “Had we had a working test environment to try this on first, we would have discovered the problem and avoided it.”
Nobody wants to hear it. Everything is about reentering the previous day’s sales, orders, receivables, inventory adjustments, etc. 4:30 the business day is basically a wipe. The downtime has cost the company a couple of hundred thousand in lost business for the day. You’re just another expense, you don’t make the company any money.
Nobody learns from it other than yourself, a few other people in IT, and the vendor who “has never seen this problem before”.
Your request for a new sandbox server is declined. Your request for a 2nd local backup server is seen as “another” frivolous idea.
You’re just another expense, you don’t make the company any money.
Welcome to IT.
The Old Wolf has been there.
¹ IT = Information Technology. You know, the computer or data-processing department that doesn’t make any money. When things are going well, they wonder what they pay you for. When things go to hell, they wonder what they pay you for.
² A sandbox is a separate place, a mirror of your computer systems, where software can be tested without impacting your production machine. If things go bad, no harm no foul.