> "At this point, the engineers in Australia decided that a brute-force approach to their safe problem was warranted and applied a power drill to the task. An hour later, the safe was open—but even the newly retrieved cards triggered the same error message."
What happened here (from what I recall) was far funnier than this does it credit.
The SREs first attempted to use a mallet (hammer) on the safe (which they had to first buy from the local hardware store - don't worry it got expensed later), then after multiple rounds of "persuasion" they eventually called in a professional (aka. a locksmith) who used a drill+crowbar to finally liberate the keycard.
The postmortem had fun step by step photos of the safe in various stages of disassembly.
Sorry for the offtopic comment, but it's bizarre to me that Google is hosting their book on Github with a github.io domain. Their previous two SRE books are hosted at https://sre.google on Google-owned IPs.[0]
What was that decision process? "We're Google, and we're literally writing a book about how good we are at hosting services. But hosting some static HTML files that are almost entirely text? That's a tough one. We'd better outsource that to one of our competitors."
I think one is a portal for GitHub developers, while the other is a public polished site. I reminisced the early Google forthright attitude that made life so simple and human.
> It took an additional hour for the team to realize that the green light on the smart card reader did not, in fact, indicate that the card had been inserted correctly.
I'm not sure which is worse: bad UI/UX use of lights, or inadequately trained engineers who misunderstood the lights.
A lot of progress has been made by acknowledging that people are idiots and that the system has to work around that. Toyota, which went from one of the worst to one the most reliable automaker is known for formalizing idiot-proofing.
If the reader was able to read the card both way, there wouldn't have been a problem and no training required. The next best thing would be for the card to not fit upside down. Or have a clear message "try flipping the card". It is not something you should train people for, it should be obvious.
I also suspect the reader was in an unusual configuration, because everyone knows how to use smart cards and they probably did what they always do instinctively and it didn't work. On the thousands of times I did it, I don't remember having ever inserted my credit card the wrong way and don't remember anyone who did, it is just so instinctive. For an entire team to miss that, there must be something wrong with how the reader is set up.
> On the thousands of times I did it, I don't remember having ever inserted my credit card the wrong way and don't remember anyone who did, it is just so instinctive.
I have done it lots of times! With machines where you just dip the tip, you're bound to put the side with the chip in, but most machines want it facing up, and some want it the other way. The iconography is only illustrative once you've messed it up at those machines enough times (around me, Walgreens has difficult machines). Readers where you insert the whole card are easier to mess up, too.
> If the reader was able to read the card both way, there wouldn't have been a problem and no training required. The next best thing would be for the card to not fit upside down. Or have a clear message "try flipping the card". It is not something you should train people for, it should be obvious.
I suspect the HSM was an off the shelf component. The real issue with training is that a system with a complex startup procedure hadn't been restarted in 5 years. You should rehearse complex procedures at least once a year, otherwise there's a good chance nobody with experience has done it. Also, maybe someone would have flagged the issue of needing the cards to start the system than grants access to the cards. (Although drill + 1 hour is a reasonable recovery procedure that was obvious and didn't need training, apparently)
The fundamental lesson of at least half my information systems undergraduate courses was you adapt the system to observed user behavior, do not expect the user to adapt their behavior to the system.
I would say of all companies that have great SRE, I would not have expected Google to be one of them were this process was so brutaly flawed:
- Storing the safes password - which is required for the password manager to start - ... in this very same password manager?
- Failing at trying to insert the card in multiple ways into the card reader (it's like USB, you're using it the wrong way around). I would have tried that before (while?) drilling the safe.
- Having no clue (no documentation) how to restart the service, despite it having passwords in it? If passwords are lost, all encrypted stuff is lost, forever.
If there's one thing I think is central to document personal or corporate), it is how to get accesss to passwords _fast and reliable_ whenever there's a disaster recovery.
What is this, sitcom slapstick? The slapstick of storing the security combination to the safe on the system that is locked by the card which inside the safe; and the slapstick of "You're inserting it wrong"...
I don't know anything about Google but I glean this Password Manager service was of low importance and was shared by employees. I'm thinking this would've been a non issue with a low tech solution like a shared document of passwords and services or a wiki page, and by virtue of being hosted on a more common platform would benefit from a better SLA.
> restart required a hardware security module (HSM) smart card.
Out of curiosity, does anyone know why? My guess would be the PW DB would be encrypted with some token generated from this card.
I've had lots of "I have a secret and the server needs it" type problems but I've never been very happy with my solutions- smart cards seem like potentially an elegant solution.
This article highlights exactly why a HSM may be potentially elegant, but also really really dependant on embedding the process for using it in your operational processes (which would include performing that operation regularly to ensure it still works and that knowledge of its use is retained).
For a 'best effort' hosted internal service, this is not a good choice.
It's mostly a rather dense engineering textbook, but it contains lots of things I found insightful. I most particularly remember a segment like "We make things more reliable by adding more layers of Swiss cheese, on the assumption that failure modes are uncorrelated and it's only when all the failures take place that the system breaks. But this doesn't work when the system is being attacked by an intelligence, because an intelligence will explicitly correlate failures."
The book is very much designed for Google-scale systems, though: everything is assumed to be microservices, for example.
The power drill mention in the headline is a bit click-baity because in the end while a power drill was used it was unnecessary and was not the solution to the problem. Had they known how to properly use the hardware security devices they had the power drill wouldn't have been deployed at all.
But the additional cards may very well have been necessary to understand “there is something wrong with our usage of the cards, this error is not a one-off failure due to corrupted data or broken hardware or other problem local to the California card(s)”. Having multiple independent reproductions of an issue helps you narrow down what the commonalities are!
The text says "Fortunately, another colleague in California had memorized the combination to the on-site safe". You might think that's unlikely and he probably wrote it down, but it's not "clear" from the text.
What happened here (from what I recall) was far funnier than this does it credit.
The SREs first attempted to use a mallet (hammer) on the safe (which they had to first buy from the local hardware store - don't worry it got expensed later), then after multiple rounds of "persuasion" they eventually called in a professional (aka. a locksmith) who used a drill+crowbar to finally liberate the keycard.
The postmortem had fun step by step photos of the safe in various stages of disassembly.
reply