| Implementation, detail

They just don't make them like they used to

October 28, 2003 - Reading time: 5 minutes

In the past couple of years, national security has been on everyone's mind; laws have been passed, rules have been enacted, and generally life has been made more miserable so that we as a country can feel more secure. Some of the initiatives that we have seen are very visible; airport security, security at federal buildings, and legislation such as the Patriot Act have been widely discussed, and their relative merits are subject to some debate. There has also been much behind-the-scenes work, such as the Container Security Initiative (CSI), which is designed to protect the transportation of the ubiquitous and increasingly important 40-foot containers that bring us much of what we buy. All discussion of the merits of these security precautions aside, we can still say that people are actively working to keep our critical infrastructure safe from attack. But we have been ignoring an important point in the process of securing our national infrastructure, and that overlooked point presented itself to us recently. The recent massive power outage in the northeast provided us an important lesson: decreasing margins of safety and error in our infrastructure place critical societal functions at greater risk of significant disruptions from rare occurrences -- accidental, malicious, or otherwise unforeseen. This is nothing new; it has been going on for decades now, as a series of decisions by policy makers placed the administration of our national infrastructure in the hands of profit-seeking organizations. This is not necessarily bad, but redefining acceptable levels of risk and protections as the world changes is hard work, and needs to be done carefully.

Cost pressures and tight engineering under benign assumptions over the last few decades have lead to thin margins of error in our current infrastructure. This is to say that certain major failures are assumed to be so unlikely that they are discounted during the design process. This way of thinking creates systems that tend to be less expensive, and are optimized to fit the relatively optimistic world and set of basic assumptions. But while optimized engineering leads to most events being of small consequence (because the systems are engineered to tolerate them), some rare events that might otherwise have been relatively benign (or at least tolerable) can now lead to massive disruption. As the margins of safety designed into the large, complex, and poorly understood systems that make up our critical infrastructure (such as the national power grid) are whittled away in the name of cost-effectiveness, the likelihood of massive, uncontrolled failures increases. But while it seems like this might be just asking for trouble, it is seen as "bad engineering" to overdesign a system to tolerate very rare events, or events whose specific causes are not well understood, if that tolerance is perceived to cost more than the failures it would prevent (in terms of expected value to the customer), or if the likelihood of the failure seems very remote -- fragility to extremely rare events is seen as a good business decision. This is why rare disruptions (like power outages) come as little surprise to insiders of highly optimized or complex infrastructures. Building excess capacity and redundancy into a system such as the electric power grid is essential to safety and reliability, but it has no market incentive -- safety doesn't sell.

What the market calls "excess capacity" (note the connotations of "excess"), others call a safety net. When a critical power line fails, parallel lines must have this "excess" capacity to take over the flow, and this safety net must remain intact when lines are out of service for maintenance. Such safety is not cheap. So while adequate margins of safety generally have the side effect of increasing the overall efficiency and reliability of a system, at some point investments in redundancy are seen as extravagant and wasteful to stakeholders, whether they are private stakeholders (i.e. shareholders) or public (i.e. taxpayers). Those who are out to placate stakeholders tend to favor more visible single-point safety or security measures, which tend to cost more in the long run and are generally less effective.

The invisible hand of economics creates systems designed and optimized under optimistic assumptions of relatively benign environments; these systems are at great risk if new or unexpected threats arise, because the margins that have historically made it possible to work around unexpected problems (think of the Apollo-13 near-disaster) are no longer designed in. The development of our critical infrastructure is subject to these economic motivations, so it is already (and will become more) fragile to rare or unexpected events. That's good business paving the road to future vulnerabilities, because the market will not bear the cost of the level of reliability that it expects. The pace of technological change and societal reliance on these systems amplify the uncertainty, urgency, and magnitude of risk here.

After 9/11, we can point out how scenarios that were previously almost unthinkable are suddenly possible, and thus engineered defenses against potential attacks are more strongly motivated. However, to define and quantify threats and their impact, particularly in combination with coordinated physical and psychological attacks and effects, requires deep contemplative research, development, large-scale experimentation, and the like -- all very costly with little to no visible immediate payoff (which makes them politically unpopular). But given the social and economic consequences that arose from the recent power outage, the national power grid is suddenly a large, inviting target for those who seek to disrupt society because it has demonstrated weaknesses and widespread impact. It is impossible to protect all important points of such a large system using the standard paradigms of physical security, which is generally designed in isolation from the system it is protecting, and therefore offers little real protection. Instead we need to fix the basic problems with the infrastructure -- if we can reduce the potential impact of catastrophic events on the power grid by making it more robust and flexible, it will become a less inviting target for catastrophic terrorism. To achieve this, we must accept that we need non-market investments in the design and implementation of safety, security, and robustness in critical infrastructure.

No shit, there I was ...

June 21, 2003 - Reading time: 3 minutes

So a friend of mine was on TV today -- specifically, he was a guest on the show Tech Support on People TV, which is broadcast live to whoever is watching in metro Atlanta. Since it's not everyday that most people get on TV, and there was supposedly room in the studio for 3 friends to watch, I went with two other people to watch David be on TV. Which was going to be fun.

So we get to People TV, which was highly reminiscent of UHF, but it was really neat to be hanging around the studio, and we were going to be sitting in the control room watching the show.

At least that was the plan until about 40 seconds (literally) before the show started, when the producer of the show asks us "are you three on camera?" We thought that he was asking us if we were going to be on the show, so we said "no" -- to which he said "Well, you are now," and started herding us through the door into the studio. We were trying to tell him that we weren't on the show until we realized that what he wanted us to do was operate the cameras.

So we operated the cameras, which was cool. Since none of us knew what we were doing, it was a bit interesting at first, but we had lots of fun and really got the hang of it by the end. And we got on the credits of the show, which was neat even though they spelled my name wrong. Plus we learned lots of neat things like how to zoom and focus and roll the cameras around, plus some cool TV cameraman phrases like "I need a two shot, left."

After the show, we were hanging around outside the building waiting for David to take care of some paperwork to get a VHS copy of the show, but he was taking too long so we went inside to get some free pizza and escape some weird drunk guy. So we're eating pizza in the hallway, and the producer comes out and says "hey, do you want to run cameras for the next show?"

Of course we said yes, and this time even got a more active role in the production process, which is really hectic by the way -- especially for live broadcasts.

Oh, and they spelled my name wrong on the credits again, but differently this time.

So anyway, definitely an interesting day. I wonder if I can go and operate the cameras more; that was fun. He said we could come back; maybe we can take him up on his offer :)

Columbia

February 2, 2003 - Reading time: 6 minutes

Note, 1 May 2020: As I look at this post, almost 2 decades after originally posting it, I no longer agree with a lot of what I said here. In the intervening 17 years, I started working at NASA, the Shuttle fleet was retired, and then I stopped working at NASA (those three events are unrelated, I swear!). Anyway, I have a lot to say about the Shuttle program (mostly good things) and NASA in general (a lot of bad things) but some get said in later posts, and some I will keep to myself for now. Maybe I'll do another NASA post, or set of posts, in the future.

I spent most of this morning and early afternoon glued to the radio, listening to reports and commentary on the loss of the Space Shuttle Columbia. I tried sitting in front of the TV watching CNN as I had done in September of 2001, but CNN's coverage of the event was sickening. NPR ended up having more intelligent coverage than any of the other news sources I tried.

The train of events leading up to the disaster is posted in so many places that I'm not going to bother mentioning it here. I'm also going to refrain from speculating on the direct cause of the disaster, because I don't have the requisite competence in this area. However, there is one nagging issue that I feel bears a closer look -- that of the piece of insulation that fell off of the OV (Orbiter Vehicle) at launch and apparently impacted the left wing.

What bothers me is not that this appears to be a smoking gun -- as I said, I'm in no position to speculate on that. The part that bothers me is the fact that once the Shuttle had launched, NASA had no way of inspecting the wing to see if it was damaged.

In one of the press conferences, we learned that Columbia was not equipped with an arm, there was no method of getting a view of the sides or bottom of the OV, and EVA was out of the question because even if one of the astronauts could get to the wing (they couldn't), there would be nothing for them to do because the astronauts do not have the training or equipment to make repairs of that nature to the shuttle. Furthermore, if there was in fact visible damage to the OV, the astronauts could do nothing but float around in space, because Columbia would not be able to (for instance) maneuver itself to rendezvous with the ISS, and even if it could it is not equipped to dock with the station. Furthermore, NASA's most optimistic estimate of how long it would take to launch a Shuttle to respond to some emergency is 2-3 weeks -- as long as there is already a shuttle on the pad, ready to go, and there are no crew change requirements. Otherwise, your emergency could have to wait 3-4 months to prepare a vehicle and crew for launch. Hardly a viable option.

Yes, it's true: NASA, which makes backups of backups of backups and contingency plans for contingency plans, has no way of saving astronauts once they are in space. Not only that; they have left themselves a huge blind spot (the physical condition of the bottom of the shuttle).

This blind spot is the cause of much speculation now on the cause of the Columbia disaster -- was there damage to the left wing of the OV from a piece of insulation that fell during launch? We may never know for sure. Any method of showing an image of the Shuttle's wing -- EVA, a camera, whatever -- could have answered many questions, and perhaps saved the lives of seven astronauts. If there are any benefits to be gained from this event, I hope to see:

Improved EVA ability. This means better space suits for the astronauts -- suits that allow greater freedom of movement than the current ILC Dover suits (which weigh over 300 pounds). A suit designed for EVA should allow astronauts to move around without depending on tethers and handles to hold.
Visual diagnostic ability for a vehicle in orbit. A picture is worth a thousand words. As we learned today, it could also be worth 2-3 years of investigation, and perhaps seven lives.
Quicker launch turnaround. No amount of diagnostic ability will get a disabled vehicle safely back to Earth. It is shameful that after 30 years of developing the shuttle, it still takes about three weeks of work at the pad to launch an OV. In 1981, the United States amazed the world by creating the first reusable launch vehicle. What we didn't create was a practical launch vehicle. Twenty two years later, we still have a vehicle that weighs more than 4.5 million pounds at launch, and burns over 3.5 million of those pounds getting off the ground. When the Challenger blew up, Ronald Reagan promised us that we would build another Shuttle, and indeed we got a replacement Shuttle. What we need now is not a replacement -- what we need is a new Shuttle. One that is lighter, stronger, more versatile, and more agile. We need to shock the world again, with the first practical reusable launch vehicle.

Alas, out of the three items on this list, this is the least likely to happen.

It seems fitting at this point to refrain from drawing any conclusions. Hopefully, we will know more about what happened and what could have been done to prevent it in the weeks and months to come. Only then can a backseat engineer like myself feel confident in providing direction for the future of the space program...

Friday's Rant, 11 October 2002

October 11, 2002 - Reading time: 10 minutes

In our rush to fix the problems that led to the recent election-related debacle, several states are trying to implement electronic voting systems to ensure quick and accurate election results. In theory, this seems like an excellent idea -- after all, an all-electronic system means no hanging chads, no butterfly ballots, and no manual recounts. The problem is that so far, we apparently haven't come across a way to do it right.

After the 2000 presidential elections, everyone had an opinion on how our voting system should be improved. Among the worst ideas were internet voting or voting at ATMs. Thankfully, those ideas weren't implemented, but some of what we've seen in 2002 is just as bad. Why bad? After all, an MIT/CalTech press release bubbled on about the wonderful improvement in Florida's voting technology in 2002. "On average," it says, "2.0 percent of Democratic voters recorded no vote for governor in [Brevard, Broward, Duval, Hillsborough, Miami-Dade, Palm Beach, and Pinellas] counties ... this is a 35 percent improvement in performance. ... These results are very encouraging."

I cannot begin to apprehend the confusion of ideas that could provoke such a statement.

Democracy has failed if even a single voter is not heard in an election. Period. According to the 2000 census, the seven counties represented in the MIT/CalTech study contain 6,260,142 residents over the age of 18. I don't know how many registered voters are in those counties, and how many of those are democrats, so for the sake of argument, I'll say that 5% of those 6,260,142 residents are democratic voters that went out to vote. (this may be unnecessarily conservative but it should work for the sake of this argument). If this were the case, over 6,200 votes were not counted in those seven counties alone. Over 6,200 votes. This is nothing to be proud of, even if it is an improvement.

Problems that did occur in the seven counties were lightly dismissed by Professor Charles Stewart, an MIT professor working on the Voting Technology Project (which released this statement), as "problems encountered preparing for election day, such as training poll workers." Next time you wonder why your computer is so hard to use, keep in mind that Charles Stewart is a professor at one of the nation's most respected engineering schools. Does he tell his students that user interface design and end-user training are unimportant? That engineers design circuits, and problems with the final product can be attributed to incompetent users? Blaming the user is a common fault among engineers who feel that if they understand their product, so should everyone else. Everyone else isn't an engineer, though, or a computer scientist. Not accounting for end users is the biggest mistake an engineer can make.

Blaming hapless poll workers or poorly funded local election commissions, while easy, overlooks two fundamental problems:

The voting equipment was unintuitive enough that the average poll worker was unable to administer it. This is unfortunate because those are exactly the people who must administer it -- they are often the same people every time, so it's not like there was a surprise ("What, you mean old people are going to run these?!"). The equipment should not require any specialized training that cannot be fit, in legible text, on a sticker on the back of the machines. Training seminars should not be required.
The voting equipment was unintuitive enough that many voters were not able to operate it. Voting should not be hard. Voting should not require training, or practice votes. My grandmother should be able to vote without asking for directions. People select items from lists every day; it is a conceptually easy task. If the user interface is complex enough that people don't know how to do such a simple task as choosing one item from a list, the user interface is a failure.

Having said all of that, we can't dismiss some of the problems so lightly. Here are some of the more spectacular failures that happened in the 2002 Florida elections:

The ES&S voting machines that were approved by the state for use take 10 minutes to boot. Machines designed for visually impaired voters take 23 minutes to boot. This is appalling. There is no good reason for a special purpose computer to take that long to boot. In addition, the machines must be booted sequentially, and many cannot be turned on before 6AM on election day, so a polling place with 20 voting stations, including 2 stations for the visually impaired, won't be fully operational until at least 9:46am.
In Union County, 2700 optically scanned ballots had to be hand counted, because a computer bug resulted in only republican votes being counted. Was this system tested?
In Duval and Orange Counties, optical ballots did not fit in counting machines. Some election officials took to trimming the ballots with scissors or pocket knives to make them fit.
In several precincts in Miami-Dade and Broward counties, electronic voting machines showed over 40% residual (lost or missing) votes, and vote data had to be extracted from backup memory inside the machines.
One south-Florida precinct showed a 1200% voter turnout, 12 times as many voters as were registered.
A state of emergency was declared to extend the election day by two hours so that people who were unable to vote because of equipment problems (or "problems encountered preparing for election day, such as training poll workers") could do so later. (How is the governor declaring a state of emergency "encouraging?")
At the end of the day, some polling place workers in Miami-Dade and Broward counties did not know how to turn off the voting machines and retrieve votes, so an unknown number of votes there went uncounted.
Liberty City, a precinct in Miami-Dade, had 1,630 registered voters but only 89 recorded votes.
In an election with an average voter turnout of more than 30%, about 60 Miami-Dade precincts showed a turnout of less than 10%. Some showed a turnout of 0%.
Miami-Dade county gave its poll workers written instructions for using voting equipment in English. Apparently, some poll workers could not read English; in some cases they could not read at all.

This list could go on, but there is no point in berating the obvious. The fact is that voting should not require training, but apparently it does. Electronic voting systems should fix this problem, but so far they haven't. An electronic voting system needs to meet several requirements:

Accuracy. You should vote for someone by pushing a button. Pushing buttons isn't hard, especially if they are large, evenly spaced, easy-to-press buttons. Resistive or capacitive touch screens are not a good solution in terms of buttons. Resistive touch screens don't work well if you press the pad of your finger on them. Capacitive touch screens don't work well if you press with your fingernail. Both methods suffer from long-term durability, reliability, and calibration problems. Infrared touch screens are better. All touch screens need to be cleaned frequently to remove grease, smudges, and prevent communicable infections. Once you've voted, you should be presented with a list so that you can confirm that you voted correctly. There should be an easy way to correct any mistakes. Once the voter is satisfied, the vote must be recorded using at least two methods. At least one of those methods needs to be human readable, for verification and recounts (if necessary).
Anonymity. This has been a no-brainer since the early days of voting, where ancient Greeks dropped stones into vases. Now people drop paper ballots into boxes. Exactly how the vote is encoded on paper isn't really relevant, as long as it can be accurately counted, and not be connected with the voter.
Auditability. As stated before, there needs to be a human readable output for every vote that can be verified by the voter and securely stored for use later in the event of a computer foulup. There will be computer foulups. There always are. Be ready for them.
Reliability. Maybe this should read "failing gracefully." The equipment should fail gracefully, the whole system should be able to handle individual component failures gracefully, and the system as a whole should fail gracefully. What happens if the power fails? What happens if someone accidentally damages a touch screen? What happens if the software crashes? Why did the software crash, anyway? Was it tested thoroughly? What is the defect rate on the equipment? How do you fix problems that occur on election day? Single-purpose software isn't conceptually hard to program, and given a limited number of input methods, it shouldn't be hard to make software that will get through the day without crashing. Having said that, there needs to be an easy way for poll workers to return the system to working order in case it does.
Scalability. The same system that handles two candidates needs to handle 12 candidates. The end result needs to be legible and intuitive. It is assumed that the machine will have enough capacity to store every vote made in a day, especially considering that there is a fixed, finite, and readily available number of registered voters for every precinct.
Security. Who programs the machines? Who has access to the source code? Who can change the source code? Who confirms that the proper source code is uploaded to the machines? Who supervises the process? Who certifies the source code? How are the results transferred to a central location? Constant supervision is good. Independent code audits are good. Random sampling of the code that is actually on the voting machines is good. Open Source is good. Background checks on the programmers is good. Hiring Russians who outsource programming to undisclosed third parties is bad. Hiring people convicted of vote fraud is bad.
Speed. Isn't that one of the great features of an electronic voting system? Let's know the results right away. The voting machines, since they are single-purpose machines, should boot and be ready to use in seconds, not minutes. It's not hard. Really.
Ease of use. You shouldn't need training to vote! All you're doing is picking someone out of a list. If it's really so hard, it's the fault of the user interface designer, not the voter or the poll worker. You also shouldn't need training to administer the voting machines. The machines should have an "on" switch and a screen. Nothing else. The people at the polling station should have to do nothing but turn on the machines at the beginning of the day, and turn them off at the end of the day.

None of this is hard, or difficult to fathom. However, I've never accused our elected officials of being competent. I say lets go back to dropping stones in vases.