Struggling with Reporting of EDM’s last moments

This is reposted from LinkedIn, and is honestly not a rant.

Exomars EDM during EDL
Exomars EDM during EDL

It’s not been easy not commenting about the Exomars precursor EDM mission, Schiaparelli. Especially when many diverse media reports claimed it would have been the first successful European landing on Mars. I will, once more, patiently explain that while Beagle-2 totally failed in its overall mission and failed to communicate, it successfully landed. Of several bits of evidence for this, here’s the latest, in case. (However, ignore the erroneous assertions about how much science data it would have collected, in silence). Back to Schiaparelli.

A lot of people have asked me what I think about this. Headline response – Mars is hard. It makes sense to engineer things away to make it as safe and easy as possible. And from here on in I’m just confused… In this Guardian article appear the following quotes:

“The erroneous information generated an estimated altitude that was negative – that is, below ground level,” the ESA said in a statement.

“This in turn successively triggered a premature release of the parachute and the backshell [heat shield], a brief firing of the braking thrusters and finally activation of the on-ground systems as if Schiaparelli had already landed. In reality, the vehicle was still at an altitude of around 3.7km (2.3 miles).”

I am really having a hard time digesting this. It is weird to implement a sensor system likely to saturate its output for a critical action. It is weird to ignore (or have incomplete) exception checking on its output (see Ariane V501 for another exception handling failure). It is weird that IMU rate data can in any way influence altitude estimates (given there was a radar). It is weird that an altitude calculation can even generate negative values – or even tolerate discontinuities of any kind in descent rate or altitude. (Yes, I am aware of and very comfortable with negative altitudes relative to the MOLA datum. I suggest that this isn’t relevant for landing events.) It is weird that even if one allowed all of the above things to happen, and so built software that could find itself underground, that its then reasonable to release ‘chutes and other hardware as the next course of action. If you agree, then for ‘weird’ above, feel free to substitute wrong. (You are of course free to disagree, and I want to clearly state that I am not familiar with the specific design of the EDM systems or software, nor with anyone involved in those things. I would love to be wrong and misunderstand the situation, or to find that the abovementioned reporting is in itself messed up through an ESA/PR blender and doesn’t reflect reality.)

El Reg, always good for a bit of eyebrow inflection upon news of technical mishap, gives the context immediately before the quote:

 [saturated] IMU data was “merged into the navigation system”

Gosh. Unchecked data being dropped into something akin to (let’s say) a Kalman filter for sensor fusion and state estimation, yes, will indeed generally foul up the works. Especially if they’re saturated values or zeros, or anything else that increases the singularity of any of the matrices you care about. There are solid approaches to control engineering when faced with uncertain parameters in a dynamic environment. Adam Steltzner is an example of someone who gets that. Conversely, “Blend all the data and ignore the planet” is not a good strategy for EDL. Or anything else. (Now that I’ve written that, it sounds like the philosophy of climate change nay-sayers.)

Lessons to take forwards for people who haven’t yet tried this.

Regardless of what you’re doing, for any variable in a flight system – if its 0, max value, or negative – at least implement placeholder checks for every case regardless – because then it forces your engineers to think “what would happen if…” – and somebody has to see that case during code review, or better, CDR. Do this even if it makes no physical sense at all. Especially if it makes no sense. Why? Because some berk will provide you with utterly invalid data. For sure. Every time. AND THEY SHOULD – because if you can’t tolerate bad data, saturated data, zeroes all the way, test patterns or debug pointers – then your design isn’t good enough for mission critical functions. And if they don’t – your test plan isn’t good enough either.

if (thing has this unexpected value) {
  // this should not happen, but still
  try { 
  catch (..) {
    // always catch
  throw (..) {
    // sometimes, you need a benign and peaceful handler. 
    // e.g. switching engines off isn't a job for an exception handler 
    // - that's an intentional decision.
} elseif (same thing has this other unexpected value) {
  // oh this really shouldn't happen.
  // but i've put it in the code in case
  // so it has to be talked about when we get to review.
  // at least it won't be my fault when the whole thing breaks...
  // and actually this doesn't cost us at all because it will be compiled
  // away if its really not needed.