2014-03-27

Co.Labs

More About Our Methodology: Tracking MH370 With Monte Carlo Data Models

Yesterday's story generated a heated discussion on Reddit, so here are more in-depth explanations of our methodology.



The article we published yesterday on the missing Malaysia Air flight has--big surprise--received some criticism and nitpicking on Reddit, which I’ll address below.

Choosing The Heading

The plane direction model is a state dependent (Markov) random walk which at each stage takes the previous heading (plane direction) into account in choosing the new heading. (Specifically, it’s normally distributed with a mean of the previous heading and a fixed standard deviation; hence I summarize it as quasi-random due to the weighting). If the standard deviation is large it will move closer to a uniform distribution, resulting in a stateless random walk.


See the post that sparked this discussion here: How I Narrowed Down The Location Of Malaysia Air Using "Monte Carlo" Data Models. Then read our followup to this post: This Data Model Shows MH370 Could Not Have Flown "Accidentally" To Its Destination


But this scenario is only realistic if we think that the plane picked a totally random direction every hour. Planes trying to get somewhere don’t tend to do that. So I make the assumption, through a smaller standard deviation, that the plane will tend to fly in the same direction it’s flying in. However, I’ll note that by using a normal distribution and varying the standard deviation, my model would have the capability to resemble a uniform distribution if that’s an assumption I chose to make.

Adding Additional States To The Model

The values that I chose for the standard deviation are a reasonable balance between constant flight direction and the ability to explore the space. Ultimately the model of the plane direction--the heading state--is only one input; we also have the ping data to help constrain where the plane goes.

While it is true that even more states like altitude, speed, and remaining fuel would make for a richer model, that’s only true if there was the data to constrain it, which I’m not aware of. For now, heading alone, plus the 5th ping, gives a very reasonable result.

Using The 5th Ping

There were 7 complete pings during the flight. Five occurred after the last radar sighting over Pulau Perak and those are the ones I refer to in the model. Each of these pings has a distance associated with it, and each distance has an error in the distance estimate. Unfortunately, Inmarsat and the Malaysian authorities have only released the distance of the last (5th) ping and have released no error estimates.

My original plan was to constrain the plane position at each stage with the ping distance for that stage. This is done by taking the product of the probabilities from the ping distance and the heading (suitably renormalized). Since the ping distances for the intermediate pings are unavailable, I constrain each ping by the final ping distance.

I assumed the ping error had a normal distribution with a standard deviation of 5%,10%, and 20% from the radius. Since we only have data for the last ping, the large error estimates effectively cover other pings also, according to the qualitative data schematic maps, besides the 5th ping error which remains unstated by Inmarsat.

Wind Information

The airport locations come from a database for an entertainment oriented flight simulator (X-Plane), but the wind data--which I decided was insignificant for the model--came from a professional pilot who used a service actually used to file flight plans for real flights.

Since it’s the same information commercial pilots use when planning to fly the plane, I trust it more than an Internet weather service which is unlikely to have data across the latitude and longitude range, and cruising altitude (35,000 ft.) MH370 is likely to have experienced.

The Overall Goal

The goal of this Monte Carlo model is not to definitely show what happened, but help to explore space of the most likely scenarios using the information available to me. My model is not supposed to be the most complicated model--it model answers the simple question: With a few simple assumptions, how far can we go?

The answer is, about as far as the experts have, but with much fewer resources (data, time, people). If more information on the Doppler analysis and the pings were released, I could incorporate that into my model to give an even better estimate of where the airplane is. It would be great if this data were made available.

I also welcome any other suggestions, criticisms, and critiques, and data I can use.

[Image: Flickr user r2hox]




Add New Comment

4 Comments

  • Stephen Stanley

    I have a feeling this will not be understood by a very broad audience. Given the prevalence of casinos, lotteries and bets on billion dollar brackets, I'd state with a fairly high degree of confidence that most of us are statistically illiterate. I understand it but my feeling is you lost most of your audience at "Markov."

  • That's a valid point -- and something I was aware of in preparing this followup, but that perhaps I still underestimated. Some of the more persuasive critiques of the model came from people who used such terminology, though, so in order to be succinct and respond to that I decided to try to translate what I'd first written, in a way that uses such terminology as well. Not because I'm trying to make what I did sound complicated now, but because I'd have to use a lot more words and explanation otherwise. Which is a big reason why terminology and shorthand specific to disciplines, exist in the first place.