A Covid test that is 95% accurate? How helpful is that?
Imagine a test that was always positive no matter whether the patient had the infection or not. It would be correct at least some of the time. So would a test that was always negative regardless of whether the patient was infected. How often these tests were correct would depend on how many people were actually infected. The more people who are infected then the more times the ‘always positive’ test gets it right.
That fact doesn’t change when the tests get it right 95% of the time. Imagine a scenario where 5% of the population, that’s 1 in 20 people, are infected. Then, 5% of those people (actually positive) test negative. Also 5% of those who are not infected test positive.
Of every 100,000 people tested, only five thousand people will be infected. But 5% of the 95,000 who are not infected will test positive (4,750). Of the 5,000 who are infected, 4,750 will test positive. So of the 9,500 people testing positive, only 4,750 people (50%) will be infected, the rest are ‘false alarms’; false positives.
In this example, if you are infected then the test is likely to be correct (95%), and if you are not, then it is also likely to be correct (95%). But that’s only useful if you already know the answer which the test is supposed to provide – whether you are infected! If you only have the test result to go on, then it is 50/50 whether a positive result means you are infected.
This demonstrates that to understand a positive result we need to know certain things. Fundamentally, we need to know the probability of someone having the condition in the first place. This is called the ‘base rate’ or ‘background rate’, and is a number that affects the entire calculation shown here. Without it almost nothing useful can be determined about the probability that any random person one meets may be infected (which I think is what most people would like to know).
The base rate cannot be discovered by looking at high risk groups or any selected group; the sample must be randomly selected from the whole population. Without an estimate of the base rate decisions regarding risk are impossible. The problems with lack of a random testing programme have been highlighted by others[1],[2].
Another way of looking at the above consideration is to ask two questions;
These are different questions, and produce different answers, as discussed above.
So what has this to do with forensic DNA?
Many crimestain samples nowadays are mixtures of DNA from different people. A DNA profile is a description of the different types of DNA in an individual. We share some of these with our relatives but also can share at least a few with other people. This creates a conundrum in understanding the possible contributors to a mixture. Mixing DNA from more than one person creates new possible contributor profiles. Take the words ‘one’ and ‘two’, mix the letters and now we can make more words than the originals; net, won, ten etc. The same thing happens with DNA, mixing profiles produces new contributor possibilities; in fact, for standard DNA profiling billions of new ‘contributor’ profiles are possible just from the DNA of the two actual contributors. All of the profiles which are possible from the combinations in the mixture, other than the two actual contributors, are in effect ‘false positives’. Given these billions of possible contributor profiles, what is the probability that anyone selected at random from the population would ‘fit’ or have one of those profiles and therefore be considered a possible contributor? This is the ‘base rate’ – the probability of a positive test regardless of whether the person is a contributor or not – a chance match.
Current practice is many jurisdictions including the UK is to calculate a statistic called a Likelihood Ratio (LR). The LR is a test of only the suspect, not the other possible contributor profiles, assuming that the suspect is indeed a contributor. Although the LR calculation produces false positives there is no calculation of the ‘base rate’, no calculation of the probability of a suspect fitting the evidence by chance. If you happen to have one of the possible profiles that might be in the mixture, the LR will probably provide compelling evidence that the DNA could have come from you (according to those who believe the LR), even though you are not a contributor. The scope for false accusations is obvious.
Remember the two questions asked about the covid test? In DNA those become;
Back to virus testing: If the only known fact, the evidence, is the positive test result, then we have shown here that we need to know much more before we can properly interpret what it means. A positive result may be very poor evidence of infection. In DNA profiling, the evidence is the profile result. Unless we know the base rate, or probability that this is a false positive because many profile from non-contributors will also produce a ‘positive’, it is impossible to properly evaluate the result.
There is more than one explanation for the evidence, i.e. many possible contributors to the mixture. Calculating the LR involves an answer only to the first question, with respect to just the one person. It is calculating the probability of the evidence, not the hypothesis that it is the suspect’s DNA. It is ASSUMING what most people believe we are trying to prove; that the suspect’s DNA is there. This is like assuming that you already have covid before being tested.
Unfortunately, this ‘transposed conditional’ (assuming we are answering question 2) is made again and again in courts throughout the world and cause the evidence to be misunderstood.
[1] https://theconversation.com/coronavirus-country-comparisons-are-pointless-unless-we-account-for-these-biases-in-testing-135464
[3] Strictly speaking, it is the probability that the DNA came from someone other than the suspect.