Most people involved in car accidents have a driver’s license
Has Simpson’s paradox anything to do with causality as Judea Pearl claims in The Book of Why ? In this book the computer scientist and philosopher of science describes the historical development of a mathematical theory of causation. This new theory licenses the scientist to talk about causes again after a period in which she could only report in terms of correlations. Will the Causal Revolution, in which Pearl playes a prominent role, eventually lead to a conversational machine that passes the Turing test?
The strange case of the school exam
A school offers courses in statistics. Two Professors are responsible for the courses and the exams. The contingency tables below show statistics about the students exam results in terms of passed (Positive) or not passed (Negative) for each of the two Professors.
The school awards the Professor with the best exam results. Professor B claims the award pointing at the first table. This table shows indeed that the relative frequencies of passing are higher for Professor B (2% negative result) than for Professor A (3% negative result).
Professor A objects against B’s claim. It was recorded which students were well prepared for the exam, and which were not. He compiled a table for the segregated results. Indeed, this second table shows that for both student categories the results of Professor A are better than for those of Professor B.
Which Professor wins the award?
The statistics in the aggregated table shows clearly that for the whole group of students prof B has better results than prof A, but for both subgroups of students it is reversed: prof A is better than prof B.
How is this possible?
This surprising outcome of the statistics exams is my favourite instance of Simpson’s paradox. The paradox is well known among scholars and among most students that followed a course in statistics. I presented it my students in a lecture to warn them for hidden variables. I have surfaced my slides again when I was reading Judea Pearl’s discussion of the paradox in The Book of Why.
Beyond statistics: causal diagrams
After he introduced Bayesian Networks in the field of Artificial Intelligence, Pearl invented causal diagrams and developed algorithms to perform causal inferences on these diagrams. In The Book of Why Pearl presents several instances of Simpson’s paradox to clarify that we cannot draw causal conclusions from data alone. We need causal information in order to do that. In other words: we need to know the mechanism that generated the data.
Causal diagrams are mathematical structures, directed acyclic graphs (DAGs) in which the arrows connecting two nodes represent a causal relation, not just a probabilistic dependency.
Figure 1 shows two possible causal diagrams for the case of the school exams.
Both networks can be extended to a Bayesian network with probabities that are consistent with the statistics in the tables. In both models the Professor and the Student, represented by the node labeled Prepared, are direct causes of the exam result, represented by the node labeled Passed. The diagrams differ in the direction of the arrow between the Prof node and the Prepared node. In the diagram on the left the causal direction is towards the Prof node; in the diagram on the right the cuasal direction is towards the Prepared node: the Professor determines how well students are prepared for the exam.
If the latter model fits the real situation the school should award Professor B. The decision should be based on the table with the combined results. The better exam results are the Professor’s credit.
The diagram on the left models the situation in which the preparedness of the students somehow determines the Professor. In this case the school could award Professor A based on the results in the lower, segragated, table.
What has Simpson’s paradox to do with causality?
What makes Simpson’s paradox a paradox? There has been some discussion about this in the statistical literature. Simpson himself gives two examples of the phenomenon. One is about the chances of survival after a medical treatment where the contigency tables show that the treatment is good for males as well as for females but valueless for the race. Of course, such a treatment cannot exist. But what should we conclude from the tables? Again, the answer depends on the underlying mechanism, that can be represented by a causal diagram. Simpson suggests that the “sensible interpretation” is that we use the segregated results for the genders. It is a bit strange, indeed, to assume that the treatment affects the patient’s gender.
Pearl distinguishes between Simpson’s reversal and Simpson’s paradox. He claims that Simpson’s paradox is a paradox because it “entails a conflict between two deeply held convictions”. Notice that also in case there was no reversal different causal diagrams are possible.
Why does Simpson’s paradox reveal?
In Causality(2003) Pearl introduces the paradox in terms of conditional probabilities.
“Simpson’s paradox refers to the phenomenon whereby an event C increases the probability of E in a given population p and, at the same time, decreases the probability of E in every subpopulation of p. In other words, if F and ~F are two complementary properties describing two subpopulations, we might well encounter the inequalities
P(E | C ) > P(E | ~C)
P(E | C,F) < P( E | ~C,F)
P(E | C,~F) < P(E | ~C,~F)
“Although such order reversal might not surprise students of probability, it is paradoxical when given causal interpretation.’’ (Causality, p.174; italization is mine)
From the first inequality we may not conclude that C has a positive effect on E. The effect of C on E might be due to a spurious confounder, e.g., a common cause of E and C.
In our example of Simpson’s paradox we could estimate conditional probabilities P(Passed|Prof) from the contingency tables.
From the inequality
P(Passed=True|Prof = A) > P(Passed=True| Prof=B)
derived from the combined table we could conclude that the Professor has a causal influence on Passed, i.e. on the exam results. If we do this we give the inequality a causal interpretation. And this is clearly wrong! There could be other mechanisms (confounders) that make Passed dependent on Professor.
Why is Simpson’s reversal surprising?
Consider the following statement.
If a certain property holds for all members of a group of entities then that same property also holds for all members of all subgroups of the group and vice versa.
This seems to me logically sound. It holds for whatever property. The statement differs from the following.
If a certain property holds for a group of entities then that same property also holds for all subgroups of the group and vice versa.
The second one is about properties of aggregates. This is not a sound logical rule. It depends on the property if it holds truth.
If a student sees the contigency tables of the school exams and notices the reversal he might perceive this as surprising and see it as contradicting the first statement.. On second thought, he might notice that it is not applicable: there is no property that holds for all students. The student might think then that it is contradicting the second statement. But then he realizes that this is not sound logic. Simpson’s paradox makes him aware that the second rule, the one about aggragates does not apply here. The reason is that the property is not “stable’’. The property changes when we consider subgroups instead of the whole group. The property is a comparison of relative frequencies of events. In our example:
6/600 < 8/600 and 57/1500 < 8/200
and for the merged group it holds that:
(6+57)/(600+ 1500) > (8+8)/(600+200)
The abstract property hides, in a sense, the differences that occur in the underlying relative frequencies. The situation is like winning a tennis match: a player can win the match although her opponent wins most of the games. The outcomes of the games are hidden by counting the number of sets that each of the players wins. With set scores 6-5, 0-6 and 6-5 player A wins 2 sets to 1, but player B wins with 16 games to 12.
Indeed, “Simpson’s reversal is a purely numerical fact”.
What has Simpson’s paradox to do with causality?
Pearl’s claims that for those who give a physical causal interpretation of the statistical data, there is a paradox. “Causal paradoxes shine a spotlight onto patterns of intuitive causal reasoning that clash with the logic of probability and statistics” (p.190).
In The Book of Why he writes that it cost him “almost twenty years to convince the scientific community that the confusion over Simpson’s paradox is a result of incorrect application of causal principles to statistical proportions.”
It looks like it depends not only on the rhetorical way an argument is brought but also on the receiver if an argument or construct is perceived as a paradox.
The heading “ Most people involved in car accidents have a driver’s license’’ is conceived as funny by the reader in as far as it suggests for the reader a causal relation, i.e. that having a driver’s license causes car accidents.’’
How would a student of the Jeffrey’s and Jaynes’ school, i.e. some one who has an epistemological concept of probability perceive Simpson’s paradox?
When I saw Simpson’s paradox for the first time I was surprised. Why? Because of the suggestion the tables offer, namely that they tell something about general categories. Subconsciously we generalize from the finite set of data in the tables to general categories. If we compute (estimate) probabilities based on relative frequencies we in fact infer general conclusions from the finite data counts. The probabilities hide the numbers. In my view the paradox could very well be caused by this inductive step. We need not interpret probabilistic relations as causal to conceive the paradoxical character.
What are probabilities about?
At the time I was a student, probability theory and statistics was not my most popular topic. On the contrary! My interest in the topic were waken up when I read E.T. Jaynes’ Probability Theory. Jaynes is an out and out Bayesian with a logical interpretation of the concept op probability. According to this view probability theory is an extension of classical logic. Probabilities are measures of the plausibility of a statement expressing a state of mind. P(H|D) denotes the plausibility of our belief in H given that we know D. I use H for Hypotheses and D for Data. P(H|D) can stand for how plausible we find H after having observed D. Bayes’ rule tells us how we should update our beliefs after we have obtained new information. Bayes’ rule is a mathematical theorem within probability theory. It allows us to compute P(H|D) from P(D|H), the probability of D given some hypothesis, and P(H), the prior probability of H.
Jaynes warns his readers to distinguish between the concept of physical (or causal) dependency and the concept of probabilistic dependency. Jaynes theory concerns the latter, epistemological (in)dependencies, not causal dependencies.
Neither involves the other. “Two events may be in fact causally dependent (i.e. one influences the other); but for a scientist who has not yet discovered this, the probabilities representing his state of knowledge – which determines the only inferences he is able to make – might be independent. On the other hand, two events may be causally independent in the sense that neither exerts any causal influence on the other (for example, the apple crop and the peach crop); yet we perceive a logical connection between them, so that new information about one changes our state of knowledge about the other. Then for us their probabilities are not independent.’’ (Jaynes, Probability Theory, p. 92).
Jaynes’ Mind Projection Fallacy is the confusion between reality and a state of knowledge about reality. The causal interpretation of probabilistic relations is an instance of this fallacy. Logical inferences can be applied in many cases where there is no assumption of physical causes.
According to Pearl the inequalities of Simpson’s paradox are paradoxical for someone who gives them a causal interpretation. I guess Jaynes would say: the fact that these inequalities hold shows that we cannot given them a causal interpretation; they express different states of knowledge. You cannot be in a knowledge state in which they all hold true.
But how would Jaynes resolve the puzzle of the school exam? Which of the two Professors should win the award? Jaynes was certainly interested in paradoxes, but he didn’t write about Simpson’s paradox, as far as I am aware of. I think, he would not consider it a well-posed problem. Jaynes considered the following puzzle of Bertrand’s not well-posed:
Bertrand’s problem can only be solved when we know the physical process that selects the cord. The Monty Hall paradox discussed by Pearl, is also not well-posed, and hence unsolvable, if we don’t have information about the way the quiz master decides which door he will open. The outcome depends on the mechanism. Jaynes and Pearl very much agree on this. Jaynes relies on his Principle of Maximum Entropy to “solve” Bertrands’paradox. I don’t see how this could solve the puzzle of the school exam. Somehow Jaynes must put causal information in the priors.
How can Jaynes theory help the scientist in finding if two events are “in fact causally dependent’’ when probabilities are about the scientist’s “state of knowledge’’ and not about reality? After all scientist aim at knowledge about the real causes. We are not forbidden, Jaynes says, to introduce the notion of physical causation. We can test any well-defined hypothesis. “Indeed, one of the most common and important applications of probability theory is to decide whether there is evidence for a causal influence: is a new medicine more effective, or a new engineering design more reliable?’’ (Jaynes, p.62).
The only thing we can do is compare hypothesis given some data and compute which of the hypothesis best fits the data. Where do the hypothesis come from? We create them using our imagination and the knowledge we have already gained about the subject.
The validation of causal models
Causal diagrams are hypothetical constructs designed by the scientist based on his state of knowledge. Which of the two causal diagrams of school exam case fits the data best? We have learned that we cannot tell based on the data in the contingency tables: both hypothetical models fit the data. Gathering more data will not help us in deciding which of the two represents reality. We can only decide when we have extra-statistical information, i.e. information about the processes that made the data. Jaynes advocates the use of his principle of maximum entropy when we have to make a choice for the best prior. But the causal direction is not testable by data. So I do not see how this can solve the school’s problem.
But how does Pearl justify the causal knowledge presented in a causal model? How can we decide that this model is better than that one? The hypothetical causal models are in fact theories about how reality works. We cannot evaluate and compare them by hypothesis testing. Data cannot decide about causation issues. How do we validate such a theory then? It seems that we can at best falsify them.
Pearl doesn’t give an explicit answer to this critical question in The Book of Why. The answer is implicit in the historical episodes of scientific inquiries that he writes about; the quests and quarrels of researchers searching for causes. If there is something like the truth, it is in these historical dialectical processes. Not outside this process. Although it helps that now and then someone stubbornly believes that she has seen the light and fights against the establishment’s doctrine. Those are the ones that make science progress. The Book of Why contains a few examples of such stubborn characters. To quote Jaynes: “In an field, the Establishment is seldom in pursuit of the truth, because it is composed of those who sincerely believe that they are already in possession of it.” (Jaynes, p.613). Eventually, it is history that decides about the truth.
The Big Questions: Can machines think ?
In the final chapter of The Book of Why Pearl shares some thoughts about what the Causal Revolution might bring to the making of Artificial Intelligence. “Are we getting any closer to the day when computers or robots will understand causal conversations?’’ Although he has the opinion that machines are not able to think yet, he believes that it is possible to make them think and that we can have causal conversations with machines in the future.
Can we ever build a machine that passes the Turing test, a machine that we can have an intelligent conversation with as we have with other humans? To see what it means to build such a machine and what this has to do with the ability to understand causality, consider the following two sentences (from Terry Winograd, cited in Dennett (2004)).
“The committee denied the group a parade because they advocated violence.’’
“The committee denied the group a parade because they feared violence.’’
If a sentence like these occurs in a conversation with a machine it must figure out the intended referent of the (ambiguous) pronoun “they”, if it will be able to respond intelligently.
It will be clear that in order to do this, the machine must have causal world knowledge, not just about a few sentences, or about some “part or aspect of the world’’ (which part or aspect then?). Such a machine might also be able to see the pun in “Most drivers that are involved in a car accident have a driver’s license.’’.
I worked for quite some time in the field of Natural Language Processing, building dialogue systems and artificial conversational agents. We haven’t succeeded up to now in making such machine, although results are sometimes impressive. Will we ever be able to build such a machine? It is an academic issue often leading to quarreling about the semantics, something that Turing tried to prevent with his imitation game.
What about responsibility?
What is not an academic issue, but a real practical one, is the responsibility that we have when using machines; computers and robots that we call intelligent and that we assign more and more autonomy and even moral intelligence.
I end my note about Simpson’s paradox that became a sort of review of Pearl’s The Book of Why, with emphatically citing another giant in the philosophy of science, Daniel C. Dennett.
“It is of more than academic importance that we learn to think clearly about the actual cognitive powers of computers, for they are now being introduced into a variety of sensitive social roles, where their powers will be put to the ultimate test: In a wide variety of areas, we are on the verge of making ourselves dependent upon their cognitive powers. The cost of overestimating them could be enormous.’’ (D.C. Dennett in: Can Machines Think?).
“The real danger is basically clueless machines being ceded authority far beyond their competence.” (D.C.Dennett in: The Singularity—an Urban Legend? 2015)
Great books are books that make you critically reflect and revisit your ideas. The Book of Why is a great book and I would definitely recommend my students to read it.
Daniel C. Dennett (2004) Can Machines Think? In: Teuscher C. (eds) Alan Turing: Life and Legacy of a Great Thinker. Springer, Berlin, Heidelberg (pp. 295-316) https://link.springer.com/chapter/10.1007%2F978-3-662-05642-4_12
Daniel C. Dennett(2015) The Singularity – an urban legend? Published in What do you think about machines that think? Edge.com https://www.edge.org/response-detail/26035
E.T. Jaynes (2003) Probability Theory: the logic of science. Cambridge University Press, UK, 2003.
Judea Pearl(2001) Causality: models, reasoning, and inference. Cambridge University Press, UK, reprint 2001.
Judea Pearl and Dana Mackenzie(2019) The Book of Why: the new science of cause and effect. First published by Basic Books 2018. Published by Penguin Random House, UK, 2019.
Stuart Russell and Peter Norvig(2009) Artificial Intelligence: A Modern Approach, 3rd edition. Published by Pearson, 2009.
E.H. Simpson(1951) The Interpretation of Interaction in Contingency Tables. Source: Journal of the Royal Statistical Society. Series B (Methodological), Vol. 13, No. 2 (1951), pp. 238-241. Published by: Blackwell Publishing for the Royal Statistical Society. Stable URL: http://www.jstor.org/stable/2984065