Can machine learning improve criminal justice and social welfare?

In a paper titled “Human Decisions and Machine Predictions,” a team of researchers from Chicago, Cornell, Harvard and Stanford analyzed whether machine learning can better predict judges’ bail decisions. The problem is as follows: when a person is charged with a crime, some time elapses before they can be brought to the courthouse to have a trial. Where should they await trial – can they wait at their homes, or should they be in prison during this waiting period? An important consideration for the judge (and society) is whether the defendant is a flight risk – how likely is it that the defendant will try and vanish. Letting the defendant await trial at their home can be more humane: it can let the defendant continue to work, earn and be with their family. It can also mean lesser costs for the state, as it is costly to house under-trials in detention facilities. However, if there is a significant chance that the defendant might try to vanish, the judge may lean toward detain the defendant in jail.

This is a clear-cut prediction task that is well suited for a machine learning algorithm: the algorithm has to predict whether the defendant will try to flee or not. There is ample prior data which can be used to train the algorithm: there are past decisions made by judges for defendants – in each case, the defendant is identified by variables such as gender, race, educational level, severity of crime, past history, and so on; and in cases where the defendant was allowed to stay at home while awaiting trial, we know which defendants waited without fleeing, and which defendants fled. In the case of defendants who were ordered jailed while awaiting trial, almost everyone obviously did not flee because they could not flee. We have no way of knowing the “counterfactuals” with respect to these detained defendants – i.e. whether they would have fled or not, had they been allowed to stay at home. The implication is that the machine learning algorithm perhaps does not see much data about really hardened criminals, or those with multiple prior offenses, serious offenses, etc. – which is probably OK, because judges would anyway put them in jail, and perhaps there is no need for a machine learning algorithm to help in those cases. It could be argued that we really need the algorithm’s help in borderline cases.

Another point to note is that the algorithm is primarily going to “learn” which characteristics of the defendant and their past history best predict whether the defendant is a flight risk – the sole focus of the algorithm is flight risk. In reality, judges may have other criteria for their decisions – such as for instance racial inequities, nature of the crime (e.g. violent crimes).

So, the researchers trained the algorithm on past cases and their outcomes, and found that the algorithm can help better pinpoint whom to jail and whom to release. As with all machine learning algorithms, the algorithm was trained on a “training set”, which is typically about 70% of all data. If they have 10,000 cases from past files, about 7000 cases are used to train the algorithm to spot important characteristics that are linked to defendants’ flight risk. In each of these 7000 training examples, the algorithm “see” the defendant’s characteristics, as well as the outcomes (the labels) – whether the defendant fled or not. Using this, the algorithm builds a set of probabilities associated with each characteristic (e.g. a male defendant has a X % probability of fleeing). Next, the algorithm is fed the remaining 3000 records (the “test” data), and asked to predict whether each person in the test set will flee or not. The algorithms’ predictions are then compared to what actually happened. This results in a “confusion matrix” (fig. below) – on the rows, we have the predicted flee and predicted not flee, and in the columns we have actual flee and actual not flee. Analyzing this confusion matrix can give us a sense of the accuracy of the algorithm. For instance, we can add up the true positives (cases where the algorithm correctly predicted that someone will flee) and true negatives (cases where the algorithm correctly predicts that someone won’t flee), and divide this sum by the total number of cases (which also include false positives and false negatives) to get the overall predictive accuracy of the algorithm.

Does the algorithm result in welfare gains? Using human judgment, in the past we ended up jailing some people and freeing others. Of those who were free, some ended up fleeing. If we had used the AI algorithm instead, would fewer people have fled? Yes, according to the researchers. They quantify the welfare gains as follows: with no change in jailing rates, crime can be reduced by up to 24.8% – that is, if with human judgment we jailed, say, 40% of defendants, and if we required that the algorithm identify 40% of the test cases for jailing, then the algorithm did a better job of identifying those who might be a flight risk. If we used the algorithm to decide whom to jail, we might have fewer of those not jailed fleeing, than if we used human judgment.

To put it in another way, if we keep the crime rate fixed (here crime is narrowly defined as number of persons fleeing while awaiting trial), then the algorithm allows us to reduce the number of persons jailed by 42%: we can jail fewer people, while allowing that those who are not jailed will flee at the same rate as before. This means that we are more likely to identify people who won’t flee if not jailed, and we can avoid jailing them.