, I wrote an article about the theory (and some applications!) of density estimation, and how it is a powerful tool for a variety of methods in statistical analysis. By overwhelmingly popular demand, I thought it may be interesting to use density estimation to derive some insight on some interesting data — in this case, data related to legal theory.
Although it’s great to dive deep into the mathematical details behind the statistical methods to form a solid understanding behind the algorithm, at the end of the day we want to use these tools to derive cool insights from data!
In this article, we’ll use density estimation to analyze some data regarding the impact of a two-verdict vs. a three-verdict system on the juror’s perceived confidence in their final verdict.
Contents
Background & Dataset
Our legal system in the US uses a two-option verdict system (guilty/not guilty) in criminal trials. However, some other countries, specifically Scotland, use a three-verdict system (guilty/not guilty/not proven) to determine the fate of a defendant. In this three-verdict system, jurors have the additional choice to choose a verdict of “not proven”, which means that the prosecution has delivered insufficient evidence to determine whether the defendant is guilty or innocent.
Legally, the “not proven” and “not guilty” verdicts are equivalent, as the defendant is acquitted under either outcome. However, the two verdicts carry different semantic meanings, as “not proven” is intended to be selected by jurors when they are not convinced that the defendant is culpable for or innocent from the crime at hand.
Scotland has recently abolished this third verdict due to its confusing nature. Indeed, when reading about this myself, I came upon conflicting definitions for this verdict — some sources defined it as the option to select when the juror believes that the defendant is culpable, but the prosecution has failed to deliver sufficient evidence to convict them. This may give a defendant who has been acquitted by the “not proven” outcome a similar stigma as a defendant who was found guilty in the eyes of the public. In contrast, other sources defined the verdict as the middle ground between guilty and innocence (confusing!).
In this article, we’ll analyze data containing the perceived confidence of verdicts from mock jurors under the two-option and three-option verdict system. The data also contains information regarding whether there was conflicting evidence present in the testimony. These features will allow us to investigate whether the perceived confidence levels of jurors in their final verdicts differ depending on the verdict system and/or the presence of conflicting evidence.
For more information about the data, check out the doc.
Density Estimation for Exploratory Analysis
Without further ado, let’s dive into the data!
mock <- read.csv("data/MockJurors.csv")
summary(mock)

Our data consists of 104 observations and three variables of interest. Each observation corresponds to a mock juror’s verdict. The three variables we’re interested in are described below:
verdict
: whether the juror’s decision was made under the two-option or three-option verdict system.conflict
: whether conflicting testimonial evidence was present in the trial.confidence
: the juror’s degree of confidence in their verdict on a scale from 0 to 1, where 0/1 corresponds to low/high confidence, respectively.
Let’s take a brief look at each of these individual features.
# barplot of verdict
ggplot(mock, aes(x = verdict, fill = verdict)) +
geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
labs(title = "Count of Verdicts") +
theme(plot.title = element_text(hjust = 0.5))
# barplot of conflict
ggplot(mock, aes(x = conflict, fill = conflict)) +
geom_bar() +
geom_text(stat = "count", aes(label = after_stat(count)), vjust = -0.5) +
labs(title = "Count of Conflict Levels") +
theme(plot.title = element_text(hjust = 0.5))
# crosstab: verdict & conflict
# i.e. distribution of conflicting evidence across verdict levels
ggplot(mock, aes(x = verdict, fill = conflict)) +
geom_bar(position = "dodge") +
geom_text(
stat = "count",
aes(label = after_stat(count)),
position = position_dodge(width = 0.9),
vjust = -0.5
) +
labs(title = "Verdict and Conflict") +
theme(plot.title = element_text(hjust = 0.5))



The observations are evenly split among the verdict levels (52/52) and nearly evenly split across the conflict
factor (53 no, 51 yes). Additionally, the distribution of conflict
appears to be evenly split across both levels of verdict
i.e. there are approximately an equal number of verdicts made under conflicting/no conflicting evidence recorded for both verdict systems. Thus, we can proceed to compare the distribution of confidence levels across these groups without worrying about imbalanced data affecting the quality of our distribution estimates.
Let’s look at the distribution of juror confidence levels.
We can visualize the distribution of confidence levels using density estimates. Density estimates, can provide a clear, intuitive display of a variable’s distribution, especially when working with large amounts of data. However, the estimate may vary considerably with respect to a few parameters. For instance, let’s look at the density estimates produced by various bandwidth selection methods.
bws <- list("SJ", "ucv", "nrd", "nrd0")
# Set up a 2x2 grid for plotting
par(mfrow = c(2, 2)) # 2 rows, 2 columns
for (bw in bws) {
pdf_est <- density(mock$confidence, bw = bw, from = 0, to = 1)
# Plot PDF
plot(pdf_est,
main = paste("Density Estimate: Confidence (", bw, ")" ),
xlab = "Confidence",
ylab = "Density",
col = "blue",
lwd = 2)
rug(mock$confidence)
# polygon(pdf_est, col = rgb(0, 0, 1, 0.2), border = NA)
grid()
}
# Reset plotting layout back to default (optional)
par(mfrow = c(1, 1))

The density estimates produced by the Sheather-Jones, unbiased cross-validation, and normal reference distribution methods are pictured above.
Clearly, the choice of bandwidth can give us a very different picture of the confidence level distribution.
- Using unbiased cross-validation gives the impression that the distribution of
confidence
is very sparse, which is not surprising considering how small our dataset is (104 observations). - The density estimates produced by the other bandwidths are fairly similar. The estimates produced by the normal reference distribution methods appear to be slightly smoother than that produced by Sheather-Jones, since the normal reference distribution methods use the Gaussian kernel in their computation. Overall, confidence levels appear to be highly concentrated around values of 0.6 or greater, and its distribution appears to have a heavy left tail.
Now, let’s get into the interesting part and examine how juror confidence levels may change depending on the presence of conflicting evidence and the verdict system.
# plot distribution of Confidence by Conflict
# use Sheather-Jones bandwidth for density estimate
ggplot(mock, aes(x = confidence, fill = conflict)) +
geom_density(alpha = 0.5, bw = bw.SJ(mock$confidence)) +
labs(title = paste("Density: Confidence by Conflict")) +
xlab("Confidence") +
ylab("Density") +
theme(plot.title = element_text(hjust = 0.5))

It appears that juror confidence levels do not differ much in the presence of conflicting evidence, as shown by the large overlap in the confidence
density estimates above. Perhaps in the presence of no conflicting evidence, jurors may be slightly more confident in their verdicts, as the confidence
density estimate under no conflict appears to show higher concentration of confidence values greater than 0.8 relative to the density estimate under the presence of conflicting evidence. However, the distributions appear nearly the same.
Let’s examine whether juror confidence levels vary across two-option vs. three-option verdict systems.
# plot distribution of Confidence by Verdict
# use Sheather-Jones bandwidth for density estimate
ggplot(mock, aes(x = confidence, fill = verdict)) +
geom_density(alpha = 0.5, bw = bw.SJ(mock$confidence)) +
labs(title = paste("Density: Confidence by Verdict")) +
xlab("Confidence") +
ylab("Density") +
theme(plot.title = element_text(hjust = 0.5))

This visual provides more compelling evidence to suggest that confidence
levels are not identically distributed across the two verdict systems. It appears that jurors may be slightly less confident in their verdicts under the two-option verdict system relative to the three-option system. This is supported by the fact that the distribution of confidence
under the two-option and three-option verdict systems appear to peak around 0.625 and 0.875, respectively. However, there is still significant overlap in the confidence
distributions for both verdict systems, so we would need to formally test our claim to conclude whether confidence levels differ significantly across these verdict systems.
Let’s examine whether the distribution of confidence
differs across joint levels of verdict
and conflict
.
# plot distribution of Confidence by Conflict & Verdict
# use Sheather-Jones bandwidth for density estimate
ggplot(mock, aes(x = confidence, fill = conflict)) +
geom_density(alpha = 0.5, bw = bw.SJ(mock$confidence)) +
facet_wrap(~ verdict) +
labs(title = paste("Density: Confidence by Conflict & Verdict")) +
xlab("Confidence") +
ylab("Density") +
theme(plot.title = element_text(hjust = 0.5))

Analyzing the distribution of confidence
stratified by conflict
and verdict
gives us some interesting insights.
- Under the two-verdict system, confidence levels of verdicts made under conflicting evidence/no conflicting evidence appear to be very similar. That is, jurors seem to be equally confident in their verdicts in the face of conflicting evidence when working under the traditional guilty/not guilty judgement paradigm.
- In contrast, under the three-option verdict, jurors seem to be more confident in their verdicts under no conflicting evidence relative to when conflicting evidence is present. Their corresponding density plots show that verdicts with no conflicting evidence show much higher concentration at high
confidence
levels (confidence
> 0.75) compared to verdicts made with conflicting evidence. Furthermore, there are nearly no verdicts made under the absence of conflicting evidence where the jurors reportedconfidence
levels less than 0.2. In contrast, in the presence of conflicting evidence, there is a much larger concentration of verdicts that had lowconfidence
levels (confidence
< 0.25).
Formally Testing Distributional Differences
Our exploratory data analysis showed that juror confidence levels may differ depending on the verdict system and whether there was conflicting evidence. Let’s formally test this by comparing the confidence
densities stratified by these factors.
We will carry out tests to compare the distribution of confidence
in the following settings (as we did above in a qualitative manner):
- Distribution of
confidence
across levels ofconflict
. - Distribution of
confidence
across levels ofverdict
. - Distribution of
confidence
across levels ofconflict
andverdict
.
First, let’s compare the distribution of confidence
in the presence of conflicting/no conflicting evidence. We can compare these confidence
distributions across these conflict
levels using the sm.density.compare() function that’s provided as part of the sm package. To carry out this test, we can specify the following key parameters:
x
: vector of data whose density we want to model. For our purposes, this will beconfidence
.group
: the factor over which to compare the density ofx
. For this example, this will beconflict
.model
: setting this toequal
will conduct a hypothesis test determining whether the distribution ofconfidence
differs across levels ofconflict
.
Additionally, we will establish a common bandwidth for the density estimates of confidence
across the levels of conflict
. We’ll do this by computing the Sheather-Jones bandwidth for the confidence
levels for each conflict
subgroup, then computing the harmonic mean of these bandwidths, and then set that to the bandwidth for our density comparison.
For all of our hypothesis tests below, we will be using the standard α = 0.05 criteria for statistical significance.
set.seed(123)
# define subsets for conflict
no_conflict <- subset(mock, conflict=="no")
yes_conflict <- subset(mock, conflict=="yes")
# compute Sheather-Jones bandwidth for subsets
bw_n <- bw.SJ(no_conflict$confidence)
bw_y <- bw.SJ(yes_conflict$confidence)
bw_h <- 2/((1/bw_n) + (1/bw_y)) # harmonic mean
# compare densities
sm.density.compare(x=mock$confidence,
group=mock$conflict,
model="equal",
bw=bw_h,
nboot=10000)

The output of our call to sm.density.compare() produces the p-value of the hypothesis test mentioned above, as well as a graphical display overlaying the density curves of confidence
across both levels of conflict
. The large p-value (p=0.691) suggests that we have insufficient evidence to reject the null hypothesis that the densities of confidence
for conflict/no-conflict are equal. In other words, this suggests that jurors in our dataset tend to have similar confidence in their verdicts, regardless of whether there was conflicting evidence in the testimony.
Now, we’ll conduct a similar analysis to formally compare juror confidence levels across both verdict systems.
set.seed(123)
# define subsets for conflict
two_verdict <- subset(mock, verdict=="two-option")
three_verdict <- subset(mock, verdict=="three-option")
# compute Sheather-Jones bandwidth for subsets
bw_2 <- bw.SJ(two_verdict$confidence)
bw_3 <- bw.SJ(three_verdict$confidence)
bw_h <- 2/((1/bw_2) + (1/bw_3)) # harmonic mean
# compare densities
sm.density.compare(mock$confidence, group=mock$verdict, model="equal",
bw=bw_h, nboot=10000)

We see that the p-value associated with the comparison of confidence
across the two-verdict vs. three-verdict system is much smaller (p=0.069). Although we still fail to reject the null hypothesis, a p-value of 0.069 in this context means that if the true distribution of confidence
levels was identical for two-verdict and three-verdict systems, then there is an approximately 7% chance that we come across empirical data where the distribution of confidence
across both verdict systems differs at least as much as what we see here. In other words, our empirical data is fairly unlikely to occur if jurors were equally confident in their verdicts across both verdict systems.
This conclusion aligns with what we saw in our qualitative analysis above, where it appeared that the confidence levels for verdicts under the two-verdict vs. three-verdict system were different — specifically, verdicts under the three-verdict system seemed to be made with higher confidence than verdicts made under two-verdict systems.
Now, for the purposes of future investigation, it would be great to extend the data to include the final verdict decision (i.e. guilty/not guilty/not proven). Perhaps, this additional data could help shed light on how jurors truly see the “not proven” verdict.
- If we see higher confidence levels in the “guilty”/“not guilty” verdicts under the three-verdict system relative to the two-verdict system, this may suggest that the “not-proven” verdict is effectively capturing the uncertainty behind the decision making of the jurors, and having it as a third verdict provides desirable flexibility that two-option verdict system lacks.
- If the confidence levels in the “guilty”/“not guilty” verdicts are approximately equal across both verdict systems, and the confidence levels of all three verdicts are approximately equal in the three-verdict system, then this may suggest that the “not proven” verdict is serving as a true third option independent of the typical binary verdicts. That is, jurors are opting to choose “not proven” primarily for reasons other than their uncertainty behind classifying the defendant as guilty/not guilty. Perhaps, jurors view “not proven” as the verdict to choose when the prosecution has failed to deliver convincing evidence, even when the juror has a hint of the true culpability of the defendant.
Lastly, let’s test whether there are any differences in the distribution of confidence
across different levels of conflict
and verdict
.
To test for differences in the distribution of confidence across these subgroups, we can run a Kruskal-Wallis test. The Kruskal-Wallis test is a non-parametric statistical method to test for differences in the distribution of a variable of interest across groups. It is appropriate when you want to avoid making assumptions about the variable’s distribution (i.e. non-parametric), the variable is ordinal in nature, and the subgroups under comparison are independent of each other. Essentially, you may think of it as the non-parametric, multi-group version of a one-way ANOVA.
R makes this easy for us via the kruskal.test() API. We can specify the following parameters to carry out our test:
x
: vector of data whose distribution we want to compare across groups. For our purposes, this will beconfidence
.g
: factor identifying the groups over which we want to compare the distribution ofx
. We will set this togroup_combo
, which contains the subgroups ofverdict
andconflict
.
kruskal.test(x=mock$confidence,
g=mock$group_combo) # group_combo: subgroups defined by verdict, conflict

The output of the Kruskal-Wallis test (p=0.189) suggests that we lack sufficient evidence to claim that juror confidence levels differ across levels of verdict
and conflict
.
This is somewhat unexpected, as our qualitative analysis seemed to suggest that partitioning each verdict
group by conflict
segmented the confidence
values in a meaningful way. It’s worthy to note that there was a small amount of data in each of these subgroups (25-27 observations), so collecting more data could be a next step to investigate this further.
Future Investigation & Wrap-up
Let’s briefly recap the results of our analysis:
- Our exploratory data analysis seemed to indicate that juror confidence levels differed across verdict systems. Furthermore, the presence of conflicting evidence seemed to affect juror confidence levels in the three verdict system, but have little affect in the two-verdict system. However, none of our statistical tests provided significant evidence to support these conclusions.
- Although our statistical tests were not supportive, we should not be so quick to dismiss our qualitative analysis. Next steps for this investigation could include getting more data, as we were working with only 104 observations. Additionally, extending our data to include the verdict decisions of the jurors (guilty/not guilty/not proven) could enable further investigation into when jurors opt to choose the “not proven” verdict.
Thanks for reading! If you have any additional thoughts about how you would’ve carried out this analysis, I would love to hear it in the comments. I’m certainly no domain expert on legal theory, so applying statistical methods on legal data was a great learning experience for me, and I’d love to hear about other interesting problems at the intersection of the two fields. If you’re interested in learning further, I highly recommend checking out the sources below!
The author has created all images in this article.
Sources
Data:
Legal theory:
Statistics: