Replicating ProPublica’s COMPAS data analysis with Python

In May 2016, a ProPublica article (Angwin et al. 2016) showed the COMPAS Recidivism Algorithm, used by US courts to predict recidivism, is biased against African Americans. The authors offered the codes in R. Based on their codes, I re-produced in Python the first half of ProPublica’s evaluation of the COMPAS Recidivism Algorithm.

1) COMPAS score and risk of recidivism

Load data

import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None  # default='warn'

dataURL = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv'
dfRaw = pd.read_csv(dataURL)

Select fields for severity of charge, number of priors, demographics, age, sex, compas scores, and whether each person was accused of a crime within two years.

dfFiltered = (dfRaw[['age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 
             'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 
             'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']]
             .loc[(dfRaw['days_b_screening_arrest'] <= 30) & (dfRaw['days_b_screening_arrest'] >= -30), :]
             .loc[dfRaw['is_recid'] != -1, :]
             .loc[dfRaw['c_charge_degree'] != 'O', :]
             .loc[dfRaw['score_text'] != 'N/A', :]
             )
print('Number of rows: {}'.format(len(dfFiltered.index)))

Number of rows: 6172

Score distribution by race

The distribution of decile scores, which are often presented to judges alongside risk classification (High, Medium and Low), suggests disparity. There is no clear downtrend in decile scores for African American defendents, unlike for the Caucasian counterpart.

COMPAS scores for each defendant ranged from 1 to 10, with ten being the highest risk. Scores 1 to 4 were labeled by COMPAS as “Low”; 5 to 7 were labeled “Medium”; and 8 to 10 were labeled “High.”

Simple cross tabulation of score categories by race

pd.crosstab(dfFiltered['score_text'],dfFiltered['race'])

race	African-American	Asian	Caucasian	Hispanic	Native American	Other
High	845	3	223	47	4	22
Low	1346	24	1407	368	3	273
Medium	984	4	473	94	4	48

Decile scores that correspond to score categories

pd.crosstab(dfFiltered['score_text'],dfFiltered['decile_score'])

decile_score	1	2	3	4	5	6	7	8	9	10
Low	1286	822	647	666	0	0	0	0	0	0
Medium	0	0	0	0	582	529	496	0	0	0
High	0	0	0	0	0	0	0	420	420	304

Histograms of decile scores

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style='darkgrid')
sns.countplot(x='decile_score', hue='race', data=dfFiltered.loc[
                (dfFiltered['race'] == 'African-American') | (dfFiltered['race'] == 'Caucasian'),:
            ])

plt.title("Distribution of Decile Scores by Race")
plt.xlabel('Decile Score')
plt.ylabel('Count')

Text(0, 0.5, 'Count')

Regression analysis - logistic regression

Here, I transform cateogorical data into dummy variables and run logistic regressions that consider race, age, criminal history, future recidivism, charge degree, gender and age. ‘High’ and ‘Medium’ categories are combined following the ProPublica analysis.

import statsmodels.api as sm
from statsmodels.formula.api import logit
catCols = ['score_text','age_cat','sex','race','c_charge_degree']
dfFiltered.loc[:,catCols] = dfFiltered.loc[:,catCols].astype('category')

# dfDummies = pd.get_dummies(data = dfFiltered.loc[dfFiltered['score_text'] != 'Low',:], columns=catCols)
dfDummies = pd.get_dummies(data = dfFiltered, columns=catCols)

# Clean column names
new_column_names = [col.lstrip().rstrip().lower().replace(" ", "_").replace("-", "_") for col in dfDummies.columns]
dfDummies.columns = new_column_names

# We want another variable that combines Medium and High
dfDummies['score_text_medhi'] = dfDummies['score_text_medium'] + dfDummies['score_text_high']

Logistic regression

# R-style specification
formula = 'score_text_medhi ~ sex_female + age_cat_greater_than_45 + age_cat_less_than_25 + race_african_american + race_asian + race_hispanic + race_native_american + race_other + priors_count + c_charge_degree_m + two_year_recid'

score_mod = logit(formula, data = dfDummies).fit()
print(score_mod.summary())

Optimization terminated successfully.
         Current function value: 0.499708
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       score_text_medhi   No. Observations:                 6172
Model:                          Logit   Df Residuals:                     6160
Method:                           MLE   Df Model:                           11
Date:                Sat, 21 Nov 2020   Pseudo R-squ.:                  0.2729
Time:                        16:26:02   Log-Likelihood:                -3084.2
converged:                       True   LL-Null:                       -4241.7
Covariance Type:            nonrobust   LLR p-value:                     0.000
===========================================================================================
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  -1.5255      0.079    -19.430      0.000      -1.679      -1.372
sex_female                  0.2213      0.080      2.783      0.005       0.065       0.377
age_cat_greater_than_45    -1.3556      0.099    -13.682      0.000      -1.550      -1.161
age_cat_less_than_25        1.3084      0.076     17.232      0.000       1.160       1.457
race_african_american       0.4772      0.069      6.881      0.000       0.341       0.613
race_asian                 -0.2544      0.478     -0.532      0.595      -1.192       0.683
race_hispanic              -0.4284      0.128     -3.344      0.001      -0.680      -0.177
race_native_american        1.3942      0.766      1.820      0.069      -0.107       2.896
race_other                 -0.8263      0.162     -5.098      0.000      -1.144      -0.509
priors_count                0.2689      0.011     24.221      0.000       0.247       0.291
c_charge_degree_m          -0.3112      0.067     -4.677      0.000      -0.442      -0.181
two_year_recid              0.6859      0.064     10.713      0.000       0.560       0.811
===========================================================================================

Black defendants were 45.3 percent more likely than white defendants to receive a higher score, correcting for prior crimes, future criminality.

control = np.exp(-1.5255) / (1 + np.exp(-1.5255))
np.exp(0.4772) / (1 - control + (control * np.exp(0.4772)))

1.452825407001621

Female defendants are 19.4% more likely than men to get a higher score.

np.exp(0.2213) / (1 - control + (control * np.exp(0.2213)))

1.1948243807769987

2) COMPAS score and risk of “violent” recidivism

ProPublica authors followed the FBI’s definition of violent crime, a category that includes murder, manslaughter, forcible rape, robbery and aggravated assault.

Load data

dataURL = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years-violent.csv'
dfRaw = pd.read_csv(dataURL)

dfFiltered = (dfRaw[['age', 'c_charge_degree', 'race', 'age_cat', 'v_score_text', 
             'sex', 'priors_count', 'days_b_screening_arrest', 'v_decile_score', 
             'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out']]
             .loc[(dfRaw['days_b_screening_arrest'] <= 30) & (dfRaw['days_b_screening_arrest'] >= -30), :]
             .loc[dfRaw['is_recid'] != -1, :]
             .loc[dfRaw['c_charge_degree'] != 'O', :]
             .loc[dfRaw['v_score_text'] != 'N/A', :]
             )
print('Number of rows: {}'.format(len(dfFiltered.index)))

Number of rows: 4020

Score distribution by race

sns.set(style='darkgrid')
sns.countplot(x='v_decile_score', hue='race', data=dfFiltered.loc[
                (dfFiltered['race'] == 'African-American') | (dfFiltered['race'] == 'Caucasian'),:
            ])

plt.title("Distribution of Violent Decile Scores by Race")
plt.xlabel('Decile Score')
plt.ylabel('Count')

Text(0, 0.5, 'Count')

Distribution of Violent Decile Scores by Race

COMPAS violent risk scores also show a disparity in distribution between white and black defendants.

catCols = ['v_score_text','age_cat','sex','race','c_charge_degree']
dfFiltered.loc[:,catCols] = dfFiltered.loc[:,catCols].astype('category')

# dfDummies = pd.get_dummies(data = dfFiltered.loc[dfFiltered['score_text'] != 'Low',:], columns=catCols)
dfDummies = pd.get_dummies(data = dfFiltered, columns=catCols)

# Clean column names
new_column_names = [col.lstrip().rstrip().lower().replace (" ", "_").replace ("-", "_") for col in dfDummies.columns]
dfDummies.columns = new_column_names

# We want another variable that combines Medium and High
dfDummies['v_score_text_medhi'] = dfDummies['v_score_text_medium'] + dfDummies['v_score_text_high']

Regression analysis - logistic regression

formula = 'v_score_text_medhi ~ sex_female + age_cat_greater_than_45 + age_cat_less_than_25 + race_african_american + race_asian + race_hispanic + race_native_american + race_other + priors_count + c_charge_degree_m + two_year_recid'

score_mod = logit(formula, data = dfDummies).fit()
print(score_mod.summary())

Optimization terminated successfully.
         Current function value: 0.372983
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:     v_score_text_medhi   No. Observations:                 4020
Model:                          Logit   Df Residuals:                     4008
Method:                           MLE   Df Model:                           11
Date:                Sat, 21 Nov 2020   Pseudo R-squ.:                  0.3662
Time:                        16:26:03   Log-Likelihood:                -1499.4
converged:                       True   LL-Null:                       -2365.9
Covariance Type:            nonrobust   LLR p-value:                     0.000
===========================================================================================
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                  -2.2427      0.113    -19.802      0.000      -2.465      -2.021
sex_female                 -0.7289      0.127     -5.755      0.000      -0.977      -0.481
age_cat_greater_than_45    -1.7421      0.184     -9.460      0.000      -2.103      -1.381
age_cat_less_than_25        3.1459      0.115     27.259      0.000       2.920       3.372
race_african_american       0.6589      0.108      6.093      0.000       0.447       0.871
race_asian                 -0.9852      0.705     -1.397      0.162      -2.368       0.397
race_hispanic              -0.0642      0.191     -0.335      0.737      -0.439       0.311
race_native_american        0.4479      1.035      0.433      0.665      -1.582       2.477
race_other                 -0.2054      0.225     -0.914      0.360      -0.646       0.235
priors_count                0.1376      0.012     11.854      0.000       0.115       0.160
c_charge_degree_m          -0.1637      0.098     -1.669      0.095      -0.356       0.029
two_year_recid              0.9345      0.115      8.107      0.000       0.709       1.160
===========================================================================================

Black defendants were 77.4 percent more likely than white defendants to receive a higher score, correcting for prior crimes, future criminality.

control = np.exp(-2.2427) / (1 + np.exp(-2.2427))
np.exp(0.6589) / (1 - control + (control * np.exp(0.6589)))

1.7738715321327136