Fairness and AI

More and more companies are automating processes with the help of ML. This has tremendous advantages since an algorithm can scale in an almost unlimited fashion and in case rich datasets are available, can often detect patterns in data that humans cannot discover. As a well-known illustration, think about how Deepmind used a ML algorithm to optimize energy consumption of Google's data centers. When algorithms are used to make important decisions that impact people's lives, such as deciding on medical treatment, granting a loan, or performing risk assessments in parole hearings, it is of paramount importance that the algorithm is fair. Because the models become ever more complicated, this is however not easy to assess. As a consequence, both the public, legislators and regulators are aware of this issue; see e.g. a report on algorithmic systems, opportunities and civil rights by the Obama administration and (in Europe) recital 71 of GDPR.

In order to detect bias/unfairness it is important to come with a proper (mathematical) definition to make sure that we can measure deviations. Below we describe a few alternatives.

Let us assume we are building a model to determine whether somebody may receive a loan. Typically such model will use information (so-called attributes) such as your credit history, marital status, education, profession, etc. in order to estimate the probability that you will be able to pay off your debts. The dataset will also include historical data on defaults (i.e. people who were not able to repay their loan). Now, given the above, the financial institution wants to make sure that the algorithm is fair with respect to gender, skin colour, etc. These are called protected attributes.

A traditional solution is to simply remove these protected attributes from the dataset altogether. Of course, when testing such model, simply changing the value of any of the protected attributes is not going to impact to output of the model. However, through so-called redundant encodings an algorithm might be able to guess the value of the protected attributes from other information. As an example, let us assume we train our credit model on a dataset representing the people of Verona. The dataset has a protected attribute called ancestor that can take two values (Capulet and Montague). If people whose ancestor attribute equals Capulet live primarily in the Eastern part of the city while the Montagues live in the Western half, then our algorithm can infer the ancestor attribute (statistically) by looking at the address. Hence, simply removing the ancestor attribute from the dataset does not help.

Another approach is so-called demographic parity. In that case one requires that the membership of a protected attribute is uncorrelated with the output of the algorithm. Let us assume that granting a loan is indicated by a target binary variable Y = 1 (the ground truth) and that the protected (binary) attribute is called A. The forecast of the model is called Z. Demographic parity then means that

Pr[ Z = 1 | A = 0] = Pr[ Z=1 | A=1]

This notion however is also flawed. A key issue with the approach is that if the ground truth (i.e. default in our example) does depend on the protected attribute, the perfect predictor (i.e. Y=Z) cannot be reached and therefore the utility and predictive power of the model reduce. Moreover, by requiring demographic parity, the model has to yield (on average) the same outcome for the different values of the protected attribute. In our example, demographic parity would imply that the model would have to refuse good candidates from one category and accept bad candidates from the other category in order to reach the same average level. Concretely, assuming that the Montagues historically have defaulted on 3% of their loans while the Capulets only on 1%, demographic parity would typically be in the disadvantage of the Capulets.

A more subtle suggestion for fairness was proposed by Hardt et al:

Pr[Z=1 | Y=y, A=0] = Pr[Z=1 | Y=y, A=1]

for y=0,1. In other words, relatively speaking the model has to be right (or wrong) as often for either value of the protected attribute. This definition incentivises more accurate models and the property is called oblivious as it depends on the joint distribution of A,Y and Z.

In this short post, we have reviewed a few different notions of fairness, focussing mostly on the simplest case where the protected attribute and the ground truth are both binary variables. The oblivious property introduced above is an interesting approach that can be easily computed. The simple (binary) set-up we discussed can be extended to the multinomial and continuous case and as such can serve as an accurate test to detect fairness in algorithms.

AI and model validation

Model validation is a labor-intensive profession that requires specialists who understand both quantitative finance as well as business practices. Validating a single valuation model typically requires between 3 and 6 weeks of hard work focusing on many different topics. In this first article, I would like to highlight a few ideas on how to automate such analyses.

see Global Model Practice Survey 2014 by Deloitte

Data

Models need. Hence, a large portion of time in model validation is spent on managing data. To give a few examples, the validator has to make sure that the input data is of good quality, that the test data is sufficiently rich and that there are processes in place to deal with data issues. Broadly speaking, input data exists in two flavours: time series and multi-dimensional data that is not indexed primarily by time.

Time series

Time series are overly present in finance. Valuation models typically use quotes (standardized prices of standardized contracts) as input data. Since the market changes, time series of quotes are abundant. To determine the quality of quotes one can run anomaly detection models on historical data (i.e. on time series). Many such algorithms exist. Simple ones use time series decomposition like STL to subtract the predictable part and study the distribution of the resulting residue to determine the likelihood of an outlier. More modern alternatives use machine learning algorithms. In one approach, a model is trained to detect anomalies by using labeled datasets (e.g. such as the ODDS database). Another family of solutions uses a ML model to forecast the time series. In that method, anomalies are discovered in case the prediction differs significantly from the realization.

Data representativeness

In some cases, the time component is less (or not) important. When building regression or classification models, one often uses a large collection of samples (not indexed by time) with many features. In that case, due to the curse of dimensionality, one would need astronomical sizes of data to cover every possibility. Most often, the data is not spread homogeneously over feature space but instead is clustered into similarly looking parts. When building a model, it is important to make sure that the data for which the model is going to be used is similar to data on which the model has been trained. Hence, one needs to understand if a new data point is close to existing data. There are many clustering algorithms (like e.g. DBSCAN) that can perform this analysis generically.

Model dependencies

A model inventory is a necessary tool in a model governance process. On top of keeping track of all the models used, it is extremely valuable to also store the dependencies between models. For instance, when computing VaR one need the historical PnL of all portfolios. If these portfolios contain derivatives, we need models to compute their net present value. All the models that are used within an enterprise can be represented as a graph where the nodes are the models while the vertices represent data. Understanding the topology makes it easier to trace back model issues.

model dependency graph

Benchmarking

ML can help benchmarking models. First of all, in the case of forecasting models, we can easily train alternative algorithms on the realizations. This will help you to understand the impact of changing the underlying assumptions of the model.

However, even in case such realizations are not available, we can still train a model to mimic the algorithm that is used in production. We can use such surrogate models (see e.g. this paper from our colleagues at IMEC) to detect changes in the behaviour. As an example, suppose we are monitoring a XVA model. We can train a ML algorithm to predict the changes in the XVA amounts of a portfolio when the market data changes. We can still build a model that tries to forecast the change in PnL with market data. Such model can be used to detect e.g. instabilities in the XVA computation. An additional benefit of such approach is that one can estimate the sensitivity from the calibrated model.

Jos Gheerardyn

Welcome to our blog

Welcome to the Yields new blog! Our blog will give us a unique opportunity to share news and updates, while also offering a place for us to interact with our model validation community

Interested in a demo?

Lorem ipsum dolorem et arceopara bellum. Lorem ipsum dolorem et arceopara bellum. Lorem ipsum dolorem et arceopara bellum.