Advocates of algorithmic justice have begun to see their proverbial “days in court” with legal investigations of enterprises like UHG and Apple Card. The Apple Card case is a strong example of how current anti-discrimination laws fall short of the fast pace of scientific research in the emerging field of quantifiable fairness.
While it may be true that Apple and their underwriters were found innocent of fair lending violations, the ruling came with clear caveats that should be a warning sign to enterprises using machine learning within any regulated space. Unless executives begin to take algorithmic fairness more seriously, their days ahead will be full of legal challenges and reputational damage.
What happened with Apple Card?
In late 2019, startup leader and social media celebrity David Heinemeier Hansson raised an important issue on Twitter, to much fanfare and applause. With almost 50,000 likes and retweets, he asked Apple and their underwriting partner, Goldman Sachs, to explain why he and his wife, who share the same financial ability, would be granted different credit limits. To many in the field of algorithmic fairness, it was a watershed moment to see the issues we advocate go mainstream, culminating in an inquiry from the NY Department of Financial Services (DFS).
At first glance, it may seem heartening to credit underwriters that the DFS concluded in March that Goldman’s underwriting algorithm did not violate the strict rules of financial access created in 1974 to protect women and minorities from lending discrimination. While disappointing to activists, this result was not surprising to those of us working closely with data teams in finance.
There are some algorithmic applications for financial institutions where the risks of experimentation far outweigh any benefit, and credit underwriting is one of them. We could have predicted that Goldman would be found innocent, because the laws for fairness in lending (if outdated) are clear and strictly enforced.
And yet, there is no doubt in my mind that the Goldman/Apple algorithm discriminates, along with every other credit scoring and underwriting algorithm on the market today. Nor do I doubt that these algorithms would fall apart if researchers were ever granted access to the models and data we would need to validate this claim. I know this because the NY DFS partially released its methodology for vetting the Goldman algorithm, and as you might expect, their audit fell far short of the standards held by modern algorithm auditors today.
How did DFS (under current law) assess the fairness of Apple Card?
In order to prove the Apple algorithm was “fair,” DFS considered first whether Goldman had used “prohibited characteristics” of potential applicants like gender or marital status. This one was easy for Goldman to pass — they don’t include race, gender or marital status as an input to the model. However, we’ve known for years now that some model features can act as “proxies” for protected classes.
If you’re Black, a woman and pregnant, for instance, your likelihood of obtaining credit may be lower than the average of the outcomes among each overarching protected category.
The DFS methodology, based on 50 years of legal precedent, failed to mention whether they considered this question, but we can guess that they did not. Because if they had, they’d have quickly found that credit score is so tightly correlated to race that some states are considering banning its use for casualty insurance. Proxy features have only stepped into the research spotlight recently, giving us our first example of how science has outpaced regulation.
In the absence of protected features, DFS then looked for credit profiles that were similar in content but belonged to people of different protected classes. In a certain imprecise sense, they sought to find out what would happen to the credit decision were we to “flip” the gender on the application. Would a female version of the male applicant receive the same treatment?
Intuitively, this seems like one way to define “fair.” And it is — in the field of machine learning fairness, there is a concept called a “flip test” and it is one of many measures of a concept called “individual fairness,” which is exactly what it sounds like. I asked Patrick Hall, principal scientist at bnh.ai, a leading boutique AI law firm, about the analysis most common in investigating fair lending cases. Referring to the methods DFS used to audit Apple Card, he called it basic regression, or “a 1970s version of the flip test,” bringing us example number two of our insufficient laws.
A new vocabulary for algorithmic fairness
Ever since Solon Barocas’ seminal paper “Big Data’s Disparate Impact” in 2016, researchers have been hard at work to define core philosophical concepts into mathematical terms. Several conferences have sprung into existence, with new fairness tracks emerging at the most notable AI events. The field is in a period of hypergrowth, where the law has as of yet failed to keep pace. But just like what happened to the cybersecurity industry, this legal reprieve won’t last forever.
Perhaps we can forgive DFS for its softball audit given that the laws governing fair lending are born of the civil rights movement and have not evolved much in the 50-plus years since inception. The legal precedents were set long before machine learning fairness research really took off. If DFS had been appropriately equipped to deal with the challenge of evaluating the fairness of the Apple Card, they would have used the robust vocabulary for algorithmic assessment that’s blossomed over the last five years.
The DFS report, for instance, makes no mention of measuring “equalized odds,” a notorious line of inquiry first made famous in 2018 by Joy Buolamwini, Timnit Gebru and Deb Raji. Their “Gender Shades” paper proved that facial recognition algorithms guess wrong on dark female faces more often than they do on subjects with lighter skin, and this reasoning holds true for many applications of prediction beyond computer vision alone.
Equalized odds would ask of Apple’s algorithm: Just how often does it predict creditworthiness correctly? How often does it guess wrong? Are there disparities in these error rates among people of different genders, races or disability status? According to Hall, these measurements are important, but simply too new to have been fully codified into the legal system.
If it turns out that Goldman regularly underestimates female applicants in the real world, or assigns interest rates that are higher than Black applicants truly deserve, it’s easy to see how this would harm these underserved populations at national scale.
Financial services’ Catch-22
Modern auditors know that the methods dictated by legal precedent fail to catch nuances in fairness for intersectional combinations within minority categories — a problem that’s exacerbated by the complexity of machine learning models. If you’re Black, a woman and pregnant, for instance, your likelihood of obtaining credit may be lower than the average of the outcomes among each overarching protected category.
These underrepresented groups may never benefit from a holistic audit of the system without special attention paid to their uniqueness, given that the sample size of minorities is by definition a smaller number in the set. This is why modern auditors prefer “fairness through awareness” approaches that allow us to measure results with explicit knowledge of the demographics of the individuals in each group.
But there’s a Catch-22. In financial services and other highly regulated fields, auditors often can’t use “fairness through awareness,” because they may be prevented from collecting sensitive information from the start. The goal of this legal constraint was to prevent lenders from discrimination. In a cruel twist of fate, this gives cover to algorithmic discrimination, giving us our third example of legal insufficiency.
The fact that we can’t collect this information hamstrings our ability to find out how models treat underserved groups. Without it, we might never prove what we know to be true in practice — full-time moms, for instance, will reliably have thinner credit files, because they don’t execute every credit-based purchase under both spousal names. Minority groups may be far more likely to be gig workers, tipped employees or participate in cash-based industries, leading to commonalities among their income profiles that prove less common for the majority.
Importantly, these differences on the applicants’ credit files do not necessarily translate to true financial responsibility or creditworthiness. If it’s your goal to predict creditworthiness accurately, you’d want to know where the method (e.g., a credit score) breaks down.
What this means for businesses using AI
In Apple’s example, it’s worth mentioning a hopeful epilogue to the story where Apple made a consequential update to their credit policy to combat the discrimination that is protected by our antiquated laws. In Apple CEO Tim Cook’s announcement, he was quick to highlight a “lack of fairness in the way the industry [calculates] credit scores.”
Their new policy allows spouses or parents to combine credit files such that the weaker credit file can benefit from the stronger. It’s a great example of a company thinking ahead to steps that may actually reduce the discrimination that exists structurally in our world. In updating their policies, Apple got ahead of the regulation that may come as a result of this inquiry.
This is a strategic advantage for Apple, because NY DFS made exhaustive mention of the insufficiency of current laws governing this space, meaning updates to regulation may be nearer than many think. To quote Superintendent of Financial Services Linda A. Lacewell: “The use of credit scoring in its current form and laws and regulations barring discrimination in lending are in need of strengthening and modernization.” In my own experience working with regulators, this is something today’s authorities are very keen to explore.
I have no doubt that American regulators are working to improve the laws that govern AI, taking advantage of this robust vocabulary for equality in automation and math. The Federal Reserve, OCC, CFPB, FTC and Congress are all eager to address algorithmic discrimination, even if their pace is slow.
In the meantime, we have every reason to believe that algorithmic discrimination is rampant, largely because the industry has also been slow to adopt the language of academia that the last few years have brought. Little excuse remains for enterprises failing to take advantage of this new field of fairness, and to root out the predictive discrimination that is in some ways guaranteed. And the EU agrees, with draft laws that apply specifically to AI that are set to be adopted some time in the next two years.
The field of machine learning fairness has matured quickly, with new techniques discovered every year and myriad tools to help. The field is only now reaching a point where this can be prescribed with some degree of automation. Standards bodies have stepped in to provide guidance to lower the frequency and severity of these issues, even if American law is slow to adopt.
Because whether discrimination by algorithm is intentional, it is illegal. So, anyone using advanced analytics for applications relating to healthcare, housing, hiring, financial services, education or government are likely breaking these laws without knowing it.
Until clearer regulatory guidance becomes available for the myriad applications of AI in sensitive situations, the industry is on its own to figure out which definitions of fairness are best.