What happens when AI is used to set students' grades?

One experiment ended unhappily. If only more thought had gone into how the grades came about and the appeals process.

Published Fri, Aug 14, 2020 · 09:50 PM

HOW would you feel if an algorithm determined where your child went to college?

This year, Covid-19 locked down millions of high-school seniors and governments around the world cancelled year-end graduation exams, forcing examining boards everywhere to consider other ways of setting the final grades that would largely determine the future of the class of 2020. One of these Boards, the International Baccalaureate Organization (IBO), opted for using artificial intelligence (AI) to set overall scores for high-school graduates based on students' past work and other historic data.

The experiment was not a success, and thousands of unhappy students and parents have since launched a furious protest. So, what went wrong? What does the experience say about the challenges that come with AI-enabled solutions?

In a normal year for IB students, final grades are determined by coursework produced by the students and a final examination administered and corrected by the IBO directly. The coursework counts for 20 to 30 per cent of the overall final grade, and the exam, the remainder.

Before the exam, teachers provide "predicted" grades, so universities can offer places conditional on the candidates' final grades meeting the predictions.

The process is generally considered to be a rigorous and well-regarded assessment protocol. The IBO has collected a substantial amount of data about each subject and school - hundreds of thousands of data points, in some cases going back over 50 years. Significantly, the relationship between predicted and final grades has been tight. At leading IB schools, over 90 per cent of grades have been equal to predicted; over 95 per cent of total scores have been within a point from that predicted. (Total scores are set on a scale of one to 45.)

A NEWSLETTER FOR YOU

Friday, 2 pm

Lifestyle

Our picks of the latest dining, travel and leisure options to treat yourself.

And then came Covid-19

Cancelling the exams raised the question of how to assign grades, and that was when IBO turned to AI. Using its trove of historical data about students' course work and predicted grades, as well as the data about the actual grade obtained at exams in previous years, it built a model to calculate an overall score for each student - in a sense predicting what the 2020 students would have obtained at the exams.

A crisis erupted when the results came out in July. Tens of thousands of students across the world received grades that deviated substantially from their predicted grades - and in unexplainable ways. Some 24,000, or more than 15 per cent of all 2020 IB diploma recipients, have since signed the protest. IBO's social media pages are flooded with furious comments.

Several governments have also launched investigations, and numerous lawsuits are in preparation, some for data abuse under the European Union's General Data Protection Regulation (GDPR). What is more, schools, students, and families involved in other high school programmes that have also adopted AI solutions are raising similar concerns.

One critical and practical question has been consistently raised by frustrated students and parents: How can they appeal the grades?

In normal years, the appeals process was well-defined and consisted of several levels, from the re-marking of an individual student's exam to a review of marks for course work by subject at a given school. The appeal process was well-understood and produced consistent results, but was not used frequently, largely because, as noted, there were few surprises when the final grades came out.

This year, the IB schools initially treated appeals as requests for re-marks of student work. But this poses a fundamental challenge: the graded papers were not in dispute - it was the AI assessment that was called into question. The AI did not actually correct any papers; it only produced final grades based on the data it was fed, which included teacher-corrected coursework and the predicted grades. Since the specifics of the programme are not disclosed, all people can see are the results, many of which were highly anomalous, with final scores in some cases well below the marks of the teacher-graded coursework of the students involved. Unsurprisingly, the IBO's appeals approach has not met with success - it is in no way aligned with the way in which the AI created the grades.

What can we learn?

The main lesson coming out of this experience is that any organisation that decides to use an AI to produce an outcome as critical and sensitive as a high-school grade marking 12 years of student's work needs to be very clear about how the outcomes are produced and how they can be appealed in the event that they appear anomalous or unexpected.

From the outside, it looks as though the IBO may have simply plugged the AI into the IB system to replace the exams and then assumed that the rest of the system - in particular the appeals process - could work as before.

So what sort of appeals process should the IBO have designed? First of all, the overall process of scoring and, more importantly, appealing the decision should be easy to explain, so that people understand what each next step will entail. Note that this is not about explaining the AI "black box", as current regulators do when arguing about the need for "explainable AI".

That would be almost impossible in many cases, since understanding the programming used in an AI generally requires a high level of technical sophistication. Rather, it is about making sure that people understand what information is used in assessing grades and what the steps are in the appeal process itself.

So what the IBO could have done was offer appellants the right to a human-led re-evaluation of anomalous grades, specify the input data the appeal panel would look at in re-analysing the case, and say how the problem would be fixed.

How the problem would be fixed would depend on whether the problem turned out to be student-specific, school-specific, or subject-specific; a single student's appeal might well affect other students, depending on what components of the AI the appeal may relate to.

If, for example, a problem with an individual student's grade seems to be driven by the school-level data - possibly a number of students studying in that same school have had final grades that differed markedly from their predicted grades - then the appeal process would look at the grades of all students in that school. If needed, the AI algorithm itself would be adjusted for the school in question, without affecting other schools, making sure the new scores provided by the AI are consistent across all schools while remaining the same for all but one school.

In contrast, if the problem is linked to factors specific to the student, then the analysis would focus on identifying why the AI produced an anomalous outcome for that student and, if needed, re-score that student and any other student whose grades were affected the same way.

Of course, much of this would be true of any grading process. But the way in which the appeal process is designed needs to reflect the different ways in which humans and machines make decisions and the specific design of the AI used as well as how the decisions can be corrected. For example, because AI awards grades on the basis of its model of relationships between various input data, there should generally be no need to look at the actual work of the students concerned, and corrections could be made to all affected students (those with similar input data characteristics) all at once. In fact, appealing an AI grade could be an easier process than appealing a traditional exam-based grade.

What is more, with an AI system, an appeals process along the lines described would enable continuous improvement to the AI. Had the IBO put such a system in place, the results of the appeals would have produced feedback data that could have updated the model for future uses - in the event, say, that examinations are again cancelled next year.

The IBO's experience obviously has lessons for deploying AI in many contexts - from approving credit, to job search or policing. Decisions in all these cases can, as with the IB, have life-altering consequences for the people involved.

It is inevitable that disputes over the outcomes will occur, given the stakes involved. Including AI in the decision-making process without carefully thinking through an appeals process and linking the appeals process to the algorithm design itself will likely end not only with new crises but potentially with a rejection of AI-enabled solutions in general. And that deprives us all of the potential for AI, when combined with humans, to dramatically improve the quality of decision-making.

This article first appeared on the Harvard Business Review blog

*Disclosure: One of the authors of this article is the parent of a student completing the IB programme this year.

Share with us your feedback on BT's products and services

Feedback