May 31, 2022

Keep Your Confidences In Check So Your Users Don’t Have To

Developing an AI-based solution to automate boring tasks can be quite daunting. Why is that? Let me explain.

Nowadays anyone that hears “AI” wants to use it on everything. But once you start thinking about it and comparing each possible problem that an AI could tackle, it might sink in that it’s not as simple as sticking AI on top of your application.

Wrong decisions can impact real businesses, people’s lives, and cause unhappy customers. Possible outcomes that could be way worse than we could imagine in this blog post. When there is such risk involved there should be a strategy in place to mitigate these outcomes.

Paperbox has been implementing and iterating on a people-first strategy since the start. But what does that really mean? And with this in mind, how do you scope and build your AI-focused application from the ground up?

This is the first part of a series of blog posts that covers our mission of making Paperbox the most trustworthy and human-centered AI. This first article focuses on how we use model confidences to fit our client’s wants and needs.

When you fail, fail gracefully

The first thing to do when building an AI-focused application is to map out what tasks the AI will automate and where it should augment manual work done by users. Our partner Google has written a great guidebook on this here.

It is obvious to factor in and measure if users and decision-makers want something automated instead of augmented. A good example is when we decided on developing our Policy/Claim ID extraction. We first looked at which parts of the process were the most time-consuming during claims processing.

At that time we saw that a lot of time was taken in by just understanding and mapping the unstructured text to a more structured format. Right now our users love our entity extraction but at first, there was some skepticism with all good reasons.

Artificial intelligence will often look like magic to people, but just as a car can be really good, it doesn't mean all cars are. It is important to create an environment where your AI is allowed to learn to make a positive impact without influencing the user’s experience all too much.

Just having that environment lets you go to market much quicker because you allow and encourage your model to make wrong decisions. In the long run, this will pay off because each error and user intervention will let the model adjust itself and make the experience better for all users. To apply this in an application there should be a "happy" and an "unhappy" flow:

The model is sure and correct (happy)

The model is sure but wrong (unhappy)

It should be clear to users where a model is not confident and had possibly made an incorrect decision. But what does confidence mean to a model, when is it sure or unsure and what does it mean to your user?

We, as humans, know what confidence means to us.

We need our service and our decisions to be trustable. When the model isn't sure about something we should communicate this to the users of our platform. When that happens it will serve as a trigger for our users to adjust our predictions or to affirm that we are in fact correct. In some way, it's a collective responsibility of the users and Paperbox to work towards better and smarter automated decision-making.

When our clients use our platform we want to provide them an easy way of grasping these complex AI concepts like confidence without having a data science and software engineering background.

That's done by explicitly explaining each setting and prediction but also by building an intuitive UI that implicitly translates these concepts into simple yet effective user interface elements. But as much as we want to make these concepts to make some sense, out-of-the-box, these Machine Learning Models don't produce trustable confidences.

The UX is (mostly) about confidences

When an insurer in Paperbox wants to automate a portion of their incoming claims, they will have to define a threshold confidence. This threshold governs a tradeoff: a lower threshold means more automation, but also more potential mistakes. Sounds scary, and it is. That means that the decision-maker will have to try to understand your model, while even the most expert people in data science can't understand these. 

The administrative panel in the Paperbox application allows making configuration changes on everything concerning their documents. Currently, we offer all metrics to our users separately but will be included in the application in future versions.

So how do you make this process as simple as possible? The first step is to make sure that all the important metrics are presented to your user.

How many false positives and negatives will that threshold let through compared to the amount of true positive documents that will be automated? Do you want your product to make more mistakes, with possible repercussions, just for more automation?

After all these decisions, the question still resides, if our users can put that into a single threshold. To do that, the model has to keep his confidences in check. This threshold will mean nothing if our Machine Learning model is so confident in something but still so wrong.

We can make this exception even broader, how do we keep these confidences stable across different training sessions and using different data? In some ways these different models can give different results meaning the threshold, that the user carefully has chosen, means nothing unless we can regulate them.

To mediate that our ML Engineers research and develop ways to allow these models to learn the task at hand but also how to fail with grace.

The technical investment

The importance of producing these trustworthy confidences is vital when you want to keep the same automation without producing a higher error rate in automated documents. That means when a prediction says it is 50% sure it means it's ½ chance that it is true or false. When Paperbox is set up to automate when we are 98% sure there should be a 2% chance we are wrong. 

In machine learning, experts often suggest punishing false positives, making your AI more unsure and less prone to high confidence errors. Defining how much you punish this behavior will depend on how much this weighting affects the actual true positives and negatives.

What if we punish too much and affect all results of the model? We really think the manual effort isn't worth it. Even if we do, what happens now when we put this model into production and retrain the model each time period on new data? What if the reweighting (punishing vs encouraging) doesn't work as expected and all confidences have shifted to be more optimistic?

This will mean that when the application uses the trained model, all automation will be higher and produce more errors or it will be less and with fewer errors. One sounds worse than the other but neither of these is favourable. We don't want anything other than better automation with the same amount or fewer errors when deploying our new fancy model revision.

The only way to fully combat this is to force your model to learn where it needs to be sure and unsure. When the input is out-of-domain, meaning the model has never seen anything like this example, it should be unsure. If it's one of the usual documents that the model is more sure about it should produce higher confidence.

To visualise and measure that we have carefully chosen 2 graphs that tell us the important parts of each different model and how they compare.

Calibration Curve 

The correct term of this practice is "Calibrating your models". That's where this graph's name comes from. On the Y-axis it holds the number of examples the model was correct on. The X-axis however tells us how confident the model is about these examples. To be properly calibrated models should follow the perfect line from (0, 0) until (1, 1). Obviously, this only tells us part of the story.

Paperbox Curve

To tell the other part of the story we have developed our own Paperbox Curve. Unlike the first graph, this one has our own name on it. That is because this is our most important curve that is directly in correlation to the first graph.

It should tell only one story about the model: Depending on what our confidence threshold is, how much automation potential do we have? On the Y-Axis, we again have the percentage correct. But this time we have the Automation Percentage on the X-Axis.

How good this curve looks will depend entirely on the confidences and the actual performance of the ML model.


In order for our users to intuitively grasp model confidences and to maintain stable automation levels between model updates, ML model confidences require proper calibration. 

We at Paperbox have solved this riddle and are making sure that our customers can automate their document processing flow hassle-free. Want to experience Paperbox or talk about fun topics like this? Send us a message using our channels below!