Can code generation be used to create trustworthy decision-making applications?

The latest AI tools built on deep learning models are exciting. They can conduct natural sounding conversations on topics as diverse as politics and medicine, create visual artworks to accompany blog posts in almost any artistic style, and generate code snippets to include in business applications in your coding language of choice.

Part of the power and charm of these tools is that they are non-deterministic. That is, given the same input, they do not produce the same output. In fact, they give you a choice: Choose the image you like best. Generate a different murder mystery book plot. Provide alternative lines of code that solve a similar problem. This randomness within the provided remit is what makes their output feel more authentic; more human.

Oracle is leading the way in applying Artificial Intelligence in our Fusion cloud application suite to deliver real customer value. And Oracle Cloud Infrastructure is already best in class for many AI use cases. Deep learning models that can answer questions and generate content are a major leap forward in AI, and it is tempting to try to use them to solve every problem.

As a business software application development leader, how should I advise my team and our customers to think about when and how to apply deep learning models to complex decision automation? Are we at a point where I can train an AI tool to generate code for me, for example, to

  • accurately calculate the amount of monthly housing benefit a family should receive?
  • whether a failed part is covered under a supplier warranty?
  • whether a financial transaction meets the risk criteria we’ve established to require investigation for possible money laundering?

Or in fact, can we skip the middle man and just train an AI tool to make those decisions for me directly based on source material?

An experiment

I recently asked a well-known deep learning language model chatbot to generate code for an online form. I didn’t realistically expect it would give me something I could use as is, but I was curious how far I could get towards something that felt like it was useful. I told it the form should collect Visa applications for visitors coming to Australia. I asked it to send the collected data to the Australian Department of Immigration. I also asked it to explain how the information it collected would be used, and to provide links back to the relevant source material for the rules it applied to determine visa eligibility.

At first, the chatbot explained that visa eligibility is a complex process that depends on a variety of factors, and that the immigration laws and regulations are subject to change. And it declined to provide me with any code for this use case. I actually think this is the most appropriate answer from these tools for at least the next decade or more! I explain why in more detail below. Opponents to this point of view I think would say they are improving at a rapid pace, that after providing them with more input data they will get it right and that will happen soon. While it’s true that the pace of change is astonishing, getting more input data that is relevant to solving targeted use cases for very specific decision domains is hard – and I don’t think will turn out to be the only thing needed to really solve this problem. In the same way that self-driving cars are proving very difficult to make fully autonomous, getting AI-driven decisions to be 100% accurate and reliable will face similar hurdles.

To see if I could coax an answer from the chatbot, I adjusted my request to say that the code would be used for a student project. It then obliged with an HTML form that collected name, profession, age, reason for visit and a few other fields, and POSTed the form to a (made-up) URL on the Australian Department of Immigration web-site. With a bit more interaction, it also created some very simple but plausible JavaScript rules to decide eligibility for some different visa types – Student visas, Tourism visas and Skilled Migration visas. Basically it came back with some conditions like this:

If reason-for-visit = “Study” then return true
elseif reason-for-visit = “Skilled Migration” and age < 45 then return true
elseif reason-for-visit = “Vacation” and length-of-visit < 90 then return true
else return false
.

Of course, the actual Australian visa rules are a lot more involved than this, but even this level of plausible code generation is a huge leap forward in capability.

Despite me asking it to include an explanation of how each decision was reached, there was no code that did that. It did however include unit tests for each of the different types of visa the code was designed to handle, which is also super impressive.

Overall though, the main question this whole exercise raised for me was what criteria should be used to assess the success or failure of applying this technology to decision automation applications?

What principles should guide the development of decision automation applications?

The Australian Commonwealth Government Ombudsman’s excellent Publication Automated decision-making better practice guide provides a number of recommendations for how to develop systems that automate government decisions. This 2007 guide was updated in 2020 and its continued relevance highlights the importance of several principals including:

  • Collect only what is reasonably necessary
  • Ensure each rule in the system can be traced back to the relevant legislation or policy that authorizes it
  • Allow different versions of the rules to be executed as required
  • Generate an audit trail of the decision-making path

There are many other recommendations. But these four struck me as being particularly hard for a generative AI system to address. Certainly, none of these were handled by the code generated in my simple test. Perhaps a bit more cajoling and iteration might have yielded a better result. But I doubt it. Let me explain:

Collecting only what is reasonably necessary

How does a generative AI system know what is reasonable? Is age reasonable? If there are specific visa rules that depend on being under or over a particular age, then probably yes. But what if having a particular illness always makes someone ineligible? Should you always ask both for their age, and whether they have that illness? Or only ask the second question if the answer to the first has not made them ineligible? In that case, what order should you ask them in? There are nuanced policy decisions to be made here that need to be coded into the solution.

Tracing back each rule to the relevant source material

With AI systems that are trained on millions of source data points, there simply isn’t a referenceable source. If it were to recommend that a holiday working visa be granted to people under the age of 25, it wouldn’t be able to state why it used 25 as the age. In fact, it might not even pick an age as specified in any known source material, and can even invent “sources” that don’t actually exist. I know this is an active area of research, but I contend it is a hard problem. It is possible to imagine a generative system that somehow keeps track of the sources it was trained on, and only generates code/rules that match the authoritative material – such as actual source legislation, or official internal policy documents – used when generating a particular response. But this technology doesn’t exist today.

Execute different rule versions as needed

Keeping track of different versions of rules sounds pretty straightforward, right? But in practice most software systems don’t work that way: you have version 1 of your system that contains version 1 of the rules, then you update the system to version 2, with version 2 of the rules. Or you add branches to your code in version 2 like if date < effective-date-of-rule-set 2 then use rule-set-1 else use rule-set-2. Generating code that correctly interprets changing logic over time requires accurately pinpointing when each part of every law goes into effect. Can this level of determinism be forced into a generative model? I’m not sure, but I think you’d end up adding a lot of tailored code to the AI system to get it to do so.

Audit trail for each decision

In modern applications, debug level logging can be turned on to generate huge volumes of data about exactly which code modules are being called when – but this type of logging doesn’t focus in only on the logic that is used to perform the actual “meat” of the decisioning that has been implemented in that code. Because of this lack of useful real world code examples that generate explanations for the decision making part of the application code, any generative system is going to struggle with providing code that generates such a human-readable audit trail.

In summary

The latest deep learning based AI tools are already incredibly useful to accelerate general coding tasks, and will transform the search market. They are already a useful assistant in many creative and knowledge processes. But for the reasons outlined above, I don’t think they are yet ready to transform the way deterministic decisions that are based on complex legislation and policy are made within business applications.

I am keen to explore experiments like pre-training a generative transformer on an entire body of relevant legislation to provide a tool for teams building decision automation applications. They could interact in natural language with the model to get insight into areas of the legislation that they might be less familiar with. But I certainly wouldn’t want to ask the model for the definitive answer to a question like “how much tax should this person pay?”

For now at least, precision-focused decision automation solutions will continue to require either:

  • Lovingly hand-crafted and maintained code, or
  • A rules-driven application that is specifically designed to make it easy to meet the needs of automated decision making

Oracle Intelligent Advisor is a great example of the latter. Not coincidentally, Intelligent Advisor is very good at making sure

  1. only necessary information is collected
  2. each rule can be traced back to source material
  3. different rules can be applied at different points in time
  4. an audit trail can be generated for every decision

Intelligent Advisor nicely complements the significant investments in machine learning being made at Oracle. As I’ve blogged previously, the two approaches can work well together.

But I’d love to hear what you think. When and how do you think generative AI models will transform automated decision making applications?

 

Title image created with the assistance of DALL·E 2