Teaching AI to learn by positive reinforcement

Training conscious beings to complete a task often entails offering a reward as an incentive: You might offer a puppy a treat if it sits, or you might give a child a lollipop if they stay quiet at a concert. In the realm of computer science, the same is true for unconscious beings. Known as reinforcement learning (RL), the process of incentivizing the machine to complete desired tasks allows artificial intelligence (AI) agents, or autonomous intelligent entities, to ‘learn’ by reward. Thus, when given the goal to maximize their reward function, the machine learns to do the task repeatedly. 

Doina Precup, a professor in McGill’s School of Computer Science, compares training AI agents to training animals, where the machines are offered a reward when the task is done correctly and denied it when the task is not.

In contrast with other ways that machines can learn to process information like supervised and unsupervised learning, Precup describes RL as the middle ground between the two. RL does not use massive amounts of hand-labelled data in order to learn something new, as supervised learning entails,  nor does it set the agent loose amongst large pools of unprocessed data for the purpose of discovering new or interesting patterns, as in unsupervised learning. Instead, RL takes a novel approach that is becoming increasingly common in computer scientists’ toolkits. 

“You’re not holding the hand of the agent at every step, but you are providing some feedback to help it understand [the task at hand],” Precup said in an interview with The McGill Tribune

The intuitive reward function, though easy to understand and implement in straightforward, game-like settings, becomes a challenge when the desired goal is multifaceted and complex. One such reward function exists in medicine. A dynamic, long-term medical treatment strategy called an adaptive treatment method is made up of a sequence of treatments that each depend on the patient’s response to previous therapies. According to Precup, the reward should aim to juggle different objectives. 

“The reward should balance the efficacy of the treatment with other considerations, such as side effects or the costs of medication,” Precup said. 

However, incentivizing a computer to take such factors into consideration is not an easy task. Precup described a solution where the AI agent can interact directly with humans while it is operating. 

“When the agent arrives at a situation which it is uncertain about, […] it has the option to ask a person,” Precup said. 

Employing this system of teaching, the agent can then take the feedback from a human operator and use this information to learn.

Precup foresees AI-human interaction as a major field for future research. The emerging narrative is becoming more focussed on determining human preferences in real time and tailoring AI algorithms to match these preferences. 

Stuart Russell, a professor of computer science at the University of California at Berkeley, echoes these sentiments in his three principles of beneficial machines. As Russell notes, the first and foremost job of a ‘beneficial machine’—that is, one which works to the advantage of its human operator—should be to maximize the realization of human preferences. If the machine is initially uncertain about what those preferences are, then Russell proposes that the ultimate source of information about human preferences must come from human behaviour itself. 

Though these general principles in no way address every potential concern about AI, it is encouraging to see that researchers remain optimistic about creating computer systems that are aligned with the goals of humanity.

Leave a Comment

Your email address will not be published. Required fields are marked *

*