Building Mathematical Reward Models

This web page and associated links will provide an analysis of two basic mathematical models representing reward processes, i.e. Rescorla & Wagner’s reward equation (1972) and Sutton & Barto’s Reinforcement Learning (Barto, Sutton, & Watkins, 1990; Sutton & Barto, 1990, 1998). Both models have made appreciable contributions to psychology and computer science and have been richly cited by others.

Rescorla & Wagner’s (1972) reward learning model used rate parameters, which monitored stimulus salience and the breadth of learning characterizing both the unconditioned stimulus (US) and conditioned stimulus (CS). Their equation also reflected the US’s ability for influencing CS stimulus qualities, which was initially to an innocuous sensory stimulus. They termed this imputed rewarding influence as gains to the CS’s associative strength. Their model broadened the interpretation of the CS, to include all contextual stimuli that were capable of gaining associative strength in the reward environment due to their increasing temporal, proximal association with an inherently rewarding US.

The difference in a subunit in their formula, λ - V, monitors changes in the US-CS relationship. The smaller the difference, the greater the CS's imputed associative strength derived from its relationship with the US, and its implied powers at predicting US occurrence. Conversely, the larger the difference, the weaker the CS's imputed associative strength and its implied powers at predicting US occurrence. The series of their equations center on both US and CS stimulus properties. Stimulus effects on reward learning are therefore implied.

Early Sutton & Barto (1990), on the other hand, uniquely reinterpreted Rescorla & Wagner's model and built elements of the Rescorla & Wagner's model onto their time derivative model. They were guided by one of Rescorla & Wagner's notions that namely learning occurs in situations where events violate expectations. In the λ - VAX portion of their equation, Sutton & Barto related λ to US influence and V to the predicted value of the CS. Accordingly, Sutton & Barto assessed that λ - V depicted the discrepancy between anticipated (a CR in response to CS input) and actual US occurrence. As referenced above, the smaller the difference between the two, the greater was the probability for predicting future US occurrence.

Conversely, the greater the difference between the two, the weaker the CS’s powers were for predicting US occurrence. For greater analytic detail, please reference the enclosed preferred document format entitled Reward_Equation.

Early Sutton & Barto (1990) believed that time derivative models seek to measure occurrence probability and prediction from the prospective of a theoretical observer of the reward task. An internal working model of the reward task or inner representation typically guides the theoretical observer’s task behavior; however, when this internal representation also matches to the task’s design, the new state is observable and inferentially undifferentiated from the theoretical observer’s internal representation.

A subject’s internal reward learning model develops from sensory input of reward (US) and other relevant sensory stimuli (CS) and learning parameters inherent in the reward task during any task. This internal reward learning model is an imputed cognitive evaluative response (whether UR or CR) to incoming sensory reward (US) and perceptual (CS) stimulus from which later behavioral strategies and response selections can be derived. λ – V, as cited by Sutton & Barto (1990), can therefore be reframed as emanating from the discrepancy between a theoretical observer’s response to expected (CR) reward occurrence less the actual response (UR) to delivery of the rewarding stimulus (US). The organism’s or subject’s response for accuracy (CR) for predicting US occurrence is enhanced as the CS gains temporal proximity to the US.

Rescorla & Wagner’s and Sutton & Barto’s equations conceptualize different aspects of reward modeling. Rescorla & Wagner’s models center on stimulus qualities, i.e. assumed associative strength of the CS, stimulus salience by both the US and CS, and learning parameters inherent in the reward task’s US and CS (including the effectiveness of reward feedback). The nature of stimulus characteristics will invariably impact the inferred subject’s approach to a reward task.

Sutton & Barto’s models (1990, 1998), as noted above, center on conceptualizing learning approaches by a subject, who is also a theoretical observer. The subject’s responses will infer the nature of stimulus characteristics. Barto, Sutton, & Watkins (1990) sought to analyze and conceptualize stimulus qualities and parameters of a reward task through development of a vector analysis. For greater analytic, please reference the enclosed preferred document format entitled Vector_Analysis.

Bibliography

Barto, A.G., Sutton, R.S., & Watkins, C.J. (1990), Learning and sequential decision-making. In: M. Gabriel and J.W. Moore, Eds., Learning and Computational Neuroscience: Foundations of Adaptive Networks, pp. 539-602, M.I.T. Press, Cambridge, Massachusetts.

Rescorla, R.A. & Wagner, A.R. (1972), A theory of Pavlovian conditioning: variations in reinforcement and nonreinforcement. In: A.H. Black and W.F. Prodsky, Eds., Classical Conditioning II: Current Research and Theory, pp. 64-99, Appleton-Century-Crofts, New York, New York.

Sutton, R.S. & Barto, A.G. (1990), Time derivative models of Pavlovian conditioning. In: M. Gabriel and J.W. Moore, Eds., Learning and Computational Neuroscience: Foundations of Adaptive Networks, pp. 497-537, M.I.T. Press, Cambridge, Massachusetts.

Sutton, R.S. & Barto, A.G. (1998), Reinforcement Learning: An Introduction. M.I.T. Press, Cambridge, Massachusetts.