Introduction and a single signal cue (Dayan & Niv,

Introduction

Critical clues could be given from studies with lower
organisms by the virtue of simplifying the mechanism and minimizing
distractions. For instance, Eric Kandel became a Nobel laureate owing to his
study elucidating the inner-working of memory making with snails (Kandel &
Schwartz, 1982). In this manner, rodents’ behaviors in the operant chamber can
be one candidate to interpret human nature. There are two different learning
strategies in instrumental conditioning. The action of rodents in the
experimental chamber seems to be goal-directed and governed by the relationship
between their actions and their specific consequences or the result at the
beginning of training. Besides, after a period of training, it seems that the
controlling over of their behaviors shifted to a stimulus-response process,
which can be considered as performance without thinking (Balleine &
Dickinson, 1998).

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

These phenomena can be extended to human behavior. As
animal behavior can be ruled by two distinct ways: a well-calculated sequence
of a goal-directed process and a stimulus-response habit mechanism, human
behavior can be driven by a series of actions aiming specific goal and
automated habitual activities that involved in two types of learning. This
somewhat looking complicated system of behavior allows us to focus on more
important thing during our daily life. We can avoid sudden obstacles in our way
because we do not have to think about how to walk. However, hypothesis demands
empirical applications to study it.

As it was said above, to study this it is required to
simplified this concept. Among many attempts, a formation of model-free and
model-based learning approaches from a computational modeling is one of the
adequate tools to capture this present of competing for goal-directed and
habitual behavior (Daw et al., 2005). In a model-free approach, the reasoning
behind each action is based on the simple connection between reward prediction
errors and a single signal cue (Dayan & Niv, 2008). The cost for the action
of mathematical calculation can be minimized, hence this strategy has less
possibility in adjusting to current goals (Dayan & Niv, 2008;  Keramati et al., 2011). On the other hand,
continuous adjustments in actions from model-based approach let us have diverse
and optimized-by-goals collections of behaviors due to vigorous and consecutive
computational works in our brain prior to the decision of specific actions (Daw
et al., 2005; Dayan & Niv, 2008; Keramati et al., 2011).

Another background of this study is that these
model-based and model-free strategies can be summed as reinforcement learning
(Dayan & Niv, 2008). Reinforcement learning is the way of optimizing
behaviors based on the prediction of consequences (Sutton & Barto, 1998).

There are many approaches to this mathematical psychology. One representative
way is event-related brain potential (ERP). A myriad of ERP studies
(Falkenstein et al., 1991; Gehring et al., 1993; Gehring & Willoughby,
2002; Holroyd & Coles, 2002) have supported the idea that reinforcement
learning can be visualized in ERP wavelet by a negative deflection in the ERP
from positive feedback (reward) to negative feedback (non-reward). This
negative deflection of ERP component happens approximately 250 ms after the
feedback, and the peak is made on the recording site of frontal-central
recording electrodes (Miltner et al., 1997; Holroyd et al., 2009). The
presumable source of this negative deflection is the anterior cingulate cortex
(ACC), which receives signals from the midbrain dopamine system to evaluate
previous actions (Rushworth et al., 2004). This hypothesis is especially well
developed in animal study (Schweimer & Hauber 2006).

One thing should be recalled in this context is that
there is another ERP component sharing its timing, polarity, and location, the
N200. The cognitive representation of the N200 has been interpreted as the
detection of a mismatch with analyzation of auditory stimuli (Folstein &
Van Petten, 2008). However, based on Holroyd et al., 2008, the possibility that
this negative deflection and the N200 are actually the same was suggested.

Accordingly, the N200 represents the mismatch of prediction and the feedback
(Baker & Holroyd, 2011; Baker et al., 2016). In other words, the prediction
errors from unexpected reward or punishment possibly display as the ERP
component, the N200, and the amplitude of the N200 can contribute the negative
deflection (Baker & Holroyd, 2011; Baker et al., 2016). This brought the
idea of feedback correct-related positivity or reward positivity (RP), which is
considered to be obtained by the difference between the amplitudes of the
negative deflection or the N200 (Cohen et al., 2007; Holroyd et al., 2008;
Baker & Holroyd, 2011; Baker et al., 2016). In sum, reinforcement learning
can be divided by two; model-free and model-based learning, and it also can be
recorded by the ERP wavelet. However, the connection between model-free and
model-based learning and the ERP recording is still needed to be elucidated.

Hence, in this study, the N200 as the indicator of
the feedback error-related negativity was measured from six healthy individuals
during model-free and model-based learning task paradigm to study the link
among these. The modified version of two-step probabilistic learning paradigm
from Smittenaar et al, 2013 was used, which can induce model-free and
model-based learning by providing stochastic chances of rewards (Fig. 1). In
the more frequently presented case (HF), a chance to get a reward is highly
biased to the left choice, while, it is more random and unbiased to get a
reward in less frequently displayed case (LF). In addition, by switching the
background color and providing previous choices, we tried to give the
information to participants for discriminating the HF and the LF (Fig. 1).

Subsequently, we could get significantly different trends in the N200 from the
LF to the HF. Moreover, learning progress in the LF is less predictable in the
point of view that the accuracy of the performance in the task is less
correlated to the RP. Therefore, we can conclude that the model-free learning
is presumably differently mediated in our brain, and this process can be
measured by ERP waveform.

– Participants

Seven healthy individuals participated in the
experiment (1 male and 6 female; age range 18 to 32; mean = 22.43, SD = 4.76
years). All participants had normal or corrected-to-normal vision. Any
participant who has a record of psychiatric or neurological disorder was
excluded. In advance of the experiment, written informed consent was provided
to all participants, which was approved by the local research ethics committee.

This experiment was conducted in accordance with the ethical standard
prescribed in the 1964 Declaration of Helsinki.

 

– Reinforcement learning task

This is modified from Smittenaar et al, 2013. On each
trial, two fractals were given to participants for a choice, each of which more
frequently (70%; fig. 1) led to another fractal particular at a second step. At
the second step, a coin (25 cents) was displayed on the screen based on the
probability (20% to 80%; fig. 1) of participants’ choices in that step.

Opposingly, the red cross was presented in the case of non-reward also based on
the probability. Choices in the first stage less frequently (30%; fig. 1) led
to the alternative second state. In this alternative stage, the reward coin was
given less frequently(40% to 60%; fig. 1) than the other state. There is no
clear evidence of assigned fractal for a high chance of a reward on the screen.

Hence, the participants were asked to use model-based learning strategy that is
sensitive not only prior reward but also the transition structure of the task,
which is highly unlikely to model-free learning strategy focusing only on the
last action of being rewarded.

Prior to the experimental task, participants were
trained to the task, which consisted of written instructions on the screen, 10
demo trials showing the probabilistic association between the second stage
fractals and coin rewards, and the verbal explanation during this demo trails
from the assistant seating next to the participant.

Participants were asked to respond within 2.5 s by
pushing keys (the left;1 and right;0) following the presentation of the
first-state choice. If the response was made over this time period, the red
colored words “no response” appeared at the center of the monitor screen, and
it moved to the brand new next trial. If the response was made on time, the
resized selected fractal placed to the top center of the screen to remind the
participant’s choice of the first stage. And the participant would saw
different background color based on their choices made in the first stage. At
the second step, the responding time reduced by 1 s, and based on the
probability (20% or 80%) a reward coin or the red cross appeared on the screen.

 

– Electrophysiological recordings

The electroencephalogram (EEG) recordings conducted
with a montage of 36 electrodes located as stated in the extended international
10-20 system (Jasper, 1958). Readings were obtained through Ag/AgCl ring
electrodes placed in a nylon electrode cap. A conductive gel (Falk Minow
Services, Herrsching, Germany) applied to the head skin, and Inter-electrode
impedances were controlled by 10 ? by this application of a conductive gel. Signals
were amplified by low-noise electrode differential amplifiers with a frequency
response of DC 0.017 to 67.5 Hz (90 dB octave roll off). The sampling rate for
digitizing was 250 per second. These digitized signals were recorded to disk
using Brain Vision Recorder software (Brain Products GmbH, Munich, Germany).

For artifacts detection, the vertical electrooculogram (EOG) was calculated by
recordings of beneath the right eye of the participant and electrode channel
Fp2. The horizontal EOG was recorded from the external canthus of both eyes.

The average reference was used, and the reference electrodes were on the left
and right mastoids. The ground electrode was placed on channel AFz.

 

– Data processing and calculating Reward Positivity

Brain Vision Analyzer (Brain Products GmbH, Munich,
Germany) was used for post-processing and data visualization. A 4th order
digital Butterworth filter with a bandpass of .1 to 20 Hz was used for
filtering the digitized readings. The readings were segmented by a 1000 msec
width epoch, which extended from 200 msec prior to the onset of the stimuli to
800 msec after that. The segmented evoked potentials were re-referenced to
mastoids electrodes. The baseline correction was made by subtracting from each
mean amplitude associated with the electrode within 200 ms interval prior to
the onset of stimuli. Blinks and saccades were corrected with eye movement
correction algorithm from Gratton et al, 1983. Trials with muscular and other
artifacts were rejected with the range of ±150 µV level threshold and a ±35 µV
step threshold. Then, averaging the single-trial EEG of each participant
rendered event-related potentials (ERP). The ERPs were sort of feedback type
and the frequency of getting a reward. Reward positivity (RP) was calculated by
the assessing the difference between positive and negative feedback of ERP
components. To do this, a difference wave by subtracting the reward feedback
ERPs from the No-reward feedback ERPs (Sambrook & Goslin, 2015; Holroyd
& Krigolson, 2007) was performed. The size of RPs is determined by peak
amplitude detection of N200, which is within a 200 to 400 msec window after
feedback onset. This peak amplitude detection was conducted at channel FCz
where the RP can reach its maximum value, and by doing this, the statistics of
analyzing could be verified. After this, the ERPs and scalp maps were revised
with Illustrator CS5 (Adobe software). The significance of the ERPs is calculated
by the peak amplitude values with SPSS (IBM) and Excel (Microsoft).

 

– Negative deflection is more salient during
model-based learning process than it in model-free strategy.

The feedback-evoking ERPs at channel FCz sorted by
conditions inducing model-free and model-based learning and scalp distributions
for each is presented in figure 2. Significant different waveform happened in
less frequently presented situation causing model-based learning (Fig. 2C,
one-tailed t.test, t = 320ms p = 0.033, mean amplitude of LF-PF = 6.80 µV, SD
of LF-PF = 4.24, mean amplitude of LF-NF = 2.59 µV, SD of LF-NF = 1.21). Based
on scalp distributions, reward-related negative deflection seems a
center-oriented process (Fig. 2B and 2D). These observations are congruent with
the previous studies (Baker & Holroyd, 2011; Baker et al., 2016). In the HF
condition, although the reason is not clear, the ERPs do not have significantly
different amplitudes of the N200 (Fig. 2A). In addition, it is hard to figure
out the source, however, there is a slight incongruent timing between PF and NF
(Fig. 2A and 2C). The timings for PF and NF are aligned cross model-free and
model-based learning conditions (Fig. 5). 
The moments of the RP for HF are faster than it of LF (Fig. 2A and 2C).

Moreover, the trends in scalp distributions are clearly divided by positive
feedback and negative feedback, yet, this modality is making a consensus
between model-free and model-based strategies (Fig. 2B and 2D).

 

– Difference wave and the relationship between
performance accuracy and reward positivity show that the model-free setting is
more in agreement with the previous ACC related reward task paradigm.

Difference wave and scalp distributions for each
condition are displayed in figure 3. Subtractions between PF and NF scalp
distribution consent to the previous literature (Fig. 3B and 3D; Baker &
Holroyd, 2011; Baker et al., 2016). However, according to Figure 3A and 3C, the
ERPs by the task paradigm provoking model-based learning are less congruent to
the previous hypothesis about reinforcement learning studies (Baker &
Holroyd, 2011; Baker et al., 2016). Unlikely to the waveforms in the model-free
conditions, the wavelet of LF is changing amplitude bigger than it of HF (Fig.

3C). Furthermore, around P300 the polarity and the negativity of the ERP is
crossing the zero line (Fig. 3C). The causality behind it is still remained to
be elucidated, but this can be evidence that there is an additional neural
process rather than the ACC reward-related mechanism.

In addition, the correlation between performance
accuracy and RP indicates that model-based learning process is less predictable
with the current hypothesis that RP reflects the reinforcement learning
progress by the amount of negative deflection (Fig. 4A and 4B; Baker &
Holroyd, 2011). Performance accuracy was calculated by assessing the percentage
of reward trials over total trials. This result can be revised with a bigger
sample size.

 

– The difference between model-free and model-based
learning in this task paradigm seems coming from LF-NF cues.

 To figure out
what is the factor to make these changes among model-free and model-based
learning signal cues the comparison between HF and LF was done in positive
feedback and negative feedback, respectively (Fig. 5). The center-oriented
activity pattern of scalp distributions among conditions are consistent (Fig.

5B and 5D). The timing of the N200 is slightly faster in positive feedback
(Fig. 5A and 5C). Moreover, overall time points of ERP components, such as P100,
N200, and P300, are more aligned than these in comparing between PF and NF
(Fig. 5A and 5C). In the PF, the ERP components do not show significant
difference from each other, while from N200 including the following component
P300 the waveforms are significantly different (Fig. 5A and 5C, one-tailed
t.test, t = 320ms p = 0.029, mean amplitude of HF-NF = 4.69 µV, SD of HF-NF =
1.70, mean amplitude of LF-NF = 2.59 µV, SD of LF-NF = 1.21). Especially,
around P300 there is a constant big difference between more-frequent and
less-frequent conditions (Fig. 5C). This is well fit to the classical study of
the P300 that the decision making or learning something that it rarely happens
evoke this ERP (Donchin, 1981).

 

The ERPs were recorded during two-step probabilistic
learning paradigm, which causes model-free and model-based learning process.

The N200 and the RP were the major subjects to be scrutinized due to the fact
that it can indicate reinforcement learning (Baker and Holroyd, 2011). The LF
signal cues triggering model-based strategy shows a more clear effect of the
negative deflection with positive feedback. Moreover, it could be assumed that
the task with more commonly presented event having less stochastic reward is
more in accordance with the previous explanation about the relationship between
the N200 and the ACC for reward evaluation. Moreover, the scalp distribution of
negative feedback presumably implied the additional mechanism to affect the
previous inner-working of the ACC and the midbrain in the reward trails. Last
but not least, based on the statistics, the difference between model-free and
model-based learning in this task paradigm seems biased to LF-NF cues.

 

– There is a possibility of additional brain
activities rather than the relationship between the ACC and the midbrain
dopamine.

The N200 indicates many different pieces of
information. There are several N2 sub-components according to their
characteristics, such as the automated or the ones requiring conscious
attention (Naatanen & Picton., 1986). The N2b, which is unlikely to the
other N2 family in terms of the fact that it is responsive not only auditory
cues but also visual and template changes, can be seen at central cortical
distribution related to to the ACC activity only during conscious stimulus
attention (Pritchard et al., 1991; Baker & Holroyd, 2011). In addition to
this, according to the animal study indicating the relationship between
midbrain dopamine activity and the ACC activity (Rushworth et al., 2004;
Schweimer & Hauber 2006), we previously set a hypothesis that the N200
negative deflection can visualize reinforcement learning with reward prediction
errors (Baker & Holroyd, 2011). However, although it does not have a clear
causation, there are three factors implies extra neural activity except reward
prediction errors in this setting; the constant latency between reward and
non-reward feedbacks, the difference in the P300 and the different polarity
between model-free and model-based approaches after the P300. Of course, it is
true that these elements appeared just because we have a small size of the
sample. Nonetheless, we still can have another plausible theory based on the
previous literature. Human age-related studies showed that N2b latency could be
caused by the general decay of attentional processes with age (Czigler et al.,
1997; Amenedo & Diaz, 1998). Among these participants, there is a less
chance that they will have the age-related depression of attentional processes
(1 male and 6 female; age range 18 to 32; mean = 22.43, SD = 4.76 years). Yet,
there is a possibility of having distraction by a stochastic chance of reward
at negative feedbacks. This may indicate that there was a factor to disturb
participants’ attention during the non-reward cues of the task paradigm in this
study. Moreover, this conclusion is well fit for the generic idea of the
delayed computational process in goal-directed behaviors compared to the
habitual learning (Dayan & Niv, 2008). Furthermore, as figure 2, 3 and 5
continuously showed, the P300 components are significantly different between
reward and non-reward feedbacks. Based on the classical studies, the P300 are
corresponding to broad recognition and memory-updating throughout rarely
happening events (Sutton et al., 1965; Donchin, 1981; Naatanen, 1990). This led
us to a similar conclusion to the previous above that because of the possible
additional cognition process, we need to extend our scope to another component,
such as the P300 in this context.

There are subcortical areas to explain the aspect
of subconsciousness process during reinforcement learning.

There are two major subclasses in reinforcement
learning; model-free and model-based, which include subconscious computation
(Dayan & Niv, 2008). It is sure that there will be the complex mechanism to
shape our behaviors, but one factor is reward evaluation by the midbrain
dopamine pathway. (Doya, 2008). Therefore, the N200 was measured in this study
on the connection of the ACC and the midbrain dopamine cells’ reward prediction
errors (Brown & Braver, 2005). One more helpful a factor to add is the
striatum. There is a study showed that the striatum also topologically
responding to habitual learning and goal-directed learning respectively (Yin et
al., 2005). Many attempts have been done to enlighten the cortico-striatal
connection in human. One of them is combining computational modeling and ERPs
(Santesso et al., 2009). This leads to the final conclusion that although the
N200 and RP are a great tool to assessing model-free and model-based learning,
still, we need to improve the task and analyzing data with the aid of
computational modeling.