Monetary Policy Under Behavioral Expectations: Theory and Experiment

Expectations play a crucial role in modern macroeconomic models. We consider a New Keynesian framework under a behavioral model of expectation formation and under rational expectations. Contrary to the rational model, the behavioral model predicts that inflation volatility can be lowered if the central bank reacts to the output gap in addition to inflation. We test the opposing theoretical predictions in a learning-to-forecast experiment. In line with the behavioral model, the results support the claim that output stabilization can lead to less volatile inflation.<br>

are fully rational and able to determine the model-consistent expectation of the underlying process governing real-world economic outcomes is highly problematic. A great deal of research has shown that humans generally do not react fully rationally to the world around them. This research ranges from providing evidence for simple biases to showing the inability of humans to work with probabilities and to forecast future economic behavior ( Tversky andKahneman, 1974 , andGrether andPlott, 1979 , are seminal early contributions, and many have followed since; see Camerer et al., 2011 , for an overview). Moreover, the claim based on evolutionary arguments that behavior deviating from the homogeneous rational expectations solution will be driven out of markets over time has not held up to scrutiny ( Brock and Hommes, 1997;Brock and Hommes, 1998;De Grauwe, 2012a ; see also Arthur et al., 1997 ).
In this paper we consider a standard macroeconomic model under both behavioral and rational expectations. We examine aggregate macroeconomic behavior and policy implications arising from the alternative assumptions on expectation formation, paying particular attention to price stability. The behavioral model of expectation formation is a heuristic switching model developed over a long period of time based on (mainly microeconomic) research investigating how people form expectations and how they adapt them over time. Models of this kind perform well in describing expectation dynamics using both survey and experimental data (e.g., Branch, 2004;Hommes, 2011;Assenza et al., 2018 ).
A key difference in outcomes between the macroeconomic models with behavioral and rational expectations concerns price stability, that is inflation volatility. Assuming rational expectations, there is a clear trade-off for a central bank between fighting inflation volatility and output gap volatility. If the central bank reacts to the output gap in addition to inflation, under rational expectations this will result in an increase of inflation volatility. The outcome is different under behavioral expectations. Starting from a situation in which the central bank does not react to the output gap at all, the central bank can simultaneously decrease inflation volatility and output gap volatility by reacting to the output gap. However, inflation volatility as a function of the extent of output gap reaction is U-shaped. This means that reacting to the output gap on top of inflation will only lower inflation volatility up to a certain point, after which inflation volatility starts to increase again.
These different outcomes regarding inflation volatility can be tested in the laboratory. We design a learning-to-forecast experiment where the only difference between treatments consists in the monetary policy rule used by the central bank. In one treatment, the central bank only reacts to inflation, while in the other it also reacts to the output gap. Our experimental results support the claim that inflation volatility can be lowered when the central bank also reacts to the output gap, in line with the predictions of the behavioral model.
Our results from the behavioral model and the experimental data have clear policy implications for central banks whose sole aim is to achieve price stability, such as the European Central Bank (many other central banks, including those of New Zealand, Canada, England, and Sweden, have a hierarchical mandate with price stability as the primary objective for monetary policy). Even if these banks ultimately only care about price stability, this goal is better achieved if they also react to changes in the output gap. This is important and at odds with standard macroeconomic thinking built upon full rationality.
Our work mainly relates to two streams of literature. Firstly, it relates to the literature on behavioral macroeconomics and learning in macroeconomics, for example Marcet and Nicolini (2003) , Orphanides and Williams (2006) , McGough (2009, 2010) , Woodford (2010) , De Grauwe (2011, 2012a, 2012b , De Grauwe and Kaltwasser (2012) , Anufriev et al. (2013) , Kurz et al. (2013) , Benhabib et al. (2014) , and Bertasiute et al. (2018) ; see Evans and Honkapohja (2001) and Woodford (2013) for overviews. In particular, our paper is not the first to derive a non-monotonic relationship between inflation and output gap volatility. De Grauwe (2011Grauwe ( , 2012a) obtains a similar result in a different macroeconomic model with simple behavioral rules of expectation formation. Moreover, Kurz et al. (2013) show that the trade-off between inflation and output gap volatility is non-monotonic under rational diverse beliefs (see Kurz, 2009 , for a survey on rational diverse beliefs).  present similar findings in sticky information economies in which the degree of attentiveness or the rate at which agents update their information is endogenized. The main contribution of our paper is the empirical test of policy trade-offs by means of a learning-to-forecast laboratory experiment. To our knowledge, our results are the first experimental evidence for the missing trade-off between inflation and output gap volatility.
Secondly, our research relates to the literature on experimental macroeconomics and learning-to-forecast experiments, for example Marimon and Sunder (1993) , Kelley and Friedman (2002) , Lei and Noussair (2002) , Arifovic and Sargent (2003) , Adam (2007) , Heemeijer et al. (2009) , Bao et al. (2012) , Kryvtsov and Petersen (2013) , Cornand and M'baye (2018) , Pfajfar and Zakelj (2014) , Assenza et al. (2018) , and Hommes et al. (2019) ; see Duffy (2012) , Assenza et al. (2014) , and Cornand and Heinemann (2014) for reviews. Most closely related to our paper are the learning-to-forecast experiments of Pfajfar and Zakelj (2014) and Assenza et al. (2018) , which are framed in a setup similar to ours. The first paper studies inflation expectation formation under different interest rate rules reacting only to inflation, finding that rationality of expectations can be rejected for the majority of participants. We focus instead on the policy trade-off between inflation and output gap volatility, comparing experimental outcomes when the central bank only targets inflation and when it also targets the output gap. Another important difference concerns the experimental design. While in Pfajfar and Zakelj (2014) participants forecast inflation only, we allow subjects to forecast both inflation and the output gap, in accordance with the theoretical macroeconomic model underlying the experiment. Assenza et al. (2018) consider participants forecasting both inflation and the output gap but focus on the Taylor principle as a device to pin down inflation dynamics. This paper is organized as follows. In Section 2 we describe how we model the economy and the formation of expectations. We also show the main differences between the rational and behavioral versions. In Section 3 we first describe the experimental design and the procedures. Then we show the experimental results. Section 4 concludes.

Theory
In this section, we first describe the underlying macroeconomic model. Then we introduce the behavioral model of expectation formation. After that, we compare the outcomes of both models and describe the economic intuition behind these outcomes.

Macroeconomic model
The economic model we use can be described by the following aggregate New Keynesian equations. Note that these aggregate equations are fully microfounded both under rational expectations (e.g., Woodford,20 03;Galí,20 08 ) and under behavioral expectations. We spell out the microfoundations for our behavioral model of expectation formation in Appendix A (these microfoundations are based on Kurz et al., 2013 ; for microfounded models under behavioral expectations see also Massaro, 2013 ).
y t and ȳ e t+1 are the actual and average expected output gap, i t is the nominal interest rate, π t and π e t+1 are the actual and average expected inflation rates. g t and u t are exogenous disturbances, and ϕ, λ, ρ, φ π and φ y are positive parameters. Eq.
(1) is the dynamic IS equation in which the output gap y t depends on the average expected future output gap ȳ e t+1 and on the real interest rate i t −π e t+1 . Eq. (2) is the New Keynesian Phillips curve according to which the inflation rate depends on the output gap and on average expected future inflation. Eq. (3) is the monetary policy rule implemented by the central bank describing how it reacts to deviations from the inflation target π and to deviations from the corresponding equilibrium level of the output gap ȳ ≡ (1 − ρ) π /λ. The coefficients φ π and φ y in this Taylor rule measure how much the central bank adjusts the nominal interest rate i t in response to deviations of the inflation rate from its target and of the output gap from its equilibrium level. As usual, the interest rate rule is subject to the zero lower bound, that is i t ≥ 0. When the zero lower bound is not binding, model (1)-(3) can be rewritten in matrix form as y t π t = ϕ π (φ π − 1) + ϕφ y ȳ λϕ π (φ π − 1) + λϕφ y ȳ where ≡ 1 / (1 + λϕφ π + ϕφ y ) .
We also remark that, although Eqs. (1) and (2) are typically derived by log-linearizing around a steady state with a zero inflation rate, this does not mean that one can only consider policy rules with a zero inflation target. In fact, as argued in Woodford (2003) , Eqs. (1) and (2) are valid approximations for the dynamics of inflation and output gap as long as the target inflation π in the policy rule (3) is not too large (see Appendix A for a further discussion). 1 In the remainder we will only make use of the aggregate equations presented here.

A behavioral model of expectation formation
Models with rational expectations are based on the assumption that agents have perfect information and a full understanding of the true model underlying the economy. There is, however, a large body of empirical literature documenting departures from this assumption. Furthermore, in survey data there is usually a high variance of inflation forecasts (e.g., Carroll, 2003;Mankiw et al., 2003;Branch, 2004 ) strongly suggesting that expectations are heterogeneous.
Existing literature on expectation formation shows that many people use heuristics to make forecasts of future (macroeconomic) variables. This behavior is not necessarily a consequence of agents' irrationality; it can also be a "rational" response of agents who face cognitive limitations and have an imperfect understanding of the true model underlying the economy (e.g. Gigerenzer and Todd, 1999;Gigerenzer and Selten, 2002 ). Next, we introduce a behavioral model of expectation formation for such an environment.
Let H denote a set of H different heuristics used by agents to make forecasts of variable x . A generic forecasting heuristic h ∈ H based on available information at time t can be described as Notes: x a v t−1 denotes the average of all observations up to time t − 1 .
In this paper x is either inflation π or the output gap y . Although agents can use simple rules to predict future inflation and output gap, we impose a certain discipline in the selection of such rules in order to avoid completely irrational behavior. Specifically, we introduce a selection mechanism that disciplines the choice of heuristics by agents according to a fitness criterion. This allows agents to learn from past mistakes and to choose heuristics that have performed well in the (recent) past. U h denotes the fitness measure of a certain forecasting strategy h defined by where F is a generic function of the forecast error of heuristic h , and 0 ≤ η ≤ 1 is a memory parameter measuring the relative weight agents give to past errors of heuristic h . Performance is completely determined by the most recent forecasting error if η = 0 , while performance depends on all past prediction errors with exponentially declining weights if 0 < η < 1 or with equal weights if η = 1 . If all agents simultaneously update the forecasting rule they use, the fraction of agents choosing rule h in each period t can be described by The multinomial logit expression described in Eq. (7) can be derived directly from a random utility model (see Manski andMcFadden, 1981 andHommes, 1997 ). The parameter β ≥ 0, referred to as "intensity of choice", reflects the sensitivity of agents to selecting the optimal prediction strategy according to the fitness measure U h . 2 If β = 0 , n h,t is constant for all h , meaning that agents do not exhibit any willingness to learn from past performance; if β = ∞ , all agents adopt the best performing heuristic with probability one. The reinforcement learning model in Eq. (7) is extended in Hommes et al. (2005a) and Diks and van der Weide (2005) to include asynchronous updating in order to allow for the possibility that not all agents update their rule in every period (consistent with empirical evidence; see Hommes et al., 2005b and. This yields a generalized version of Eq. (7) described by The parameter 0 ≤ δ ≤ 1 introduces persistence in the adoption of forecasting strategies and can be interpreted as the average fraction of individuals who, in each period, stick to their previous strategy. In order to use this behavioral model for policy analyses or predictions, specific assumptions have to be made about the nature of agents' forecasting heuristics (in general, the set H may contain an arbitrary number of forecasting rules). We restrict our attention to a set of four heuristics described in Table 1 .
The choice of this specific set of heuristics is motivated on empirical grounds. These heuristics were obtained as descriptions of typical individual forecasting behavior observed in Hommes et al. (2005b) , Hommes et al. (2008) , and Assenza et al. (2018) . Based upon the calibration in these papers, we use the parameters β = 0 . 4 , δ = 0 . 9 , and η = 0 . 7 . 3 While different heuristic switching models have been employed in the literature, the four-heuristic model that we use has the best empirical support (it fits the data well in learning-to-forecast experiments both on financial asset markets and macroeconomics). We consider it a feature that this model performs well across different settings (for economists worrying about the "wilderness of bounded rationality," it should be useful to see that there are models that perform well in a variety of settings).

Existence and non-existence of trade-offs
A result derived from Model (4) under rational expectations is a policy trade-off between the volatility of the output gap and the volatility of inflation. A decline in output gap volatility resulting from a more active output stabilization policy comes at the price of an increase in inflation volatility (it is reasonable to focus on volatility as for the rational and the behavioral models alike inflation and output gap are on average at their target and steady state level for reasonable values of φ π and φ y ). This policy trade-off is described in Fig. 1 (a), where we show the effect of φ y (with which the central bank reacts to deviations of the output gap from its steady state level) on inflation volatility. Higher output stabilization, that is, an increase in the reaction coefficient φ y , comes at the price of higher inflation volatility. The immediate policy implication for a central bank whose main objective is price stability is that it is optimal to set φ y = 0 , that is, not to react to output gap fluctuations at all (cf. Galí, 2008 andWoodford, 2003 ).
For the simulations of this graph, the parameter φ π is equal to 1.5 (different values lead to similar results, see Online Appendix B) and the structural parameters in Eqs. (1)-(3) are as estimated in Clarida et al. (20 0 0) . 4 The inflation target used for the simulations is π = 3 . 5 (this is the same target that will be used in the experiment, a rationale for this value can be found in Section 3.2 ; the simulations yield similar results for different values of π ). This inflation target leads to a steady state level of the output gap of ȳ = 0 . 116 6 6 67 . Inflation volatility is measured by v T denoting the total number of periods. This measure of volatility has some properties that make it preferable to other measures of price instability (the measurement of volatility is discussed in Section 2.3.2 and in Online Appendix C; using alternative measures yields similar results).
In Fig. 1 (b), we show the effect of the parameter φ y on inflation volatility when expectations are formed according to the behavioral model described in Section 2.2 (note that the scales in Fig. 1 (a) and (b) are different; the overall level of inflation volatility is higher under behavioral expectations than under rational expectations). In contrast to the simulation results under rational expectations, the graph of inflation volatility as a function of φ y has a U -shape. 5 Thus, starting from φ y = 0 , the central bank can simultaneously decrease inflation and output gap volatility by also reacting with its monetary policy to deviations of the output gap from its steady state level (in addition, reacting to the output gap would also lead to less volatile interest rates). Fig. 2 depicts output gap volatility and interest rate volatility as functions of φ y (as φ y increases output gap volatility decreases under both rational and behavioral expectations; the interest rate decreases continuously in φ y under rational expectations, while it first decreases strongly under behavioral expectations and then slowly increases again). Hence, under behavioral expectations, there is a broader scope for output stabilization. Now we turn to the intuition of these results. Considering the outcome simulated with rational expectations ( Fig. 1 (a)), one may be tempted to believe that the following simple rule is correct: "If there are two variables, targeting one variable will always come at the expense of the other variable." In general, this is not the case, however. The intuition is slightly more complex. Homogeneous rational expectations are strictly forward looking and in this model always equal to the inflation target and the corresponding steady state level of the output gap, respectively (assuming that φ π + φ y (1 − ρ) /λ > 1 , which ensures a determinate model solution, see,e.g., Woodford, 2003 ). These expectations do not depend in any way on the current level of inflation and output gap or on any past behavior. It is precisely via the dependence of expectations on (past) actual variables that reacting to the output gap can also pay off in terms of inflation volatility. To illustrate this, imagine that inflation and output gap are constant at π and ȳ , respectively, and that a combination of shocks arrive in one period that would lead (without any reaction by the central bank) to inflation staying constant and the output gap rising above the steady state level. Should the central bank react to this shock if it only cares about inflation? The rational expectations answer would be "no"; inflation is at its target and in the next period one would (assuming no further shocks) again be at the inflation target and the steady state level of the output gap, because expectations do not react to the past. However, under behavioral expectations, what happens now matters for the future. If there is some adaptive or trendfollowing behavior, a higher output gap now will lead agents to revise their expectations of the future output gap upward, leading to a higher realized output gap in the future, which will in turn lead to upward pressure on inflation. Therefore, it can be beneficial for the central bank to curb the increase of the output gap now (at the expense of slightly lower inflation now) in order to reduce the upward pressure on inflation in the future. However, if the monetary authority puts too much weight on output gap stabilization, the ensuing fluctuations in inflation dominate the stabilization bonus provided by less volatile output, leading to higher inflation volatility.

Robustness and measurement of inflation volatility
The simulation results are qualitatively robust to a wide variety of changes. This includes changes in all parameters of the macroeconomic model. It also includes changes in the parameters of the behavioral model of expectation formation. More interestingly, the results are also robust to other models of behavioral expectation formation, such as a heuristic switching model with fewer and simpler heuristics or adaptive expectations without any switching involved. Also if the central bank uses a slightly different Taylor rule to smooth the interest rate, the results persist (more precisely, we model the interest rate rule in that exercise as a weighted average of the regular Taylor rule and the previous period's interest rate). Such variations are shown in Online Appendix B. While the results are qualitatively robust to these changes within this macroeconomic framework (which is the most standard framework for macroeconomic policy analysis), it is possible in other macroeconomic frameworks to reverse the results obtained by rational expectations. That is, it is possible to obtain a reduction of inflation volatility by increasing φ y ; an example are models that only include shocks to the aggregate demand equation (1) such as technology shocks, preference shocks or variations in government purchases, but do not include shocks to the short-run aggregate supply relationship (2) (see Woodford, 2003 ). In such frameworks, behavioral expectations are an additional reason why inflation volatility decreases when also targeting the output gap.
We focus on inflation volatility as measured by v (π ) = 1 T T t=2 ( π t − π t−1 ) 2 for the simulations of the theoretical model (and for the predictions for the experiment). This measure has advantages over alternative measures of price instability. For some economists, the mean squared deviation from the target springs to mind as a measure. However, the measure we use has a few intuitive advantages over the mean squared deviation. For example, the mean squared deviation does not distinguish between erratic behavior around the target with decreasing distance from the target and slow convergence if the absolute distance to the target is always equal. The differences between these two measures and other simple measures are discussed in more detail in Online Appendix C. Note, however, that we obtain similar results when using different measures. When using, for example, the mean squared deviation from the target, the shapes of the graphs showing the trade-offs persist, but differences become smaller, that is, the curves becomes flatter. The same holds for our experimental results, which are described in the next section: the results go in the same direction but are not quite as strong (though the mean squared deviation from the target in the "inflation targeting only" treatment is still more than 20% above that in the "inflation and output gap targeting" treatment).
One of the reasons why the mean squared deviation from the target may be popular among economists is that it constitutes a welfare criterion under homogeneous expectations. However, as shown in Di Bartolomeo et al. (2016) , it is not an appropriate welfare criterion when agents have heterogeneous expectations. In this case, price dispersion arises not only because of the staggered price setting mechanism but also because of the heterogeneity of prices set by reoptimizing firms in the Calvo lottery, which depends on the heterogeneity of firms' expectations of future inflation. In our behavioral model, this heterogeneity increases in the relative changes in inflation. In general, reducing inflation volatility by means of an interest rate rule that reacts to both inflation and output gap fluctuations, is thus welfare improving not only because it reduces the price differences between optimizing and non-optimizing firms, but also because it reduces the cross-sectional variance of prices set by optimizing firms. We refrain from using the precise welfare criterion as it depends on more than inflation alone. While welfare criteria derived in particular models may have influenced the fact that price stability is now the sole aim of many central banks, these central banks now have the mandate to achieve price stability and not the aim of maximizing a model-dependent welfare criterion. In addition, using a composite welfare measure would reduce the clarity and readability of the paper. Note that both simulation results and experimental results are similar when considering the precise welfare criterion.

Experiment
The only task for subjects in the experiment is to forecast inflation and output gap. These forecasts are then used to calculate subsequent realizations. The model underlying the experimental economy is the macroeconomic model described in Section 2.1 (with the same calibration of macroeconomic parameters as before). Before we describe the experiment in more detail, we now explain the treatments and hypotheses. The design of the experiment and the hypotheses can be motivated with the theory described in Section 2 .

Treatments and hypotheses
There are two treatments, T 1 ("inflation targeting only") and T 2 ("inflation and output gap targeting"). The only difference between the treatments lies in the Taylor rule describing monetary policy. In T 1, the parameters of the Taylor rule are φ π = 1 . 5 and φ y = 0 , whereas they are φ π = 1 . 5 and φ y = 0 . 5 in T 2. 6 That is, the only difference between the treatments is that in T 1 the central bank only targets inflation, whereas it targets the output gap in addition to inflation in T 2.
We are interested in testing the null-hypothesis (which can be derived from the rational expectations model in Section 2 ) that inflation volatility in T 1 is less or equal to inflation volatility in T 2 against the alternative hypothesis (which can be derived from the behavioral model) that inflation volatility is greater in T 1 than in T 2. Fig. 3 summarizes these hypotheses.
In the experiment, the number of subjects per experimental economy is six. Evidence from other experiments indicates that four to six subjects are enough to justify the use of the competitive equilibrium as equilibrium concept (see, e.g., Huck et al., 2004 ). Note, however, that also in a game theoretic analysis the unique Nash equilibrium is forecasting π and ȳ . 6 The optimal choice of φy in Fig. 1 (b) is lower than 0.5. However, the U-shape in that figure is not symmetric: the function is much steeper to the left of the minimum than to the right. If the true function underlying human behavior is a similar asymmetric function with a minimum that is slightly different from the one of the benchmark model, we want to avoid being to the left of the minimum while being to the right of the minimum is less problematic.
We have chosen φy = 0 . 5 as this is a round value often used in economic applications that is to the right of the minimum in the benchmark model and that leads to inflation volatility not much above that at the minimum.

Course of events and implementation
The design is a between-subjects design with within-session randomization. In the beginning, all participants are divided into groups (experimental economies) of six. Subjects only interact with other subjects in their group, without knowing who they are. Subjects are asked to make forecasts of inflation and output gap. The average forecasts of all subjects in one group are then used to calculate the realizations of inflation and output gap according to model equations (1)-(3) (only the average forecasts π e t+1 and ȳ e t+1 are needed to calculate the realizations π t and y t ). When making their forecasts for period t + 1 , the information subjects can see on their screen (as numbers and partly also in graphs) is the following: all realizations of inflation, output gap, and interest rate up to period t − 1 , their own forecasts of inflation and output gap up to period t and their scores stating how close their past forecasts were to realized values up to period t − 1 (these scores determine the payments). As subjects are only informed about realizations up to period t − 1 , their forecasts for period t + 1 are effectively two-period-ahead forecasts. Fig. 4 shows a screenshot of the experiment (a larger version of the same screenshot can be found in Online Appendix E). The inflation target of the central bank in the experiment is π = 3 . 5 . This target is chosen for two reasons. First, it is distant from the zero lower bound, which is desirable as we do not wish to investigate behavior in a liquidity trap. Second, it is different from focal points such as 2% or 2.5%, which are standard inflation targets in the real world. We avoid these focal points so that learning can be observed in the experiment. Our theory and experiment concern feedback from the monetary policy rule to deviations of inflation and output gap from their target and steady state levels. Laboratory subjects are very heterogeneous, and if most of them start out with their forecasts extremely close to the target already, the feedback plays a smaller role in comparison to subjects' heterogeneity and mistakes.
Subjects' payments depend on their forecasting performance. Whether a participant is paid for inflation forecasting or output gap forecasting is determined randomly at the end of the experiment. The total scores for inflation and output gap forecasting are the sums of the respective forecasting scores over all periods. This score is for subject i 's inflation forecast in period t equal to 100 / (1 + | π e t,i − π t | ) , where π e t,i denotes subject i 's forecast for period t and π t the realized value of this period. The score for output gap forecasting is calculated analogously. This means that subjects' payments decrease with the distance of the realizations from their forecasts. In the instructions, subjects receive a qualitative description of the economy that includes an explanation of the mechanisms that govern the model equations. Concerning monetary policy, subjects in both treatments are only told that the central bank decreases the interest rate if it wants to increase inflation or output gap, and that it increases the interest rate if it wants to decrease inflation or output gap. 7 Except for the precise formulation of the equations of the macroeconomic model, the instructions contain full information about the experiment (i.e., on the number of subjects per group, payments, etc.). The complete instructions can be found in Online Appendix D.
The experiment was programmed in Java and conducted at the CREED laboratory at the University of Amsterdam. The experiment was conducted with 258 subjects recruited from the CREED subject pool (43 groups of six subjects each, distributed over thirteen sessions). After each session, participants filled out a short questionnaire. Participants were primarily undergraduate students, the average age was slightly above 22 years. About half of the participants were female, about twothirds were majoring in economics or business, and about half were Dutch. During the experiment, 'points' were used as currency. These points were exchanged for euros at the end of each session at an exchange rate of 0.75 euros per 100 points. The experiment lasted around two hours, and participants earned on average about 30 euros. The series of error terms used in the model equations ( g t and u t in Eqs. (1) and ( 2 )) differed across groups within each treatment, but the sets of noise series used in the two treatments were the same. 8

Results
There are data of 43 different groups, 21 in T 1 and 22 in T 2. The groups' actions do not influence one another in any way; thus the observations at the group level are statistically independent. The data for all groups separately including all individual forecasts can be found in the Online Appendix E. Fig. 5 gives an overview of realized inflation in all experimental economies, separately for T 1 and T 2. Each line corresponds to the inflation in one economy, tracked over all 50 periods of the experiment. Almost all economies are close to 7 As the experiment uses two-period-ahead forecasts, after reading the instructions subjects are asked to enter forecasts for periods 1 and 2 simultaneously. Subjects therefore receive some indication of reasonable values by being told in the instructions that in economies similar to the one at hand inflation has historically been between −5% and 10% and the output gap between −5% and 5%. 8 Before conducting the experiment, two pilot sessions were conducted (with a total of six groups). The pilot sessions differ from the actual experiment as follows: the error terms added to the model equations had a larger standard deviation, a different inflation target was used, and subjects in the pilot did not receive any information on the number of participants in each group. For two of the groups, a different combination of parameters for the Taylor rule was used.

Inflation
We excluded two of the groups from the analysis (including these two groups, the experiment was conducted with 270 subjects). One of the groups was excluded because of a very large typo (a forecast of 30 instead of 3.0; the corresponding participant notified us about this typo in the post-experiment questionnaire). The other group was excluded due to a severe misunderstanding on the part of one subject, who systematically stayed very far from the actual realizations (thereby also losing a lot of money). Our conclusions do not change if we include these groups in our analysis. The realizations and forecasts of inflation and output gap for these two groups are shown in Figure 34, Online Appendix E. the inflation target after 50 periods, and in the economies with inflation still oscillating around the target the amplitude of these oscillations is decreasing. 9 Groups are heterogeneous, with some groups exhibiting much larger volatility than other groups. In many groups in both treatments, inflation is within one percentage point from the target in most or all periods. Inflation generally fluctuates around its target: mean inflation is between 3.13 and 4.33 in T 1 and between 2.79 and 3.82 in T 2 (treatment averages of mean inflation are 3.55 in T 1 and 3.41 in T 2). On average, inflation fluctuates a bit less in T 2 than in T 1, as predicted by the behavioral model. 10 Average inflation volatility is 0.21 in T 1 and 0.09 in T 2. Inflation volatility in each group can be seen in Fig. 6 where the empirical cumulative distribution functions (ECDFs) are drawn. For each value on the horizontal axis, the ECDF shows on the vertical axis the fraction of groups in each treatment with inflation volatility less or equal to this value (the colored dots represent the observations; that is, the value on the horizontal axis below a dot is the corresponding group's inflation volatility). 11 The graph clearly shows that inflation volatility is lower in T 2 than in T 1. In fact, the whole ECDF of observations in T 2 lies to the left of the ECDF of observations in T 1 (the single one high value in T 2, which is the rightmost blue dot, corresponds to the oscillating red line in the right graph of Fig. 5 ).
In order to test the statistical significance of this finding, we use a Wilcoxon rank-sum test. We test the null-hypothesis that inflation volatility is less or equal in T 1 than in T 2 against the alternative hypothesis that inflation volatility is lower in T 2. 12 This test rejects the null-hypothesis ( p < 10 −3 ). The advantage of the Wilcoxon rank-sum test is that it makes very unrestrictive assumptions on the underlying data. Note, however, that the results are robust to employing different tests. 13 Fig. 7 shows the output gap in all experimental economies. Here, the differences are even larger; the output gap is much more volatile in T 1 than in T 2. This was to be expected, as both models predict that the output gap is more stable when it is also targeted by the central bank. The mean of the output gap is between −0 . 12 and 0.70 in T 1 and between −0 . 03 and 0.66 in T 2. Fig. 8 shows the ECDFs of output gap volatility. A Wilcoxon rank-sum test rejects the null-hypothesis that output gap volatility is less or equal in T 1 than in T 2 ( p < 10 −4 ).

Output gap and interest rate
Similarly, Fig. 9 shows the interest rates in all groups. In addition, it shows a horizontal line at zero. As one can see in these graphs, the zero lower bound is never hit (it is almost hit in one group in T 1, but the lowest interest rate in this group is still slightly above zero). The mean of the interest rate is between 2.94 and 4.74 in T 1 and between 2.70 and 3.92 in T 2. Fig. 10 show the ECDFs of interest rate volatility. These figures show that the interest rate is much smoother in T 2 than in T 1. This smoothness is achieved without interest rate smoothing in the Taylor Rule. Thus, reacting to changes in the output gap on top of inflation not only decreases inflation volatility and output gap volatility simultaneously but also leads to a less volatile interest rate. This can be seen as an additional reason for central banks to react to the output gap on top of inflation (a smooth interest rate may not be included in the mandate of a central bank, but in practice central bankers 9 That many economies are converging to the steady state over the course of the experiment is not necessarily surprising, as there are 50 periods without any changes to the underlying model (cf. Pfajfar andZakelj, 2014 andAssenza et al., 2018 ). 10 In particular, when looking at the many groups that stay within roughly one percentage point from the target, one can see that there is more up-anddown movement of the lines in T 1 than in T 2 (although there is one more observation in T 2). This is also what one can see when one follows single lines from period 1 to 50; the lines of most groups in T 2 are flatter than the lines of most groups in T 1. 11 As in Section 2.3 , we use v (π ) = 1 T T t=2 ( πt − π t−1 ) 2 as measure of inflation volatility (see Section 2.3.2 and Online Appendix C for a discussion). The ECDFs of other measures of price instability look similar to the one in Fig. 6 and can be found in Online Appendix F (Fig. 36). 12 Strictly speaking, the Wilcoxon rank-sum test tests the null-hypothesis that the distribution shifts to the right (from T 1 to T 2) or that it does not change. 13 The data are not normally distributed, but the logarithms of the data look rather close to a normal distribution (and are statistically not significantly different from it, according to a Kolmogorov-Smirnov test). A t -test on the logarithms of the data also rejects the null-hypothesis ( p = 0 . 002) .  care about it; for a discussion see Srour, 2001 ). These differences are also statistically significant: a Wilcoxon rank-sum test rejects the null-hypothesis that interest rate volatility is less or equal in T 1 than in T 2 with a p -value of less than 10 −4 . 14

The heuristic switching model as predictor of subjects' forecasts
After having analyzed the economic outcomes, we now examine the performance of the heuristic switching model used to derive the predictions in the experiment. Does this model accurately describe subjects' forecasts? Or does the rational expectation solution or one of the heuristics alone predict subjects' forecasts better than the switching model? In addition, we compare the prediction performance to that of a few parsimonious heuristic switching models.
We report the prediction performance of the heuristic switching model (HSM), the performance of the homogeneous rational agent solution (RE) and the performance of the four heuristics involved in the switching model without any switching: adaptive expectations (ADA), weak trend-following (WTR), strong trend-following (STR), and the learning, anchoring and adjustment rule (LAA). In addition, we compare the model to four parsimonious heuristic switching models, namely one  where agents switch between using naive expectations and using a trend-following rule with coefficient one (Naive+Trend), one where agents switch between forecasting the steady state and a trend-following rule with coefficient one (Fundamen-tal+Trend), and models where agents switch between the steady state plus a constant and the steady state minus a constant (Biased Fund.; we use three different constants, 0.25, 0.5, and 1).
We use the mean squared difference between the models' two-period ahead predictions and average forecasts in a group as prediction error. The predictions are thus out-of-sample predictions. In the main heuristic switching model, we use the same parameters as in Section 2 (that is, we use a preexisting calibration that is not influenced by our experimental data; for the parsimonious switching models, we use the same parameters unless otherwise specified). To form the predictions of the heuristic switching model, the fractions of agents using the different heuristics need to be determined. For this, we use the fractions that are implied by the theoretical model with a continuum of agents. We use the squared difference between this prediction and a group's average forecast, because there are no degrees of freedom with this method. 15 Table 2 shows average prediction errors across all periods and all groups in a treatment. Table 2 shows that, across the board, the benchmark model performs much better than rational expectations. Also evident in the table is that the rational expectation solution is a worse predictor in all cases than any of the four involved heuristics alone. Furthermore, the switching model is a better predictor in all cases than any of the four heuristics alone. In general, the differences are considerable. The switching model does much better than most of the other models. There are two heuristics that do very well when employed alone: the weak trend-following rule and the anchoring and adjustment rule. Nevertheless, the switching model predicts all four forecasts better than these heuristics. The prediction errors of these two best-performing heuristics when employed alone are always at least 25% greater than the prediction errors of the switching model. In addition, the main heuristic switching model does better throughout than the smaller switching models (that also have prediction errors of at least 25% above the benchmark model). Note that we are not attempting to fit the parameters of the heuristic switching model to the data ex post, as this would only give us lower prediction errors at the cost of adding degrees of freedom and thereby make the comparison with the homogeneous rational agent solution uneven. It is noticeable when looking at Table 2 that prediction errors of all models are smaller in T 2 than in T 1. This can be explained by the fact that the realizations of the variables are more volatile in T 1 than in T 2. More volatile realizations and more volatile forecasts naturally go hand in hand. When looking at the data, groups' average forecasts are indeed more volatile in T 1 than in T 2. Inflation forecast volatility is 0.281 in T 1 and 0.134 in T 2, output gap forecast volatility is 0.464 in T 1 and 0.096 in T2. These differences are statistically significant when tested with two-sided Wilcoxon rank-sum tests (the p -values are 0.006 for inflation and < 10 −3 for the output gap). It is not surprising that the models have a harder time accurately describing subjects' forecasts when these are more volatile.

Fractions of heuristics used
In the following, we consider the fractions of employed heuristics in the model fitted to the experimental data. This can help to understand which heuristics are used most and whether there are patterns concerning the use of the heuristics over time. Fig. 11 shows the fractions of the heuristics over time for inflation and output gap in T 1 and T 2 (the lines represent averages across groups).
One can see that all heuristics have some support in the experiment. One can also see that the graphs of inflation and output gap forecasting in T 1 are very similar. The same holds for the corresponding graphs in T 2. This suggests that while subjects learn and update over time the way that they form expectations, they form expectations on inflation and output gap in similar ways, and change how they form expectations on inflation and output gap in similar ways. This is not selfevident: it could well have been the case that subjects rely more on trend extrapolation for one variable while behaving adaptively when forecasting the other variable.
Regarding the use of the heuristics themselves, the adaptive rule and the anchoring and adjustment rule are used more often than the trend-following rules. The use of the adaptive rule increases over the course of the experiment, partially explaining the learning observed in the experiment. Furthermore, the trend-following rules are used less and less as the experiment proceeds (for inflation and output gap alike in both treatments; there are some small upward movements toward the end of the experiment in T 1, however). This also contributes to the stability of inflation and output gap in the second half of the experiment, as the trend-following rules are destabilizing. The use of the anchoring and adjustment rule follows a less-clear pattern. It increases strongly in the beginning in T 1 and decreases again thereafter. In T 2 the use of this rule increases more slowly; afterwards, it levels off. This rule, which has two components, has a less clear-cut interpretation than the other rules. One component is destabilizing, taking into account last trends rather than predicting a return to the anchor immediately, while the other component, the anchor itself, is stabilizing, as the long-run averages are very close to the inflation target and the steady state of the output gap. It is interesting to see that, overall, relatively little use is made of the weak trend-following rule. While this rule alone predicts group level aggregates of forecasts rather well ( Table 2 ), when looking at it from the point of view of the heuristic switching model, it seems that this is only the case because it approximates the prevailing mixes of the whole set of heuristics.
The fractions from the experimental data in Fig. 11 can be compared with the fractions in the theoretical model (obtained when simulating the model for 50 periods without any input of the experimental data). The fractions from the theoretical model alone are reported in Fig. 12 (the graphs show averages of 10 , 0 0 0 simulations, which are naturally quite smooth; T 1 and T 2 here just refer to the parameter combinations of the monetary policy rule in analogy to the experiment, that is, T 1 stands for φ π = 1 . 5 and φ y = 0 , while T 2 stands for φ π = 1 . 5 and φ y = 0 . 5 ). A comparison of Figs. 11 and 12 shows that the fractions from the experimental data resemble those from the simulations. Over time, both in the experiment and in the simulations, the adaptive rule becomes more important (which can explain the learning that we observe, as described above). The learning, anchoring, and adjustment rule becomes more important over time and is mostly the rule with the greatest support but starts to flatten out and even to decrease again towards the end of the 50 periods. In both the experiment and the simulations, the learning, anchoring, and adjustment rule has greater support when there is no reaction to the output gap in the Taylor rule than when there is. Trend-following behavior (WTR and STR) has considerable support in the beginning (even with a short increase in trend-following behavior in the beginning in most cases -driven by an increase of weak trend following in the experiment and strong trend following in the simulations). However, over time, the use of trend-following behavior decreases (in line with the observed learning). Summing up, the fractions in the experiment and in the simulations from the theoretical model look very similar, lending additional support to the use of the benchmark heuristic switching model.

Concluding remarks
We have conducted a learning-to-forecast experiment to test the predictions of a macroeconomic model with behavioral expectations. This behavioral model yields results that differ from those of the same macroeconomic model based on rational expectations. Namely, the behavioral model yields that inflation volatility can be reduced if the central bank reacts to the output gap on top of inflation. The predictions of the behavioral model are supported by the outcomes of our experiment, in which the only treatment variation consists in a modification of the central bank's monetary policy reaction function.
These results are relevant for monetary policy analysis. They show a different relationship between inflation and outputgap than is usually assumed. The policy implications are particularly straightforward for central banks that aim at price stability alone, such as, for example, the ECB; these central banks should react to the output gap even if they are ultimately only interested in price stability.
Note that our policy recommendation has a broad foundation. We obtain it not only with our behavioral benchmark model but also assuming a variety of other specifications of expectation formation (as outlined in Online Appendix B). Moreover, the same recommendation can arise from slightly different macroeconomic models with different behavioral expectations De Grauwe, 2011;De Grauwe, 2012a;Kurz et al., 2013 ). In addition to these theoretical findings, we present empirical evidence from a laboratory experiment leading to the same policy recommendation. We are aware that many macroeconomists are still skeptical about the idea that one can learn about macroeconomic behavior by conducting laboratory experiments (with small group sizes). However, while one cannot mirror a completely macroeconomy with all its decisions in the laboratory, it is possible to shed light on some specific macroeconomic questions. Our experiment is designed such that it abstracts from everything except for the feedback mechanism from expectations to realizations (which is altered by a one-parameter change). All decisions made by agents in the economy are computerized and correspond fully to the underlying macroeconomic model except for the forecasting of future variables. If such forecasts in the laboratory are formed in similar ways as forecasts in the outside world, the results from the laboratory experiment help to understand macroeconomic behavior. There is indeed recent evidence that students' forecasts in the laboratory have very similar characteristics to forecasts in the outside world ( Cornand and Hubert, 2018 ) supporting the external validity of macroeconomic learning-to-forecast experiments.

Appendix A. Microfoundations of the behavioral macroeconomic model
The following derivation follows the work of Kurz et al. (2013) . The economy is populated by a continuum of householdsproducers indexed by j . Agents are identical except for the fact that they may have different expectations about future macroeconomic variables. Household j thus chooses consumption C j t , labor L j t , and bond holdings B j t to maximize E j t denotes the subjective expectations of household j, ρ is the discount factor, W t is the nominal wage, R t is gross interest, P t is the aggregate price level, and T j t are lump sum transfers including profit from firms. We assume that B j 0 is given and that there is no aggregate debt. As in Kurz et al. (2013) we include a penalty term ˜ τ b in the utility function in place of institutional constraints to limit borrowing (with sufficiently small values of ˜ τ b solutions with explosive borrowing are not equilibria). The first order conditions are given by For a generic variable X t we denote the steady-state value by X and define ˆ x t = (X t −X ) / X , while for bond holdings we define ˆ b t = B t / (P t Ȳ ) (with a steady-state value of zero). Denoting gross inflation P t /P t−1 as t and log-linearizing Eq.
(A.1) around a zero inflation steady state we get . It is worth remarking at this point that we loglinearize the system around a zero inflation steady state for the sake of algebraic simplicity. However, as argued in Woodford (2003) and further discussed below, this does not imply that we can only consider policies under which the inflation target is zero, as long as the target inflation rate is not too large. Rewriting the individual consumption function above as where ˆ c t = ˆ c j t dj is aggregate consumption and using both the aggregate market clearing condition ˆ c t = ˆ y t and the fact that ˆ b t = 0 , we can aggregate the individual consumption functions to get where Ē t is the aggregate expectation operator defined as Ē t (x t+1 ) = E j t x t+1 dj for a generic variable x and the term c t+1 ) dj denotes the difference between the average expectation of individual consumption and average consumption.
We now turn to the supply side of the economy. Final consumption of household j is composed by intermediate goods, indexed by i and produced by a continuum of monopolistically competitive firms so that where P it is the price of good i and P t denotes the aggregate price level defined as Aggregating demand for each good i over households and using the aggregate market clearing condition C t = Y t , we get Each firm has a linear production technology using labor as only input where A t is the aggregate productivity. Given the production function we can write the expression for real marginal costs as so that individual real profits can be expressed as We assume a staggered price setting as in the Calvo model, where only a fraction 1 − ω of prices are readjusted in every period. Moreover, we consider a scenario in which households have equal ownership shares in all firms (so that income effects of random price adjustments are removed), though each household j manages only one firm (i.e., makes price decisions for only one firm). Since each firm produces a single good and is managed by a single household j (with subjective expectations j ), we can, without loss of generality, use a single index, say j , to denote the produced good and the subjective expectations. A firm j adjusting its price in period t maximizes (given its subjective expectations) the present discounted value of profits in all future states prior to the next price readjustment In this expression, ρ τ (C j t+ τ /C j t ) −σ is the stochastic discount factor of household j managing the firm. Defining q * jt = P * jt /P t as the optimal price set by firm j relative to the aggregate price level, we can write the first order condition as (A.6) Log-linearizing Eq. (A.6) and using the steady-state relation mc = (θ − 1) /θ we get the individual pricing rule Assuming, as standard in the literature, that the law of iterated expectations holds at the individual level (see e.g. Evans and Honkapohja, 2001, and Kurz et al., 2013, we can rewrite Eq. (A.7) as Given the Calvo pricing scheme, in each period only a set of firms S t ∈ [0, 1] of measure 1 − ω adjust prices, while a set S c t ∈ [0 , 1] of measure ω do not adjust. We assume that the sample of firms allowed to adjust prices in each period is selected independently across agents, so that the distribution of subjective expectations is the same for firms that adjust prices and for those that do not. Using the aggregate price definition we can then write which can be rewritten as Log-linearizing the above relation we get Denoting ˆ q t = ˆ q * jt dj and integrating Eq. (A.8) on both sides we get which can be rewritten as Recalling from Eq. (A.9) that ˆ q t = ω/ (1 − ω) ˆ t and substituting it in the equation above we get where again Ē t is the aggregate expectation operator and t ( ˆ Eq. (A.11) implies a natural level of output under flexible prices given by ˆ Plugging Eq. (A.11) into Eq. (A.10) and defining the output gap as y t = ˆ y t −ˆ y n t results in 3) in terms of the output gap yields 2 . The behavioral model assumes that agents deviate from fully rational behavior by using the described heuristics to forecast future output gap and inflation. We assume that using the heuristics for these forecasts is the only source of irrationality. More precisely, we assume that agents are not irrational when forming expectations about their own future consumption relative to average consumption and the price set by the firm managed by them relative to the average price. This implies that the terms t ( ˆ c ) and t ( ˆ q ) are equal to zero. 16 We remark that this simplifying assumption makes the implementation of the model in the laboratory easier. The implementation of a macroeconomic model including the terms t ( ˆ c ) and t ( ˆ q ) would require the elicitation of expectations of individual consumption ˆ c j t+1 and price ˆ q * jt+1 as well as higher order beliefs about average consumption ˆ c t+1 and price ˆ q t+1 , which would be cognitively considerably more demanding for experimental subjects than forecasting only inflation and output gap. Nevertheless, we stress that, as argued by Kurz et al. (2013) , the terms t ( ˆ c ) and t ( ˆ q ) can play an important role in models with diverse beliefs. We leave the experimental investigation of the impact of these terms for future research. We can therefore rewrite the aggregate demand and supply equations as Defining π t as the inflation rate and i t ≡ log (1 + yield t ) − γ yield t − γ , where yield t denotes the yield on the one period bond and γ ≡ − log ρ, we can write Eqs. (A.14) and (A.15) as .16) π t = λy t + ρĒ t π t+1 + u t , (A.17) with ϕ ≡ σ −1 and with a cost-push shock u t added to the aggregate supply relation. We close the model with a monetary policy rule of the form i t = π + φ π (π t −π ) + φ y (y t −ȳ ) (A.18) when not at the zero lower bound, where π is the inflation target and ȳ ≡ (1 − ρ) π /λ is the steady state level of the output gap consistent with the inflation target π .
As argued in Woodford (2003) , the fact that the New Keynesian equations above have been log-linearized around a zero inflation steady state does not mean that one can consider only policy rules that involve a target inflation rate of zero. In fact, Eqs. (A.16) and (A.17) are valid approximations as long as the target inflation is not too large. 17 The economy can thus be described by Eqs. (A.16) and (A.17) together with the monetary policy rule in Eq. (A.18) , potentially subject to the zero lower bound. These equations correspond to Eqs. (1)-(3) in the main text. When the zero lower bound is not binding, the model can be written in matrix form as y t π t = ϕ π (φ π − 1) + ϕφ y ȳ λϕ π (φ π − 1) + λϕφ y ȳ 16 These terms also drop out if one assumes instead that agents' expectations of the average future consumption across all agents and of the average price set by all firms equal their expectations of their own future consumption and of the price set by the firm managed by them, respectively. 17 More precisely, for average inflation rates of order ν (where ν is an expansion parameter characterizing monetary policy such that the average inflation rate is zero for policies with ν = 0 ), the error in the characterisation of the dynamics of aggregate variables is of order O( ν, ξ 2 ) , where ξ is a bound on the size of the disturbances in the model (see Woodford, 2003 , for details).
where ≡ 1 / (1 + λϕφ π + ϕφ y ) and x e t+1 ≡Ē t x t+1 denotes average expectation about a generic variable x . The system above describes the law of motion of the output gap and inflation as a function of agents' average expectations on output gap and inflation (the above matrix equation is identical to Eq. (4) in the main text).

Supplementary material
Supplementary material associated with this article can be found, in the online version, at 10.1016/j.euroecorev.2019.05. 009 .

Online Appendix for "Monetary Policy under
Behavioral Expectations: Theory and Experiment" *

Cars Hommes Domenico Massaro Matthias Weber
This online appendix contains material in addition to the manuscript and to the printed appendix. As the online appendix follows the printed appendix (Appendix A), it begins with Appendix B. Similarly, Figure numbers build on the numbers in the article and the printed appendix, therefore starting with Figure 13 in this online appendix. Figure 13 shows inflation volatility as a function of the output gap reaction coefficient φ y for the model assuming rational expectations, similarly to Figure 1a. The graph now shows multiple coefficients of φ π simultaneously (from top to bottom the lines correspond to φ π -values of 1.4, 1.5, 1.6, and 1.7). Figure 14 shows the same graph for the behavioral model (again the lines correspond to φ π -values of 1.4, 1.5, 1.6, and 1.7, from top to bottom).

B.2 Results with Different Behavioral Models of Expectation Formation
The results are robust to variations of the parameters of the behavioral model of expectation formation we employ. Furthermore, the results are qualitatively the same for a wide variety of other behavioral mechanisms. We show two examples here. Figure 19 shows inflation, output gap, and interest volatility as a function of the output gap reaction coefficient. Expectations are not formed according to the main heuristic switching model described in Section 2.2, but according to two simpler models of behavioral expectation formation. On the left side of this figure, it is assumed that agents use a heuristic switching model similar to the one described before but including only two very simple heuristics, naive expectations which always forecast the last observation and a trend-following rule with trend-following coefficient one. On the right side, the graphs show the results from naive expectations alone (thus without any switching).
Here as well, the results look similar to the ones in Figures 1 and 2.

B.3 Results with Different Starting Values of Output Gap and Inflation Forecasts
Figures 20 and 21 show graphs similar to Figure 1b for different combinations of starting values of inflation and output gap (i.e. inflation and output gap are set to these starting values in the first two periods). In all cases the U-shape arises similarly to Figure 1b.

B.4 Results with an Interest Rate Smoothing Taylor Rule
One can also modify the model to include interest rate smoothing by the central bank.
Including interest rate smoothing in the Taylor rule leads to aggregate New Keynesian equations of the form below, where everything is equal to our main model, except for the monetary policy rule that includes an interest rate smoothing parameter µ, with 0 < µ < 1. Note that if the central bank places too much weight on the past interest rate, it loses its ability to steer the economy, so that µ should not be too large (in the extreme case of µ = 1, the interest rate is just a constant without any reaction to  Figure 19: Inflation, output gap, and interest rate volatility for a simple HSM of expectation formation (with switching between naive expectations and trend-following) and for naive expectations Notes: This figure shows the effect of parameter φ y on inflation, output gap, and interest rate volatility for alternative models of expectation formation (φ π = 1.5 throughout). economic activity).

C Appendix: Discussion of the Measurement of Volatility
In general, different simple measures of price instability, i.e. of volatility, dispersion, or distance from the target are possible. We discuss mainly two of them here. The first one is the measure that we use, v(π) = 1 T ∑ T t=2 (π t − π t−1 ) 2 (equivalently, one could of course take v 1 (π) = 1 T −1 ∑ T t=2 (π t − π t−1 ) 2 or even v 2 (π) = ∑ T t=2 (π t − π t−1 ) 2 if the number of periods is fixed). The second one is the mean squared deviation from the target, Other alternatives that one could use are the absolute deviation, ad(π) = 1 T ∑ T t=2 |π t − π t−1 |, and the standard deviation, sd(π) = 1 T ∑ T t=2 (π t − π av ) 2 , where π av is the average of inflation in a group taken over the whole time period. We do not discuss these measures here in detail; in general, ad(·) shares many features with v(·), and sd(·) shares many features with msd(·).
The measures v(·) and msd(·), are different in the following ways. The mean squared deviation from the target exclusively takes into account the distance to the target, not whether or not this distance is positive or negative. Figure  The solid red line and the dashed blue line have exactly the same distance from the target in each period. However, it seems clear that the red line is much more volatile than the blue line, which converges slowly but nicely to the target. Any policy maker would prefer inflation as shown by the blue line over inflation as shown by the red line. However, msd(·) does not differentiate between these lines (while v(·) does).
There are other examples one can use to illustrate the differences between the measures. Imagine for example inflation staying constant for the first half of a time span at one percentage point below the target and then changing once and staying constant at one percentage point above the target. msd(·) does not distinguish between this very stable series and a series which randomly jumps back and forth between one percentage point below and one above the target (being at either value half of the time). v(·) distinguishes between these time series. v(·) is also not a perfect measure, however. For example if one were to compare inflation represented by two horizontal lines of which one is close to the target while the other is relatively far from the target, v(·) does not distinguish between these lines, while msd(·) does.
From a practitioner's or policy maker's point of view, which measure to use can thus depend on what kind of dynamics are present. For example if there are a lot of inflation time series which are relatively constant on one side of the target while some of these observations are close and some far from the target, msd(·) looks like a better measure. If one sees both erratic behavior or oscillations partly below and partly above the target and slow convergence, v(·) is the better measure. The latter case is exactly what we observe in the experiment. Inflation mainly oscillates around the target with mean values close to the target with some observations converging gradually to the target. From this point of view v(·) is clearly to be preferred.

D Appendix: Instructions in the Experiment
Subjects in the experiment received the following instructions (as subjects only received qualitative information on the model governing the experimental economy the instructions are the same for both treatments):

Instructions
Welcome to this experiment! The experiment is anonymous, the data from your choices will only be linked to your station ID, not to your name. You will be paid privately at the end, after all participants have finished the experiment. After the main part of the experiment and before the payment you will be asked to fill out a short questionnaire. On your desk you will find a calculator and scratch paper, which you can use during the experiment.
During the experiment you are not allowed to use your mobile phone. You are also not allowed to communicate with other participants. If you have a question at any time, please raise your hand and someone will come to your desk.

General information and experimental economy
All participants will be randomly divided into groups of six people. The group composition will not change during the experiment. You and all other participants will take the roles of statistical research bureaus making predictions of inflation and the so-called "output gap". The experiment consists of 50 periods in total. In each period you will be asked to predict inflation and output gap for the next period. The economy you are participating in is described by three variables: inflation π t , output gap y t and interest rate i t . The subscript t indicates the period the experiment is in. In total there are 50 periods, so t increases during the experiment from 1 to 50.

Inflation
Inflation measures the percentage change in the price level of the economy. In each period, inflation depends on inflation predictions of the statistical research bureaus in the economy (a group of six participants in this experiment), on actual output gap and on a random term. There is a positive relation between the actual inflation and both inflation predictions and actual output gap. This means for example that if the inflation predictions of the research bureaus increase, then actual inflation will also increase (everything else equal). In economies similar to this one, inflation has historically been between −5% and 10%.

Output gap
The output gap measures the percentage difference between the Gross Domestic Product (GDP) and the natural GDP. The GDP is the value of all goods produced during a period in the economy. The natural GDP is the value the total production would have if prices in the economy were fully flexible. If the output gap is positive (negative), the economy therefore produces more (less) than the natural GDP. In each period the output gap depends on inflation predictions and output gap predictions of the statistical bureaus, on the interest rate and on a random term. There is a positive relation between the output gap and inflation predictions and also between the output gap and output gap predictions. There is a negative relation between the output gap and the interest rate. In economies similar to this one, the output gap has historically been between −5% and 5%.

Interest Rate
The interest rate measures the price of borrowing money and is determined by the central bank. If the central bank wants to increase inflation or output gap it decreases the interest rate, if it wants to decrease inflation or output gap it increases the interest rate.

Prediction task
Your task in each period of the experiment is to predict inflation and output gap in the next period. When the experiment starts, you have to predict inflation and output gap for the first two periods, i.e. π e 1 and π e 2 , and y e 1 and y e 2 . The superscript e indicates that these are predictions. When all participants have made their predictions for the first two periods, the actual inflation (π 1 ), the actual output gap (y 1 ) and the interest rate (i 1 ) for period 1 are announced. Then period 2 of the experiment begins. In period 2 you make inflation and output gap predictions for period 3 (π e 3 and y e 3 ). When all participants have made their predictions for period 3, inflation (π 2 ), output gap (y 2 ), and interest rate (i 2 ) for period 2 are announced. This process repeats itself for 50 periods.
Thus, in a certain period t when you make predictions of inflation and output gap in period t + 1, the following information is available to you: • Values of actual inflation, output gap and interest rate up to period t − 1; • Your predictions up to period t; • Your prediction scores up to period t − 1.
Payments Your payment will depend on the accuracy of your predictions. You will be paid either for predicting inflation or for predicting the output gap. The accuracy of your predictions is measured by the absolute distance between your prediction and the actual values (this distance is the prediction error). For each period the prediction error is calculated as soon as the actual values are known; you subsequently get a prediction score that decreases as the prediction error increases. The table below gives the relation between the prediction error and the prediction score. The prediction error is calculated in the same way for inflation and output gap.
Prediction error 0 1 2 3 4 9 Score 100 50 33.33 25 20 10 Example: If (for a certain period) you predict an inflation of 2%, and the actual inflation turns out to be 3%, then you make an absolute error of 3% − 2% = 1%. Therefore you get a prediction score of 50. If you predict an inflation of 1%, and the actual inflation turns out to be negative 2% (i.e. −2%), you make a prediction error of 1% − (−2%) = 3%. Then you get a prediction score of 25. For a perfect prediction, with a prediction error of zero, you get a prediction score of 100. The figure below shows the relation between your prediction score (vertical axis) and your prediction error (horizontal axis). Points in the graph correspond to the prediction scores in the previous table.
[ Figure 25 appears here in the experimental instructions.] At the end of the experiment, you will have two total scores, one for inflation predictions and one for output gap predictions. These total scores simply consist of the sum of all prediction scores you got during the experiment, separately for inflation and output gap predictions. When the experiment has ended, one of the two total scores will be randomly selected for payment.
Your final payment will consist of 0.75 euro for each 100 points in the selected total score (200 points therefore equals 1.50 euro). This will be the only payment from this experiment, i.e. you will not receive a show-up fee on top of it.

Computer interface
The computer interface will be mainly self-explanatory. The top right part of the screen will show you all of the information available up to the period that you are in (in period t, i.e. when you are asked to make your prediction for period t + 1, this will be actual inflation, output gap, and interest rate until period t − 1, your predictions until period t, and the prediction scores arising from your predictions until period t − 1 for both inflation (I) and output gap (O)). The top left part of the screen will show you the information on inflation and output gap in graphs. The axis of a graph shows values in percentage points (i.e. 3 corresponds to 3%). Note that the values on the vertical axes may change during the experiment and that they are different between the two graphs -the values will be such that it is comfortable for you to read the graphs.
In the bottom left part of the screen you will be asked to enter your predictions. When submitting your prediction, use a decimal point if necessary (not a comma). For example, if you want to submit a prediction of 2.5% type "2.5"; for a prediction of −1.75% type "−1.75". The sum of the prediction scores over the different periods are shown in the bottom right of the screen, separately for your inflation and output gap predictions.
At the bottom of the screen there is a status bar telling you when you can enter your predictions and when you have to wait for other participants.  Figure 34 shows the two groups (from T 2) that have been excluded from the analysis as explained in Footnote 8. Figure 35 shows a screenshot (a larger version of the screenshot already used in Figure  4.   Figure 36 shows the empirical cumulative distribution functions of price instability when employing different measures. The first graph shows the volatility measure based on the absolute deviation, ad(π) = 1 T ∑ T t=2 |π t − π t−1 |. The second graph shows the means squared deviation from the target, msd(π) = 1 T ∑ T t=2 (π t −π) 2 , and the third graph shows the standard deviation, sd(π) = 1 T ∑ T t=2 (π t − π av ) 2 . The average values for ad are 0.304 in T 1 and 0.188 in T 2. For msd, the values are 0.402 in T 1 and 0.317 in T 2, and for sd 0.510 in T 1 and 0.419 in T 2.

ECDF (RAD)
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q T1 T2 Notes: This graph shows the ECDFs for three different measures. From top to bottom: The volatility measure based on the absolute deviation, the mean squared deviation from target, and the standard deviation. For each value on the horizontal axis, the fraction of observations with the respective measure less or equal to this value (i.e. the ECDF) is shown on the vertical axis, separately for T 1 and T 2.