Comparing Predictive Accuracy in Small Samples Using Fixed-Smoothing Asymptotics

We consider fixed-smoothing asymptotics for the Diebold and Mariano (1995) test of predictive accuracy. We show that this approach delivers predictive accuracy tests that are correctly sized even when only a small number of out of sample observations are available. We apply the fixed-smoothing asymptotics to the Diebold and Mariano (1995) test to evaluate the predictive accuracy of the Survey of Professional Forecasters (SPF) and the ECB Survey of Professional Forecasters (ECB SPF) against a simple random walk. Our results show that the predictive abilities of the SPF and the ECB SPF were partially spurious.

Good forecasts are key to good decision making. And being able to compare predictive accuracy is key to discriminate between good and bad forecasts. To this end, one of the most used tests to compare the predictive accuracy of two competing forecasts is the Diebold and Mariano (1995) [DM] test.
The DM test is based on a loss function associated with the forecast errors of each forecast, testing the null of zero expected loss differential of two competing forecasts.
This framework allows to test for equal predictive accuracy using any loss function, and the test statistic is valid for contemporaneously correlated, serially correlated and nonnormal forecast errors. The DM approach takes forecast errors as model-free and the test is valid also when the forecasts are produced from unknown models, as for example from forecast survey data.
When the forecasts are produced by estimated models, nested or non-nested, it is in general necessary to account for the impact of the model parameter estimation uncertainty on the distribution of the forecast accuracy test, see West (1996) and Clark and McCracken (2001). In this case, the limiting distribution of the test statistics depends on the specific modelling assumptions made for obtaining the forecast errors, see West (2006) and Clark and McCracken (2013). West (1996) shows that in some cases the DM approach is asymptotically valid even when forecasts are obtained from estimated models. This happens when the number of in sample observations is large relative to the number of out of sample observations or when a quadratic loss function is used to evaluate the accuracy of non-nested models estimated by ordinary least squares. However, in practice it is not uncommon to compare forecasts produced by models for which it is not tractable to account for the model parameter estimation uncertainty. In addition, if the objective is to compare forecasting methods as opposed to forecasting models, then Giacomini and White (2006) show that in an environment with asymptotically nonvanishing estimation uncertainty the DM test can still be applied. For these reasons, the DM test is still widely applied also when forecasts are obtained by estimated models, see Diebold (2015). One reason for the success of the DM test is that the test statistic is simple to compute and asymptotically normally distributed. However, as also noted by DM, the test can be subject to large size distortions in small samples, which can be spuriously interpreted as superior predictive ability for one forecast. This is due to the fact that in the test statistic the long run variance is replaced by a consistent estimate, and standard limit normality is then still employed: this may be unsatisfactory when only a small number of out of sample observations are available. As remarked by Clark and McCracken (2013), "one unresolved challenge in forecast test inference is achieving accurately sized tests applied at multi-step horizons -a challenge that increases as the forecast horizon grows and the size of the forecast sample declines".
In this paper, we consider two alternative asymptotics for testing assumptions about the expected loss differential of two competing forecasts. The first is the fixed-b approach of Kiefer and Vogelsang (2005), in which the limit properties of the weighted autocovariances estimate of the long run variance are derived assuming that the bandwidth to sample size ratio is constant. With this approach, the test to compare predictive accuracy has a non-standard limit distribution, that depends on the bandwidth to sample ratio b and on the kernel used to estimate the long run variance. The second alternative asymptotics that we consider is the fixed-m approach as in Sun (2013) and Hualde and Iacone (2015) and Hualde and Iacone (2017). In this case, the estimate of the long run variance is based on a weighted periodogram estimate with Daniell kernel and a truncation parameter m that is assumed to be constant as the sample size increases. The test to compare predictive accuracy has a t distribution with degrees of freedom that depends on the truncation parameter. This averaged periodogram estimate can be seen as one application of the orthonormal series variance estimate, see Phillips (2005).
Following Sun (2014a) and Sun (2014b) we refer to these two alternative asymptotics, fixed-b and fixed-m, as "fixed-smoothing asymptotics". With this type of asymptotics, the assumption on the bandwidth parameter implies that the estimate of the long run variance is not consistent. However, inference is more precise than with HAC standard asymptotics, and therefore it is often referred to as "Heteroskedasticity Autocorrelation Robust" (HAR).
We perform a Monte Carlo analysis and find that: i) fixed-smoothing asymptotics delivers correctly sized predictive accuracy tests for correlated loss differentials even in small samples; ii) the power of the tests with fixed-smoothing asymptotics is comparable to the power of bootstrap tests. We also apply fixed-smoothing asymptotics to the case of comparing forecasts obtained from models. Using the unconditional predictive ability test of Giacomini and White (2006), we perform a novel Monte Carlo exercise extending the design in Giacomini and Rossi (2010) to allow for different degrees of autocorrelation in the loss differential. Our results indicate that, in presence of serially correlated errors, standard asymptotics causes relevant size distortions. Fixed-smoothing asymptotics, when coupled with an appropriate bandwidth choice, addresses this problem to satisfaction.
To illustrate the usefulness of fixed-smoothing asymptotics for equal predictive accuracy tests, we evaluate the predictive accuracy of the Surveys of Professional Forecasters (SPF) and the ECB Survey of Professional Forecasters (ECB SPF) against a naive random walk. As for the SPF, we evaluate forecasts for four core macroeconomic indicators (output growth, inflation, the unemployment rate and the three-month Treasury bill rate) for the period from 1985:Q1 until 2014:Q4. Results show that part of the superior predictive accuracy indicated by the the DM test is spurious, especially in the most recent subsample. As for the ECB SPF, we evaluate forecasts for year-on-year Euro area GDP growth and year-on-year Euro area HICP inflation for the period from 2006Q1 to 2016Q4. With such a small sample size, standard tests of equal predictive ability suffer from large size distortions and provide partially spurious results, that are not confirmed when using fixed-smoothing asymptotics.
For high frequency, large sample forecast evaluations, Patton (2015) and Li and Patton (2013) show that fixed-b asymptotics delivers considerable size improvements. For small samples, Harvey, Leybourne and Newbold (1997) propose a modified statistic and critical value: while this is only justified when the loss differential is an independent process, they find that their modified DM test alleviates the size distortion of the original test, even in presence of weak autocorrelation. The modifications of the DM test based on fixed-smoothing asymptotics that we propose are formally based on asymptotic theory also when the loss differential is a dependent process. Harvey, Leybourne and Whitehouse (2017) perform an extensive Monte Carlo simulation exercise to examine the small sample size and power properties of different approaches. Their results confirm that the fixed-m approach proposed in this paper outperforms standard approaches in small samples. Our Monte Carlo exercise, however, is richer than the one in Harvey, Leybourne and Whitehouse (2017), as they do not consider fixed-b asymptotics. In addition, our contribution relies in focussing on the choices of both the estimate of the long run variance and of the bandwidth, and also in clarifying why fixed-m asymptotics delivers correctly sized tests. This paper is organized as follows. In Section 2, we introduce the test for equal predictive accuracy and, in Section 3, we describe the DM estimate. In Section 4, we detail the tests for equal predictive accuracy using fixed-b asymptotics and fixed-m asymptotics. In Section 5, we present a Monte Carlo study, including a Monte Carlo comparison with the bootstrap. In Section 6, we apply fixed-smoothing asymptotics to the unconditional predictive ability test of Giacomini and White (2006), including a Monte Carlo study. In Section 7, we discuss the empirical applications and, in Section 8, we conclude.

Comparing predictive accuracy
We consider the time series y 1 , ..., y T , for which we want to compare two h-step ahead forecasts y h 1,t and y h 2,t made at time t − h, with forecast errors e h 1,t = y t − y h 1,t and e h 2,t = y t − y h 2,t , respectively. We denote by L(e h i,t ), for i = 1, 2, the loss associated with the forecast error e h i,t ; for example, a quadratic loss would be L(e h i,t ) = e h i,t 2 . The time-t loss differential between the two forecasts is and it can be represented as where u t has E (u t ) = 0 and it is a weakly dependent process, with autocovariances γ j = E (u t u t+j ) and long run variance σ 2 = ∞ j=−∞ γ j , with 0 < σ 2 < ∞. DM propose to test the hypothesis of equal predictive ability as H 0 : {µ = 0}. Let denote the sample mean of the loss differential. Under regularity conditions, it holds Unfortunately, this statistic is unfeasible to test H 0 , because σ 2 is unknown. However, the parameter σ 2 can be replaced with an appropriate estimate and, if a consistent estimate is used, then the limit normality is not affected by the replacement.

6
A typical estimate for the long run variance is the Weighted AutoCovariances Estimate (WCE), The parameter M is a bandwidth parameter (or a truncation lag), and for consistency of σ 2 regularity conditions include M → ∞ and M/T → 0 as T → ∞. We refer to Hannan (1970) for a survey of these estimates, and for a discussion of which kernels ensure that In a variation of this approach, DM note that if y h 1t is an optimal forecast h steps ahead, then e h 1t is at most a MA(h − 1), and then propose to set M = h − 1 and k (j/M ) = 1 if j/M ≤ 1 and 0 otherwise, so This does not meet the condition M → ∞, but the estimate is nevertheless consistent, because it exploits the assumption that u t is MA(h − 1), thus ensuring The choice of σ 2 DM may be very appealing, as it exploits information about the structure of u t . However, the rectangular kernel used in (3) may generate negative estimates for σ 2 DM , which is undesirable. Moreover, the Monte Carlo exercise in DM suggests the possibility of large size distortions in small samples, which would be spuriously interpreted as evidence of superior predictive power for one forecast rule. DM mention the possibility of using alternative kernels and standard asymptotics, to avoid the risk of negative estimates of σ, but simulations in Clark (1999), in which a Bartlett kernel was used, do not suggest that simply replacing the kernel results in a definite improvement of the size distortion.
4 Fixed-smoothing asymptotics 4.1 Fixed-b asymptotics Following the approach of Kiefer and Vogelsang (2005) we consider alternative asymptotics for the estimate (2): for given M , the ratio M/T is taken as fixed as T → ∞.
As M/T is fixed, letting b = M/T , this alternative approach is referred to as fixed-b asymptotics. With this assumption, Kiefer and Vogelsang (2005) show that the estimate of σ is not consistent and that the standardized sample mean has a non-standard limit distribution that depends on b and on the kernel. Kiefer and Vogelsang (2005) provide a formula to generate quantiles of the limit distribution, that can be used as critical values in tests.
For fixed-b asymptotics and assuming that the Bartlett kernel is used, we introduce the notation Kiefer and Vogelsang (2005) show that if b ∈ (0, 1] , then where ⇒ denotes weak convergence in the in the D[0, 1] space with the Skorohod topology. They characterise the limit distribution Φ BART (b) and provide formulas to compute quantiles. For the Bartlett kernel with b ≤ 1, these can be obtained using the formula where α 0 = 1.6449, α 1 = 2.1859, α 2 = 0.3142, α 3 = −0.3427 for 0.950 quantile α 0 = 1.9600, α 1 = 2.9694, α 2 = 0.4160, α 3 = −0.5324 for 0.975 quantile The results of Kiefer and Vogelsang (2005) provide asymptotics that may be valid for any M , even M = T , but notice that Kiefer and Vogelsang (2005) do not automatically recommend using M = bT : rather, they provide alternative asymptotics for a user chosen bandwidth. So, for example, assuming T = 128 and M = T 1/3 = 5, then b = 5/128 = 0.039063 and the 5% critical value for a two-sided test is 2.0766 instead of 1.96.
When testing assumptions about the sample mean, Kiefer and Vogelsang (2005) show in Monte Carlo simulations that the fixed-b asymptotics yields a remarkable improvement in size. However, while the empirical size improves (it gets closer to the theoretical size) as b is closer to 1, the power of the test worsens, implying that there is a size-power trade-off. These results are also confirmed analytically by Sun, Phillips and Jin (2008), who prove that the fixed-b limit distribution provides a higher-order correction.

Fixed-m asymptotics
We now consider an alternative estimate of the long run variance, a Weighted Periodogram Estimate (WPE). Letting λ j = 2πj/T for j = 0, ±1, ..., ± T /2 as the Fourier frequencies, and as the periodogram of d t , we consider estimates where K M (λ j ) is a kernel function that is symmetric and M is a bandwidth parameter.
Notice that as 1 T t=1 e −iλ j t and, for j = 0, T t=1 e −iλ j t = 0, I (λ j ) is also the periodogram of u t at these frequencies. Kernels k (j/M ) in (2) and where I * (λ) is the periodogram of d t − d. Weighted covariance estimation and weighted periodogram estimation are therefore very similar, and this suggests for WPE an alternative theory analogue to fixed-b for WCE.
The WPE of the long run variance using the Daniell kernel is where m is a function of the bandwidth M (and, with slight abuse of notation, it is usually referred to as bandwidth itself). Regularity conditions, including m → ∞, ensure that σ 2 DAN is a consistent estimate of σ 2 ; for fixed m this is no longer the case, but using results from Hannan (1970), it is possible to show that This result was anticipated in Sun (2013) and Müller (2014). 1 Monte Carlo simulations in Hualde and Iacone (2015) show that fixed-m asymptotics has the same size-power trade-off documented for fixed-b asymptotics the smaller the value for m, the better the empirical size, but also the weaker the power.

A Monte Carlo study of the test for predictive accuracy under fixed-smoothing asymptotics
In this section we analyse the size and power properties of the proposed tests of equal predictive accuracy in small samples for both the case of equal predictive accuracy and the case of superior predictive accuracy of one forecasting model.

Size analysis
We simulate forecast errors as in DM and Clark (1999). In particular, we first simulate a vector of forecast innovations from a bivariate standard normal, (v 1t , v 2t ) ∼ N (0 2 , I 2 ).
We then introduce contemporaneous correlation by taking and serial correlation by taking We provide some details about the derivation of (11) in Appendix A.
where k = 1, ρ = 0.5 and θ = 0.75. DM, Clark (1999) and Harvey, Leybourne and Whitehouse (2017) fix q to 1, in our case instead q is set to range between 1 and 5. With this design, as q increases the processes e 1t and e 2t become similar to an AR(1) with parameter θ. Results in Clark (1999) suggest only limited sensitivity of size to ρ and θ, so we keep these fixed and investigate the effect of increasing the serial correlation with q. 2 In Tables 1-2 we report results of the Monte Carlo, with theoretical size set to 5%.
In all cases we use 10,000 replications (entries in the tables are rounded to the third decimal digit) and a quadratic loss function. We use T = 40 and T = 120 as these samples correspond to 10 years and 30 years of quarterly data, and therefore match the dimension of our sample in the empirical analysis. We consider three estimates of σ: the WCE using the DM estimate in (3) with h − 1 = q; the WCE using the Bartlett kernel in (5)-(6); the WPE using the Daniell kernel in (10). We refer to these three estimates as WCE-DM, WCE-B and WPE-D, respectively.
In the first part of the experiment, we study the size properties treating the estimates of σ as consistent and using standard asymptotics, i.e. the limit normal distribution, to compute the empirical size. In Table 1 we report the empirical size of the tests when the WCE-DM, WCE-B and WPE-D are used to estimate σ. When using the WCE-DM estimate, negative estimates are possible. We treat these instances as rejections of the null hypothesis. 3 For the WCE-B we use M = T 1/3 and M = T 1/2 , and for the WPE-D we use m = T 1/3 , m = T 1/2 and m = T 2/3 . The choice of the first bandwidth for the WCE-B is motivated by the fact that the optimal bandwidth, in minimum MSE sense, is obtained setting M proportional to T 1/3 , see for example Newey and West (1994).
We discuss here the naïve choice M = T 1/3 . 4 The second bandwidth, M = T 1/2 , is 2 In Appendix C.2, we report a a size study for θ = 0.5 and θ = 0. 3 We discuss these occurrences in Appendix C.1. 4 In Appendix C.3 we also consider the automatic procedures from Newey and West (1994).
chosen because existing Monte Carlo evidence for fixed-b asyptotics suggests that longer bandwidths are associated with better empirical size.
As for the bandwidths for the WPE-D, Delgado and Robinson (1996), Phillips (2005) and Sun (2013) show that the optimal bandwidth, in MSE sense, is proportional to T 4/5 , whereas Abadir, Distaso and Giraitis (2009) recommend m = T 2/3 . 5 However, in samples as small the ones of this exercise, even m = T 2/3 spans a substantial part of the interval (0, π), and the estimate of σ with this bandwidth may therefore be subject to too much bias. The other two bandwidths are therefore chosen to limit this bias, and to allow comparison with the fixed-m asymptotics.
In general, Table 1 shows that, as the serial correlation increases with q, the size of the test deteriorates, although the size distortion is less serious in the larger sample.
Comparing the results when WCE-B is used, on balance we find that M = T 1/3 yields better size properties, at least for small values of q. The comparison between using the WCE-B with M = T 1/3 and the WCE-DM estimate is less clear cut in this instance.
The DM estimate delivers better size properties in the large sample, but using the WCE with Bartlett kernel helps avoiding the very severe size distortion occurring in the small sample with q = 4 or q = 5 when the DM estimate is used.
For the WPE-D, we find that the bandwidth m = T 2/3 is too long for the small samples used in this investigation: the bandwidth m = T 1/2 yields better size in most cases, although a certain size distortion still occurs, especially in the smallest sample.
Comparing the results for the three cases in which WPE-D is used, corresponding to the three different bandwidths, the choice m = T 1/2 limits two alternative sources of size distortion: the lower order bias in the estimation of σ at higher frequencies, which affects m = T 2/3 most, and the high variance of the estimate, which is more a problem when the shortest bandwidth, m = T 1/3 , is used. Bearing in mind that our focus is on small samples, the WPE estimate with bandwidth m = T 1/2 is overall the best choice. Note: empirical rejection frequencies for tests of equal predictive ability at 5% nominal size using standard normal asymptotics for various MA(q) processes with θ = 0.75 and alternative estimates of the long run variance. For the WCE, DM is the WCE with the truncated kernel as in DM and h − 1 = q, T 1/3 and T 1/2 are the WCE with the Bartlett kernel and M = T 1/3 and M = T 1/2 . For the WPE, we use the Daniell kernel with m = T 1/3 , m = T 1/2 and m = T 2/3 .
In Table 2 we report results when the properties of the estimates of σ and of the test statistic are derived assuming fixed-smoothing asymptotics. In columns WCE, we use (5)-(6), with M = T 1/3 , M = T 1/2 , and M = T , and fixed-b asymptotics, with limit (7); in columns WPE, we use the estimate (10) with m = T 1/4 , m = T 1/3 and m = T 1/2 and asymptotics from (11). Bandwidths M = T 1/3 and M = T 1/2 for the WCE-B means that the same test statistic is used both in Table 1 and Table 2, and the difference in the empirical size in the two tables is then due only to the different critical values. Bandwidth M = T , on the other hand, has been proposed when fixed-b asymptotics is used, by Kiefer and Vogelsang (2002). Likewise, for the WPE-D estimate,  bandwidths m = T 1/3 and m = T 1/2 allow for a comparison with results from Table 1. The size distortion for m = T 2/3 documented in Table 1 is due to the bias in the estimation of the long run variance and therefore cannot be improved upon, with fixed-m asymptotics. Instead, we consider m = T 1/4 : this is too short to be considered for standard asymptotics, as m = 2 when T = 40, but fixed-m asymptotics provides a useful justification for this choice. As the Monte Carlo exercise in Hualde and Iacone (2015) shows that the best size is achieved for the lowest bandwidths, m = T 1/4 is a very interesting choice.
Comparing Tables 1 and 2, we find that fixed-smoothing asymptotics always improves the empirical size, yielding results closer to the prescribed 5%. Moreover, with WCE-B the empirical size is better the larger is the bandwidth, whereas with the WPE-D the empirical size is more precise the smaller is m. Indeed, we find that the bandwidth M = T 1/3 in the WCE-B still yields some size distortion, even when fixed-b asymptotics is used; results for m = T 1/2 for the WPE-D are also not entirely satisfactory, especially in the T=40 sample. Overall, then, with fixed-b asyptotics it seems desirable to choose bandwidths M longer than what we would consider when standard asymptotics is used; this result is mirrored in case fixed-m asymptotics is used, in which case, the bandwidths could be shorter than what is usually recommended under standard asymptotics.
In summary, in our Monte Carlo exercise we find that the DM test with the WCE-DM may be subject to relevant size distortion in small samples, and that alternative estimates of the long run variance may help limiting this size distortion, but not completely restore the theoretical 5% size. Fixed-smoothing asymptotics alleviates the size distortion, and may eliminate it completely, when a long bandwidth is used for the WCE-B or when a short bandwidth is used for the WPE-D.

Power analysis
In the previous exercise, we saw that some tests of equal predictive accuracy give rise to relevant size distortion, and we therefore do not recommend using those tests. To choose between the remaining tests, that are broadly correctly sized, in the second part of the Monte Carlo exercise, we study the power of the tests.
In this experiment, we only consider test statistics in which σ is estimated as the WCE-B or as WPE-D, and only use critical values from fixed-smoothing asymptotics. 6 Notice that we also include two cases in which even the non-standard asymptotics does not completely eliminate the size distortion: when σ is estimated with M = T 1/3 for the WCE-B and m = T 1/2 for the WPE-D. In this way, we are able to observe the power loss associated to using M = T 1/2 for the WCE-B, instead of M = T 1/3 . We keep m = T 1/2 for the WPE-D for a similar power comparison against the case in which the WPE-D with m = T 1/3 is used.
We test H 0 : {µ = 0} in processes with µ = cT −1/2 , for c ranging between 0 and 7. Since in this part of the exercise we are interested in power, rather than in size distortion, we use a time series of independent, standard normal distributed variates. 7 As in the previous exercise, we use 10,000 repetitions and T = 40 and T = 120. We also compare the tests with fixed-smoothing asymptotics against a benchmark case in which σ is known. With samples as small as the ones used in our experiment, this benchmark is unfeasible. If a very large sample is available, this situation can be interpreted as a limit case of the test when a WCE-B with b → 0 or a WPE-D with m → ∞ are used, so that the replacement of σ 2 with its estimate is negligible and asymptotic normality is justified. Thus, in our experiment this benchmark should be the upper bound for the empirical power functions.
The simulated empirical power is in Figure 1. Previous simulations in Kiefer and Vogelsang (2005) and in Hualde and Iacone (2015) found that the power is higher the smaller is M or the larger is m, and our results are consistent with them. The test with statistic with known σ has the highest power, as expected. It is worth noticing, however, that the power loss due to estimating σ is minimal, especially when the WCE-B with M = T 1/3 or M = T 1/2 is used. Overall, the only case in which we observe a remarkable power loss is for M = T when the WCE-B is used. For this bandwidth choice, the condition b → 0 as T → ∞ is certainly not justifiable so the power loss with respect to the unfeasible benchmark is not going to disappear as the sample size increases. We also verify that the power difference between using M = T 1/2 instead of M = T 1/3 for the WCE-B is very limited; to a sightly less extent, this is also true of using m = T 1/2 instead of m = T 1/3 for the WPE-D.
7 Results when autocorrelation is preserved under the alternative are reported in Appendix C.6.

Figure 1: Finite sample local power
The figure displays empirical rejection frequencies at 5% nominal size for deviations from the null by cT −1/2 and independent innovations. U refers to the unfeasible case in which the unknown variance is used and the test statistic has standard normal limit distribution. For the feasible tests, fixed-smoothing asymptotics is used. The alternative estimates of the long run variance are: WCE-B is for the WCE with Bartlett kernel with M = T 1/3 , M = T 1/2 or M = T ; WPE-D for the WPE with Daniell kernel and m = T 1/2 , m = T 1/3 or m = T 1/4 .

Comparison with the bootstrap
Bootstrap is a widely used alternative to using asymptotic approximations in tests for equal predictive ability. For this reason, in this section we perform a Monte Carlo analysis of the size and power of the tests for equal predictive ability using bootstrap critical values, and compare it with the results using fixed-smoothing asymptotics.
In the i-th Monte Carlo replication, we simulate forecast errors e (i) (1t) and e (i) (2t) as described in section 5.1 (for size analysis) or section 5.2 (for power analysis), compute the loss differential d (i) t and the test statistic Then for each bootstrap replication b, we generate bootstrapped loss differentials d using the overlapping stationary block-bootstrap of Politis and Romano (1994) with a circular scheme. In particular, we collate the loss differentials (d We then draw block sizes L 1 , L 2 , . . . from a discrete uniform distribution with support on {1, . . . , 2 T 1/4 }. We also draw random initial indices I 1 , I 2 , . . . from a discrete uniform distribution with support on {1, . . . , T }. The series of bootstrapped loss differential is then given by the first T elements of (d We finally construct the boostrapped test statistic as where , and σ (i,b) is the estimate of its long run variance constructed using the same formula as in the original data (WCE-B or WPE-D). We perform 10, 000 bootstrap replications and use the 95% quantile of the bootstrap distribution of the test statistic, (t (i,1) , . . . , t (i,10000) ), as critical value cv (i) . We then reject the null of equal predictive ability if |t (i) | > cv (i) . Notice that this is the naive bootstrap  also performed by Kiefer and Vogelsang (2005) and Gonçalves and Vogelsang (2011) for the test with the WCE-B estimate of the long run variance using block-bootstrap.
In Table 3   The figure displays empirical rejection frequencies at 5% nominal size for deviations from the null by cT −1/2 and independent innovations. U refers to the unfeasible case in which the unknown variance is used and the test statistic has standard normal limit distribution. For the feasible tests, fixed-m or bootstrap critical values are used. The long run variance is estimated using the WPE with Daniell kernel with m = T 1/4 or m = T 1/2 .
Both figures indicate that the bootstrap local power mimics the fixed-b and the fixed-m local power.
Results for the bootstrap test with the WCE-B estimate of the long run variance are in line with Gonçalves and Vogelsang (2011). They prove that the naive block-bootstrap has the same limiting distribution as the fixed-b asymptotic distribution. They also find that the power of the naive block-bootstrap closely follows the power when using the fixed-b critical value. However, Kiefer and Vogelsang (2005) show that the size properties of the naive block-bootstrap test statistic depends on the choice of the block length.

Comparing forecasts from models
In the previous sections, we assumed that forecast errors were model-free or generated from unknown models, as for example from forecast survey data. In reality, forecasts rely on forecasting models and estimation procedures. For this reason, we now consider the case in which two alternative models are used to forecast h-steps ahead the variable of interest y t .
Let's assume that model 1 is characterized by parameters δ (1) and model 2 is characterized by parameters δ (2) . We have a sample of size n and use the last T observations for predictions. Each h-step ahead time t predictions for t = R+h, . . . , n = R+h+T −1 are based on the parameters δ We denote the two forecasts by y t−h,R , respectively. The time-t loss differential between the two forecasts is given by denotes the sample mean of the loss differential at the estimated parameter values.
Without imposing any restrictions on the estimation methods used to produce the forecasts for the two models, a feasible statistic to test H 0 : where σ δ and γ j and I (λ j ) are the sample autocovariances and periodograms of d t δ respectively. Under regularity conditions, Giacomini and White (2006) show that for b → 0, under the null it holds that D δ The same applies for m → 0. For fixed-b, proceeding as in Section 4.1 and using the Bartlett kernel, we where the quantiles of Φ BART (b) are reported in Section 4.1. Also, applying fixed-m asymptotics as in Section 4.2 and using the Daniell kernel, we obtain that for fixed-m D δ

Monte Carlo study with forecasts from models
In this section, we revisit the size and power properties of the the Giacomini and White (2006) unconditional predictive ability test, focussing on the case of serial dependence.
We assume the following data-generating process Giacomini and Rossi (2010) consider the case of ρ = 0, but in our case we are interested in analyzing the performance of standard and fixed-smoothing asymptotics as the autocorrelation of the loss differential increases.
Following Giacomini and Rossi (2010), we use a quadratic loss function to compare one-step-ahead out-of-sample forecasts from two models: model 1 where y t = βx t + ε (1) t and model 2 where y t = ε (2) t . The one-step-ahead forecasts of y t+1 implied by the two models are where β t,R is the in-sample parameter estimated based on a rolling window of size R, and where we assume for simplicity that x t+1 is known at time t.
The two models have equal predictive performance at the estimated parameter values When ρ = 0 in (17), Giacomini and Rossi (2010) show that (21) holds for where On the other hand, setting σ 2 = 1 in (22) and σ 2 < 1 in (18), then, due to a reduction in the variance of the parameter estimate β t,R , model 1 provides more accurate forecasts.
When ρ = 0 (and |ρ| < 1) the formula for β t+1 that yields (21) is given by where With this generalization of β t+1 to allow for non-zero ρ, we can analyse the size properties of the proposed tests for values of ρ that range from 0 to 0.8. To this end, we first generate time series for x t and ε t as in (16)-(18). As in Giacomini and Rossi (2010), we initialize β t with β t = 0.05 for t = 1, . . . , R. Then using (15) we generate n = R + T observations for y t that satisfy (21). To study size, we set σ 2 = 1 both in (17) and in the formula for β t+1 . In all cases, we use 10,000 replications and the in-sample and out-of-sample sizes (R, T ) equal to (50, 40) and (150, 120).
In the top plots of Figure 4, we report results of the Monte Carlo with theoretical size set to 5%. We consider five different tests, two with standard normal limit distribution, one with bootstrap critical values and two that use fixed-smoothing asymptotics.
The WCE-DM test uses the sample variance to estimate the long run variance, as in Giacomini and White (2006), and for this reason it becomes seriously oversized as the degree of autocorrelation of the error increases. This problem is more serious in the Monte Carlo with small in sample and out of sample sizes, but it is substantial also 8 See Appendix B for the derivation. The figure displays empirical rejection frequencies for the Giacomini and White (2006) unconditional predictive ability test at 5% nominal size. The in-sample and out-of-sample sizes (R, T ) equal to (50, 40) for the left-hand-side plots and (150, 120) for the right-hand-side plots. The top plots report the empirical size for different values of ρ in (17). The bottom plots report the power of the tests for different values of σ 2 in (18). WCE-DM, Standard refers to the test that uses the sample variance to estimate the long run variance and standard normal limit distribution. WCE-B, M=T 1/3 , Standard refers to the test that uses a WCE of the long run variance with Bartlett kernel with truncation T 1/3 and standard normal limit distribution. WCE-B, M=T 1/3 , Bootstrap refers to the test that uses a WCE of the long run variance with Bartlett kernel with truncation T 1/3 and bootstrap critical values, computed as detailed in Section 5.3. WCE-B, M=T 1/2 , Fixed-b refers to the test that uses WCE of the long run variance with Bartlett kernel with M = T 1/2 and fixed-b asymptotics. WPE-D, m=T 1/3 , Fixed-m refers to the test that uses WPE of the long run variance with Daniell kernel with m = T 1/3 and fixed-m asymptotics.
for the rolling window of 150 observations and the out of sample size of 120, as shown by the 0.13 empirical size of the test for ρ = 0.8. This problem is partly addressed by the test that uses a WCE of the long run variance with Bartlett kernel and truncation T 1/3 and standard normal asymptotics. Still, also this test is oversized, especially for R=50 and T=40. On the other hand, the tests that use bootstrap critical values and fixed-smoothing asymptotics are correctly sized for all the rolling windows and out of sample sizes, and also for any degree of autocorrelation of the error.
To assess the power properties of the Giacomini and White (2006) unconditional predictive ability test with standard and fixed-smoothing asymptotics, we generate data under the alternative hypothesis that model 1 provides more accurate forecasts by simulating the data-generating process in (15)-(18) with β t that satisfies (21) with σ 2 set to 1, whereas σ 2 in (18) decreases from its value of 1 under the null hypothesis to 0.05. Since in this case we are interested in the power of the test, we set ρ = 0 in all experiments.
Results are reported in the bottom plots of Figure 4. As expected, the power of the tests increases towards 1 as σ 2 decreases from 1 to 0.05. For small rolling windows and out of sample sizes, the tests that use standard asymptotics are oversized and have higher rejection frequencies for any σ 2 . For larger rolling window and out of sample sizes, all tests are correctly sized and the plot shows that the power loss associated with the use of fixed-smoothing asymptotics is small.

Empirical illustration
To illustrate the usefulness of fixed-smoothing asymptotics for equal predictive accuracy tests, we evaluate the predictive accuracy of the Surveys of Professional Forecasters (SPF) and of the ECB Survey of Professional Forecasters (ECB SPF).
We perform the DM test, with WCE-DM, WCE-B and WCE-D estimates of the long run variance, using standard and fixed-smoothing asymptotics. To compute the WCE-DM, we use truncation lags equal to the forecast horizon. To select the bandwidths for the WCE-B and for the WPE-D, we use the results of our Monte Carlo exercise. For the WPE-D, we use the bandwidths m = T 1/4 and m = T 1/3 , as in our Monte Carlo they always returned good size properties for the DM test when fixed-m asymptotics was used. We omit m = T 1/2 as we still found some evidence of size distortion in the Monte Carlo exercise, even with fixed-m asymptotics. For the WCE-B, our choice is a little bit more delicate: we omit M = T in view of its low power, but we keep M = T 1/3 , alongside M = T 1/2 , despite some residual size distortion for the DM test even under fixed-b asymptotics, when this estimate is used. This implies that, for the WCE-B, we should put more weight on M = T 1/2 .
We use as benchmark a naive random walk, i.e. a no change benchmark using the vintages of data that were available to the public before the survey's deadline. We denote by e h 1,t the h-steps ahead forecast error of the random walk and by e h 2,t the h-steps ahead forecast error from the SPF. Therefore, a positive loss differential means higher loss for the forecast made using the random walk, and viceversa for a negative entry. Also, to evaluate the forecasts, we use the first release as realised value and a quadratic loss function. In the next two subsections, we describe the survey data and the empirical results for, respectively, the SPF and the ECB SPF. In the SPF, the output price index is the implicit price deflator for GNP in surveys conducted prior to 1992:Q1, the implicit deflator for GDP in surveys from 1992:Q1

Survey of Professional Forecasters
to 1995:Q4, and the chain-weighted price index in surveys conducted thereafter. In the same way, real output is defined as fixed-weighted real GNP in surveys conducted before 1992:Q1, fixed-weighted real GDP in surveys from 1992:Q1 to 1995:Q4, and chain-weighted real GDP in surveys conducted thereafter. Real GNP/GDP growth and GNP/GDP inflation are constructed as the annualized quarter over quarter growth rates. For both variables, we define the corresponding benchmark forecasts and realized values accordingly, as in Stark (2010). Finally, the three-month Treasury bill rate and the unemployment rate are expressed in levels.
Tables 4-7 report the test statistics presented in Section 3 for the null hypothesis of equal predictive accuracy of the SPF forecasts for real GNP/GDP growth, GNP/GDP inflation, the unemployment rate and the three-month T-Bill rates with respect to the random walk. In the tables, we use shades of gray to indicate two-sided significance using standard asymptotics (limit normality) and asterisks to indicate two-sided significance Note: this table reports the predictive accuracy tests for the SPF forecasts of real GNP/GDP growth with respect to a random walk. GNP/GDP growth is defined as the annualized quarter over quarter growth rates of fixed-weighted real GNP in the surveys conducted before 1992:Q1, fixed-weighted real GDP in the surveys from 1992:Q1 to 1995:Q4, and chain-weighted real GDP in the surveys thereafter. Random walk predictions and realized values are computed accordingly. A positive entry means higher average loss for the forecast made using the random walk. * * and * indicate, respectively, two-sided significance at the 5% and 10% level using fixed-b asymptotics for WCE-B and fixed-m asymptotics for WPE-D. and indicate, respectively, two-sided significance at the 5% and 10% level using standard asymptotics (limit normality). Note: this table reports the predictive accuracy tests for the SPF forecasts of GNP/GDP inflation with respect to a random walk. GNP/GDP inflation is defined as the implicit price deflator for GNP in surveys conducted prior to 1992:Q1, the implicit deflator for GDP in the surveys from 1992:Q1 to 1995:Q4, and the chain-weighted price index in the surveys thereafter. Random walk predictions and realized values are computed accordingly. A positive entry means higher average loss for the forecast made using the random walk. * * and * indicate, respectively, two-sided significance at the 5% and 10% level using fixed-b asymptotics for WCE-B and fixed-m asymptotics for WPE-D. and indicate, respectively, two-sided significance at the 5% and 10% level using standard asymptotics (limit normality). Note: this table reports the predictive accuracy tests for the SPF forecasts of the unemployment rate with respect to a random walk. A positive entry means higher average loss for the forecast made using the random walk. * * and * indicate, respectively, two-sided significance at the 5% and 10% level using fixed-b asymptotics for WCE-B and fixed-m asymptotics for WPE-D. and indicate, respectively, two-sided significance at the 5% and 10% level using standard asymptotics (limit normality). Note: this table reports the predictive accuracy tests for the SPF forecasts of the three-month Treasury Bill rate with respect to a random walk. A positive entry means higher average loss for the forecast made using the random walk. * * and * indicate, respectively, two-sided significance at the 5% and 10% level using fixed-b asymptotics for WCE-B and fixed-m asymptotics for WPE-D. and indicate, respectively, twosided significance at the 5% and 10% level using standard asymptotics (limit normality).
using fixed-smoothing asymptotics. Tables 4-7 show that in the full sample the predictive ability of the SPF is stronger than the one of the random walk for all the variables at short and medium horizon. The tables also indicate that the subsample 1985.Q1 to 1994.Q4 is characterized by stronger predictive ability of the SPF with respect to the random walk than the other two subsamples. The results for the third subsample are most interesting as we best see how standard asymptotics may lead to spurious rejections of the null hypothesis, and therefore to incorrect conclusions (in this case, a marked resurgence of forecasting power for the SPF). To compare our results with the existing literature, Demetrescu, Hanck and Kruse (2017)  We now discuss details of the various tests. Table 4 shows that the SPF forecasts for real GNP/GDP growth outperform the random walk on the full sample for all forecasting horizons. However when looking at the three subsamples, the evidence of significant outperformance of the SPF is consistently supported by the tests with fixed-b and fixedm asymptotics only for the nowcast. For the other horizons, the outperformance of the SPF sharply declined in the last subsample. As for GNP/GDP price inflation, Table 5 shows a much stronger predictive ability of the SPF, especially for short horizons and in the first and the last subsamples. Results in Table 6 indicate some predictive ability of the SPF forecasts for the unemployment rate, but the evidence is much weaker when using the proposed tests with fixed-smoothing asymptotics. Finally, Table 7 provides strong evidence of superior predictive accuracy of the SPF forecasts for the three month Treasury bill rate with respect the random walk, especially for short horizons. However, the predictive ability of the SPF for the three month Treasury bill rate sharply declined in the last two subsamples.

Results in
Comparing the application of standard asymptotics with fixed-smoothing asymptotics, we reject the null of equal predictive ability more frequently for the tests with standard asymptotics, especially in the subsamples (see for example the bottom panels in Tables 6-7). This is due to the fact that in the subsamples the tests are performed only on 40 observations, exacerbating the size distortions induced by standard asymptotics, see Table 1. For example, Table 6 shows that for unemployment both the test with WCE-DM and test the with WCE-B and standard asymptotics reject at 10% significance level the null of equal predictive ability of the SPF and the random walk on the last subsample for almost all forecasting horizons. This could be interpreted as a clear indication of predictive ability of the SPF for the unemployment rate. However, the tests with fixed-smoothing asymptotics fail to reject the null of equal predictive ability for almost all forecasting horizons, especially when fixed-m asymptotics is used, indicating that the SPF did not have any significant predictive ability for the unemployment rate in this period.
Finally, we notice that results for the nowcasts (and, to a lesser extent, for longer horizons too) are affected by a common pitfall when using the WCE-DM, which is the presence of autocorrelation in the loss differential that is not accounted for by the WCE-DM estimate. Let's look for example at the nowcast for unemployment in the last subsample, where T = 40. The WCE-DM test rejects the null of equal predictive ability of the SPF and the random walk with a test statistic equal to 2.84, much larger than the test statistics using the Bartlett kernel that are 1.96 with M = T 1/3 = 3 and 1.80 with M = T 1/2 = 6. This is because the WCE-DM test statistic assumes optimality of the nowcasts and, as a consequence, uses only the sample variance to estimate the long run variance. However, the sample first and second order autocorrelations of the loss differential are 0.665 and 0.298, due to both autocorrelation in the forecast error of the SPF, which are respectively 0.294 and 0.224, and more importantly of the random walk, which are respectively 0.690 and 0.602. When we use the Bartlett kernel the estimates of the long run variance are larger than the sample variance used to construct the WCE-DM statistic because the estimates of the long run variance use also the first 3 or 6 autocovariances and these, given the degree of autocorrelation of the loss differential, are different from zero.

ECB Survey of Professional Forecasters
Data on the ECB SPF is provided by European Central Bank since 1999. The survey is performed quarterly and includes expectations for some of the key macroeconomic variables for the Euro area. Between 1999Q1 and 2001Q3, the survey was conducted in the middle month of each quarter, i.e. in February, May, August and November. Since 2001Q4, the survey has been shifted to the first month of the quarter, i.e. in January, April, July and October. The questionnaire is sent to the panelists just after the release of the Harmonized Index of Consumer Prices (HICP), i.e. in the third week of the month before the survey, and the forecasts are collected in the second half of the first month of each quarter. For more details, see Bowles et al. (2007Bowles et al. ( , 2010 and Garcia (2003).
We focus on mean responses about the year-on-year GDP growth and year-on-year HICP inflation at the rolling horizons of one, two and five years. In the ECB SPF, the rolling horizons are set to one, two and five years ahead of the latest period for which the variable in question is observed when the survey is conducted and not one or two years ahead of the survey date.  , for a total of 44 observations. A positive entry means higher average loss for the forecast made using the random walk. * * and * indicate, respectively, two-sided significance at the 5% and 10% level using fixed-b asymptotics for WCE-B and fixed-m asymptotics for WPE-D. and indicate, respectively, two-sided significance at the 5% and 10% level using standard asymptotics (limit normality).
2016Q4, for a total of 44 observations. With such a small sample size, standard tests of equal predictive ability suffer from large size distortions but fixed-smoothing asymptotics can still provide reliable inference.
Since data on GDP and HICP are subject to revisions, following Conflitti, De Mol and Giannone (2015), we use the Euro area real-time database (see Giannone, Henry, Lalik and Modugno, 2012) to match the survey data with the information that was available to the forecasters at the time that they submitted their projections. Table 8 reports the test statistics presented in Section 4 for the null hypothesis of equal predictive accuracy of the ECB SPF forecasts for year-on-year GDP growth and year-on-year HICP inflation with respect to the random walk. As for Tables 4-7, we use asterisks to indicate two-sided significance using fixed-smoothing asymptotics, and shades of gray to indicate two-sided significance using standard asymptotics and limit normality. Table 8 indicate that the tests with standard asymptotics reject the null of equal predictive ability of the ECB SPF with respect to random walk for year-on-year GDP growth at short and medium horizons, and also for year-on-year HICP inflation at medium horizon. However, these results are partially spurious and demonstrate the risks of using standard asymptotics in a small sample. Indeed when using fixed-smoothing asymptotics, we only find limited evidence of superior predictive ability of the ECB SPF with respect to the random walk and only for the GDP growth at the nearest horizon.

Conclusion
We propose fixed-smoothing asymptotics to overcome the small sample size distortions of standard tests for predictive accuracy. Our Monte Carlo results show that these alternative asymptotics provide correctly sized tests for autocorrelated loss differentials even when only a small number of out of sample observations are available.
The methodology proposed in this paper is well-suited to evaluate the predictive accuracy of surveys with limited samples. As an illustrative example, and to facilitate comparison with other works, we apply our methodology to reassess the predictive accuracy of the SPF. We also include an application to the ECB SPF, which has a short time series dimension and thus makes our approach very convenient.
In this paper, we focus on applying the fixed-b and fixed-m asymptotics to the Diebold and Mariano (1995) test. However, these methodologies are of broader applicability in the forecasting literature. For example, Harvey, Leybourne and Whitehouse (2015) apply the fixed-m approach to forecast encompassing tests, and Demetrescu at al. (2017) apply fixed-b asymptotics to the fluctuation test of Giacomini and Rossi (2010). Future work includes applications of fixed-smoothing asymptotics to tests of equal predictive ability in presence of parameter estimation errors, see West (1996) and Clark and McCracken (2001); to forecast rationality tests, see Granger and Newbold (1986) and Diebold and Lopez (1996) and to forecast breakdown tests, see Giacomini and Rossi (2009).

A Limiting fixed-m asymptotics
where ε t is an independent, identically distributed process with E (ε t ) = 0, E (ε 2 t ) = 1, E (ε 4 t ) < ∞, and ∞ l=0 j 1/2 |A l | < ∞. Define the Fourier frequencies λ j = 0, ±1, ..., T /2 and the Fourier transform w x (λ) = 1 √ 2πT T t=1 x t e iλt , the periodogram I x (λ) = |w x (λ)| 2 , the sample mean x = 1 T T t=1 x t and the statistic τ = Proof. First, note that, for j = 1, ..., m, 1 √ 2πT T t=1 e iλ j t = 0, so w x (λ j ) = w u (λ j ). Moreover, following Hannan (1970), page 247, Now let then sufficient conditions for the central limit theorem are that The first three conditions are easy to establish; the Liapunov condition can be easily where we also used 1 T T t=1 cos 2 2πjt T = 1 2 from Gradshteyn and Ryzhik (1994), equation (1.351.2), page 37, and The term 1 T to a bivariate vector of independently normally distributed random variables with diagonal covariance matrix follows from an application of the Cramer-Wold device. Therefore, Moreover, using for integers j, k such that λ j ∈ [0, π] and λ k ∈ [0, π], then, following Giraitis, Koul and Surgalis (2012), page 112, the formula (27) yields E (w ε (λ j ) w ε (λ k ) * ) = 0 for j = k; therefore, with an application of the Cramer Wold device, it is easy to conclude that where C 2 2m / (2m) is a χ 2 2m distributed random variable divided by the number of degrees of freedom. Using (26) and ∞ l=0 A l e iλ j l → ∞ l=0 A l = σ, it also follows that Finally, as in Phillips and Solo (1992), we use the Beveridge Nelson decomposition Phillips and Solo (1992) page 978. In view of Remark 3.5 of Phillips and Solo (1992), the condition on the weights A l is ∞ l=0 A 2 l < ∞, as in equation (16) of Phillips and Solo (1992), and this is implied by ∞ l=0 l 1/2 |A l | < ∞, Phillips and Solo (1992) page 973, so, Another application of the Cramer Wold device and of (27) allows to establish a central limit theorem for the vectors T for integer 0 < j < m and conclude that 1 ε t is established in the same way. Therefore, (25) holds. Remark. Condition ∞ l=0 l 1/2 |A l | < 0 is fairly common in the literature, and it holds for any ARMA model. Many of these results are already known in the literature. For example, the limit normality for the Fourier transform is given in Hannan (1970) in page 225, also see Kokoszka and Mikosch (2000) page 51, where the asymptotic independence of the periodograms I ε (λ j ) at different frequencies is also discussed. A result similar to (25) is also in Sun (2013). The main reason of interest for this proof is then in the fact that, using the decompositions (26) and (28) we see that can treat most weakly dependent processes as independent processes, and derive results from the latter ones.
These results are then fairly intuitive and easy to establish. B The unconditional predictive ability test in presence of serial correlation Giacomini and Rossi (2010) propose the DGP (15) -(18) with ρ = 0 in (17), and consider the two competing forecasts To ensure that the two models have equal predictive performance at the estimated parameters values, they set β t as in (22). This implies that β t changes at any point t, but the estimate is made with OLS, which is designed for β constant over the period t − R + 1, ..., t. However, the estimate β t,R changes at any point in t because of its recursive structure.
To derive the values of β t that ensure that the two models have equal predictive performance at the estimated parameters values in presence of serial dependence, we first revise the derivation of β t in (22) for the case of no serial dependence, as in Giacomini and Rossi (2010).
The forecast error of model 1 is given by and notice that ε t+1 and ε j for j = t − R + 1, ..., t has E (ε t+1 , ε j ) = 0 so the expected value of the third term is 0.
As for the first term, and again the expected value of the third term is 0. The first term is non-random, while the second one is and, as ε j is independently distributed, E t j,k=t−R+1 x 2 j so the expectation of the second term is For the other forecast, , as in (21), we obtain Solving for β t+1 we get For the case in which ε t is a dependent process, consider again the second term because ε x = x ε as these are scalars. So, letting Ω = E (εε ) For the other forecast, and, again imposing (21), and, solving for β t+1 we get

C Additional Monte Carlo results
In this appendix, we report additional Monte Carlo results that include the frequency of negative estimates for the long run variance using the WCE-DM, the size of standard asymptotics for θ = 0.5 and θ = 0, for the WCE-B with automatic bandwidth selection and for the WPE with feasible minimum MSE bandwidth, and also power comparisons with standard asymptotics and with with autocorrelation under the alternative.

C.1 Negative estimates of the long run variance
In Table 9, we study the frequency of negative estimates for σ 2 DM , the WCE estimate with the rectangular kernel (WCE-DM) defined in (3). The table shows that the risk of negative long-run variance estimates is higher in the small sample, at large forecasting horizons and for low values of θ. For θ = 0, q = 5 and T = 40, the size distortion due just to a negative estimate σ 2 DM < 0 is actually larger than the nominal size. This is due to the fact that one does not know that theta is 0 and therefore adds more lagged autocovariances, increasing the risk of a negative estimate.
In Table 10, we study the size properties of the DM test for various estimates of σ when θ = 0 or θ = 0.5 assuming standard asymptotics. This exercise allows a comparison with Table 1 in which θ = 0.75 was used, to appreciate the consequences of altering θ. Consistently with results in Clark (1999), the size when the WCE-DM is used does not seem to be sensitive to the change of the value of θ to θ = 0.5; on the other hand, the reduction in the dependence is associated with an improvement in the size properties when the WCE with Bartlett kernel (WCE-B) or the WPE with Daniell kernel (WPE-D) is used. In the case θ = 0.5, the evidence that the test with statistic standardized by the WPE-D estimate with m = T 1/2 gives best size is even more compelling. In the case of θ = 0, the simulations show that the test with WCE-DM is heavily oversized.
This is due to the fact that one does not know that θ is 0 and, therefore, adds more lagged autocovariances, increasing the risk of a negative estimate and of a larger bias.

C.3 Automatic bandwidth selection
In Table 11, we study the application of the automatic bandwidth selection of Newey and West (1994), when θ = 0.75. We compare the performance for the naïve M = T 1/3 bandwidth (already available in Table 1) against the NW estimate with prewhitening as in Newey and West (1994), and against a third estimate in which the same procedure is applied, but without prewhitening. Note: empirical rejection frequencies for tests of equal predictive ability at 5% nominal size using standard normal asymptotics for various MA(q) processes with θ = 0.75 and alternative bandwidths for the WCE using the Bartlett kernel: T 1/3 , the Newey and West (1994) estimate with prewhitening (Prew) and the Newey and West (1994) against the same procedure without prewhitening (No Pre).
In general, using the NW estimate without prewhitening does not yield size as good as when the naïve M = T 1/3 estimate is employed: the prewithening on the other hand does provide some size correction, but the better size for larger q is mostly offset by worse size when q = 1: this suggests that the automatic NW procedure would not fare well when the dependence is relatively weak, and actually size properties deteriorating for larger θ are documented also in Clark (1999). Table 11 therefore shows that even the automatic bandwidth selection with prewhitening from Newey and West (1994) does not offer a complete correction of the size distortion, when standard asymptotics is used.

C.4 Minimum MSE bandwidth for the WPE
In this appendix, we report the empirical size of equal predictive ability tests when the WPE estimate of the long run variance with feasible minimum MSE bandwidth is used.
To derive the minimum MSE bandwidth, we follow Phillips (2005) and Sun (2013).
For the average periodogram with bandwidth m, the bias is Using the fact that . The bias factor B is usually unknown, but when u t = φu t−1 + ε t with |φ| < 1 and ε t iid(0, ω), then σ 2 = ω 2 (1−φ) 2 and B = − π 2 6 2φ (1−φ) 4 ω 2 , so we approximate σ 4 /B 2 with a common plug in method: we assume such AR(1) model, estimate φ and then replace the estimated value in the formula for m M SE . Finally, the feasible MSE bandwidth m M SE is given by the integer part of m M SE , when this is between 1 and T /2, and by 1 or T /2 otherwise.
Notice that the minimum MSE bandwidth trades off bias and variance, but this may not be the best criterion for application in tests, as in testing we are looking at different properties, namely, minimum size distortion and maximum power. With standard asymptotics, both the bias and variance of the estimate of the long run variance cause size distortion in the test, whereas with fixed-smoothing asymptotics the effect of the variance of the estimate of the long run variance is accounted for by the change in the distribution of the test statistic, and we are only concerned about the effect due to the (lower order) bias. This bias is stronger the larger is the bandwidth m, as with larger bandwidths periodograms that are more distant from frequency zero are used to estimate the spectral density at frequency zero. Bandwidths proportional to T 4/5 are therefore more prone to causing size distortion in testing. Indeed, results in Table 12 indicate that, as for the NW automatic bandwidth selection, the test is oversized both when standard and fixed-m asymptotics are used. This is due to the fact that the feasible minimum MSE bandwidth is larger than T 1/4 , T 1/3 or T 1/2 used in Table 2, resulting in a larger bias. For this reason, we do not recommend using this bandwidth.

C.5 Power comparison with standard asymptotics
In this appendix, we analyze the power of the tests under standard asymptotics when the long run variance is estimated. Results in Figure 5 refer to the case in which a Figure 5: Finite sample local power using WCE The figure displays empirical rejection frequencies at 5% nominal size for deviations from the null by cT −1/2 and independent innovations. U refers to the unfeasible case in which the unknown variance is used and the test statistic has standard normal limit distribution. For the feasible tests, the test statistic uses the WCE estimate of the long run variance with Bartlett kernel with M = T 1/3 (left panels) and M = T 1/2 (right panels). We use standard critical values (Standard), size-adjusted critical values (Size-Adjusted) and fixed-b critical values (Fixed-b). Figure 6: Finite sample local power using WPE The figure displays empirical rejection frequencies at 5% nominal size for deviations from the null by cT −1/2 and independent innovations. U refers to the unfeasible case in which the unknown variance is used and the test statistic has standard normal limit distribution. For the feasible tests, the test statistic uses the WPE estimate of the long run variance with Daniell kernel with m = T 1/4 (left panels) and m = T 1/3 (right panels). We use standard critical values (Standard), size-adjusted critical values (Size-Adjusted) and fixed-m critical values (Fixed-m).
WCE estimate of the long run variance with Bartlett kernel is used, with M = T 1/3 (left panels) and M = T 1/2 (right panels). We take again as benchmark the limit local power function obtained from the normal distribution, and we compare it to the simulated local power functions both when standard asymptotics and when fixed-b are used; finally, we also include the simulation of the local power after size adjustment.
Using standard asymptotics we would be misled into thinking that we attain power comparable to the limit benchmark. However, this is spurious, as we can see from the size distortion and from the distance from the local power function for the size adjusted test. Results are similar when using a WPE estimate of the long run variance, as shown in Figure 6 where we use a WPE estimate of the long run variance with Daniell kernel with m = T 1/4 (left panels) and m = T 1/3 (right panels). Therefore we can conclude that both fixed-b and fixed-m asymptotics mimic the correct power, as the tests have correct size and power very close to the size adjusted power.

C.6 Power with autocorrelation under the alternative
In this appendix, we analyze the power of the test of equal forecast accuracy when autocorrelation is preserved under the alternative. To this end we use the following data generating process d t = c(1 − ρ)T −1/2 + ρd t−1 + u t where u t ∼ N (0, 1) and ρ = 0.5. This process has µ = cT −1/2 and we test H 0 : {µ = 0} when c ranges from 0 to 9.
As in Section 5.2, we compare the tests with fixed-smoothing asymptotics against a benchmark case in which σ is known and standard normal limit distribution is used. In  The figure displays empirical rejection frequencies at 5% nominal size for deviations from the null by cT −1/2 and autocorrelated innovations. U refers to the unfeasible case in which the unknown variance is used and the test statistic has standard normal limit distribution. For the feasible tests, size-adjusted power is reported. The alternative estimates of the long run variance are: WCE-B is for the WCE with Bartlett kernel with M = T 1/3 , M = T 1/2 or M = T ; WPE-D for the WPE with Daniell kernel and m = T 1/2 , m = T 1/3 or m = T 1/4 .