# Goodness-of-Fit Analysis for Wind Speed Distributions Using Measurement Data from Domestic Wind Farms in Korea

Chunhyun Paik, Yongjoo Chung, and Young Jin Kim

## Abstract

Many countries have strived to expand the adoption of renewable power generation in the transition to a low-carbon society. Wind power is recognized as one of the most promising and scalable renewable energy sources for power generation, but the amount of wind power generation is heavily dependent on wind speed. Therefore, techniques that enable the reliable estimation of wind speed have long been under focus. In this study, statistically appropriate probability distribution functions were explored using wind speed measurement data from wind farm sites in the Republic of Korea. In particular, the problem of overfitting was investigated in depth by evaluating the fitness of distributions using different parameters. The suitability of mixed distributions was examined statistically based on the information criteria until suitable distributions were established. The results indicated that monthly wind speed data are a good fit with distinct Weibull distributions; thus, planning for wind power generation in the ROK should consider temporal variations in wind speed as revealed by the distribution analyses.

**Keywords:**Wind farm, Wind speed, Weibull distribution, Distribution fitting, Goodness-of-Fit

## 1. Introduction

Global warming is mainly caused by the excessive emission of greenhouse gases (GHGs) and is considered one of the greatest environmental threats worldwide. The Intergovernmental Panel on Climate Change was formed in 1992 to counter climate change, and the Kyoto Protocol, under the United Nations Framework Convention on Climate Change, was signed in 1997 and superseded by the Paris Agreement in 2016. Under the agreement, member countries of the Conference of the Parties (COP) are required to submit national GHG inventories that account for GHG emissions from different sectors. Based on these inventories, it is known that the power generation sector is responsible for a significant portion of GHG emissions, and many countries are striving for the large-scale adoption of renewable power generation. Although the intermittency of renewables poses great challenges in energy system planning, the share of power generation from renewable resources is increasing at a remarkable pace. In particular, the use of wind power has many environmental and societal benefits, and wind energy is regarded as the most mature form of renewable energy from a techno-economic perspective [1]. It has also been noted that the effect of wind generation on climate change may be low alongside various benefits for the environment, economy, and society [2]. Thus, wind power is considered the most promising renewable resource in terms of potential installed capacity [3].

The Korean government recently announced the 9th Master Plan for Long-Term Electricity Supply and Demand, and it is advocated that approximately 26 % of electricity should be from renewable generation by 2034, of which 91 % will be supplied by solar and wind power [4]. In particular, accounting for more than 30 % of renewable energy generation, wind power is expected to become a critical form of power generation. Accordingly, the estimation and prediction of wind turbine output have been the subject of increasing research to better account for the intermittency of wind and variable wind speeds. A wide variety of models and methods have been proposed to predict the power output from wind generation under a range of circumstances [57]. The amount of power that can be harvested from the wind largely depends on wind speed and the given specifications of the wind turbine, such as its size and blade length. Thus, it is imperative to accurately estimate wind speeds to project the power that can be generated. Wind speed can generally be considered a random variable following specific distributions and can be modeled with different probability distribution functions of which the Weibull distribution is most widely adopted [812]. In previous studies, 2-parameter, 3-parameter, and mixed Weibull distributions have been investigated for their level of fitness for model wind speeds with reference to different criteria [1315]. It should be noted, however, that most previous studies [815] have tested the goodness-of-fit (GOF) of wind speed against a single preselected distribution and employed different criteria, resulting in a lack of consistency in interpretation and comparison. This study was designed to derive wind speed probability distributions based on measurement data from wind farms in the Republic of Korea (ROK) from a statistical point of view. Specifically, a formal statistical procedure for the GOF test was employed to identify the most appropriate distribution function by considering temporal and seasonal variations. Often, as a suitable distribution cannot be identified owing to distinct variability over a relatively short time period, a mixture of different distributions can be explored; however, this may lead to overfitting, particularly when comparing the GOF for distributions with different parameters. Therefore, in this study, the Akaike and Bayesian information criteria are proposed for effectively scrutinizing the GOF of mixed distributions.

## 2. Data and Methods

Even though the effective estimation of wind speed distribution can only be achieved using data with a sufficiently high spatiotemporal resolution, available data related to the operation of wind turbines at wind farm sites is often limited because of the closed nature of the domestic electricity market. In the ROK, as the amount of electricity generation directly affects the purchase price of the governmentowned distribution company (the Korea Electric Power Corporation, KEPCO), it is difficult to secure such a dataset because of the business confidentiality policy of private electricity generation companies. Therefore, publicly available datasets form the main source of information about wind power generation on the ROK, complemented by some limited information obtained through restricted access granted by power generation companies. The daily average wind speeds from individual wind turbines at three wind farm sites, Hankyung (HK), Sungsan (SS), and Taebaek (TB), in Korea were obtained from 2018 to 2020, as shown in Table I. Note that one of the nine turbines at the HK site was excluded from the analysis because of data unavailability.

It is assumed that daily wind speed follows an identical distribution within a particular month and thus, its distribution is estimated monthly for each site. This assumption takes seasonal (i.e., monthly) variations in wind speed into account while acquiring a sufficiently large number of samples. For example, 93 samples were available for January at the HK and SS sites (i.e., 93 = 31 days/year × 3 years). Figure 1 depicts the variations in daily wind speeds at the different sites and Fig. 2 compares the monthly average daily wind speeds. Significant differences were observed in the monthly average wind speeds, which indicates that seasonal variations must be accounted for.

When estimating wind speed distributions, possible candidate distribution functions must first be explored. Generally, descriptive statistics based on empirical distribution functions can be effective in checking the normality of the dataset, as shown in Fig. 3(a). Higher-order moments, such as skewness and kurtosis, are also useful for identifying candidate distributions using the Cullen–Frey relationship depicted in Fig. 3(b) [16]. The optimal candidate group for the wind speed distribution in February at the SS site included a normal distribution with skewness and kurtosis of 0 and 3, respectively, a lognormal distribution, and a Weibull distribution. Further details on the construction and interpretation of these graphs are provided elsewhere [16].

Once several candidate distributions were identified, the parameter estimation of the corresponding distributions was performed using classical statistics, such as maximum likelihood and moment. The maximum likelihood method is the most popular method when the sample size is greater than 30, which was employed in this study unless otherwise specified. Statistical analyses were conducted to estimate the monthly wind speed distribution for each of the wind farm sites using MINITAB Release 19 and R 4.1.2.

## 3. Results and Discussion of GOF Analysis

The GOF for the parameter estimation of distribution functions is often tested using
graphical and analytical approaches. One of the most popular graphical approaches
is to compare the empirical and theoretical distributions, as demonstrated in Fig. 4. The most widely used graphs include probability densities, cumulative distributions,
quantile–quantile (*Q*–*Q*) plots, and probability–probability (*P*–*P*) plots, which are shown in Figs. 4(a)-(d), respectively.

Although providing intuitive insights that can help guide subsequent analysis, these
graphical approaches have limited quantitative value. A more rigorous analytical approach
to evaluating the GOF for parameter estimation often involves statistical hypothesis
testing. In this case, the null hypothesis, *H*_{0}, is ‘The wind speed data follow a specified distribution.’ The Kolmogorov–Smirnov
(K–S) test is one of the most popular approaches for determining whether a sample
comes from a population with a specific distribution [17]. Given *N* ordered data points in ascending order (i.e., *Y*_{1}, *Y*_{2}, … , *Y _{N}* ), the K–S statistic quantifies the distance between the empirical distribution of
the sample and the cumulative distribution of the reference distribution. Denoted
by

*D*, the K–S test statistic is defined as follows:

where *F* (·) denotes the cumulative distribution function of the tested reference distribution.
It should be noted that the distribution of the K–S statistic itself does not depend
on the underlying cumulative distribution function being tested, and tends to be more
sensitive near the center of the distribution than that at the tail. Despite its usefulness
as a nonparametric test, the K–S statistic may not be the most appropriate choice
for analyzing renewable power generation, because important factors such as the intermittency
and peak contribution (i.e., capacity credit) of renewables are closely related to
the distributional characteristics at the tail [18]. Therefore, the Anderson–Darling (A–D) test is proposed as a potentially more suitable
test [19]. The A–D test makes use of the specific distribution in calculating critical values,
which is considered more powerful than the K–S test, and is more suitable for parameter
estimation in the case of wind speed distributions. Denoted by *A*^{2}, the A–D test statistic is defined as follows:

*H*_{0} is rejected when the test statistic calculated from the sample is greater than or
equal to the critical value as determined by the prescribed significance level, α.
Defined by the probability that an extreme result will be observed if *H*_{0} is true, the significance probability *p*-value can also be calculated. A smaller *p*-value indicates that the sample provides stronger evidence against *H*_{0}, which is rejected if the *p*-value is lower than or equal to the significance level.

Considering that distributions with fewer parameters are preferred in practice, three different 2-parameter distributions, namely normal, lognormal, and 2-parameter Weibull, were explored first to test the GOF for the wind speed data. These candidate distributions were chosen based on a preliminary descriptive statistical analysis and the results of previous studies. Wind speed data from the HK site were used to compare two GOF tests with a significance level of 5 %.

The candidate 2-parameter distributions were tested for their GOF against the monthly wind speed data, and the results are summarized in Table II. The wind speed data were fitted to the different distribution month by month. While the K–S test revealed that the GOF to one or more of the three distributions for each month was significant, the A–D test provided more conservative results in that, except for March, June, and November, most of the monthly wind speed data were well-fitted only to the lognormal distribution. There were no statistically significant distributions fitted to the wind speed data for June and November, which may be attributed to the fact that the A–D test assigns more weight to the data at the tail to better represent the distributional behavior of wind speed data. For November at the HK site, the data were significantly fitted to each of the three distributions based on the K–S test but not the A–D test, and noticeable differences between the tails of the empirical and theoretical distributions were observed (Fig. 5).

The A–D test may be considered more rigorous than the K–S test to fit the wind speed data, and the wind speed data from the other two sites were also tested for their fitness for the three 2-parameter distributions using the A–D test, as summarized in Table III. No single distribution was found to fit the wind speed data from these sites and, as such, different distributions must be employed to model the wind speeds at different times of the year. In other words, spatiotemporal variations in wind speed may not be properly represented by a specific single distribution, which contradicts the assumptions made in previous studies that wind speed follows a 2-parameter Weibull distribution. It is well known that the GOF can generally be improved by employing distribution functions with more parameters within the same distribution family. Therefore, 3-parameter Weibull distributions were subsequently investigated.

In addition to the scale and shape parameters of the 2-parameter Weibull distribution,
the threshold parameter is added to the 3-parameter Weibull distributions, of which
the density *f*(*x*) is defined as follows:

where β, λ, and τ denote the shape, scale, and the threshold parameter, respectively.
Note that β > 0, λ > 0, and *x* ≥ τ. The GOF of the 3-parameter Weibull distribution was tested for the wind speed
data from the three sites, which were all well-fitted except for October at the SS
site and January at the TB site (Table IV).

For example, the wind speed data for January at the HK site were not fitted for the 2-parameter Weibull distribution, but were successfully fitted using the 3-parameter distribution at a significance level of 5 % (Table II and Table IV). On the other hand, the wind speed data for February at the HK site were well-fitted using both the 2- and 3-parameter Weibull distributions. It is intuitive that whenever the 2-parameter Weibull distribution properly represents the data, the 3-parameter distribution will provide an improved fit because the GOF of the parameter estimation can always be improved as more parameters are included within the same distribution family.

When both the 2- and 3-parameter Weibull distributions fit the data properly, it is
important to determine the specific test for modeling wind speeds. Therefore, simply
comparing the *p*-values of the respective GOF tests is not appropriate because the numbers of parameters
differ. Instead, an incremental analysis may be conducted to confirm whether the addition
of parameters contributes to a statistically significant improvement in the GOF. This
can be carried out using the popular likelihood ratio test (LRT). In this case, *H*_{0} is that ‘The GOF of the 2-parameter Weibull distribution is better than that of the
3-parameter Weibull distribution.’ The test statistic of the LRT can be written as
follows:

where
*L* (
*k*-parameter Weibull distribution, respectively. It should be noted that the LRT test
statistic follows the Chisquare distribution with 1 degree of freedom, that is, LRT
∼ χ^{2}(1). The results of the LRT for the monthly wind speed data are also presented in
Table IV, and except for two or three months at each site, the 3-parameter Weibull distributions
generally provided a good fit. As an example, there was no evidence that the addition
of the threshold parameter improved the GOF for February at the HK site (*p*-value = 0.093), whereas the GOF was significantly improved by introducing this parameter
for March (*p*-value = 0.019).

It is worth noting that the wind speed data for October at the SS site, and January at the TB site, could not be properly fitted with these distributions; further analysis confirmed that no suitable 2- or 3-parameter distributions could be identified. Indeed, a mixture of distributions may need to be explored when models based on a single distribution provide poor characterization [20]. Here, a mixture of 2-parameter Weibull distributions was subsequently investigated, defined by the weighted sum of two or more 2-parameter Weibull distributions, as follows:

where *k* denotes the number of 2-parameter Weibull distributions to be aggregated, and ω_{i} is the weight of the *i*th distribution, such that

where *N* and *p* denote the sample size and number of parameters involved in eachmodel, respectively,
and *L* (

The AIC and BIC values for the mixed Weibull distribution were lower than those for the 3-parameter Weibull distribution, indicating an improvement in the GOF. This indicates that a mixture of the 2- parameter Weibull distributions provides a better data fit than the 3- parameter Weibull distribution. Such an improvement in GOF does not necessarily mean that a mixed Weibull distribution provides the best possible fit to the data, but indicates that it can be considered a strong candidate distribution. Following these procedures, the most appropriate distributions for fitting the wind speed data are summarized in Table VI. Notably, the wind speed data for October at the SS site, and January at the TB site, were better represented by a mixture of two 2-parameter Weibull distributions (Fig. 6).

## 4. Conclusions

The accurate estimation of wind speed distributions is essential for the effective prediction and management of wind power generation, which plays a central role in expanding renewable power generation and mitigating GHG emissions. That there will be spatiotemporal variations in wind speed is intuitive, and therefore, wind speed distribution will vary greatly between regions and over time. This study explored a formal statistical procedure for deriving appropriate distribution functions that fit monthly wind speed data obtained from three different wind farms in the ROK. The candidate distributions were first identified using descriptive statistical approaches, and then tested for GOF following formal statistical tests. The A–D test may be considered more rigorous than the standard K–S test because it better captures the distributional behavior at the tail of the data distribution. Furthermore, 2- and 3-parameter Weibull distributions were compared for their GOF to the wind speed data, and the results indicated that the wind speed data are better fitted by including additional parameters. When a suitable distribution cannot be obtained, the application of mixed Weibull distributions can be considered, and the problem of overfitting can be addressed by testing the obtained GOF with information criteria, such as AIC and BIC.

One of the major shortcomings of this study is its limited ability to account for possible overfitting and underfitting, which may be overcome with more extensive data collection. In addition, filtering outliers via data preprocessing may further improve the understanding of wind-speed distributions, which warrants further investigation. Overall, the results of this study are expected to be extended and linked to the derivation and prediction of power output from wind generation in the ROK and elsewhere.

## Article information

###### Articles from Applied Science and Convergence Technology are provided here courtesy of **Applied Science and Convergence Technology**

## References

- Gómez-Lázaro E., Bueso M. C., Kessler M., Martín-Martínez E., Zhang J., Hodge B.-M., and Molina-García A.. Array 2016;9:91.