• Home
  • Sitemap
  • Contact us
Article View

Research Paper

Applied Science and Convergence Technology 2022; 31(4): 93-98

Published online July 30, 2022


Copyright © The Korean Vacuum Society.

Goodness-of-Fit Analysis for Wind Speed Distributions Using Measurement Data from Domestic Wind Farms in Korea

Chunhyun Paika , Yongjoo Chungb , and Young Jin Kimc , ∗

aDivision of Industrial Convergence Systems Engineering, Dongeui University, Busan 47340, Republic of Korea
bDepartment of e-Business, Busan University of Foreign Studies, Busan 46234, Republic of Korea
cDepartment of Systems Management and Engineering, Pukyong National University, Busan 48513, Republic of Korea

Correspondence to:youngk@pknu.ac.kr

Received: July 6, 2022; Accepted: July 29, 2022

Many countries have strived to expand the adoption of renewable power generation in the transition to a low-carbon society. Wind power is recognized as one of the most promising and scalable renewable energy sources for power generation, but the amount of wind power generation is heavily dependent on wind speed. Therefore, techniques that enable the reliable estimation of wind speed have long been under focus. In this study, statistically appropriate probability distribution functions were explored using wind speed measurement data from wind farm sites in the Republic of Korea. In particular, the problem of overfitting was investigated in depth by evaluating the fitness of distributions using different parameters. The suitability of mixed distributions was examined statistically based on the information criteria until suitable distributions were established. The results indicated that monthly wind speed data are a good fit with distinct Weibull distributions; thus, planning for wind power generation in the ROK should consider temporal variations in wind speed as revealed by the distribution analyses.

Keywords: Wind farm, Wind speed, Weibull distribution, Distribution fitting, Goodness-of-Fit

Global warming is mainly caused by the excessive emission of greenhouse gases (GHGs) and is considered one of the greatest environmental threats worldwide. The Intergovernmental Panel on Climate Change was formed in 1992 to counter climate change, and the Kyoto Protocol, under the United Nations Framework Convention on Climate Change, was signed in 1997 and superseded by the Paris Agreement in 2016. Under the agreement, member countries of the Conference of the Parties (COP) are required to submit national GHG inventories that account for GHG emissions from different sectors. Based on these inventories, it is known that the power generation sector is responsible for a significant portion of GHG emissions, and many countries are striving for the large-scale adoption of renewable power generation. Although the intermittency of renewables poses great challenges in energy system planning, the share of power generation from renewable resources is increasing at a remarkable pace. In particular, the use of wind power has many environmental and societal benefits, and wind energy is regarded as the most mature form of renewable energy from a techno-economic perspective [1]. It has also been noted that the effect of wind generation on climate change may be low alongside various benefits for the environment, economy, and society [2]. Thus, wind power is considered the most promising renewable resource in terms of potential installed capacity [3].

The Korean government recently announced the 9th Master Plan for Long-Term Electricity Supply and Demand, and it is advocated that approximately 26 % of electricity should be from renewable generation by 2034, of which 91 % will be supplied by solar and wind power [4]. In particular, accounting for more than 30 % of renewable energy generation, wind power is expected to become a critical form of power generation. Accordingly, the estimation and prediction of wind turbine output have been the subject of increasing research to better account for the intermittency of wind and variable wind speeds. A wide variety of models and methods have been proposed to predict the power output from wind generation under a range of circumstances [57]. The amount of power that can be harvested from the wind largely depends on wind speed and the given specifications of the wind turbine, such as its size and blade length. Thus, it is imperative to accurately estimate wind speeds to project the power that can be generated. Wind speed can generally be considered a random variable following specific distributions and can be modeled with different probability distribution functions of which the Weibull distribution is most widely adopted [812]. In previous studies, 2-parameter, 3-parameter, and mixed Weibull distributions have been investigated for their level of fitness for model wind speeds with reference to different criteria [1315]. It should be noted, however, that most previous studies [815] have tested the goodness-of-fit (GOF) of wind speed against a single preselected distribution and employed different criteria, resulting in a lack of consistency in interpretation and comparison. This study was designed to derive wind speed probability distributions based on measurement data from wind farms in the Republic of Korea (ROK) from a statistical point of view. Specifically, a formal statistical procedure for the GOF test was employed to identify the most appropriate distribution function by considering temporal and seasonal variations. Often, as a suitable distribution cannot be identified owing to distinct variability over a relatively short time period, a mixture of different distributions can be explored; however, this may lead to overfitting, particularly when comparing the GOF for distributions with different parameters. Therefore, in this study, the Akaike and Bayesian information criteria are proposed for effectively scrutinizing the GOF of mixed distributions.

Even though the effective estimation of wind speed distribution can only be achieved using data with a sufficiently high spatiotemporal resolution, available data related to the operation of wind turbines at wind farm sites is often limited because of the closed nature of the domestic electricity market. In the ROK, as the amount of electricity generation directly affects the purchase price of the governmentowned distribution company (the Korea Electric Power Corporation, KEPCO), it is difficult to secure such a dataset because of the business confidentiality policy of private electricity generation companies. Therefore, publicly available datasets form the main source of information about wind power generation on the ROK, complemented by some limited information obtained through restricted access granted by power generation companies. The daily average wind speeds from individual wind turbines at three wind farm sites, Hankyung (HK), Sungsan (SS), and Taebaek (TB), in Korea were obtained from 2018 to 2020, as shown in Table I. Note that one of the nine turbines at the HK site was excluded from the analysis because of data unavailability.

Table 1 . Daily wind speed data from three wind farms in the Republic of Korea.

Wind farmNumber of turbinesCapacity (MW)Data collection period

It is assumed that daily wind speed follows an identical distribution within a particular month and thus, its distribution is estimated monthly for each site. This assumption takes seasonal (i.e., monthly) variations in wind speed into account while acquiring a sufficiently large number of samples. For example, 93 samples were available for January at the HK and SS sites (i.e., 93 = 31 days/year × 3 years). Figure 1 depicts the variations in daily wind speeds at the different sites and Fig. 2 compares the monthly average daily wind speeds. Significant differences were observed in the monthly average wind speeds, which indicates that seasonal variations must be accounted for.

Figure 1. Variations in daily wind speed.

Figure 2. Comparison of monthly average daily wind speed.

When estimating wind speed distributions, possible candidate distribution functions must first be explored. Generally, descriptive statistics based on empirical distribution functions can be effective in checking the normality of the dataset, as shown in Fig. 3(a). Higher-order moments, such as skewness and kurtosis, are also useful for identifying candidate distributions using the Cullen–Frey relationship depicted in Fig. 3(b) [16]. The optimal candidate group for the wind speed distribution in February at the SS site included a normal distribution with skewness and kurtosis of 0 and 3, respectively, a lognormal distribution, and a Weibull distribution. Further details on the construction and interpretation of these graphs are provided elsewhere [16].

Figure 3. Descriptive statistical tools for exploring candidate distribution functions for February at the Sungsan (SS) site. (a) Empirical distribution functions. (b) Cullen and Frey graph.

Once several candidate distributions were identified, the parameter estimation of the corresponding distributions was performed using classical statistics, such as maximum likelihood and moment. The maximum likelihood method is the most popular method when the sample size is greater than 30, which was employed in this study unless otherwise specified. Statistical analyses were conducted to estimate the monthly wind speed distribution for each of the wind farm sites using MINITAB Release 19 and R 4.1.2.

The GOF for the parameter estimation of distribution functions is often tested using graphical and analytical approaches. One of the most popular graphical approaches is to compare the empirical and theoretical distributions, as demonstrated in Fig. 4. The most widely used graphs include probability densities, cumulative distributions, quantile–quantile (QQ) plots, and probability–probability (PP) plots, which are shown in Figs. 4(a)-(d), respectively.

Figure 4. Useful graphical tools for goodness-of-fit testing. (a) Comparison of probability densities. (b) Comparison of cumulative distributions. (c) Quantile– quantile plot. (d) Probability–probability plot.

Although providing intuitive insights that can help guide subsequent analysis, these graphical approaches have limited quantitative value. A more rigorous analytical approach to evaluating the GOF for parameter estimation often involves statistical hypothesis testing. In this case, the null hypothesis, H0, is ‘The wind speed data follow a specified distribution.’ The Kolmogorov–Smirnov (K–S) test is one of the most popular approaches for determining whether a sample comes from a population with a specific distribution [17]. Given N ordered data points in ascending order (i.e., Y1, Y2, … , YN ), the K–S statistic quantifies the distance between the empirical distribution of the sample and the cumulative distribution of the reference distribution. Denoted by D, the K–S test statistic is defined as follows:


where F (·) denotes the cumulative distribution function of the tested reference distribution. It should be noted that the distribution of the K–S statistic itself does not depend on the underlying cumulative distribution function being tested, and tends to be more sensitive near the center of the distribution than that at the tail. Despite its usefulness as a nonparametric test, the K–S statistic may not be the most appropriate choice for analyzing renewable power generation, because important factors such as the intermittency and peak contribution (i.e., capacity credit) of renewables are closely related to the distributional characteristics at the tail [18]. Therefore, the Anderson–Darling (A–D) test is proposed as a potentially more suitable test [19]. The A–D test makes use of the specific distribution in calculating critical values, which is considered more powerful than the K–S test, and is more suitable for parameter estimation in the case of wind speed distributions. Denoted by A2, the A–D test statistic is defined as follows:

A2=N i=1N(2i1) N[lnF(Yi)+ln(1F(YN+1i))].

H0 is rejected when the test statistic calculated from the sample is greater than or equal to the critical value as determined by the prescribed significance level, α. Defined by the probability that an extreme result will be observed if H0 is true, the significance probability p-value can also be calculated. A smaller p-value indicates that the sample provides stronger evidence against H0, which is rejected if the p-value is lower than or equal to the significance level.

Considering that distributions with fewer parameters are preferred in practice, three different 2-parameter distributions, namely normal, lognormal, and 2-parameter Weibull, were explored first to test the GOF for the wind speed data. These candidate distributions were chosen based on a preliminary descriptive statistical analysis and the results of previous studies. Wind speed data from the HK site were used to compare two GOF tests with a significance level of 5 %.

The candidate 2-parameter distributions were tested for their GOF against the monthly wind speed data, and the results are summarized in Table II. The wind speed data were fitted to the different distribution month by month. While the K–S test revealed that the GOF to one or more of the three distributions for each month was significant, the A–D test provided more conservative results in that, except for March, June, and November, most of the monthly wind speed data were well-fitted only to the lognormal distribution. There were no statistically significant distributions fitted to the wind speed data for June and November, which may be attributed to the fact that the A–D test assigns more weight to the data at the tail to better represent the distributional behavior of wind speed data. For November at the HK site, the data were significantly fitted to each of the three distributions based on the K–S test but not the A–D test, and noticeable differences between the tails of the empirical and theoretical distributions were observed (Fig. 5).

Table 2 . Goodness-of-fit test for wind speed data from Hankyung using 2-parameter Weibull distributions.













* Rejection of the null hypothesis at the significance level 5 %. A–D, Anderson –Darling test; K–S, Kolmogorov–Smirnov test.

** p-value.

Figure 5. Quantile–quantile plot of wind speed data for November from the Hankyung (HK) site.

The A–D test may be considered more rigorous than the K–S test to fit the wind speed data, and the wind speed data from the other two sites were also tested for their fitness for the three 2-parameter distributions using the A–D test, as summarized in Table III. No single distribution was found to fit the wind speed data from these sites and, as such, different distributions must be employed to model the wind speeds at different times of the year. In other words, spatiotemporal variations in wind speed may not be properly represented by a specific single distribution, which contradicts the assumptions made in previous studies that wind speed follows a 2-parameter Weibull distribution. It is well known that the GOF can generally be improved by employing distribution functions with more parameters within the same distribution family. Therefore, 3-parameter Weibull distributions were subsequently investigated.

Table 3 . A–D test of wind speed data from Sungsan and Taebaek using 2-parameter Weibull distributions.

Mon.Sungsan siteTaebaek site












* Rejection of the null hypothesis at the significance level 5 %.

** p-value.

In addition to the scale and shape parameters of the 2-parameter Weibull distribution, the threshold parameter is added to the 3-parameter Weibull distributions, of which the density f(x) is defined as follows:


where β, λ, and τ denote the shape, scale, and the threshold parameter, respectively. Note that β > 0, λ > 0, and x ≥ τ. The GOF of the 3-parameter Weibull distribution was tested for the wind speed data from the three sites, which were all well-fitted except for October at the SS site and January at the TB site (Table IV).

Table 4 . A–D test results of wind speed data for 3-parameter Weibull distributions.

Mon.Wind Farm Site
Statistic p-valuep-value of LRTStatistic p-valuep-value of LRTStatistic p-valuep-value of LRT












* Rejection of the null hypothesis at the significance level 5 %.

For example, the wind speed data for January at the HK site were not fitted for the 2-parameter Weibull distribution, but were successfully fitted using the 3-parameter distribution at a significance level of 5 % (Table II and Table IV). On the other hand, the wind speed data for February at the HK site were well-fitted using both the 2- and 3-parameter Weibull distributions. It is intuitive that whenever the 2-parameter Weibull distribution properly represents the data, the 3-parameter distribution will provide an improved fit because the GOF of the parameter estimation can always be improved as more parameters are included within the same distribution family.

When both the 2- and 3-parameter Weibull distributions fit the data properly, it is important to determine the specific test for modeling wind speeds. Therefore, simply comparing the p-values of the respective GOF tests is not appropriate because the numbers of parameters differ. Instead, an incremental analysis may be conducted to confirm whether the addition of parameters contributes to a statistically significant improvement in the GOF. This can be carried out using the popular likelihood ratio test (LRT). In this case, H0 is that ‘The GOF of the 2-parameter Weibull distribution is better than that of the 3-parameter Weibull distribution.’ The test statistic of the LRT can be written as follows:


where θ^k and L (θ^k) denote themaximumlikelihood estimator (MLE) and likelihood function of k-parameter Weibull distribution, respectively. It should be noted that the LRT test statistic follows the Chisquare distribution with 1 degree of freedom, that is, LRT ∼ χ2(1). The results of the LRT for the monthly wind speed data are also presented in Table IV, and except for two or three months at each site, the 3-parameter Weibull distributions generally provided a good fit. As an example, there was no evidence that the addition of the threshold parameter improved the GOF for February at the HK site (p-value = 0.093), whereas the GOF was significantly improved by introducing this parameter for March (p-value = 0.019).

It is worth noting that the wind speed data for October at the SS site, and January at the TB site, could not be properly fitted with these distributions; further analysis confirmed that no suitable 2- or 3-parameter distributions could be identified. Indeed, a mixture of distributions may need to be explored when models based on a single distribution provide poor characterization [20]. Here, a mixture of 2-parameter Weibull distributions was subsequently investigated, defined by the weighted sum of two or more 2-parameter Weibull distributions, as follows:

f(x)= i=1kωiβiλi x λiβi1exp x λi βi,

where k denotes the number of 2-parameter Weibull distributions to be aggregated, and ωi is the weight of the ith distribution, such that i=1kωi=1. However, when using mixed distributions, more parameters need to be estimated, which may lead to overfitting. Furthermore, calculating the critical values of the LRT test statistic for mixed Weibull distributions is not straightforward. Where a formal procedure such as LRT is inapplicable, information criteria are widely implemented to compare the GOF of different distributions, with the Akaike information criterion (AIC) and Bayesian information criterion (BIC) being most widely used, which are defined as follows:


where N and p denote the sample size and number of parameters involved in eachmodel, respectively, and L (θ^) represents the likelihood function evaluated at MLE θ^. These information criteria compare the GOF of the distributions considering the number of parameters to be estimated. Given any two candidate distributions, those involving more parameters are penalized; thus, a lower information criteria value is preferred. Even if several 2-parameter Weibull distributions are aggregated to better fit the data, only a mixture of two 2- parameter Weibull distributions is compared against the 3-parameter Weibull distribution because of concerns related to overfitting. Here, these information criteria were calculated for the wind speed data for October at the SS site and January at the TB site, as shown in Table V.

Table 5 . Comparison of the Akaike information criterion (AIC) and Bayesian information criterion (BIC).

Data3-parameter WeibullMixed 2-parameter Weibull
Oct. at Sungsan452.54460.14438.32450.98
Jan. at Taebaek316.64323.02307.37318.00

The AIC and BIC values for the mixed Weibull distribution were lower than those for the 3-parameter Weibull distribution, indicating an improvement in the GOF. This indicates that a mixture of the 2- parameter Weibull distributions provides a better data fit than the 3- parameter Weibull distribution. Such an improvement in GOF does not necessarily mean that a mixed Weibull distribution provides the best possible fit to the data, but indicates that it can be considered a strong candidate distribution. Following these procedures, the most appropriate distributions for fitting the wind speed data are summarized in Table VI. Notably, the wind speed data for October at the SS site, and January at the TB site, were better represented by a mixture of two 2-parameter Weibull distributions (Fig. 6).

Table 6 . Appropriate distribution functions for fitting monthly wind speed at three wind farm sites in the Republic of Korea.

Mon.Wind Farm Site
Jan.3-parameter (1.833, 5.042, 3.869)2-parameter (3.777, 9.338, NA)Mixed with ω1= 0.210 (9.621, 3.619, 3.642, 9.505)

Feb.2-parameter (2.837, 8.491, NA)2-parameter (3.252, 9.012, NA)2-parameter (2.994, 7.491, NA)

Mar.3-parameter (1.756, 5.737, 1.722)3-parameter (1.912, 6.288, 2.282)3-parameter (1.399, 4,502, 2,832)

Apr.2-parameter (2.413, 6.929, NA)3-parameter (1.745, 5.654, 2.246)3-parameter (1.288, 3.723, 2.946)

May.3-parameter (1.361, 3.799, 1.761)3-parameter (1.639, 5.080, 2.052)3-parameter (1.438, 4.479, 2.379)

Jun.3-parameter (1.412, 3.116, 1.371)3-parameter (1.308, 3.160, 2.289)3-parameter (1.100, 2.311, 2.677)

Jul.3-parameter (1.341, 3.819, 1.527)3-parameter (1.354, 3.986, 2.425)3-parameter (1.615, 4.427, 1.368)

Aug.3-parameter (1.282, 4.209, 1.936)3-parameter (1.471, 5.171, 2.204)3-parameter (1.292, 3.595, 2.714)

Sep.3-parameter (1.306, 4.341, 1.759)3-parameter (1.401, 4.870, 2.297)3-parameter (1.056, 3.122, 2.398)

Oct.2-parameter (2.720, 7.578, NA)Mixed with ω1= 0.940 (3.722, 18.908, 7.652, 14.462)3-parameter (1.258, 3.051, 2.780)

Nov.3-parameter (1.770, 5.101, 2.056)3-parameter (1.181, 5.203, 2.387)3-parameter (1.477, 4.377, 2.799)

Dec.3-parameter (1.932, 6.071, 2.480)3-parameter (1.519, 4.913, 3.332)3-parameter (1.547, 1.372, 4.397)

* For the two- and three-parameter distributions, the values in parentheses correspond to (β,λ, and τ).

** For the mixed distributions, the values in the parenthesis correspond to (β1212).

*** NA: Not Applicable.

Figure 6. Wind speed data fitted with mixed Weibull distributions. (a) October at Sungsan. (b) January at Taebaek.

The accurate estimation of wind speed distributions is essential for the effective prediction and management of wind power generation, which plays a central role in expanding renewable power generation and mitigating GHG emissions. That there will be spatiotemporal variations in wind speed is intuitive, and therefore, wind speed distribution will vary greatly between regions and over time. This study explored a formal statistical procedure for deriving appropriate distribution functions that fit monthly wind speed data obtained from three different wind farms in the ROK. The candidate distributions were first identified using descriptive statistical approaches, and then tested for GOF following formal statistical tests. The A–D test may be considered more rigorous than the standard K–S test because it better captures the distributional behavior at the tail of the data distribution. Furthermore, 2- and 3-parameter Weibull distributions were compared for their GOF to the wind speed data, and the results indicated that the wind speed data are better fitted by including additional parameters. When a suitable distribution cannot be obtained, the application of mixed Weibull distributions can be considered, and the problem of overfitting can be addressed by testing the obtained GOF with information criteria, such as AIC and BIC.

One of the major shortcomings of this study is its limited ability to account for possible overfitting and underfitting, which may be overcome with more extensive data collection. In addition, filtering outliers via data preprocessing may further improve the understanding of wind-speed distributions, which warrants further investigation. Overall, the results of this study are expected to be extended and linked to the derivation and prediction of power output from wind generation in the ROK and elsewhere.

This work was supported by a research grant from Pukyong National University (2021).

  1. M. S. Nazir, N. Ali, M. Bilal, and H. M. N. Iqbal, Curr. Opin. Environ. Sci. Health 13, 85 (2020).
  2. K. Dai, A. Bergot, C. Liang, W. N. Xiang, and Z. Huang, Renew. Energy 75, 911 (2015).
  3. M. H. Bollen and F. Hassan, Integration of Distributed Genera-tion in the Power System (USA, John Wiley & Sons, 2011).
  4. Ministry of Trade, Industry and Energy (MOTIE), The 9th Mas-ter Plan for Long-Term Electricity Supply and Demand (Sejong, Republic of Korea, 2021).
  5. A. T. Abolude and W. Zhou, Energies 11, 1992 (2018).
  6. K.-H. Kim, Y.-C. Ju, and D.-H. Kim, J. Korean Solar Energy Soc. 26, 63 (2006).
    Pubmed KoreaMed CrossRef
  7. C. Paik, J. Korean Solar Energy Soc. 39, 79 (2019).
  8. Z. Qin, W. Li, and X. Xiong, Electr. Power Syst. Res. 81, 2139 (2011).
  9. K. Mohammadi, O. Alavi, A. Mostafaeipour, N. Goudarzi, and M. Jalilvand, Energy Convers. Manag. 108, 322 (2016).
  10. T. B. M. J. Ouarda, C. Charron, and F. Chebana, Energy Convers. Manag. 124, 247 (2016).
  11. N. Y. Yürüşen and J. J. Melero, J. Phys. Conf. Ser. 753, 032067 (2016).
  12. I. Pobočíková, Z. Sedliačková, and M. Michalková, Procedia Eng. 192, 713 (2017).
  13. A. K. Azad, M. G. Rasul, M. M. Alam, S. M. Ameer Uddin, and S. K. Mondal, Procedia Eng. 90, 725 (2014).
  14. X. Qin, J. S. Zhang, and X. D. Yan, J. App. Meteorol. Climatol. 51, 1321 (2012).
  15. K. Sukkiramathi and C. V. Seshaiah, Energy Explor. Exploit. 38, 158 (2020).
  16. A. C. Cullen and H. C. Frey, Probabilistic Techniques in Expo-sure Assessment: A Handbook for Dealing with Variability and Uncertainty in Models and Inputs (USA, Plenum Press, 1999).
  17. I. M. Chakravarti, R. G. Laha, and J. Roy, Handbook of Methods of Applied Statistics (New York, USA, John Wiley and Sons, 1967).
  18. C. Paik, Y. Chung, and Y. J. Kim, Renew. Energy 164, 833 (2021).
  19. M. A. Stephens, J. Am. Stat. Assoc. 69, 730 (1974).
  20. E. Gómez-Lázaro, M. C. Bueso, M. Kessler, E. Martín-Martínez, J. Zhang, B.-M. Hodge, and A. Molina-García, Energies 9, 91 (2016).

Share this article on :

Stats or metrics