Data curation and conditioning - In-depth analysis of the global power infrastructure

After the extensive data collection exercise, the first plots and reads of the resulting data highlighted a multitude of issues and the relatively poor quality of the source data.

Therefore, further data curation and conditioning was required.

The first step of data curation was the standardisation of entries. This step consisted of the elimination of useless characters, such as space characters randomly appearing at the end of data fields, such as “Mexico” and “Mexico ”. The next step was the standardisation of the date format for years of commissioning, which was present in multiple different formats. Despite the field having the name “Year online”, the content of this field sometimes included month or day, randomly added in numbers or text, which required unification across the data entries.

The next issue that required tackling was incomplete fields in the data provided. With one of the main objectives of the analysis being the historical development of the power sector, the year of commissioning was one of the most important information fields to have to be completed. The existence of this issue demanded the introduction of a new timestamp category for all the power plants that did not have year of commissioning reported. These capacities were thus aggregated into a “Yearless” category. After the first

six months of work to fill the gaps of the missing data, the “Yearless” category was still around 9% of the data entries and 2.7% of the global capacity, with the initial state being significantly worse. An example of the effect of this issue is shown in Figure 4.

Figure 4: Brazil as an example of the wide issue of the incomplete source data for the commissioning year.

Figure 4 shows an example of the issue of incomplete data particularly in the field of commissioning year, and an excerpt of the earliest set of country-wise plots. As shown in Figure 4 for the case of Brazil, if the “Yearless” category was a year, it would greatly surpass the level of installations of any other given year. There were several other countries, such as Iran, Guatemala and Kenya, in which the data condition was similar.

The last version of the database reduced the “Yearless” entries down to less than 6% of the total entries, and around 2.2% of the global capacity.

The “Yearless” issue also generated a conflict with the aggregated capacities. The amount of “Yearless” capacities is double accounted under the previous definition of aggregated capacities, and thus, these capacities have to be factored in. Therefore, the following

equation is used instead to balance the aggregated capacities with the “Yearless”

capacities.

𝑍 = 𝐴𝐶𝑎𝑌𝑛 − 𝑌_?𝑎

𝑁𝐴𝐶𝑎𝑌𝑥 =

𝑖𝑓 𝑍 ≤ 0 → 0 𝑖𝑓 𝑍 > 0 → 𝑍 𝑖𝑓 𝑍 > 𝐴𝐶𝑎𝑌𝑥 → 𝐴𝐶𝑎𝑌𝑥

(2)

In Equation (2), NACaYx represents the new aggregated capacity of technology a for year x, and Y?a is the “Yearless” capacity for technology a. The equation balances “Yearless”

capacities with the aggregated capacities, by compensating first for the earlier years of reported capacities, from 2000 and onwards. With the new balancing, Figure 3 is modified into Figure 5.

Figure 5: Example of adjusted aggregated capacities.

In the example of Figure 5, the “Yearless” capacities for technology a account for 500 MW, while the cumulative discrepancy between the power plant list and the reported capacities is 600 MW. Thus, 100 MW is added to the power plant list as aggregated capacities (purple bar in Figure 5). In that example, all discrepancies from 2006 onwards for that technology would be added fully as aggregated capacities, as calculated in Equation (1).

In some cases, the power plant list was so incomplete that the amount of aggregated capacities generated unnatural spikes in the profiles of some countries. Figure 6 shows a clear example of this issue. The example shown is Japan, but the issue was clear in many

other countries including Croatia, Pakistan, Netherlands, Germany, Ukraine, the United Kingdom, and many others. The spike in the year 2000 is the result of a significant lack of entries in the power plant list, particularly for the power plants commissioned before the year 2000. Beyond the missing information regarding the year of commissioning, the gap between the countries’ reported total capacity, added as NACaYx, generates spikes that in the worst cases the year 2000 spike artificially surpass the installations of other years by a factor of four.

Figure 6: Japan’s power capacities by year until 2014 as an example of insufficient data entries in the power plant list, generating an issue of aggregated capacities for the year 2000.

At this point, it does not suffice to use aggregated capacities to balance the discrepancies with the country-reported capacities. In order to solve this issue, it was necessary to extend the list of power plants, which could only be done by importing the missing information from alternative data sources, such as IRENA (2015), Platts (2009), GRanD (Liermann et al., 2011), Werner et al. (2015) and BMWi (2014).

From these sources, only Platts (2009) provided an alternative list of power plants, while the rest were used as a reference for the total capacity reported by countries for different

technologies. Including around 145,000 entries, Platts (2009), at first glance, appears to be a more comprehensive list than GlobalData (2015). However, the list contains far less detailed information per entry and includes power stations of less than 1 MW of capacity while missing installations from 2009 onwards.

The list comparison between Platts (2009) and GlobalData (2015) was performed manually. All entries of the Platts (2009) database with a registered capacity of less than 5 MW were filtered out because of time constrains, while all entries of more than 5 MW of capacity were compared manually. The manual comparison was necessary in order to avoid double accounting caused by transliteration. For example, a power plant in Japan by the name “Miyako 1” by GlobalData is spelled as “MIYAKO1” or “Miyako Daini”, which a script comparing strings would find different despite it being the same power station. Thus, automatic comparison of partial matches was also not possible, as many different power stations have similar names; for example “Miyazu 1” would be a partial match of a different power station.

Figure 7 shows a simplified view of the data curation process described above. While searching on google for missing data, a commonly used reliable source were files from the Clean Development Mechanism (CDM) project files, particularly for renewable energy (RE) based power plants, and in some cases of gas turbines (UNFCCC-CDM, 2005). However, in the absence of official documents available online, news articles mentioning the opening of the plant, for example, were taken into account.

Figure 7: Data curation process in a nutshell.

Finally, the last data treatment applied was the distribution of hydropower over reservoir and run-of-river (RoR) categories. GlobalData classifies hydropower into four categories:

“Hydro”, “Hydro/Reservoir-Based”, “Hydro/Run of River” and “Pumped Hydro

Stations”. However, the category “Hydro” exists only because of the lack of information, as hydropower stations, in general, can either be reservoir based or RoR, while pumped-hydro storage is, as the name suggests, storage capacity rather than generation. In order to distribute the hydro capacities of unspecified type, the ratio of the specified capacities was applied. Globally, the specified capacities of hydropower are 68.3% reservoir based and 31.7% RoR. Thus, all the capacities of each power plant with an unspecified hydropower type are distributed according to this ratio.

After all the above-described patching and curation the data were finally at a quality level at which the issues with the data were reduced significantly. Figure 8 shows the examples presented above of Brazil (top) and Japan (bottom) after the data curation process. When compared with Figure 4 for Brazil and Figure 6 for Japan, the improvement in the data quality is evident.

Figure 8: Effects of data curation on two countries strongly affected by the general low-quality data from the original source.

As shown in Figure 8, in the case of Brazil, the “Yearless” capacities are reduced from around 13 GW down to less than 2 GW. Similarly, the spike of the year 2000 for Japan is redistributed and reduced from around 47 GW down to around 15 GW. In both cases, the improvement resulting from the data curation process shows clear benefits.

In document In-depth analysis of the global power infrastructure—Opportunities for sustainable evolution of the power sector (sivua 29-36)