Processing of data sources

Methodology

Having identified and selected sources for inclusion in the Index, each source then underwent some degree of processing to ensure all source data is ultimately held in the same, common format.

The common format is defined as a comma-separated values (CSV) file where each row represents a single data point (i.e. each row relates to a specific country’s value for a specific variable for a specific year). The common format has four columns:

cc_iso3c – a 3-character alphabetic code identifying the country/territory, based on the ISO 3166-1 standard.
ref_year – a 4-digit number providing the reference year of the data point.
variable – a variable label that provides a unique reference to the data series the data point is drawn from.
value – the numeric value of the data point.

The processing of data sources also involved the creation of a metadata record for each source that provided standardised metadata about the source, a description of the source, documentation of the processing, and metadata relating to the extracted variables.

For many sources this processing was a simple process that involves the extraction of the relevant subset of variables, conformance of country identifiers to the Index’s harmonised coding scheme, and transposition of the data from a ‘wide’ to ‘long’ format.

In some situations, more complex processing is undertaken such as when data for countries in different regions has been published in separate files and needs consolidation, where data has been published in PDF files or to in situations where it has been necessary to re-process the original data to extract relevant scores. See the article on pre-processing for more details.

The source code for the processing is available on Github.

Type of processing	Sources
Variable extraction and formatting	Bertelsmann Transformation Index (Bertelsmann Stiftung) Doing Business Report (World Bank) Gender Statistics Database (European Institute for Gender Equality) ILOSTAT (International Labor Organization) Logistics Performance Index (World Bank) PARIS21 Statistical Capacity Monitor (OECD) Quality of Government Expert Survey (University of Gothenburg) Rule of Law Index 2023 (World Justice Project) Sendai Framework Monitor (United Nations) Sustainable Governance Indicators (Bertelsmann Stiftung) Varieties of Democracy dataset (University of Gothenburg)
Consolidation of regional data files	Global Corruption Barometer (Transparency International)
Extraction of data from PDF report	Global Cybersecurity Index (International Telecommunications Union)
Re-processing of original data	Global Data Barometer (D4D.net and ILDA) GovTech Maturity Index 2022 (World Bank) International Survey of Revenue Administration (CIAT, IMF, IOTA, and OECD) Open Data Inventory (Open Data Watch)

Processing of data sources

Methodology

Google Anaytics (functional)