Having identified and selected sources for inclusion in the Index, each source then underwent some degree of processing to ensure all source data is ultimately held in the same, common format.
The common format is defined as a comma-separated values (CSV) file where each row represents a single data point (i.e. each row relates to a specific country’s value for a specific variable for a specific year). The common format has four columns:
- cc_iso3c – a 3-character alphabetic code identifying the country/territory, based on the ISO 3166-1 standard.
- ref_year – a 4-digit number providing the reference year of the data point.
- variable – a variable label that provides a unique reference to the data series the data point is drawn from.
- value – the numeric value of the data point.
The processing of data sources also involved the creation of a metadata record for each source that provided standardised metadata about the source, a description of the source, documentation of the processing, and metadata relating to the extracted variables.
For many sources this processing was a simple process that involves the extraction of the relevant subset of variables, conformance of country identifiers to the Index’s harmonised coding scheme, and transposition of the data from a ‘wide’ to ‘long’ format.
In some situations, more complex processing is undertaken such as when data for countries in different regions has been published in separate files and needs consolidation, where data has been published in PDF files or to in situations where it has been necessary to re-process the original data to extract relevant scores. See the article on pre-processing for more details.
The source code for the processing is available on Github.
Type of processing | Sources |
---|---|
Variable extraction and formatting |
|
Consolidation of regional data files |
|
Extraction of data from PDF report |
|
Re-processing of original data |
|