Skip to content

Adding Data Sources

This guide explains how to add new data sources to fair-shares.


Overview

Data sources are configured in conf/data_sources/data_sources_unified.yaml and processed through preprocessing notebooks in the notebooks/1xx_*.py series.

Data Types

Type Purpose Current Sources
emissions Historical non-LULUCF emissions PRIMAP-hist
gdp Economic capability World Bank WDI, IMF
population Per capita calculations UN/OWID
gini Within-country inequality UNU-WIDER, WID
lulucf NGHGI-consistent LULUCF emissions Melo et al. (2026)
targets Global constraints AR6 scenarios, RCBs

Step 1: Add Raw Data

Place your data files in the appropriate subdirectory:

Text Only
data/
├── emissions/
│   └── my-source-YYYY/
│       └── raw_data_file.csv
├── gdp/
│   └── my-source-YYYY/
├── population/
├── gini/
├── lulucf/
│   └── my-source-YYYY/
├── scenarios/
└── rcbs/

Use the naming convention {source}-{year}/ for versioning.


Step 2: Configure the Source

Add an entry to conf/data_sources/data_sources_unified.yaml:

YAML
# Example: Adding a new emissions source
emissions:
  primap-202503:
    # ... existing source ...

  my-source-2026: # New source
    path: "data/emissions/my-source-2026/emissions_data.csv"
    data_parameters:
      available_categories:
        - co2-ffi
        - all-ghg
      world_key: "WORLD" # How the source identifies global totals
      scenario: "HISTCR" # Historical scenario identifier

Common Configuration Parameters

Parameter Purpose
path Relative path to data file
available_categories Which emission categories this source provides
world_key String used to identify global totals in the data

Step 3: Create Preprocessing Notebook

Create a preprocessing notebook in the 1xx series:

Text Only
notebooks/
├── 101_data_preprocess_emiss_primap-202503.py       # Existing
├── 102_data_preprocess_gdp_wdi-2025.py              # Existing
├── 103_data_preprocess_population_un-owid-2025.py   # Existing
├── 1xx_data_preprocess_my_source.py                 # New notebook

Preprocessing Pattern

Python
"""
Preprocess my-source-2026 data.

Input: Raw data file
Output: Standardized DataFrame with proper index structure
"""

import pandas as pd
from pyprojroot import here

# Load raw data
raw_path = here() / "data/emissions/my-source-2026/emissions_data.csv"
df = pd.read_csv(raw_path)

# Standardize country codes to ISO3c
df["iso3c"] = convert_to_iso3c(df["country_column"])

# Set standard index
df = df.set_index(["iso3c", "unit", "emission-category"])

# Ensure year columns are strings
from fair_shares.library.utils import ensure_string_year_columns
df = ensure_string_year_columns(df)

# Add World row if missing
if "World" not in df.index.get_level_values("iso3c"):
    world_row = df.groupby(["unit", "emission-category"]).sum()
    world_row["iso3c"] = "World"
    df = pd.concat([df, world_row.set_index("iso3c", append=True)])

# Save processed data
output_path = here() / "data/processed/my-source-2026/emissions.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path)

Step 4: Index Structure Requirements

All data must follow standardized MultiIndex structures:

Emissions

Python
# Index: iso3c, unit, emission-category
# Columns: year columns as strings ("1990", "2000", ...)
df.index.names == ["iso3c", "unit", "emission-category"]

GDP / Population

Python
# Index: iso3c, unit
# Columns: year columns as strings
df.index.names == ["iso3c", "unit"]

Gini (Stationary)

Python
# Index: iso3c, unit
# Columns: "gini" (single value, not time-varying)
df.index.names == ["iso3c", "unit"]
df.columns == ["gini"]

Step 5: Integrate with Pipeline

The Snakemake workflow automatically picks up sources from the configuration. Ensure your preprocessing notebook:

  1. Reads from the path specified in the config
  2. Outputs to the standard processed data location
  3. Uses consistent index structures

Step 6: Test

  1. Run preprocessing notebook - Verify it completes without errors
  2. Run allocation with new source - Use in 301 notebook
  3. Check results - Verify country coverage and data ranges
Python
# In 301 notebook:
active_sources = {
    "target": "rcbs",
    "emissions": "my-source-2026",  # Use new source
    "gdp": "wdi-2025",
    "population": "un-owid-2025",
    "gini": "unu-wider-2025",
    "lulucf": "melo-2026",
}

LULUCF Data Sources

LULUCF data provides NGHGI-consistent land-use CO2 emissions that replace the bookkeeping model (BM) estimates in PRIMAP. This is required for total CO2 (co2) and all-GHG (all-ghg) categories — see NGHGI Corrections for the science.

What LULUCF preprocessing produces

Notebook 107 reads the raw LULUCF source and outputs:

  • emiss_co2-lulucf_timeseries.csv — country-level NGHGI LULUCF emissions (overwrites the PRIMAP BM version from notebook 101)
  • world_co2-lulucf_timeseries.csv — world-total LULUCF for RCB corrections
  • bunker_timeseries.csv — international bunker fuel emissions
  • lulucf_metadata.yaml — NGHGI start year (enforces allocation year ≥ 2000 for co2 category)

Adding a new LULUCF source

  1. Place data in data/lulucf/{source-name}/
  2. Add config entry under lulucf: in data_sources_unified.yaml with data_parameters including format, iso3_column, year_column, value_column, category_filter, gas_filter, and exclude_regions
  3. Create 107_data_preprocess_lulucf_{source-name}.py following the pattern of 107_data_preprocess_lulucf_melo-2026.py
  4. The Snakefile will pick it up via active_lulucf_source

Which categories use LULUCF?

Emission category Uses LULUCF? Why
co2-ffi No Fossil fuels only
co2 Yes Total CO2 = fossil − bunkers + NGHGI LULUCF
all-ghg Yes Decomposes into co2 (NGHGI) + non-co2
all-ghg-ex-co2-lulucf No CO2 component is co2-ffi
co2-lulucf Indirect IS the LULUCF data
non-co2 No Derived by subtraction

Scenario Data and NGHGI Consistency

When adding or updating scenario data (AR6 or custom), ensure the scenarios use NGHGI-consistent emissions conventions. The pipeline applies NGHGI corrections to remaining carbon budgets (RCBs) to account for the gap between bookkeeping model and NGHGI LULUCF estimates, but scenario pathways must already be internally consistent.

For AR6 scenarios: The Gidden et al. reanalysis provides scenarios that are consistent with PRIMAP historical emissions. NGHGI corrections are applied at the RCB level (adjusting the budget), not at the scenario pathway level.

For custom scenarios: If your scenario data uses a different emissions convention than PRIMAP/NGHGI, apply the necessary corrections in the preprocessing notebook (104_data_preprocess_scenarios.py) before the data enters the pipeline. Do not rely on downstream corrections — the allocation functions assume scenario data is already convention-consistent.

Normalised scenario schema

The pipeline normalises all scenario data into a common schema with climate-assessment (temperature target) and quantile (probability percentile) fields. Each scenario source defines its own mapping into this schema. For example, AR6 maps C1 to climate-assessment="1.5C", quantile=0.5. All downstream code works with these normalised fields regardless of the upstream source.

Any new data processing notebook that introduces a scenario source must output data with climate-assessment and quantile columns conforming to this schema. See notebook 104 for the AR6 reference implementation.


RCB Sources and All-GHG Pathway Dependency

When adding or updating remaining carbon budget (RCB) data, keep in mind that all-GHG allocations using RCBs require a decomposition into CO₂ (allocated via the budget) and non-CO₂ (allocated via scenario pathways). This means:

  • New RCBs must have matching scenario pathways available in the active data source configuration. Without them the non-CO₂ leg has no data to allocate from and build_data_config() will raise a ConfigurationError.
  • The scenario pathways must cover the same climate assessments as the RCBs (e.g., 1.5 °C / 2 °C categories).
  • Auto-derivation of pathway approaches for non-CO₂ only works when pathway data is present in the active source set.

See Other Operations: Decomposition for the science and Architecture Walkthrough: Composite Category Decomposition for the code path.


Normative Implications

Some data source choices carry normative weight. Contributors should document the rationale for their data source choices. For example, some decision points include:

  • GDP: PPP vs. MER measurement can significantly affect allocation results — PPP tends to raise developing-country capacity shares [Pelz 2025b].
  • Emissions: Production vs. consumption accounting embeds different theories of responsibility. Production accounting (territorial) excludes embedded imports; consumption accounting includes them.
  • Population: Projection method choices (UN median, SSP scenarios) affect per capita allocations, particularly for countries with high projected growth.

Validation Requirements

New data sources should:

  1. Cover expected countries - At minimum, major emitters
  2. Include World total - Required for validation
  3. Use standard units - Mt CO2e for emissions, persons for population
  4. Handle missing values - Document any gaps

Existing Notebooks as Examples

Notebook Data Type Good Example Of
101_data_preprocess_emiss_primap-202503.py Emissions NetCDF processing, category mapping
102_data_preprocess_gdp_wdi-2025.py GDP CSV processing, country code mapping
103_data_preprocess_population_un-owid-2025.py Population Combining historical and projected data
105_data_preprocess_gini_unu-wider-2025.py Gini Quality filtering, stationary output
105_data_preprocess_gini_wid-2025.py Gini WID.world processing, stationary output
107_data_preprocess_lulucf_melo-2026.py LULUCF NGHGI corrections, metadata export, world totals

See Also

  • Data Sources Config: conf/data_sources/data_sources_unified.yaml in the repository
  • Validation Utilities - Data validation functions