Adding Data Sources¶
This guide explains how to add new data sources to fair-shares.
Overview¶
Data sources are configured in conf/data_sources/data_sources_unified.yaml and processed through preprocessing notebooks in the notebooks/1xx_*.py series.
Data Types¶
| Type | Purpose | Current Sources |
|---|---|---|
emissions |
Historical non-LULUCF emissions | PRIMAP-hist |
gdp |
Economic capability | World Bank WDI, IMF |
population |
Per capita calculations | UN/OWID |
gini |
Within-country inequality | UNU-WIDER, WID |
lulucf |
NGHGI-consistent LULUCF emissions | Melo et al. (2026) |
targets |
Global constraints | AR6 scenarios, RCBs |
Step 1: Add Raw Data¶
Place your data files in the appropriate subdirectory:
data/
├── emissions/
│ └── my-source-YYYY/
│ └── raw_data_file.csv
├── gdp/
│ └── my-source-YYYY/
├── population/
├── gini/
├── lulucf/
│ └── my-source-YYYY/
├── scenarios/
└── rcbs/
Use the naming convention {source}-{year}/ for versioning.
Step 2: Configure the Source¶
Add an entry to conf/data_sources/data_sources_unified.yaml:
# Example: Adding a new emissions source
emissions:
primap-202503:
# ... existing source ...
my-source-2026: # New source
path: "data/emissions/my-source-2026/emissions_data.csv"
data_parameters:
available_categories:
- co2-ffi
- all-ghg
world_key: "WORLD" # How the source identifies global totals
scenario: "HISTCR" # Historical scenario identifier
Common Configuration Parameters¶
| Parameter | Purpose |
|---|---|
path |
Relative path to data file |
available_categories |
Which emission categories this source provides |
world_key |
String used to identify global totals in the data |
Step 3: Create Preprocessing Notebook¶
Create a preprocessing notebook in the 1xx series:
notebooks/
├── 101_data_preprocess_emiss_primap-202503.py # Existing
├── 102_data_preprocess_gdp_wdi-2025.py # Existing
├── 103_data_preprocess_population_un-owid-2025.py # Existing
├── 1xx_data_preprocess_my_source.py # New notebook
Preprocessing Pattern¶
"""
Preprocess my-source-2026 data.
Input: Raw data file
Output: Standardized DataFrame with proper index structure
"""
import pandas as pd
from pyprojroot import here
# Load raw data
raw_path = here() / "data/emissions/my-source-2026/emissions_data.csv"
df = pd.read_csv(raw_path)
# Standardize country codes to ISO3c
df["iso3c"] = convert_to_iso3c(df["country_column"])
# Set standard index
df = df.set_index(["iso3c", "unit", "emission-category"])
# Ensure year columns are strings
from fair_shares.library.utils import ensure_string_year_columns
df = ensure_string_year_columns(df)
# Add World row if missing
if "World" not in df.index.get_level_values("iso3c"):
world_row = df.groupby(["unit", "emission-category"]).sum()
world_row["iso3c"] = "World"
df = pd.concat([df, world_row.set_index("iso3c", append=True)])
# Save processed data
output_path = here() / "data/processed/my-source-2026/emissions.csv"
output_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(output_path)
Step 4: Index Structure Requirements¶
All data must follow standardized MultiIndex structures:
Emissions¶
# Index: iso3c, unit, emission-category
# Columns: year columns as strings ("1990", "2000", ...)
df.index.names == ["iso3c", "unit", "emission-category"]
GDP / Population¶
Gini (Stationary)¶
# Index: iso3c, unit
# Columns: "gini" (single value, not time-varying)
df.index.names == ["iso3c", "unit"]
df.columns == ["gini"]
Step 5: Integrate with Pipeline¶
The Snakemake workflow automatically picks up sources from the configuration. Ensure your preprocessing notebook:
- Reads from the path specified in the config
- Outputs to the standard processed data location
- Uses consistent index structures
Step 6: Test¶
- Run preprocessing notebook - Verify it completes without errors
- Run allocation with new source - Use in 301 notebook
- Check results - Verify country coverage and data ranges
# In 301 notebook:
active_sources = {
"target": "rcbs",
"emissions": "my-source-2026", # Use new source
"gdp": "wdi-2025",
"population": "un-owid-2025",
"gini": "unu-wider-2025",
"lulucf": "melo-2026",
}
LULUCF Data Sources¶
LULUCF data provides NGHGI-consistent land-use CO2 emissions that replace
the bookkeeping model (BM) estimates in PRIMAP. This is required for total
CO2 (co2) and all-GHG (all-ghg) categories — see
NGHGI Corrections for the science.
What LULUCF preprocessing produces¶
Notebook 107 reads the raw LULUCF source and outputs:
emiss_co2-lulucf_timeseries.csv— country-level NGHGI LULUCF emissions (overwrites the PRIMAP BM version from notebook 101)world_co2-lulucf_timeseries.csv— world-total LULUCF for RCB correctionsbunker_timeseries.csv— international bunker fuel emissionslulucf_metadata.yaml— NGHGI start year (enforces allocation year ≥ 2000 forco2category)
Adding a new LULUCF source¶
- Place data in
data/lulucf/{source-name}/ - Add config entry under
lulucf:indata_sources_unified.yamlwithdata_parametersincludingformat,iso3_column,year_column,value_column,category_filter,gas_filter, andexclude_regions - Create
107_data_preprocess_lulucf_{source-name}.pyfollowing the pattern of107_data_preprocess_lulucf_melo-2026.py - The Snakefile will pick it up via
active_lulucf_source
Which categories use LULUCF?¶
| Emission category | Uses LULUCF? | Why |
|---|---|---|
co2-ffi |
No | Fossil fuels only |
co2 |
Yes | Total CO2 = fossil − bunkers + NGHGI LULUCF |
all-ghg |
Yes | Decomposes into co2 (NGHGI) + non-co2 |
all-ghg-ex-co2-lulucf |
No | CO2 component is co2-ffi |
co2-lulucf |
Indirect | IS the LULUCF data |
non-co2 |
No | Derived by subtraction |
Scenario Data and NGHGI Consistency¶
When adding or updating scenario data (AR6 or custom), ensure the scenarios use NGHGI-consistent emissions conventions. The pipeline applies NGHGI corrections to remaining carbon budgets (RCBs) to account for the gap between bookkeeping model and NGHGI LULUCF estimates, but scenario pathways must already be internally consistent.
For AR6 scenarios: The Gidden et al. reanalysis provides scenarios that are consistent with PRIMAP historical emissions. NGHGI corrections are applied at the RCB level (adjusting the budget), not at the scenario pathway level.
For custom scenarios: If your scenario data uses a different emissions
convention than PRIMAP/NGHGI, apply the necessary corrections in the
preprocessing notebook (104_data_preprocess_scenarios.py) before
the data enters the pipeline. Do not rely on downstream corrections — the
allocation functions assume scenario data is already convention-consistent.
Normalised scenario schema¶
The pipeline normalises all scenario data into a common schema with
climate-assessment (temperature target) and quantile (probability
percentile) fields. Each scenario source defines its own mapping into this
schema. For example, AR6 maps C1 to climate-assessment="1.5C",
quantile=0.5. All downstream code works with these normalised fields
regardless of the upstream source.
Any new data processing notebook that introduces a scenario source must
output data with climate-assessment and quantile columns conforming to
this schema. See notebook 104 for the AR6 reference implementation.
RCB Sources and All-GHG Pathway Dependency¶
When adding or updating remaining carbon budget (RCB) data, keep in mind that all-GHG allocations using RCBs require a decomposition into CO₂ (allocated via the budget) and non-CO₂ (allocated via scenario pathways). This means:
- New RCBs must have matching scenario pathways available in the active
data source configuration. Without them the non-CO₂ leg has no data to
allocate from and
build_data_config()will raise aConfigurationError. - The scenario pathways must cover the same climate assessments as the RCBs (e.g., 1.5 °C / 2 °C categories).
- Auto-derivation of pathway approaches for non-CO₂ only works when pathway data is present in the active source set.
See Other Operations: Decomposition for the science and Architecture Walkthrough: Composite Category Decomposition for the code path.
Normative Implications¶
Some data source choices carry normative weight. Contributors should document the rationale for their data source choices. For example, some decision points include:
- GDP: PPP vs. MER measurement can significantly affect allocation results — PPP tends to raise developing-country capacity shares [Pelz 2025b].
- Emissions: Production vs. consumption accounting embeds different theories of responsibility. Production accounting (territorial) excludes embedded imports; consumption accounting includes them.
- Population: Projection method choices (UN median, SSP scenarios) affect per capita allocations, particularly for countries with high projected growth.
Validation Requirements¶
New data sources should:
- Cover expected countries - At minimum, major emitters
- Include World total - Required for validation
- Use standard units - Mt CO2e for emissions, persons for population
- Handle missing values - Document any gaps
Existing Notebooks as Examples¶
| Notebook | Data Type | Good Example Of |
|---|---|---|
101_data_preprocess_emiss_primap-202503.py |
Emissions | NetCDF processing, category mapping |
102_data_preprocess_gdp_wdi-2025.py |
GDP | CSV processing, country code mapping |
103_data_preprocess_population_un-owid-2025.py |
Population | Combining historical and projected data |
105_data_preprocess_gini_unu-wider-2025.py |
Gini | Quality filtering, stationary output |
105_data_preprocess_gini_wid-2025.py |
Gini | WID.world processing, stationary output |
107_data_preprocess_lulucf_melo-2026.py |
LULUCF | NGHGI corrections, metadata export, world totals |
See Also¶
- Data Sources Config:
conf/data_sources/data_sources_unified.yamlin the repository - Validation Utilities - Data validation functions