Architecture Walkthrough¶

This document traces the full execution path of a fair-shares allocation, from the notebook configuration cell to the final parquet file on disk. Use it to build a mental model of the codebase before making changes.

Intended audience: Developers who need to understand where code runs and why the layers exist, not the climate science behind the equity principles (see Scientific Documentation for that).

Layer Map¶

The codebase has four layers. A request flows top-to-bottom; data flows bottom-to-top.

Text Only

 Layer 0  Notebook / CLI                 notebooks/301_*.py
 Layer 1  Helpers                        notebook_helpers.py
 Layer 2  Manager                        allocations/manager.py
 Layer 3  Math                           allocations/budgets/per_capita.py
                                         allocations/pathways/per_capita.py
                                         allocations/pathways/per_capita_convergence.py
                                         allocations/pathways/cumulative_per_capita_convergence.py

Layer	Module	Responsibility	Key entry point
0	`notebooks/301_custom_fair_share_allocation.py`	User configuration, data pipeline trigger, visualization	Config cell (lines 57-109), execution cell (lines 240-270)
1	`src/.../notebook_helpers.py`	Extract boilerplate from notebooks; load data, run all allocations, print summary	`load_allocation_data()`, `run_all_allocations()`, `run_and_save_category_allocations()`
2	`src/.../allocations/manager.py`	Approach registry, parameter grid expansion, single allocation dispatch, result saving, budget/pathway classification	`run_parameter_grid()`, `run_allocation()`, `get_function()`, `is_budget_approach()`
3	`src/.../allocations/budgets/per_capita.py`	Actual math: population shares, pre-allocation responsibility/capability adjustments	`_per_capita_budget_core()`, `equal_per_capita_budget()`

Result containers live alongside the math layer:

Container	Module	Holds
`BudgetAllocationResult`	`allocations/results/__init__.py`	`relative_shares_cumulative_emission` -- single-year column, sums to 1.0
`PathwayAllocationResult`	`allocations/results/__init__.py`	`relative_shares_pathway_emissions` -- multi-year columns, each sums to 1.0

Both containers store relative shares (dimensionless fractions). Absolute emissions are computed later by multiplying shares by a global budget or world pathway.

Worked Example 1: Budget Allocation for CO2-FFI under RCBs¶

This traces what happens when a user configures notebook 301 with:

Python

emission_category = "co2-ffi"
active_sources = {"target": "rcbs", ...}
allocations = {
    "equal-per-capita-budget": [{"allocation_year": [2015], "preserve_allocation_year_shares": [False]}]
}

Step 1: Data Pipeline (notebook 301, lines 146-170)¶

The notebook calls setup_data() from src/.../utils/data/setup.py. This function:

Validates config via build_data_config() (utils/data/config.py) which loads conf/data_sources/data_sources_unified.yaml, filters to the selected target, and validates with Pydantic (config/models.py).
Builds paths via build_data_paths() (utils/data/setup.py).
Generates a Snakemake command via generate_snakemake_command() (utils/data/setup.py).
Executes Snakemake via execute_snakemake_setup() (utils/data/setup.py). This runs the Snakefile which orchestrates the 100-series preprocessing notebooks (emissions, GDP, population, Gini, scenarios) and produces CSV files under output/<source_id>/intermediate/processed/.

The Snakefile (Snakefile, line 208 rule all) chains: compose_config -> preprocess_emiss -> preprocess_gdp -> preprocess_population -> preprocess_gini -> preprocess_lulucf -> master_preprocess.

For target=rcbs, the master notebook is 100_data_preprocess_rcbs (Snakefile lines 131-135). No scenario notebook runs because RCBs are cumulative budgets, not pathways.

Output: output/<source_id>/intermediate/processed/ containing:

country_emissions_co2-ffi_timeseries.csv
world_emissions_co2-ffi_timeseries.csv
rcbs_co2-ffi.csv
country_gdp_timeseries.csv
country_population_timeseries.csv
country_gini_stationary.csv

Step 2: Load Data (notebook 301, lines 240-245)¶

load_allocation_data() (notebook_helpers.py) reads the processed CSVs into DataFrames. For RCB runs, it loads:

emissions_data["co2-ffi"] -- country emissions indexed by [iso3c, unit, emission-category]
rcbs_data["co2-ffi"] -- RCB table with columns: source, climate-assessment, quantile, rcb_2020_mt
world_emissions_data["co2-ffi"] -- world historical emissions (used to convert RCB values to total budgets)
Socioeconomic DataFrames: GDP, population, Gini (with validation)

Step 3: Run Allocations (notebook 301, lines 262-270)¶

run_all_allocations() (notebook_helpers.py) orchestrates the full run:

Splits approaches into budget vs pathway using is_budget_approach() from the manager.
Iterates over final_categories. For co2-ffi with RCBs, there is one category: ("co2-ffi",).
Delegates to run_and_save_category_allocations().

Step 4: Category-Level Budget Runner (notebook_helpers.py)¶

_run_budget_allocations() does two things:

4a. Compute share allocations:

Calls run_parameter_grid() from manager.py with all budget approach configs. This is called once per category because shares depend only on socioeconomic data and approach parameters, not on individual RCB values.

4b. Iterate over RCB rows:

For each row in the RCB table (e.g., "1.5C|0.5 from IPCC AR6"):

Convert the RCB value (GtCO2 remaining from 2020) to a total budget for the allocation year using calculate_budget_from_rcb().
Multiply relative shares by the total budget: result.get_absolute_budgets(cumulative_budget) -- calls BudgetAllocationResult.get_absolute_budgets().
Save via save_allocation_result() to a parquet file.

Step 5: Parameter Grid Expansion (manager.py)¶

run_parameter_grid() expands the config:

Python

{"equal-per-capita-budget": [{"allocation_year": [2015], "preserve_allocation_year_shares": [False]}]}

Validates target-source compatibility and allocation years.
Iterates approaches. For each approach:
Converts kebab-case keys to snake_case.
Validates parameters.
Expands parameter lists into combinations via _expand_parameters(). Here: 1 year x 1 preserve setting = 1 combination.
Calls run_allocation() for each combination.

Step 6: Single Allocation Dispatch (manager.py)¶

run_allocation():

Looks up the function: get_function("equal-per-capita-budget") from the approach registry in manager.py, which returns equal_per_capita_budget.
Builds func_args dict with all data + parameters.
Validates (validate_function_parameters()) and filters to only the parameters the function accepts (filter_function_parameters()).
Calls the math function.

Step 7: The Math (budgets/per_capita.py)¶

equal_per_capita_budget() delegates to _per_capita_budget_core() with pre_allocation_responsibility_weight=0.0 and capability_weight=0.0.

Inside _per_capita_budget_core():

Filter population to allocation year onwards.
Convert units to common scale.
Since no adjustments, skip pre-allocation responsibility/capability blocks.
Calculate shares:
group_totals = base_population.sum(axis=1) -- sum each country's population from allocation year onward.
world_totals = groupby_except_robust(group_totals, group_level) -- sum across all countries.
shares = group_totals / world_totals -- each country's fraction.
Apply deviation constraint if max_deviation_sigma is set.
Return a BudgetAllocationResult with the shares DataFrame.

Step 8: Result Serialization¶

Back in _run_budget_allocations() (notebook_helpers.py), save_allocation_result() (manager.py) delegates to results/serializers.py. The serializer:

Adds metadata columns (approach, climate-assessment, quantile, data sources) from results/metadata.py.
Appends rows to the consolidated parquet files:
allocations_relative.parquet -- dimensionless shares
allocations_absolute.parquet -- shares multiplied by the global budget (MtCO2)

After all RCB rows and approaches, create_param_manifest() writes param_manifest.csv and generate_readme() writes documentation text files.

Worked Example 2: Pathway Allocation for all-GHG (Decomposition)¶

This traces a more complex case:

Python

emission_category = "all-ghg"
active_sources = {"target": "rcbs", ...}
allocations = {
    "equal-per-capita-budget": [{"allocation_year": [2015], "preserve_allocation_year_shares": [False]}]
}

Why decomposition?¶

RCBs only constrain CO2. For all-GHG, the system must decompose into:

CO2 component (co2): allocated via budget approach (RCBs)
non-CO2 component (non-co2): allocated via pathway approach (e.g. AR6 scenarios)

This logic lives in utils/data/config.py:

is_composite_category("all-ghg") returns True
needs_decomposition("rcbs", "all-ghg") returns True
get_final_categories("rcbs", "all-ghg") returns ("co2", "non-co2")
get_co2_component("all-ghg") returns "co2"

Step 1: Data Pipeline (Snakefile decomposition)¶

The Snakefile detects is_multi_category = True and:

Runs emissions preprocessing for all source categories (e.g. PRIMAP): co2-ffi, co2, co2-lulucf, all-ghg-ex-co2-lulucf (Snakefile lines 262-286).
Runs scenario preprocessing (e.g. AR6) for derivation sources: co2-ffi and all-ghg-ex-co2-lulucf (Snakefile lines 433-483).
Derives non-CO2 by subtraction (Snakefile lines 493-538):
Historical: non-co2 = all-ghg-ex-co2-lulucf - co2-ffi
Scenarios: same subtraction on scenario data.
Runs master preprocessing twice (Snakefile lines 548-597):
CO2 pass: 100_data_preprocess_rcbs.ipynb for co2
non-CO2 pass: 100_data_preprocess_pathways.ipynb for non-co2

Step 2: Auto-Derive Pathway Approaches¶

In run_all_allocations() (notebook_helpers.py):

The user only defined budget approaches (equal-per-capita-budget). But non-CO2 needs pathway approaches. The helper auto-derives them:

Python

if budget_allocs and not pathway_allocs:
    pathway_allocs = derive_pathway_allocations(budget_allocs)

derive_pathway_allocations() (manager.py) maps:

equal-per-capita-budget -> equal-per-capita
allocation_year -> first_allocation_year
preserve_allocation_year_shares -> preserve_first_allocation_year_shares

Step 3: Category Loop¶

run_all_allocations() iterates over final_categories = ("co2", "non-co2"):

Pass 1: co2 (budget)

is_budget_target("rcbs", "co2") returns True (config.py)
Uses budget_allocs dict
Follows the same path as Worked Example 1

Pass 2: non-co2 (pathway)

is_budget_target("rcbs", "non-co2") returns False
Uses auto-derived pathway_allocs dict
Calls _run_pathway_allocations() (notebook_helpers.py)

Step 4: Pathway Runner¶

_run_pathway_allocations() (notebook_helpers.py):

Groups scenarios by climate-assessment and quantile.
For each group, extracts the World totals timeseries.
Calls run_parameter_grid() with pathway approaches and the world scenario data.
For each result:
PathwayAllocationResult.get_absolute_emissions(world_ts) multiplies year-by-year shares by the global pathway.
Saves to parquet.

Step 5: Pathway Math¶

Pathway allocations flow through pathways/per_capita.py (_per_capita_core()). The key difference from budget math:

Uses first_allocation_year instead of allocation_year.
Produces multi-year shares (one column per year from first_allocation_year onward).
If preserve_first_allocation_year_shares=False, each year's shares reflect that year's population (dynamic shares).
Returns PathwayAllocationResult instead of BudgetAllocationResult.

Step 6: Scenario Labels¶

Each scenario source's native categories are mapped to normalised climate-assessment and quantile fields during preprocessing (notebook 104). This normalisation happens once at data loading time, so all downstream code — including non-CO₂ pathways — works with a consistent schema regardless of the upstream source. New scenario sources define their own mapping into the same normalised format, and any new data processing notebook must output data in this schema.

For example, AR6 categories map as: C1 → climate-assessment="1.5C", quantile=0.5; C3 → climate-assessment="2C", quantile=0.66; C2 → climate-assessment="2C", quantile=0.83.

Data Preprocessing Pipeline¶

The preprocessing pipeline transforms raw data sources into analysis-ready CSVs. It runs via Snakemake, orchestrated by the Snakefile.

Pipeline Architecture¶

Text Only

conf/data_sources/data_sources_unified.yaml   (source of truth for config)
        |
        v
   Snakefile                                   (DAG orchestration)
        |
        +---> compose_config                   (Pydantic validation)
        |         |
        +---> preprocess_emiss (101_*.ipynb)   (per emission category)
        +---> preprocess_gdp   (102_*.ipynb)
        +---> preprocess_population (103_*.ipynb)
        +---> preprocess_gini  (105_*.ipynb)
        +---> preprocess_lulucf (107_*.ipynb)  (NGHGI LULUCF, bunkers, metadata)
        +---> preprocess_scenarios (104/106_*.ipynb)  (if pathway/rcb-pathways)
        |         |
        +---> [derive_non_co2]                 (if decomposition)
        |         |
        +---> master_preprocess (100_*.ipynb)  (combines all, produces final CSVs)

Key Preprocessing Modules¶

Module	Location	Purpose
`DataPreprocessor`	`pipeline/preprocessing.py`	Common preprocessing: load, validate, filter to analysis countries, add ROW
`run_rcb_preprocessing()`	`pipeline/preprocessing.py`	RCB-specific: NGHGI corrections, RCB processing
`run_pathway_preprocessing()`	`pipeline/preprocessing.py`	Pathway-specific: scenario loading and processing
`run_composite_preprocessing()`	`pipeline/preprocessing.py`	Composite: 2-pass decomposition for all-GHG
`run_non_co2_preprocessing()`	`pipeline/preprocessing.py`	Derive non-CO2 by subtraction, then pathway-process

Output Directory Structure¶

Text Only

output/<source_id>/
  config.yaml                          (validated configuration)
  notebooks/                           (executed preprocessing notebooks)
  intermediate/
    emissions/                         (per-category emission CSVs)
      world_co2-lulucf_timeseries.csv  (NGHGI world LULUCF for RCB corrections)
      bunker_timeseries.csv            (international bunker emissions)
      lulucf_metadata.yaml             (NGHGI start year, splice year)
    gdp/                               (GDP timeseries)
    population/                        (population timeseries)
    gini/                              (Gini coefficients)
    scenarios/                         (scenario timeseries, pathway mode)
    processed/                         (final analysis-ready CSVs)
      country_emissions_*.csv
      country_gdp_timeseries.csv
      country_population_timeseries.csv
      country_gini_stationary.csv
      rcbs_*.csv                       (budget mode only)
      world_emissions_*.csv            (budget mode only)
      world_scenarios_*_complete.csv   (pathway mode only)
  allocations/
    <folder_name>/
      allocations_relative.parquet     (dimensionless shares)
      allocations_absolute.parquet     (MtCO2/MtCO2eq)
      allocations_wide.csv             (Excel-friendly wide format)
      param_manifest.csv               (all parameter combinations)
      README_*.txt                     (auto-generated docs)

"Where Do I Change X?" Quick Reference¶

I want to...	File(s) to edit	Key function/class
Add a new allocation approach	`allocations/budgets/.py` or `allocations/pathways/.py`, then `allocations/manager.py`	Write the math function, add to `get_allocation_functions()` dict in `manager.py`
Change how parameters are expanded	`allocations/manager.py`	`_expand_parameters()`, `run_parameter_grid()`
Add a new data source	`conf/data_sources/data_sources_unified.yaml`, new `notebooks/10x_*.py`, update `Snakefile`	Add YAML config, write preprocessing notebook, add Snakefile rule
Change validation rules	`src/.../validation/`	`validate_allocation_parameters()`, `validate_target_source_compatibility()`
Modify output parquet schema	`allocations/results/metadata.py`	`DATA_CONTEXT_COLUMNS`, `ALLOCATION_PARAMETER_COLUMNS`
Add a new emission category	`utils/data/config.py`	`get_final_categories()`, `get_emission_preprocessing_categories()`
Change how RCBs are processed	`preprocessing/rcbs.py`, `utils/data/rcb.py`	`load_and_process_rcbs()`, `calculate_budget_from_rcb()`
Add/modify NGHGI corrections	`utils/data/nghgi.py`, `config/models.py` (`AdjustmentsConfig`)	`build_nghgi_world_co2_timeseries()`
Change the notebook helpers	`notebook_helpers.py`	`load_allocation_data()`, `run_all_allocations()`, `run_and_save_category_allocations()`
Change Snakemake pipeline	`Snakefile`, `utils/data/setup.py`	Rules in Snakefile, `setup_data()`
Add a visualization	`visualization/`	`plot_allocation_comparison()`, `plot_decomposition_summary()`
Change how composite categories decompose	`utils/data/config.py`, `pipeline/preprocessing.py`	`needs_decomposition()`, `run_composite_preprocessing()`
Change result serialization	`allocations/results/serializers.py`	`save_allocation_result()`
Modify the budget-to-pathway derivation	`allocations/manager.py`	`_BUDGET_TO_PATHWAY` dict, `_PARAM_RENAMES` dict, `derive_pathway_allocations()`

Key Concepts¶

Relative Shares vs Absolute Emissions¶

The system separates allocation (who gets what fraction) from quantification (how large the pie is). Math functions return relative shares (dimensionless, sum to 1.0). The helpers layer multiplies these by a concrete global budget or pathway to produce absolute emissions.

This separation means the same equity-based allocation can be applied to different RCB estimates or scenario pathways without re-running the math.

Budget vs Pathway Approaches¶

Budget approaches (names ending in -budget) produce a single column of shares for one allocation year. They answer: "What fraction of the remaining cumulative budget does each country get?"

Pathway approaches produce a column per year. They answer: "What fraction of each year's global emissions does each country get?"

The manager (manager.py) classifies approaches by checking whether the name ends in -budget. This convention is load-bearing.

The Parameter Grid¶

Users specify parameters as lists in the config:

Python

{"allocation_year": [2015, 2020], "capability_weight": [0.25, 0.5]}

_expand_parameters() (manager.py) uses itertools.product to create all combinations (here: 4). run_parameter_grid() iterates over them and calls run_allocation() for each.

Composite Category Decomposition¶

When emission_category="all-ghg" and target="rcbs", the system cannot allocate all-GHG directly because RCBs only constrain CO2. The solution:

Decompose into CO2 (budget allocation via RCBs) + non-CO2 (pathway allocation via e.g. AR6 scenarios).
Derive non-CO2 data by subtraction: all-ghg-ex-co2-lulucf - co2-ffi.
Auto-derive pathway approach configs from budget configs (the user only specifies budget approaches).

The final outputs for each sub-category can be recombined downstream.

Data dependency: The non-CO2 leg requires scenario pathway data (e.g. AR6) that covers the same climate assessments as the RCBs. Adding new RCBs without matching scenario pathways in the active data source configuration will cause a ConfigurationError at validation time. Auto-derivation of pathway approaches also only works when this pathway data is available. See Other Operations: Decomposition.

Kebab-Case vs Snake-Case Convention¶

Config keys and approach names use kebab-case (allocation-year, equal-per-capita-budget). Python identifiers use snake_case (allocation_year, equal_per_capita_budget). The manager converts between them:

Python

params = {k.replace("-", "_"): v for k, v in params.items()}

Year Columns as Strings¶

All DataFrames use string year columns ("2020", not 2020). Call ensure_string_year_columns(df) after loading any CSV. This convention prevents pandas from treating years as integer indices, which causes subtle alignment bugs.

The Rest-of-World (ROW) Pattern¶

During preprocessing, the DataPreprocessor (pipeline/preprocessing.py) filters datasets to countries with complete data across all sources, then adds a "ROW" (Rest of World) row as the residual between the world total and the sum of included countries. This ensures allocations always cover 100% of global emissions.

Architecture Walkthrough¶

Layer Map¶

Worked Example 1: Budget Allocation for CO2-FFI under RCBs¶

Step 1: Data Pipeline (notebook 301, lines 146-170)¶

Step 2: Load Data (notebook 301, lines 240-245)¶

Step 3: Run Allocations (notebook 301, lines 262-270)¶

Step 4: Category-Level Budget Runner (notebook_helpers.py)¶

Step 5: Parameter Grid Expansion (manager.py)¶

Step 6: Single Allocation Dispatch (manager.py)¶

Step 7: The Math (budgets/per_capita.py)¶

Step 8: Result Serialization¶

Worked Example 2: Pathway Allocation for all-GHG (Decomposition)¶

Why decomposition?¶

Step 1: Data Pipeline (Snakefile decomposition)¶

Step 2: Auto-Derive Pathway Approaches¶

Step 3: Category Loop¶

Step 4: Pathway Runner¶

Step 5: Pathway Math¶

Step 6: Scenario Labels¶

Data Preprocessing Pipeline¶

Pipeline Architecture¶

Key Preprocessing Modules¶

Output Directory Structure¶

"Where Do I Change X?" Quick Reference¶

Key Concepts¶

Relative Shares vs Absolute Emissions¶

Budget vs Pathway Approaches¶

The Parameter Grid¶

Composite Category Decomposition¶

Kebab-Case vs Snake-Case Convention¶

Year Columns as Strings¶

The Rest-of-World (ROW) Pattern¶

See Also¶