Tutorial: Idealised isosceles experiment ======================================== This walkthrough follows the canonical idealised CSA experiment in ``runs/idealised_isosceles.py``. It generates a synthetic terrain whose spectrum is known a priori, then runs four ways of recovering that spectrum from the (masked) topography: * a **pure least-squares Fourier fit** (pLSFF), unregularised — kept for comparison with the JAMES 2024 baseline; * a **regularised** LSFF (regLSFF); * the **optimal** Constrained Spectral Approximation (optCSA) with the known mode count; * a **sub-optimal** CSA (subCSA) that's told to recover fewer modes than the terrain actually has. Running the script as-is reproduces all four spectra plus the reference in a single call. After ``pip install -e .`` it's one command:: pycsa-idealised The script is deterministic (``np.random.seed(777)``) so the numerical output is the same on every run, modulo cross-LAPACK drift in pLSFF. Step 1 — Generate a synthetic terrain ------------------------------------- The terrain is built from a sum of cosines and sines on a regular ``(lon, lat)`` mesh inside an isosceles triangle cell. The seeded generator picks 22 unique ``(k, l)`` mode coordinates and per-mode amplitudes ``Ak``: .. code-block:: python pts, nk, nl, Ak, Al, sck, _scl = _generate_terrain(seed=777) # pts: array of 22 unique (k, l) wavenumber pairs # Ak: per-mode amplitudes in [0, 100) # sck: 0 → cosine basis, 1 → sine basis These determine the reference spectrum ``freqs_ref`` directly — the spectrum the four estimators below are trying to recover. Step 2 — Build the cell + paint topography ------------------------------------------ :func:`pycsa.core.utils.isosceles` returns the vertex indices of an isosceles triangle in the underlying meshgrid; the cell is masked to that triangle: .. code-block:: python grid = var.grid() cell = var.topo_cell() vid = utils.isosceles(grid, cell) lat_v, lon_v = grid.clat_vertices[vid], grid.clon_vertices[vid] cell.gen_mgrids() # paint the sum of basis functions cell.topo = sum(Ak[i] * basis(nk[i], nl[i], ...) for i in range(22)) triangle = utils.gen_triangle(lon_v, lat_v) cell.get_masked(triangle=triangle) See :class:`pycsa.data.cell.topo_cell` for the cell container and :func:`pycsa.core.utils.gen_triangle` for the triangle mask. Step 3 — Pure LSFF (broken, but informative) -------------------------------------------- The unregularised least-squares fit uses every wavenumber in the spectrum and over-fits dramatically: .. code-block:: python pure_lsff = interface.get_pmf(nhi=12, nhj=12, U=1.0, V=1.0) freqs_pLSFF, _, _ = pure_lsff.sappx(cell, lmbda=0.0, iter_solve=False) L2 error against the reference is ``≈164,000`` — pLSFF's amplitudes blow up because the design matrix is rank-deficient on the triangle support. This is the failure mode that motivates regularisation and mode selection. .. note:: pLSFF is the only experiment whose absolute amplitudes drift across LAPACK builds (the reproducibility suite uses ``rtol=1e-2`` for the four pLSFF-affected variables for this reason). The regularised methods below are bit-identical across platforms. Step 4 — Regularised LSFF ------------------------- A small Tikhonov term ``λ`` tames the over-fit: .. code-block:: python reg_lsff = interface.get_pmf(nhi=12, nhj=12, U=1.0, V=1.0) freqs_regLSFF, _, _ = reg_lsff.sappx(cell, lmbda=8e-5, iter_solve=False) L2 error drops to ``≈115``. The spectrum is now a plausible approximation of the reference, but it still has energy spread across many wavenumbers. Step 5 — Constrained spectral approximation ------------------------------------------- CSA is a two-step procedure: 1. A first-guess regularised LSFF identifies the top-N wavenumbers by amplitude. 2. A second LSFF is run *constrained* to those wavenumbers only. .. code-block:: python # First guess: regularised LSFF first_guess = interface.get_pmf(nhi=12, nhj=12, U=1.0, V=1.0) freqs_fg, _, _ = first_guess.sappx(cell, lmbda=1e-1, iter_solve=False) # Pick top N modes k_idxs, l_idxs = top_n_indices(freqs_fg, n=22) # Constrained second LSFF second_guess = interface.get_pmf(nhi=12, nhj=12, U=1.0, V=1.0) second_guess.fobj.set_kls(k_idxs, l_idxs, recompute_nhij=False) freqs_optCSA, _, _ = second_guess.sappx( cell, lmbda=1e-6, updt_analysis=True, iter_solve=False ) With ``N = 22`` (matching the terrain's known mode count), L2 error falls to ``≈86``. With ``N = 14`` (fewer modes than the terrain has), error rises to ``≈111`` — the algorithm gracefully degrades when the mode budget is insufficient. See :class:`pycsa.wrappers.interface.get_pmf` for the ``sappx`` / ``set_kls`` API. Step 6 — Compare ---------------- The script renders a figure showing the original masked topography, the reconstructed topography from regLSFF and subCSA, and the reference / regLSFF / subCSA spectra: .. figure:: _static/idealised_tutorial.png :alt: Idealised isosceles spectra + reconstructions :align: center :width: 100% Top row: original masked topography (reference), regLSFF reconstruction, sub-optimal CSA reconstruction (N=14). Bottom row: the corresponding spectra. The reference column shows the 22 known modes; regLSFF spreads energy across many wavenumbers; subCSA recovers a sharp subset. The L2 errors against the reference for the default seed:: reference : 0.00 pLSFF : 163750.14 (broken — see Step 3) regLSFF : 115.71 optCSA : 85.68 subCSA : 111.37 pLSFF_quad : 163750.14 Reproducibility: the four CSA-family rows are bit-identical across platforms. pLSFF drifts sub-1% across LAPACK builds (so its exact value here will differ slightly on your machine) and is pinned at ``rtol=1e-2`` in the reproducibility suite. Running it yourself ------------------- Three equivalent invocations after ``pip install -e .``:: pycsa-idealised # the console script python -m runs.idealised_isosceles # module-runner python ./runs/idealised_isosceles.py # direct All three accept ``--seed`` and ``--n-modes`` flags. With the defaults (``seed=777``, ``n_modes=14``) the numerical summary above is what you get on stdout. Where to go next ---------------- * **Real data, laptop-sized:** ``examples/icon_regional_minimal.py`` runs the full pipeline on a real ICON cell (Aleutians, ~52°N) using a bundled MERIT slice. ~10 s end-to-end, no manual data setup. * **Real data with ocean-aware masking:** the :doc:`showcase` runs the pipeline on a central-Andes cell using bundled **ETOPO** data and the production diagnostic plot — the same data path and masking as the global HPC run, on one cell. * **Production HPC runs:** see :doc:`hpc_reproducibility` for the global ICON+ETOPO pipeline (``runs/icon_etopo_global.py``) — SLURM submission, memory batching, restart story, and tile-cache lifecycle. * **Gating numerical refactors:** the reproducibility suite at ``tests/reproducibility/`` pins the output of three canonical cases (idealised, regional MERIT on the Aleutians, ETOPO single-cell at the south pole) and gates every refactor PR.