.. _dataset_processing_flow: Dataset Processing Flow ======================= This page documents the request decision tree that remains after the director cleanup. There is no longer a ``Director`` class; the name now refers to the small wrapper around request planning and execution used by WPS processes. The diagram is written as a blueprint for a future reimplementation: first decide what kind of input was received, then decide whether catalog metadata is needed, then choose between returning existing files and running an operation. Director Decision Tree ---------------------- .. mermaid:: flowchart TD Start(["Request arrives from a WPS process
or workflow stage"]) subgraph Input["1. Understand the input"] Start --> HasFiles{"Do we already have concrete files
from an earlier workflow step?"} HasFiles -- yes --> DirectFiles["Normalize those files
and run the requested operation"] HasFiles -- no --> Project["Read the project from
the requested collection id"] end subgraph Catalog["2. Resolve catalog data when needed"] Project --> UsesCatalog{"Does this project use
a Rook catalog?"} UsesCatalog -- no --> PlainOperation["Keep the original collection
and run the operation"] UsesCatalog -- yes --> Search["Search the catalog
for collection and time"] Search --> Found{"Did the catalog find
every requested collection?"} Found -- no --> Reject["Reject the request
as an invalid collection"] end subgraph Choice["3. Choose response type"] Found -- yes --> WantsOriginal{"Can the request return
existing catalog files?"} WantsOriginal -- "yes: original_files
or atlas shortcut" --> CatalogOriginal["Return catalog download URLs"] WantsOriginal -- no --> ChangesData{"Does the operation need
new data to be written?"} ChangesData -- "yes: average, regrid,
or dimension change" --> CatalogOperation["Resolve catalog files
for processing"] ChangesData -- no --> Aligned{"Does the subset match
whole source files?"} Aligned -- yes --> AlignedOriginal["Return only the matching
download URLs"] Aligned -- no --> CatalogOperation end subgraph Run["4. Execute or adapt the result"] DirectFiles --> BuildSources["Build dataset sources"] PlainOperation --> BuildSources CatalogOperation --> BuildSources BuildSources --> Open["Detect data format and transport
NetCDF, Zarr, Kerchunk, file, HTTP, S3"] Open --> Fixes["Apply internal dataset fixes
when a dataset id is known"] Fixes --> Operation["Run subset, average,
regrid, concat, or weighted average"] CatalogOriginal --> OriginalResponse["Return original-file response"] AlignedOriginal --> OriginalResponse Operation --> OutputResponse["Return operation output files"] end Decision Ownership ------------------ ``rook.operations.execution.Operator.call`` decides whether a request is already a file list from a previous workflow step. Those requests bypass catalog planning and run the operation runner directly with a ``FileMapper``. ``rook.director.planning.plan_request`` handles catalog-backed requests. It resolves the project, validates catalog search results, and chooses between an original-file response and operation execution. ``rook.director.execution.execute_plan`` adapts the plan into output URIs. It collects original file URLs when processing is skipped, otherwise it prepares operation inputs and calls the operation runner. ``rook.operations.consolidate`` converts operation collections into ``DatasetSource`` values. It keeps direct Zarr, Kerchunk, and S3 inputs out of catalog lookup, resolves catalog-backed NetCDF datasets to files, and preserves dataset IDs where they are needed for dataset fixes. ``rook.io.datasets`` owns format and transport detection, storage options, and dataset opening. Catalog-specific fixes are applied only when a ``DatasetSource`` has a dataset ID. Blueprint for Reimplementation ------------------------------ The future director should be a planner, not an operation runner. It should return one explicit decision value that describes what the caller must do next: * reject the request with a known error; * return original files; * run an operation with the original collection; * run an operation with catalog-resolved dataset sources. The planner should keep these responsibilities separate: * input classification: workflow files versus collection IDs; * project and catalog resolution; * original-file eligibility; * subset-to-file alignment; * construction of operation sources; * WPS response and exception adaptation. The execution side should be boring on purpose. Given a plan, it should either collect original-file URLs or prepare operation inputs and call the supplied runner. It should not repeat catalog decisions. A future type model could make the decision tree easier to read in code: .. code-block:: python RequestDecision = ( InvalidRequest | ReturnOriginalFiles | RunWithOriginalCollection | RunWithResolvedSources ) The important boundary is that catalog planning decides *what should happen*, while operation execution decides *how to run the selected operation*.