Uploaded image for project: 'JASMIN CIS'
  1. JASMIN CIS
  2. JASCIS-248

Aggregation of individual dimensions in ungridded data

    Details

    • Type: Epic
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.3
    • Fix Version/s: 2.0
    • Component/s: None
    • Labels:
      None

      Description

      There are a number of reasons why this might be useful, but the primary use case is the temporal aggregation of station or aircraft data. Using a grid means that the memory usage and output files can become extremely large (especially for long time periods).

        Issue Links

        There are no issues in this epic.

          Activity

          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - - edited
          During a discussion with Nick and Zak we came up with a number of possible solutions:
          1. Force the user to specify a spatial grid when doing ungridded aggregation. Currently the user can leave this blank and CIS chooses one large cell to cover the area, but making the user explicitly enter a grid may make the behaviour less 'unexpected'.
          2. Just output the aggregation as a set of ungridded points rather than the whole grid. This should actually be really straight forward - I could just create an ungriddedData object from the Cube and output that if the user has specified it. This reduces the size of the output file, but not the memory usage
          3. Another option which would remove the need to create a cube at all (and thus save memory) is to create a 'zero distance' ungridded/ungridded colocation routine which uses some kind of hash to only compare those points which have the exact same spatial coordinates.
          4. I could write a new algorithm which doesn't create a Cube, but creates bins algorithmically, and allocates the points to these bins (and creates the bins) as they are needed.
          5. A hybrid gridded/ungridded construct which can have a mixture of regular and irregular coordinates.

          1 and 2 would be doable fairly quickly (a few days) but the others would take an increasing amount of time.
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - - edited During a discussion with Nick and Zak we came up with a number of possible solutions: 1. Force the user to specify a spatial grid when doing ungridded aggregation. Currently the user can leave this blank and CIS chooses one large cell to cover the area, but making the user explicitly enter a grid may make the behaviour less 'unexpected'. 2. Just output the aggregation as a set of ungridded points rather than the whole grid. This should actually be really straight forward - I could just create an ungriddedData object from the Cube and output that if the user has specified it. This reduces the size of the output file, but not the memory usage 3. Another option which would remove the need to create a cube at all (and thus save memory) is to create a 'zero distance' ungridded/ungridded colocation routine which uses some kind of hash to only compare those points which have the exact same spatial coordinates. 4. I could write a new algorithm which doesn't create a Cube, but creates bins algorithmically, and allocates the points to these bins (and creates the bins) as they are needed. 5. A hybrid gridded/ungridded construct which can have a mixture of regular and irregular coordinates. 1 and 2 would be doable fairly quickly (a few days) but the others would take an increasing amount of time.
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          The semi gridded/ungridded concept would also reduce memory (potentially significantly) for many of the ungridded data types. For example the satelites often have a 'gridded' component along one or more dimensions. e.g. images all have the same time, lidar columns all have the same lat/lon. It's a fairly fundamental change though and would take a lot of work, I've put together a rough estimate below.

          Writing the fundamental changes to the ungridded data object and Hyperpoints, 3 weeks
          changing the data products to take advantage 2-3 weeks
          changing the aggregation routines to take advantage 2 weeks
          changing the co-location routes 2 weeks
          changing the plotting routines 2-3 weeks
          changing the writing routines 1 week
          total: 12-14 weeks, 3 -3.5 months
           
          or, as a minimum first pass:
          create a semi-gridded datatype 1-2 weeks
          change the aggregation routines to use it if they can 2 weeks
          change the writing routines 1 week
          total: 4-5 weeks
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - The semi gridded/ungridded concept would also reduce memory (potentially significantly) for many of the ungridded data types. For example the satelites often have a 'gridded' component along one or more dimensions. e.g. images all have the same time, lidar columns all have the same lat/lon. It's a fairly fundamental change though and would take a lot of work, I've put together a rough estimate below. Writing the fundamental changes to the ungridded data object and Hyperpoints, 3 weeks changing the data products to take advantage 2-3 weeks changing the aggregation routines to take advantage 2 weeks changing the co-location routes 2 weeks changing the plotting routines 2-3 weeks changing the writing routines 1 week total: 12-14 weeks, 3 -3.5 months   or, as a minimum first pass: create a semi-gridded datatype 1-2 weeks change the aggregation routines to use it if they can 2 weeks change the writing routines 1 week total: 4-5 weeks
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          Could an unstructured grid provide an alternative approach: https://github.com/ugrid-conventions/ugrid-conventions/blob/v0.9.0/ugrid-conventions.md ?
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - Could an unstructured grid provide an alternative approach: https://github.com/ugrid-conventions/ugrid-conventions/blob/v0.9.0/ugrid-conventions.md ?
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          Another option for the implementation of option 5 above, is to use IRIS cubes with hybrid coordinates other than altitude. E.g. a hybrid latitude dimension. I think this could work for things like satellite data, but not necessarily the aggregation of time series data...
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - Another option for the implementation of option 5 above, is to use IRIS cubes with hybrid coordinates other than altitude. E.g. a hybrid latitude dimension. I think this could work for things like satellite data, but not necessarily the aggregation of time series data...
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          Philip provided another potential use case for this: Producing height - latitude profiles of aircraft data without having to specify altitude and latitude grids.
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - Philip provided another potential use case for this: Producing height - latitude profiles of aircraft data without having to specify altitude and latitude grids.
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          Having implemented the as_pandas_dataframe it would be pretty easy to use this for time series analysis. We could even use a DataFrame as the core data structure underlying UngriddedData, with the potential to use multi-indexes for structured data...
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - Having implemented the as_pandas_dataframe it would be pretty easy to use this for time series analysis. We could even use a DataFrame as the core data structure underlying UngriddedData, with the potential to use multi-indexes for structured data...
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          I've recently come accross a number of useful definitions in the CF-conventions: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/aph.html

          They define: Point Data; Time Series Data; Profile Data; Trajectory Data; Time Series of Profiles; and Trajectory of Profiles. Which seems to cover all of the bases.

          Perhaps we could implement subclasses of UngriddedData (which represents Point Data) for each of these other types. Any optimisations or constraints could then be build into the classes.
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - I've recently come accross a number of useful definitions in the CF-conventions: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/aph.html They define: Point Data; Time Series Data; Profile Data; Trajectory Data; Time Series of Profiles; and Trajectory of Profiles. Which seems to cover all of the bases. Perhaps we could implement subclasses of UngriddedData (which represents Point Data) for each of these other types. Any optimisations or constraints could then be build into the classes.
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          The only thing missing here is some sort of 'image' convention for e.g. MODIS data. I wonder if this can be handled by Iris though? It might be a problem if the coordinates are not continous (for multiple tiles).
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - The only thing missing here is some sort of 'image' convention for e.g. MODIS data. I wonder if this can be handled by Iris though? It might be a problem if the coordinates are not continous (for multiple tiles).
          Hide
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment -
          There is a proof of concept data structure in CIS_Pandas_dataframe.py in pywork.
          Show
          duncan.watson-parris@physics.ox.ac.uk Duncan Watson-Parris added a comment - There is a proof of concept data structure in CIS_Pandas_dataframe.py in pywork.

            People

            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: