CoDICE data product organization refactor #826

bourque · 2024-09-11T19:03:37Z

Apologies for the large PR here, I couldn't see a great way to split this up into smaller PRs. This PR is a somewhat large refactor of the CoDICE L1a processing pipeline in order to allow better flexibility in defining and processing the 18 different CoDICE L1a data products. Before, I was treating all CoDICE-lo products and all CoDICE-hi products as if they could be built the same way, but I have found that there are enough subtle differences between all of these products that they should be built in a more individualized way. As such, I have introduced a few new 'data product configuration' variables (namely CDF coordinates, CDF dimensions, and CDF data variables) that can be defined on the per-product level. In doing so, I have created a few more focused methods and renamed a few things where I think it made sense to.

…fining different types of data products

…nfig

…processing into codice-data-product-organization-refactor

imap_processing/codice/codice_l1a.py

greglucas · 2024-09-12T22:07:59Z

imap_processing/codice/codice_l1a.py

+            # Reshape to 4 dimensions to allow for epoch dimension
+            reshaped_variable_data = np.expand_dims(variable_data, axis=0)


This doesn't seem quite right. This is adding an extra empty dimension, shouldn't we be reshaping the data you have into the proper dimensions? (Similar to what you had before I think)

Will this only ever have one epoch variable, or could there be more than that?

variable_data.reshape( ( len(dataset["epoch"], self.num_energy_steps, self.num_positions, self.num_spin_sectors, )

greglucas · 2024-09-12T22:14:16Z

imap_processing/codice/codice_l1a.py

@@ -288,33 +305,23 @@ def get_energy_table(self) -> None:

        # Get the appropriate values
        sweep_table = sweep_data[sweep_data["table_idx"] == sweep_table_id]
-        self.energy_table = sweep_table["esa_v"].values
+        energy_table: list[float] = sweep_table["esa_v"].values


Aren't these numpy arrays? I wonder if we need to bring in typechecking stubs from pandas and/or xarray to help with this... (For another time)

Suggested change

energy_table: list[float] = sweep_table["esa_v"].values

energy_table: NDArray[float] = sweep_table["esa_v"].values

greglucas · 2024-09-12T22:16:49Z

imap_processing/codice/codice_l1a.py

+        config = constants.DATA_PRODUCT_CONFIGURATIONS.get(apid)  # type: ignore[call-overload]
+        self.coords_to_include = config["coords"]


Would it make sense to set a self.config = config, and then lookup these values later since they are all named identically. self.config["dataset_name"] rather than self.dataset_name doesn't seem too bad.

imap_processing/codice/constants.py

tech3371

This was a ton of good work. I followed almost of the change except the last one. Complicated but logic flows well IMO.

tech3371 · 2024-09-16T16:15:20Z

imap_processing/codice/constants.py

+            "esa_step",
+            "energy_label",
+        ],  # TODO: These will likely change
+        "dataset_name": "imap_codice_l1a_hi_counters_aggregated",


This seems to follow filename convention. If so, can you change this and others to use - instead of _ for the descriptor names?

Suggested change

"dataset_name": "imap_codice_l1a_hi_counters_aggregated",

"dataset_name": "imap_codice_l1a_hi_counters-aggregated",

tech3371 · 2024-09-16T16:25:25Z

imap_processing/codice/constants.py

        "num_counters": 8,
        "num_energy_steps": 15,  # TODO: Double check with Joey
        "num_positions": 4,  # TODO: Double check with Joey
        "num_spin_sectors": 1,
+        "support_variables": ["data_quality", "spin_period"],


Does this mean that there are these two extra data variable stored in CDF file and others only has science data stored?

Yes, exactly!

tech3371 · 2024-09-16T16:49:01Z

imap_processing/codice/codice_l1a.py

@@ -509,9 +552,11 @@ def process_codice_l1a(file_path: Path, data_version: str) -> xr.Dataset:



I just noticed that you are only processing first packet from packet_dataset. Is that right?

tech3371 · 2024-09-16T16:54:59Z

imap_processing/codice/codice_l1a.py

+        """
+        for variable_name in self.support_variables:
+            if variable_name == "energy_table":
+                variable_data = self.get_energy_table()


Is this function getting energy value in engineering units or is this getting energy step value, Eg. [0,1,2,....127]?

tech3371 · 2024-09-16T16:58:19Z

imap_processing/codice/codice_l1a.py

        for variable_data, variable_name in zip(self.data, self.variable_names):
-            # Data arrays are structured depending on the instrument
-            if self.instrument == "lo":
-                variable_data_arr = np.array(variable_data).reshape(
-                    (
-                        1,
-                        self.num_positions,
-                        self.num_spin_sectors,
-                        self.num_energy_steps,
-                    )
-                )
-                dims = ["epoch", "inst_az", "spin_sector", "esa_step"]
-            elif self.instrument == "hi":
-                variable_data_arr = np.array(variable_data).reshape(
-                    (
-                        1,
-                        self.num_energy_steps,
-                        self.num_positions,
-                        self.num_spin_sectors,
-                    )
-                )
-                dims = ["epoch", "esa_step", "inst_az", "spin_sector"]
+            # Reshape to 4 dimensions to allow for epoch dimension
+            reshaped_variable_data = np.expand_dims(variable_data, axis=0)

            # Get the CDF attributes
            cdf_attrs_key = (
                f"{self.dataset_name.split('imap_codice_l1a_')[-1]}-{variable_name}"
            )
-            attrs = cdf_attrs.get_variable_attributes(cdf_attrs_key)
+            attrs = self.cdf_attrs.get_variable_attributes(cdf_attrs_key)

            # Create the CDF data variable
            dataset[variable_name] = xr.DataArray(
-                variable_data_arr,
+                reshaped_variable_data,
                name=variable_name,
-                dims=dims,
+                dims=self.dims,
                attrs=attrs,
            )


I was following pretty good up until this. I am not following how self.data is associated with self.variable_names.

bourque added 5 commits September 10, 2024 13:08

Initial refactor of CoDICEL1aPipeline to allow more flexibility in de…

a887557

…fining different types of data products

Added cooords, dims, support_variables as part of the data product co…

b642750

…nfig

Updated branchc with upstream dev

df59624

Updated branch with upstream dev

9fa29f3

Merge branch 'dev' of github.com:IMAP-Science-Operations-Center/imap_…

77c10c9

…processing into codice-data-product-organization-refactor

bourque added Ins: CoDICE Related to the CoDICE instrument Level: L1 Level 1 processing labels Sep 11, 2024

bourque added this to the Sept 2024 milestone Sep 11, 2024

bourque requested review from joeymukherjee and a team September 11, 2024 19:03

bourque self-assigned this Sep 11, 2024

bourque requested review from greglucas, subagonsouth and tech3371 and removed request for a team September 11, 2024 19:03

bourque commented Sep 11, 2024

View reviewed changes

imap_processing/codice/codice_l1a.py Show resolved Hide resolved

bourque commented Sep 11, 2024

View reviewed changes

imap_processing/codice/codice_l1a.py Show resolved Hide resolved

bourque added 2 commits September 11, 2024 13:09

Fixing mypy errors

098674c

Fixing mypy errors

ba9da63

greglucas reviewed Sep 12, 2024

View reviewed changes

tech3371 reviewed Sep 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoDICE data product organization refactor #826

CoDICE data product organization refactor #826

bourque commented Sep 11, 2024

greglucas Sep 12, 2024

greglucas Sep 12, 2024

greglucas Sep 12, 2024

tech3371 left a comment

tech3371 Sep 16, 2024

tech3371 Sep 16, 2024

bourque Sep 17, 2024

tech3371 Sep 16, 2024

tech3371 Sep 16, 2024

tech3371 Sep 16, 2024

		# Reshape to 4 dimensions to allow for epoch dimension
		reshaped_variable_data = np.expand_dims(variable_data, axis=0)

	energy_table: list[float] = sweep_table["esa_v"].values
	energy_table: NDArray[float] = sweep_table["esa_v"].values

		config = constants.DATA_PRODUCT_CONFIGURATIONS.get(apid) # type: ignore[call-overload]
		self.coords_to_include = config["coords"]

	"dataset_name": "imap_codice_l1a_hi_counters_aggregated",
	"dataset_name": "imap_codice_l1a_hi_counters-aggregated",

		@@ -509,9 +552,11 @@ def process_codice_l1a(file_path: Path, data_version: str) -> xr.Dataset:

CoDICE data product organization refactor #826

Are you sure you want to change the base?

CoDICE data product organization refactor #826

Conversation

bourque commented Sep 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tech3371 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment