-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoDICE data product organization refactor #826
base: dev
Are you sure you want to change the base?
CoDICE data product organization refactor #826
Conversation
…fining different types of data products
…processing into codice-data-product-organization-refactor
# Reshape to 4 dimensions to allow for epoch dimension | ||
reshaped_variable_data = np.expand_dims(variable_data, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem quite right. This is adding an extra empty dimension, shouldn't we be reshaping the data you have into the proper dimensions? (Similar to what you had before I think)
Will this only ever have one epoch variable, or could there be more than that?
variable_data.reshape(
(
len(dataset["epoch"],
self.num_energy_steps,
self.num_positions,
self.num_spin_sectors,
)
@@ -288,33 +305,23 @@ def get_energy_table(self) -> None: | |||
|
|||
# Get the appropriate values | |||
sweep_table = sweep_data[sweep_data["table_idx"] == sweep_table_id] | |||
self.energy_table = sweep_table["esa_v"].values | |||
energy_table: list[float] = sweep_table["esa_v"].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these numpy arrays? I wonder if we need to bring in typechecking stubs from pandas and/or xarray to help with this... (For another time)
energy_table: list[float] = sweep_table["esa_v"].values | |
energy_table: NDArray[float] = sweep_table["esa_v"].values |
config = constants.DATA_PRODUCT_CONFIGURATIONS.get(apid) # type: ignore[call-overload] | ||
self.coords_to_include = config["coords"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set a self.config = config
, and then lookup these values later since they are all named identically. self.config["dataset_name"]
rather than self.dataset_name
doesn't seem too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a ton of good work. I followed almost of the change except the last one. Complicated but logic flows well IMO.
"esa_step", | ||
"energy_label", | ||
], # TODO: These will likely change | ||
"dataset_name": "imap_codice_l1a_hi_counters_aggregated", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to follow filename convention. If so, can you change this and others to use -
instead of _
for the descriptor names?
"dataset_name": "imap_codice_l1a_hi_counters_aggregated", | |
"dataset_name": "imap_codice_l1a_hi_counters-aggregated", |
"num_counters": 8, | ||
"num_energy_steps": 15, # TODO: Double check with Joey | ||
"num_positions": 4, # TODO: Double check with Joey | ||
"num_spin_sectors": 1, | ||
"support_variables": ["data_quality", "spin_period"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there are these two extra data variable stored in CDF file and others only has science data stored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly!
@@ -509,9 +552,11 @@ def process_codice_l1a(file_path: Path, data_version: str) -> xr.Dataset: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that you are only processing first packet from packet_dataset
. Is that right?
""" | ||
for variable_name in self.support_variables: | ||
if variable_name == "energy_table": | ||
variable_data = self.get_energy_table() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function getting energy value in engineering units or is this getting energy step value, Eg. [0,1,2,....127]?
for variable_data, variable_name in zip(self.data, self.variable_names): | ||
# Data arrays are structured depending on the instrument | ||
if self.instrument == "lo": | ||
variable_data_arr = np.array(variable_data).reshape( | ||
( | ||
1, | ||
self.num_positions, | ||
self.num_spin_sectors, | ||
self.num_energy_steps, | ||
) | ||
) | ||
dims = ["epoch", "inst_az", "spin_sector", "esa_step"] | ||
elif self.instrument == "hi": | ||
variable_data_arr = np.array(variable_data).reshape( | ||
( | ||
1, | ||
self.num_energy_steps, | ||
self.num_positions, | ||
self.num_spin_sectors, | ||
) | ||
) | ||
dims = ["epoch", "esa_step", "inst_az", "spin_sector"] | ||
# Reshape to 4 dimensions to allow for epoch dimension | ||
reshaped_variable_data = np.expand_dims(variable_data, axis=0) | ||
|
||
# Get the CDF attributes | ||
cdf_attrs_key = ( | ||
f"{self.dataset_name.split('imap_codice_l1a_')[-1]}-{variable_name}" | ||
) | ||
attrs = cdf_attrs.get_variable_attributes(cdf_attrs_key) | ||
attrs = self.cdf_attrs.get_variable_attributes(cdf_attrs_key) | ||
|
||
# Create the CDF data variable | ||
dataset[variable_name] = xr.DataArray( | ||
variable_data_arr, | ||
reshaped_variable_data, | ||
name=variable_name, | ||
dims=dims, | ||
dims=self.dims, | ||
attrs=attrs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following pretty good up until this. I am not following how self.data
is associated with self.variable_names
.
Apologies for the large PR here, I couldn't see a great way to split this up into smaller PRs. This PR is a somewhat large refactor of the CoDICE L1a processing pipeline in order to allow better flexibility in defining and processing the 18 different CoDICE L1a data products. Before, I was treating all CoDICE-lo products and all CoDICE-hi products as if they could be built the same way, but I have found that there are enough subtle differences between all of these products that they should be built in a more individualized way. As such, I have introduced a few new 'data product configuration' variables (namely CDF coordinates, CDF dimensions, and CDF data variables) that can be defined on the per-product level. In doing so, I have created a few more focused methods and renamed a few things where I think it made sense to.