process_data: add comments, refactor code, and add tests #208
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I tried to add code comments to help understanding the data preprocessing script, e.g., what data are removed, and what are kept for the scoring algorithms. I ended up with refactoring the code and adding tests for the
process_data.pyscript. I will start with result validation, then explain the changes I made, and finally compare the logs.Result validation
The results of preprocess_data function are three dataframe,
notes, ratings, noteStatusHistory. Those dataframe will be the input of scoring algorithms. I validated my results by writing the intermediate dataframe to disk, and compared them wit results from the latest release version (c7db275) by doing adiff. The intermediate dataframes are exactly the same.Add code comments and logs to explain what notes/ratings are kept or removed
I think the current code comments/logs can be improved to help the users understand the data preprocessing part. As per my understanding,
_filter_misleading_notesfunction, the logic does the following:I added logs to each of the three steps, and showed how the row numbers changed (see the new log output below). I also added test cases for step 3, because the numbers should add up. The tests are here.
Log comparison
I showed the log comparison based on data released on 2024-02-07. The new log shows the row number of dataframe from reading the provided tsv files, and a detailed history of dataframe row changes as we go through Step 1-3 above.
Previous log output from the latest release version (
c7db275)New log output from my PR