-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
i #284 Updated documentation and modified function for download_piper…
…mail() - Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]] - Added project_key_2 to helix.yml - Created /vignettes/download_mail.Rmd to document information about pipermail downloader - Made function calls explicit for external libraries - ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
- Loading branch information
Showing
4 changed files
with
126 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
--- | ||
title: "Download Mod Mbox and Pipermail Mailing List Archives" | ||
output: | ||
html_document: | ||
toc: true | ||
number_sections: true | ||
vignette: > | ||
%\VignetteEngine{knitr::rmarkdown} | ||
%\VignetteIndexEntry{Download Mod Mbox Mailing List Archives} | ||
%\VignetteEncoding{UTF-8} | ||
--- | ||
|
||
|
||
```{r} | ||
rm(list = ls()) | ||
seed <- 1 | ||
set.seed(seed) | ||
# Load libraries | ||
library(kaiaulu) | ||
library(data.table) | ||
library(yaml) | ||
library(stringi) | ||
library(XML) | ||
library(httr) | ||
``` | ||
|
||
|
||
# Introduction | ||
|
||
Mailing list data is stored in a variety of archives. See: | ||
- Mod Mbox: [Apache Geronimo](https://geronimo.apache.org/mailing-lists.html)). | ||
- Pipermail: [OpenSSL](https://mta.openssl.org/mailman/listinfo/). | ||
is notebook demonstrates how to download and refresh mailing list archives from Mod Mbox and Pipermail. | ||
|
||
## Mailing List Organization | ||
|
||
Mailing lists are typically organized by topic or purpose. For example, the [OpenSSL project](https://www.openssl.org/community/mailinglists.html) maintains several mailing lists, each serving a different group: | ||
|
||
- **openssl-announce**: For important announcements. | ||
- **openssl-commits**: For commit messages. | ||
- **openssl-project**: For project discussions. | ||
- **openssl-users**: For general user questions and discussions. | ||
|
||
Each mailing list maintains archives of past messages, often organized by month and year. These archives can be accessed and downloaded for analysis. | ||
|
||
# Project Configuration File | ||
|
||
To start, we load the project configuration file, which contains parameters for downloading the mailing list archives. | ||
|
||
// # Project Configuration File | ||
|
||
```{r} | ||
conf <- yaml::read_yaml("conf/helix.yml") | ||
mailing_list <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["mailing_list"]] | ||
start_year_month <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["start_year_month"]] | ||
end_year_month <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["end_year_month"]] | ||
save_folder_path <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["save_folder_path"]] | ||
``` | ||
|
||
### Explanation of Configuration Parameters | ||
- mailing_list: The URL of the mailing list archive index page (e.g., https://lists.openssl.org/pipermail/openssl-users/). | ||
- start_year_month: The starting date for downloading archives (in YYYYMM format). | ||
- end_year_month: The ending date for downloading archives (in YYYYMM format). | ||
- save_folder_path: The local directory where the downloaded archives will be saved. | ||
|
||
|
||
# Pipermail Downloader | ||
|
||
```{r} | ||
# Download archives | ||
download_pipermail( | ||
mailing_list = mailing_list, | ||
start_year_month = start_year_month, | ||
end_year_month = end_year_month, | ||
save_folder_path = save_folder_path | ||
) | ||
``` | ||
After running this function, the .mbox files will be saved in the specified directory with filenames like kaiaulu_202310.mbox, kaiaulu_202311.mbox, etc. | ||
|
@daomcgill to make more specific my request:
We want to have both options available, but it should not execute both. The file you download to know all the URLs for the pipermail mailing list should allow you to test the URL for both to see what is available. The default behavior should be the .txt. This is because once you download the .txt and rename the file according to a .mbox suffix, I believe this is all you need to use parse_mbox() pointing to the folder.
Now let's assume this mailing list only offers the .gz instead. In this case, you should be able to figure this out automatically from the file where you get the URLs. The download_mbox() should then, when needed, download the .gz.
Now, this will introduce you a new issue: A user may end up with a folder where some files are .mbox and others are .gz. You should write a new function that, given the path of the folder that may contain .mbox or .gz, it will open the .gz files, and rename their content accordingly to .mbox. In this process, so long the file name is consistent, it should just overwrite an existing file at its worst, but never leave duplicated data (as mailing lists in pipermail only make available one month at a time). Please add this new function to your specification with parameters. Once unzipped, the function should remove the .gz files.
Lastly, there is the refresh_ function which is what a user would normally run against the folder to start downloading. The refresh function calls your download function. But now, your refresh function would call your download function, and the new function to check that there are no .gz, and if there is, they would be moved to .mbox.
You want the behavior above, because the user then only needs to call refresh() over a folder path any time to download current or new mailing list, and always end up with a folder of .mbox files per month regardless of it being gz or txt that you downloaded.
So long this state is preserved, parse_mbox() will always work pointed on the refresh folder. Of course, this whole specification assumes that my understanding of either file being available or no longer becoming available holds. Let me know if my premise is false.