Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresher Capability for MBOX Downloader (Milestone 2) #284

Open
6 tasks done
ian-lastname opened this issue Mar 9, 2024 · 28 comments
Open
6 tasks done

Refresher Capability for MBOX Downloader (Milestone 2) #284

ian-lastname opened this issue Mar 9, 2024 · 28 comments
Assignees

Comments

@ian-lastname
Copy link
Collaborator

ian-lastname commented Mar 9, 2024

1. Purpose

The purpose of this issue is to add refresh capability for the mod mbox downloader and pipermail downloader. I'll have to create a refresh function for both downloaders, as well as a parser function that parses the latest downloaded mail file. There are two mod mbox downloader functions: download_mod_mbox and download_mod_mbox_per_month. Since pagination is required for a refresh function, I will only be focusing on the download_mod_mbox_per_month function.

2. Process

I will base my changes and new code on the already existing code related to the mbox downloader and parser. For the refresh capability, I will look through Sean's jira downloader refresher to get a good idea on how I should make it. Though from what I already know about it, I will most definitely be making a new function that takes a date of some sort.

3. Endpoints

From the meeting, apparently I only have year and month to work with when it comes to end points. I'll do a bit more checking around just to make sure.

4. Task List

  • Double check the Jira download refresher to get a better understanding on how a refresh function would work.
  • Most likely change the naming convention of the downloaded files for the purpose of getting the latest date for the refresh function. (mailinglist_archive_yearmonth.mbox)
  • Create the download refresher (called "mbox_download_refresher")
  • Find a way to delete the duplicates that may be downloaded by the downloader refresher.
  • Make a "get latest date" function similar to the "get latest date" function that I made for Jira. Most likely called "mbox_latest_date" (parse_mbox_latest_date(mbox_save_path))
  • Edit download_pipermail so that it downloads mail as mbox files instead of just txt files, also so that it makes the downloaded files' names adhere to the naming convention (mailinglist_archive_yearmonth.mbox)

Refresher (Endpoint)

I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year.

Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_mod_mbox_per_month, downloading mbox files starting from the from_year parameter, to the current real-life year.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year

Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_pipermail, which downloads all mail from a selected mailing list in a selected archive.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year. Very similar to refresh_mod_mbox; the only reason why I had to make a separate function for pipermail is that the pipermail downloader works differently from the mod mbox downloader.

New Parser: parse_mbox_latest_date(mbox_path)

  • Finds the latest downloaded mail file in a selected mbox save folder. It does so based on the yearmonth part of the file name. This function returns the name of the latest downloaded mail file for use by the refresher function.
  • Can be used for both mail downloaded via download_pipermail and download_mod_mbox_per_month since both downloader functions download the mail files as mbox files (as in, both downloaders save mail with ".mbox" as the extension)

Incorporating Month as an Endpoint Along With Year

Currently, the endpoint parameters for the downloader/refresher functions that take them only take a year (i.e. 2004). Due to this, the downloaders will always start at the beginning of the year when downloaded at a certain "from" year. It is 100% possible to make it so that the downloader can start at a specified month as well as a year. The logic in order to do so is as follows:

  • The "from_year" and "to_year" parameters in download_mod_mbox_per_month can take in a date in a yearmonth format (i.e. January 2017 -> 201701)
  • Extract the year and month values from the parameters using as.numeric(substr([variable], 1, 4)) and as.numeric(substr([variable], 5, 6)) respectively
  • Currently, the mbox downloader loops through each month to download files from the current iterated month and year. When iterating through the months on the extracted year from the "from_year" parameter, just make it so that it starts on the extracted month. Like wise, when iterating through the months of the "to_year" parameter, just make it so that it ends on the extracted month for that parameter.

Pipermail: Manually Prompting Pipermail Refresher to Start After a Certain Year and Month

Pipermail archives have their archived mail in txt or txt.gz formats. Here is an example of a pipermail archive
piper1
In this picture, you can see that the downloadable versions of each mail file are viewable with a link to the txt file. Clicking on the link takes you to this page:
piper2
As you can see, this is a raw file of all the mail messages in April 2018. Notice the naming convention of the downloadable file, which is underlined in red. The file is named on a year-month basis. You'll want to download the file whose date you want to start from, and put it in the save folder in which you will be running the pipermail refresh on.

Next, you will want to rename your downloaded file to the correct naming format (i.e. openssl_mta_201804.mbox as per the second picture). With that, the refresher should start from the month and year that your downloaded file is from.

Chances are, you might not even need to name the file in the correct naming format; as long as you have the yearmonth aspect of the name and the correct extension (i.e. 201804.mbox should be enough to start from April 2018), it should work. You might not even need to actually manually download the file from the mail archive to begin with; just having a blank file with the correct naming convention (or at the very least yearmonth.mbox) should be sufficient enough as the refresher will just delete that file, then replace it with the actual mail file at that year and month.

@ian-lastname ian-lastname self-assigned this Mar 9, 2024
@ian-lastname ian-lastname changed the title Mbox Downloader Refresher Mbox Downloader Refresher (Milestone 2) Mar 11, 2024
@carlosparadis carlosparadis changed the title Mbox Downloader Refresher (Milestone 2) Refresher Capability for MBOX Downloader (Milestone 2) Mar 18, 2024
@ian-lastname
Copy link
Collaborator Author

@ian-lastname
Copy link
Collaborator Author

  • Explain logic behind implementing month in the parameter for from_year and to_year
  • Post hyperlink to pipermail openssl-dev archive all files and the most recent file

@carlosparadis
Copy link
Member

@ian-lastname Please add here the notes requested during the last meeting Friday:

  • Screenshots / urls / examples of how the pipermail .txt file can be obtained to manually prompt your refresher to start after a given year and month

There was another item, what was it?

@ian-lastname
Copy link
Collaborator Author

@ian-lastname Please add here the notes requested during the last meeting Friday:

  • Screenshots / urls / examples of how the pipermail .txt file can be obtained to manually prompt your refresher to start after a given year and month

There was another item, what was it?

I remember the other item; it was to link to the part of the code in the pipermail refresher that would supposedly put a warning message when there is no file found error at a certain url. Turns out, I just removed the code that actually printed a warning message when the error is encountered.

@carlosparadis
Copy link
Member

@ian-lastname If the code already exists, could you make a commit to just place it back? I have not start reviewing your code yet

@carlosparadis
Copy link
Member

The pipermail mbox refresher has a main IF and ELSE. In the case the IF enters, it will default the entire code logic to download_pipermail.

Download pipermail downloads the main page of the mailing list archive (e.g. https://mta.openssl/pipermail/openssl-users/) this page contains the list of all URLs of the mbox as either .txt or .gz. Both are mbox in disguise, we only need to rename the file extensions.

download_pipermail will get the urls, download the appropriate files and rename. download_pipermail relies on this file to know if .gz or .txt will be available and what dates. Without said file, it is impossible to know which will be the case.

The Else portion of pipermail refresher will not rely on the file. Therefore, it will not know the year to end, other than system time, and will also not know whether txt, gz or both are available. In addition, the code logic for current year and last year was split into two functions. Combined with the txt or gz functions, this results in 4 functions being fired every year/month all the way to current year/month from system time. This generates a number of empty files saved, which are subsequently deleted as they are downloaded all the way to current year.

The rework of the else function should rely on the download_pipermail function, and re-obtain the list of all files, use the last file year_month, and then download only the files of either .txt or .gz according to the URLs extracted from said file. This will reduce the number of function calls to only 1 per year month, and also prevent firing for years and months that are not available (perhaps because the archive stopped storing data way before the current year date).

@carlosparadis
Copy link
Member

download_mod_mbox was not tested on a project that the data was not available to current date, as most apache projects had them. I suspect there will be a problem where empty files will be saved (edit this comment later to refer to issue lihan posted about that or I did).

@daomcgill
Copy link
Collaborator

daomcgill commented Sep 12, 2024


Purpose

Rework mbox and pipermail download functions. Add refresh capability for both.

Process

Start by working on pipermail download and refresh functions. Update config files and relevant notebook. Move on to mbox download and refresh.

Task List

  • mail.R/download_pipermail: Create this function.
        - Use SSL Archive (just one of the lists, do not need all) for pipermail mailing list. Should be able to convert by changing .txt to .mbox extension. Fix so it does this.
        - Look at kaiaulu mailing list downloaders for url examples (these do not refresh).
        - Use kaiaulu Jira downloader for example, except uses URL (not API).
  • mail.R/refresh_pipermail: Edit.
        - Conform to refresher cheatsheet.
  • mail.R/convert_pipermail_to_mbox: Remove this function.
  • mail.R/download_mod_mbox_per_month: Remove this function.
  • mail.R/download_mod_mbox: Edit.
  • mail.R/refresh_mod_mbox: Edit.
  • mail.R/parse_mbox(perceval_path, mbox_path): Decide what to do with this.
  • Create notebook explaining how to use functions.

Functions

Pipermail Downloader

download_pipermail(archive_url, mailing_list, start_year_month, end_year_month, archive type, save_folder_path):

  • Gets the year_month of all mail from the table found in archive_url. Example archive URL.
  • If year_month is within start_year_month and end_year_month parameters, download the file from URL into save_folder_path. Save file as ''kaiaulu_'year_month.mbox'.

Pipermail Refresher

refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path):

  • If save_folder_path is empty, download all links in mailing_list.
  • Else, find the most recent year_month from the files in save_folder_path, using the filenames. Delete this file and set as most_recent_year_month.
  • Call download_pipermail(start_year_month = most_recent_year_month, end_year_month = current_year_month) to download months starting with most recent.

Mbox Downloader

download_mod_mbox(base_url, mailing_list, start_year_month, end_year_month, save_folder_path):

  • Downloads mod mbox within specified time range from mailing_list. Saves files as ''kaiaulu_'datetime.mbox'. Example base_url.

Mbox Refresher

refresh_mod_mbox(archive_url, mailing_list, archive_type, start_year_month, save_folder_path):

  • If save_folder_path is empty, download all links in mailing_list.
  • If start_year_month = 'most_recent', find the most recent datetime from the filenames in save_folder_path. Delete the most recent one.
  • Get the datetimes from the mailing_list and download from the deleted one onwards.

Parser

parse_mbox_latest_date(mbox_path):

Libraries

  • httr
  • stringi

@daomcgill
Copy link
Collaborator

daomcgill commented Sep 12, 2024

Question

I tried using the mail.R/download_mod_mbox_per_month function. When the from_year parameter for download_mbox_per_month is set to 201801 and to_year is current_year (to_year is set within the function, not a user parameter), it starts downloading from 201801 and works backwards. Is this expected behavior? My assumption was that it would download files starting from 201801 and move forwards towards more recent years, ending in the current year. The resulting saved mbox file has a size of 0 bytes.
Here is what I did:

conf <- yaml::read_yaml("conf/helix.yml")
save_path_mbox <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mbox"]]
mod_mbox_url <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_url"]]
mailing_list <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mailing_list"]]
archive_url <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_url"]]
archive_type <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["archive_type"]]
from_year <- 201801
save_folder_path <- "save_folder_mail"
refresh_mod_mbox(
    archive_url = archive_url,
    mailing_list = mailing_list,
    archive_type = archive_type,
    from_year = from_year,
    save_folder_path = save_folder_path,
    verbose = TRUE
)

Here it is still downloading, now having reached 2002:
Screenshot 2024-09-12 at 12 45 01 PM

@daomcgill
Copy link
Collaborator

I found this working link of the openssl-project Archives.

@carlosparadis
Copy link
Member

@daomcgill

You can use that or any of the ones here: https://mta.openssl.org/mailman/listinfo/

The behavior of going backwards is not intended. Neither is making 4 calls to download the same file:

kaiaulu/R/mail.R

Lines 544 to 577 in d2ce222

download_txt_files_latest_downloaded_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
latest_downloaded_month=latest_downloaded_month,
current_year = current_year,
current_month = current_month,
save_folder_path=save_folder_path)
download_txt_gz_files_latest_downloaded_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
latest_downloaded_month=latest_downloaded_month,
current_year = current_year,
current_month = current_month,
save_folder_path=save_folder_path)
download_txt_files_current_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
current_year=current_year,
current_month = current_month,
save_folder_path=save_folder_path)
download_txt_gz_files_current_year(archive_url=archive_url,
mailing_list=mailing_list,
archive_type=archive_type,
latest_downloaded_year=latest_downloaded_year,
current_year = current_year,
current_month = current_month,
save_folder_path=save_folder_path)
}

I would also like to make sure your specification reflects the "refresher" concept, which this mail function has to abide by. To implement this, you will want to look on:

this comment #284 (comment) section:

Pipermail: Manually Prompting Pipermail Refresher to Start After a Certain Year and Month

Looking at the refreshet cheatsheet is likely needed to understand the concept that applies to Kaiaulu downloaders, of which this one also should implement: https://github.com/sailuh/kaiaulu_cheatsheet/blob/main/cheatsheets/refresher-cheatsheet.pdf

@carlosparadis
Copy link
Member

@daomcgill, on the closing week of this PR, I went over what had to be fixed for this to be merged. The summary of that can be found in this comment at a logic flow level:

#284 (comment)

@carlosparadis
Copy link
Member

@daomcgill

If you get a chance, would you mind checking my updated specifications? I want to make sure this part is right before I do anything else.

Ian's specification should still be the target interface we want (copy and pasting from the first message in this issue the part that is relevant to you:


Refresher (Endpoint)

I'll be using year for the end point. For the refresher function, I'll make the upper bound endpoint the current year, getting it by some built-in function that returns the current year.

Refresher Function: refresh_mod_mbox(archive_url, mailing_list, archive_type, from_year, save_folder_path, verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_mod_mbox_per_month, downloading mbox files starting from the from_year parameter, to the current real-life year.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year

Refresher Function for pipermail: refresh_pipermail(archive_url, mailing_list, archive_type, save_folder_path,verbose=FALSE)

  • Checks if the save_folder_path is empty or not. If empty, then it calls download_pipermail, which downloads all mail from a selected mailing list in a selected archive.
  • If save_folder_path is not empty, then it deletes the latest year & month mbox file currently downloaded, then redownloads that deleted file along with all files after it up to the current real-life year. Very similar to refresh_mod_mbox; the only reason why I had to make a separate function for pipermail is that the pipermail downloader works differently from the mod mbox downloader.

New Parser: parse_mbox_latest_date(mbox_path)

  • Finds the latest downloaded mail file in a selected mbox save folder. It does so based on the yearmonth part of the file name. This function returns the name of the latest downloaded mail file for use by the refresher function.
  • Can be used for both mail downloaded via download_pipermail and download_mod_mbox_per_month since both downloader functions download the mail files as mbox files (as in, both downloaders save mail with ".mbox" as the extension)

Note the defined set of functions above offer you the logic to implement "refresh". You need a file name convention (as shown on the cheatsheet), and a function that assumes said convention to find out what is the latest year and month on the system (that of course assumes the user did not introduce gaps manually).

The set of 3 functions above rely on the implementation of download_mod_mbox() and download_pipermail(). Ian did not specify that in his specification, but you should (i'd also appreciate if you format this so the header is not as big as his, this is hard to read).

Maybe you can reuse these 3 functions from him, you will need to check. The download_mod_mbox should be able to take a start_year_month and end_year_month parameter, and so should the download_pipermail(). As we discussed, the download_pipermail() logic needs as a re-do. I believe the download_mod_mbox() outside this PR needs to incorporate the month, and the ability to select a time range so it does not download the entire year.

Try taking another pass on the specification with this, and then post a comment here pinging me (it is easier for me than e-mail). We want this exchange documented here too so it is easy to find and reference in the future.

There was at some point on Spring a joint effort on putting all the signatures together: #292 however this issue specification I am pasting in this comment is the most current one.

@carlosparadis
Copy link
Member

Ian's format of specification is also generally what you want: The function signature and a few bullets giving me some idea of your logic under said function. Try to do that for the download_pipermail() and download_mod_mbox(), and also add the parameters to the signature.

@daomcgill
Copy link
Collaborator

@carlosparadis could you please review updated specifications.

@carlosparadis
Copy link
Member

convert_pipermail_to_mbox(filelist):

I don't believe you need this function. Just try to save the files as .mbox instead of .txt when naming them and see if parse_mbox() recognizes it.

If save_folder_path is empty, throw an error to first call download_pipermail with a specified start_year_month and end_year_month.

I am not sure you should throw any errors. If the folder is empty, it means you need to start from scratch. In pipermail you can just use the file you download with all links to infer the start date. You may need to give some thought on what your options are on mod mbox.

You need this function:

parse_mbox_latest_date(mbox_path)

The refresh should erase the more recent file and re-download, because the mbox files are available monthly. This means the current month is always incomplete and need to be re-downloaded.

In your notes, it mentions a script to keep running: Don't worry about this. This is done via a cron job, which lies outside R. You want a function i can point to a folder and will run on an empty folder, and if i delete one of the recent files, it will just dowload new files. In short, the function works for the empty case, and for the case where files are in there.

After these corrections, I think it should be fine to start coding. Just make sure the logic and purpose of every function is clear. Thanks!

daomcgill added a commit that referenced this issue Sep 15, 2024
- Remove archive_url and archive_type parameters from download_pipermail().
- Add start_year_month and end_year_month parameters for date filtering.
- Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically.
- Change file naming convention to 'kaiaulu_'YYYYMM.mbox'.
- Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection.

Signed-off-by: Dao McGill <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill

Thank you for the update! I believe you are using the specification from Anthony:

#286 (comment)

mailing_list:
  mod_mbox: 
    mail_key_1:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-dev
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-dev/
    mail_key_2:
      archive_url: http://mail-archives.apache.org/mod_mbox/geronimo-user
      mbox: ../../rawdata/geronimo/mod_mbox/geronimo-user/
  pipermail:
    mail_key_1:
      archive_url: http://some/pipermail/url
      mbox: ../../rawdata/geronimo/pipermail/geronimo-dev/
  1. Is that correct? If so this is fine, except we should change from mail_key_1 to just project_key_1 etc for consistency with the other downloaders (you will want to defer this to the pair working on the project config, but just to make sure we are all on the same page). Pipermail should also have multiple project keys, similar to mod_mbox.

  2. You will also want to update this Notebook: http://itm0.shidler.hawaii.edu/kaiaulu/articles/download_mod_mbox.html and create sections that explain how to use your downloaders, how mailing lists are organized (remember the openssl example I gave you pointing to their page, where there are multiple mailing lists, and then multiple archives? we should add to the text, so it is self contained. Even if it does not relate to using the function per se, it helps a newcomer on understanding what the data is before downloading). You should also introduce the idea of refresh to the users.

Sean did a great job on this one: https://github.com/sailuh/kaiaulu/blob/master/vignettes/download_jira_issues.Rmd (note this is out of sync with the docs, so you will need to read the local text).

  1. Another thing I am noticing is the error on GitHub Actions seems to be associated with XML package does not exist. Could you press the "Check" button on RStudio to run the checks locally, and see if they pass there?

  2. Make the function call explicit for external libraries (the data.table and stringi package is fine) (httr:: or XML::)

I may notice others as you proceed, but this is a iterative process hereafter. Thank you for getting things going.

@carlosparadis
Copy link
Member

To be clear on item 2: You will want to make that Notebook file name to be download_mail.Rmd, and create separate sections for mod_mbox and pipermail showcasing their use. This, in turn, will facilitate it for you to test run everything is working as intended.

I forgot to mention an item 5): We do not need to download both zip and .txt. If given the option, (which you can infer from the first file downloaded with everything), you can just download the .txt. If only the zip is available, then you download the zip. But there is no reason to download both. Therefore, the total number of requests per file from the website should just be 1, after you download the file containing everything.

In your function and in your notebook you should also mention that for pipermail, users should expect an extra file to be downloaded. We need to think where to put it, or at least update parse_mbox to ensure it does not try to read it in thinking it is an .mbox file.

Did you already fix the logic of downloading files reverse in time or once for current year and once for the years prior?

daomcgill added a commit that referenced this issue Sep 17, 2024
…mail()

- Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]]
- Added project_key_2 to helix.yml
- Created /vignettes/download_mail.Rmd to document information about pipermail downloader
- Made function calls explicit for external libraries
- ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
@daomcgill
Copy link
Collaborator

@carlosparadis Updated the function according to you comments. It currently downloads just the gz and then unzips it locally, before deleting the compressed one. Does that work? The files download in the correct order.
I am having issues with Build -> Check, as can be seen in my most recent commit message.

daomcgill added a commit that referenced this issue Sep 17, 2024
…process_gz_to_mbox_in_folder()

- download_pipermail: Attempts to download .txt file first. If unavailable fallback to .gz. If using .gz file, unzips and writes output in .mbox
- Added log messages
- download_pipermail: Added timeout parameter to deal with case that server takes too long to respond
- Added refresh_pipermail function
- Updated vignettes/download_mail.Rmd to include refresh_pipermail
- Added process_gz_to_mbox_in_folder function
@daomcgill
Copy link
Collaborator

@carlosparadis Made changes according to your in-line comment. Please let me know if this seems sound to you. Here are my notes for the proposed changes:

Edited download_pipermail(mailing_list, start_year_month, end_year_month, save_folder_path)

  1. Create Directory: The function first ensures that the save_folder_path directory exists. If it doesn't, create the directory.
  2. Ensure Correct Mailing List URL: The mailing_list URL is verified to end with a /, which is important when constructing the links for downloading files.
  3. Download and Parse the Mailing List: The function sends a GET request to the mailing list’s URL to retrieve content. The content is parsed to extract the rows of data from the table that contains the file links.
  4. Extract Date and Links from Rows: The function loops through the table rows (skipping the header) to extract the dates and links from each row. It converts the date to YYYYMM format and checks if the date falls within the specified start_year_month and end_year_month. If a link exists for that date, it is stored for later download.
  5. File Download Process: The function tries to download the .txt version of the file first. If the .txt file is unavailable, it attempts to download the .gz version of the file. If both attempts fail, the function skips the link and logs a message.
  6. Handling .gz Files: If the .gz file is downloaded, the function unzips it and writes its contents to an .mbox file. After unzipping, the .gz file is deleted to avoid having multiple versions of the same data.
  7. File Writing: If the .txt file is available, it is downloaded directly and saved with a .mbox extension (skips step 6). The final list of downloaded .mbox files is returned.

Added refresh_pipermail(mailing_list, start_year_month, save_folder_path)

  1. Create Directory: The function first checks whether the directory save_folder_path exists. If not, it creates the directory recursively.
  2. Check if Folder is Empty:If the folder is empty, it calls the download_pipermail function from start_year_month to the current month (end_year_month), which is found using Sys.Date().
  3. Find the Most Recent Month: If the folder is not empty, the function looks for files in the folder matching the pattern kaiaulu_YYYYMM.mbox. It extracts the YYYYMM parts from the filenames and finds the most recent month using max().
  4. Delete the Most Recent File: The function deletes the most recent file (assuming it's the last one downloaded). This is because we want to redownload that month to ensure it's up to date.
  5. Redownload the Most Recent to Current Month: After deleting the most recent file, the function calls download_pipermail again, starting from the most recent month up to the current month.

Added process_gz_to_mbox_in_folder(folder_path)

As per your request, I added a process_gz_to_mbox_in_folder(folder_path) function. My understanding was that you want to be able to receive a folder that may contain .gz or .mboz files. Any .gz files are then unzipped and renamed to .mbox. If any .mbox file with that name already exists, it will be overwritten. Question: is this necessary? Assuming the user already has this folder containing both types of files, I could see why this would be useful. If, however, they are using the download_pipermail function, this function should never be necessary as .gz files are already processed during the download.

Note: I have not yet started working on the parser or mod mbox functions, so those are yet unchanged.
Next step: Start on the parsers?

@carlosparadis
Copy link
Member

@daomcgill I have a request: the level of detail in your post would be great if you moved exactly as you stated into the code right around where it is implemented. It may appear excessive to be in code, but since Kaiaulu is code that has great benefit from ICS 496 students, I am perfectly fine we are excessive in explaining the code (R does not come natural to everyone).

I would also like to save you time on having to post them here for me, so I can review directly in code as you document it.

I am not sure if there is anything you need to do for parser. At the end of the day both pipermail and mod_mbox will give you a folder of .mbox files. parse_mbox() wants to see that. Is there a reason you wanted to edit it? Or was that to ensure it only reads *.mbox files?

I would say try to run parse_mbox() and then proceed to download_mod_mbox() changes.

p.s.: Let's agree to continue the specification discussion here, since a part is on PR and a part is now on issue. For more specific in line code comments, we can use the PR since GitHub will auto post them there.

p.s.2: If you ever feel you are spending too much time in anything going in circles because it is not clear in text, we can set an additional call to go over it too.

Thank you for your hard work, I am really impressed!

daomcgill added a commit that referenced this issue Sep 19, 2024
…il refresher.

- Replaced paste0 with stringi::stri_c
- Removed create directory if does not exist
- Added more verbose descriptions/comments
- Added dividers within functions
- Added verbose parameter
- Added else block for refresher
- Added call to process_gz_to_mbox_in_folder at end of refresher
- parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex
- Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.
daomcgill added a commit that referenced this issue Sep 19, 2024
…il refresher.

- Replaced paste0 with stringi::stri_c
- Removed create directory if does not exist
- Added more verbose descriptions/comments
- Added dividers within functions
- Added verbose parameter
- Added else block for refresher
- Added call to process_gz_to_mbox_in_folder at end of refresher
- parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex
- Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

@carlosparadis I added more comments/ descriptions., although I am not sure if it is now an excessive amount? The parse_mbox does have output when called, so I am assuming it is the expected output. Moving on to the mod mbox, unless you have any further changes.

@carlosparadis
Copy link
Member

@daomcgill let me look when I have more time over the weekend on your code before you spend more time on pipermail. If you followed on my inline comments, this is plenty for now! I suggest you just move on to mod_mbox for the time being.

daomcgill added a commit that referenced this issue Sep 21, 2024
Updated parameters for download_mod_mbox to use Apache Pony Mail links as Apache lists now redirect there
- Modified downloads to use YYYYMM  instead of YYYY
- Removed the option for downloading by year for clearer functionality.
- Updated vignette/download_mail.Rmd

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

daomcgill commented Sep 21, 2024

@carlosparadis I added the download_mod_mbox function, although I am not sure if this method of getting urls covers all mailing lists that would be used. The previous mailing_list (redirects) and the links in https://lists.apache.org/ lead to the Apache Pony Mail archive browser. I found that the mbox download link can be found by editing the url from 'lists.apache.org/list.html?' to the format 'lists.apache.org/api/mbox.lua?list=&date=YYYY-MM' which is how I am getting the links for download_mod_mbox. Will this method be enough to cover all use cases, or will there also be mod mbox mail from other sources?

@carlosparadis
Copy link
Member

There may be other mod mbox beyond apache software foundation, however I expect the interface to be the same. Just the URL would change. It makes sense Apache would use it, since they also created it: https://httpd.apache.org/mod_mbox/install.html

I will take a closer look today. Did you create a refresh function for it yet or organized the file format for it to save to be consistent to what the pipermail one does?

Additionally, you may reach M2 while waiting for the subsequent reviews. If you want to continue progressing your milestones, you're welcome to create a new issue for M2 specification. I believe in the original milestone e-mail I linked two reference issues but I can elaborate further on the new issue. Clarifying the specification is faster for me too.

daomcgill added a commit that referenced this issue Sep 22, 2024
- Created `refresh_mod_mbox` function to automatically refresh mailing list archives downloaded using Mod Mbox.
- The function checks for the latest downloaded file, deletes it, and redownloads the archive from that month to the current date.
- Added documentation for `refresh_mod_mbox` to the notebook.

Signed-off-by: Dao McGill <[email protected]>
@daomcgill
Copy link
Collaborator

@carlosparadis I just added the refresh function for mod mbox. It should now be consistent with what the pipermail one does. I will start working on M2 specification while you review this one.

daomcgill added a commit that referenced this issue Oct 2, 2024
- Updated vignettes/download_mail.Rmd to working version
- Fixed errors in helix.yml
- Minor edits in mail.R

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Oct 2, 2024
- Check works locally
- Commit all changed files
@daomcgill
Copy link
Collaborator

@carlosparadis Here is my full output for devtools::check(). It looks like there is some issue with the download_pipermail Roxygen, although I could not figure out how that function signature differs from the other functions that work. I do have the utags specified to the correct path in my tools.yml. I also noticed an error for XML in GitHub Actions. I do have the correct version of XML installed locally. I also noted that checks and Actions pass on my new PR (which I branched off this one, not main).

  • using log directory ‘/Users/dao/Desktop/kaiaulu.Rcheck’
  • using R version 4.4.0 (2024-04-24)
  • using platform: x86_64-apple-darwin20
  • R was compiled by
    Apple clang version 14.0.0 (clang-1400.0.29.202)
    GNU Fortran (GCC) 12.2.0
  • running under: macOS Sonoma 14.5
  • using session charset: UTF-8
  • using options ‘--no-manual --as-cran’
  • checking for file ‘kaiaulu/DESCRIPTION’ ... OK
  • checking extension type ... Package
  • this is package ‘kaiaulu’ version ‘0.0.0.9700’
  • package encoding: UTF-8
  • checking package namespace information ... OK
  • checking package dependencies ... NOTE
    Package suggested but not available for checking: ‘markdown’
  • checking if this is a source package ... OK
  • checking if there is a namespace ... OK
  • checking for executable files ... OK
  • checking for hidden files and directories ... NOTE
    Found the following hidden files and directories:
    .github
    These were most likely included in error. See section ‘Package
    structure’ in the ‘Writing R Extensions’ manual.
  • checking for portable file names ... OK
  • checking for sufficient/correct file permissions ... OK
  • checking whether package ‘kaiaulu’ can be installed ... WARNING
    Found the following significant warnings:
    Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:22: unknown macro '\item'
    Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:24: unknown macro '\item'
    Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:26: unexpected section header '\value'
    Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:29: unexpected section header '\description'
    Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:40: unexpected END_OF_INPUT '
    See ‘/Users/dao/Desktop/kaiaulu.Rcheck/00install.out’ for details.
  • checking installed package size ... OK
  • checking package directory ... OK
  • checking for future file timestamps ... OK
  • checking DESCRIPTION meta-information ... OK
  • checking top-level files ... NOTE
    Non-standard files/directories found at top level:
    ‘CONTRIBUTING.md’ ‘conf’ ‘tools.yml’
  • checking for left-over files ... OK
  • checking index information ... OK
  • checking package subdirectories ... OK
  • checking code files for non-ASCII characters ... OK
  • checking R files for syntax errors ... OK
  • checking whether the package can be loaded ... OK
  • checking whether the package can be loaded with stated dependencies ... OK
  • checking whether the package can be unloaded cleanly ... OK
  • checking whether the namespace can be loaded with stated dependencies ... OK
  • checking whether the namespace can be unloaded cleanly ... OK
  • checking dependencies in R code ... NOTE
    Namespaces in Imports field not imported from:
    ‘cli’ ‘curl’ ‘docopt’
    All declared Imports should be used.
  • checking S3 generic/method consistency ... OK
  • checking replacement functions ... OK
  • checking foreign function calls ... OK
  • checking R code for possible problems ... [12s/12s] NOTE
    bipartite_graph_projection : get_combinations: no visible global
    function definition for ‘combn’
    bipartite_graph_projection: no visible binding for global variable
    ‘type’
    bipartite_graph_projection: no visible binding for global variable
    ‘.SD’
    bipartite_graph_projection: no visible global function definition for
    ‘complete.cases’
    commit_message_id_coverage: no visible binding for global variable
    ‘commit_hash’
    commit_message_id_coverage: no visible binding for global variable
    ‘commit_message’
    community_oslom: no visible global function definition for ‘fwrite’
    community_oslom: no visible global function definition for ‘stri_match’
    community_oslom: no visible global function definition for ‘stri_split’
    dependencies_to_sdsmj: no visible binding for global variable
    ‘project_name’
    filter_by_commit_size: no visible binding for global variable
    ‘frequency’
    filter_by_last_files_change: no visible global function definition for
    ‘first’
    filter_by_last_files_change: no visible binding for global variable
    ‘commit_hash’
    filter_by_last_files_change: no visible binding for global variable
    ‘author_datetimetz’
    filter_by_last_files_change: no visible binding for global variable
    ‘.SD’
    filter_by_last_files_change: no visible binding for global variable
    ‘file_pathname’
    get_date_from_commit_hash: no visible binding for global variable
    ‘commit_hash’
    github_api_iterate_pages: no visible global function definition for
    ‘write_json’
    github_api_iterate_pages: no visible binding for global variable
    ‘owner’
    github_api_iterate_pages: no visible binding for global variable ‘repo’
    gitlog_to_hdsmj: no visible binding for global variable ‘project_name’
    graph_to_dsmj: no visible binding for global variable ‘weight’
    graph_to_dsmj: no visible global function definition for ‘dcast’
    graph_to_dsmj: no visible binding for global variable ‘.SD’
    graph_to_dsmj: no visible binding for global variable ‘from’
    graph_to_dsmj: no visible binding for global variable ‘to’
    identity_match: no visible binding for global variable ‘raw_name’
    identity_match: no visible binding for global variable ‘identity_id’
    metric_churn_per_commit_per_file: no visible binding for global
    variable ‘lines_added’
    metric_churn_per_commit_per_file: no visible binding for global
    variable ‘lines_removed’
    metric_file_bug_churn: no visible binding for global variable
    ‘issue_status’
    metric_file_bug_churn: no visible binding for global variable
    ‘issue_type’
    metric_file_bug_churn: no visible binding for global variable
    ‘issue_key’
    metric_file_bug_churn: no visible binding for global variable
    ‘file_pathname’
    metric_file_bug_churn: no visible binding for global variable ‘churn’
    metric_file_bug_churn: no visible binding for global variable
    ‘commit_message_id’
    metric_file_bug_frequency: no visible binding for global variable
    ‘issue_status’
    metric_file_bug_frequency: no visible binding for global variable
    ‘issue_type’
    metric_file_bug_frequency: no visible binding for global variable
    ‘issue_key’
    metric_file_bug_frequency: no visible binding for global variable
    ‘file_pathname’
    metric_file_bug_frequency: no visible binding for global variable
    ‘commit_message_id’
    metric_file_churn: no visible binding for global variable ‘churn’
    metric_file_churn: no visible binding for global variable
    ‘file_pathname’
    metric_file_churn: no visible binding for global variable ‘file_churn’
    metric_file_non_bug_churn: no visible binding for global variable
    ‘issue_status’
    metric_file_non_bug_churn: no visible binding for global variable
    ‘issue_type’
    metric_file_non_bug_churn: no visible binding for global variable
    ‘issue_key’
    metric_file_non_bug_churn: no visible binding for global variable
    ‘file_pathname’
    metric_file_non_bug_churn: no visible binding for global variable
    ‘churn’
    metric_file_non_bug_churn: no visible binding for global variable
    ‘commit_message_id’
    metric_file_non_bug_frequency: no visible binding for global variable
    ‘issue_status’
    metric_file_non_bug_frequency: no visible binding for global variable
    ‘issue_type’
    metric_file_non_bug_frequency: no visible binding for global variable
    ‘issue_key’
    metric_file_non_bug_frequency: no visible binding for global variable
    ‘file_pathname’
    metric_file_non_bug_frequency: no visible binding for global variable
    ‘commit_message_id’
    parse_bugzilla_perceval_rest_issue_comments : bugzilla_parse_comment:
    no visible binding for global variable ‘bug_id’
    parse_bugzilla_perceval_rest_issue_comments: no visible global function
    definition for ‘merge.data.table’
    parse_bugzilla_perceval_traditional_issue_comments: no visible global
    function definition for ‘merge.data.table’
    parse_bugzilla_rest_comments: no visible binding for global variable
    ‘..expected_columns’
    parse_bugzilla_rest_issues: no visible binding for global variable
    ‘issue_type’
    parse_bugzilla_rest_issues: no visible binding for global variable
    ‘..expected_columns’
    parse_bugzilla_rest_issues_comments: no visible binding for global
    variable ‘..expected_comments_columns’
    parse_dv8_architectural_flaws : generate_file_paths: no visible binding
    for global variable ‘architecture_issue_type’
    parse_dv8_architectural_flaws : generate_file_paths: no visible binding
    for global variable ‘architecture_issue_id’
    parse_dv8_architectural_flaws: no visible global function definition
    for ‘txtProgressBar’
    parse_dv8_architectural_flaws: no visible global function definition
    for ‘setTxtProgressBar’
    parse_dv8_metrics_decoupling_level: no visible global function
    definition for ‘setDT’
    parse_git_blame: no visible binding for global variable
    ‘is_commit_line’
    parse_git_blame: no visible binding for global variable ‘raw_line’
    parse_git_blame: no visible binding for global variable
    ‘commit_hash_id’
    parse_git_blame: no visible binding for global variable ‘author_name’
    parse_git_blame: no visible binding for global variable ‘author_email’
    parse_git_blame: no visible binding for global variable
    ‘author_timestamp’
    parse_git_blame: no visible binding for global variable ‘author_tz’
    parse_git_blame: no visible binding for global variable
    ‘committer_name’
    parse_git_blame: no visible binding for global variable
    ‘committer_email’
    parse_git_blame: no visible binding for global variable
    ‘committer_timestamp’
    parse_git_blame: no visible binding for global variable ‘committer_tz’
    parse_git_blame: no visible binding for global variable
    ‘committer_summary’
    parse_git_blame: no visible binding for global variable
    ‘line_n_original_file’
    parse_git_blame: no visible binding for global variable
    ‘line_n_final_file’
    parse_git_blame: no visible binding for global variable
    ‘previous_commit_hash’
    parse_git_blame: no visible binding for global variable ‘content’
    parse_github_replies: no visible binding for global variable ‘issue_id’
    parse_github_replies: no visible binding for global variable
    ‘created_at’
    parse_github_replies: no visible binding for global variable
    ‘issue_user_login’
    parse_github_replies: no visible binding for global variable
    ‘issue_number’
    parse_github_replies: no visible binding for global variable ‘pr_id’
    parse_github_replies: no visible binding for global variable
    ‘pr_user_login’
    parse_github_replies: no visible binding for global variable
    ‘pr_number’
    parse_github_replies: no visible binding for global variable
    ‘comment_id’
    parse_github_replies: no visible binding for global variable
    ‘comment_user_login’
    parse_github_replies: no visible binding for global variable
    ‘issue_url’
    parse_github_replies: no visible binding for global variable
    ‘author_login’
    parse_github_replies: no visible binding for global variable
    ‘commit_author_name’
    parse_github_replies: no visible binding for global variable
    ‘commit_author_email’
    parse_github_replies: no visible binding for global variable
    ‘committer_login’
    parse_github_replies: no visible binding for global variable
    ‘commit_committer_name’
    parse_github_replies: no visible binding for global variable
    ‘commit_committer_email’
    parse_github_replies: no visible binding for global variable
    ‘name_email’
    parse_gitlog : add_new_files_to_table: no visible binding for global
    variable ‘action’
    parse_gitlog : add_new_files_to_table: no visible binding for global
    variable ‘added’
    parse_gitlog : add_new_files_to_table: no visible binding for global
    variable ‘indexes’
    parse_gitlog : add_new_files_to_table: no visible binding for global
    variable ‘modes’
    parse_gitlog : add_new_files_to_table: no visible binding for global
    variable ‘newfile’
    parse_gitlog : add_new_files_to_table: no visible binding for global
    variable ‘removed’
    parse_gitlog: no visible binding for global variable ‘data.files’
    parse_gitlog: no visible binding for global variable ‘data.Author’
    parse_gitlog: no visible binding for global variable ‘data.AuthorDate’
    parse_gitlog: no visible binding for global variable ‘data.commit’
    parse_gitlog: no visible binding for global variable ‘data.Commit’
    parse_gitlog: no visible binding for global variable ‘data.CommitDate’
    parse_gitlog: no visible binding for global variable ‘data.message’
    parse_gitlog: no visible binding for global variable
    ‘file_pathname_renamed’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘commit_hash’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘line_n_final_file’
    parse_gitlog_entity : blamed_git_log: no visible global function
    definition for ‘complete.cases’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘line_start’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘line_end’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘x.commit_hash’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘i.entity_name’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘i.entity_type’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘x.line_n_final_file’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘i.line_start’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘i.line_end’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘author_name’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘author_email’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘author_timestamp’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘author_tz’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘committer_name’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘committer_email’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘committer_timestamp’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘committer_tz’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘committer_summary’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘changed_line_number’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘entity_definition_name’
    parse_gitlog_entity : blamed_git_log: no visible binding for global
    variable ‘n_lines_changed’
    parse_gitlog_entity: no visible binding for global variable ‘row_id’
    parse_gitlog_entity: no visible global function definition for
    ‘txtProgressBar’
    parse_gitlog_entity: no visible global function definition for
    ‘setTxtProgressBar’
    parse_gitlog_entity: no visible binding for global variable ‘.GRP’
    parse_gitlog_entity: no visible binding for global variable
    ‘commit_hash’
    parse_gitlog_entity: no visible binding for global variable
    ‘file_pathname’
    parse_gof_patterns : parse_instance: no visible binding for global
    variable ‘instance_id’
    parse_gof_patterns : parse_instance: no visible binding for '<<-'
    assignment to ‘instance_id’
    parse_gof_patterns : parse_pattern: no visible binding for '<<-'
    assignment to ‘instance_id’
    parse_gof_patterns: no visible binding for global variable
    ‘pattern_name’
    parse_gof_patterns: no visible binding for global variable
    ‘instance_id’
    parse_gof_patterns: no visible binding for global variable ‘role_name’
    parse_gof_patterns: no visible binding for global variable ‘element’
    parse_jira_replies: no visible binding for global variable ‘issue_key’
    parse_jira_replies: no visible binding for global variable
    ‘issue_created_datetimetz’
    parse_jira_replies: no visible binding for global variable
    ‘issue_creator_name’
    parse_jira_replies: no visible binding for global variable
    ‘issue_description’
    parse_jira_replies: no visible binding for global variable ‘comment_id’
    parse_jira_replies: no visible binding for global variable
    ‘comment_created_datetimetz’
    parse_jira_replies: no visible binding for global variable
    ‘comment_author_name’
    parse_jira_replies: no visible binding for global variable
    ‘comment_body’
    parse_line_metrics: no visible global function definition for ‘fread’
    parse_mbox: no visible binding for global variable
    ‘..columns_of_interest’
    parse_r_dependencies : : no visible binding for global
    variable ‘filepath’
    parse_r_dependencies : filter_by_ownership: no visible binding for
    global variable ‘src_functions_call_name’
    parse_r_dependencies : filter_by_ownership: no visible binding for
    global variable ‘src_functions_caller_name’
    parse_r_function_dependencies: no visible binding for global variable
    ‘file_path’
    parse_r_function_dependencies: no visible global function definition
    for ‘complete.cases’
    recolor_network_by_community: no visible binding for global variable
    ‘name’
    recolor_network_by_community: no visible binding for global variable
    ‘color_community’
    recolor_network_by_community: no visible binding for global variable
    ‘type’
    recolor_network_by_community: no visible binding for global variable
    ‘color’
    smell_missing_links: no visible binding for global variable ‘from’
    smell_missing_links: no visible binding for global variable ‘to’
    smell_organizational_silo: no visible binding for global variable
    ‘from’
    smell_organizational_silo: no visible binding for global variable ‘to’
    smell_radio_silence: no visible binding for global variable
    ‘cluster_id’
    smell_radio_silence: no visible binding for global variable ‘from’
    smell_radio_silence: no visible binding for global variable ‘to’
    smell_radio_silence: no visible binding for global variable ‘node_id’
    split_name_email: no visible global function definition for
    ‘stri_match_first’
    subset_gof_class: no visible binding for global variable ‘role_name’
    temporal_graph_projection : one_lag_combinations: no visible binding
    for global variable ‘datetimetz’
    temporal_graph_projection : one_lag_combinations: no visible binding
    for global variable ‘weight’
    temporal_graph_projection : one_lag_combinations: no visible binding
    for global variable ‘from_projection’
    temporal_graph_projection : one_lag_combinations: no visible binding
    for global variable ‘from_weight’
    temporal_graph_projection : one_lag_combinations: no visible binding
    for global variable ‘to_projection’
    temporal_graph_projection : one_lag_combinations: no visible binding
    for global variable ‘to_weight’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘datetimetz’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘weight’
    temporal_graph_projection : all_lag_combinations: no visible global
    function definition for ‘combn’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘from_projection’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘from_weight’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘from_datetimetz’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘to_projection’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘to_weight’
    temporal_graph_projection : all_lag_combinations: no visible binding
    for global variable ‘to_datetimetz’
    temporal_graph_projection: no visible binding for global variable
    ‘type’
    temporal_graph_projection: no visible binding for global variable ‘.SD’
    temporal_graph_projection: no visible global function definition for
    ‘complete.cases’
    transform_commit_message_id_to_network: no visible binding for global
    variable ‘file_pathname’
    transform_cve_cwe_file_to_network: no visible binding for global
    variable ‘from’
    transform_dependencies_to_network: no visible binding for global
    variable ‘src_filepath’
    transform_dependencies_to_network: no visible binding for global
    variable ‘dest_filepath’
    transform_dependencies_to_sdsmj: no visible global function definition
    for ‘melt’
    transform_dependencies_to_sdsmj: no visible global function definition
    for ‘setcolorder’
    transform_gitlog_to_bipartite_network: no visible binding for global
    variable ‘file_pathname’
    transform_gitlog_to_bipartite_network: no visible binding for global
    variable ‘author’
    transform_gitlog_to_bipartite_network: no visible binding for global
    variable ‘committer’
    transform_gitlog_to_entity_bipartite_network: no visible binding for
    global variable ‘entity’
    transform_gitlog_to_entity_bipartite_network: no visible binding for
    global variable ‘weight’
    transform_gitlog_to_entity_bipartite_network: no visible binding for
    global variable ‘author’
    transform_gitlog_to_entity_bipartite_network: no visible binding for
    global variable ‘committer’
    transform_gitlog_to_entity_temporal_network: no visible binding for
    global variable ‘from’
    transform_gitlog_to_entity_temporal_network: no visible binding for
    global variable ‘to’
    transform_gitlog_to_entity_temporal_network: no visible binding for
    global variable ‘weight’
    transform_gitlog_to_entity_temporal_network: no visible binding for
    global variable ‘datetimetz’
    transform_r_dependencies_to_network: no visible binding for global
    variable ‘src_functions_call_name’
    transform_r_dependencies_to_network: no visible binding for global
    variable ‘src_functions_caller_name’
    transform_r_dependencies_to_network: no visible binding for global
    variable ‘src_functions_call_filename’
    transform_r_dependencies_to_network: no visible binding for global
    variable ‘src_functions_caller_filename’
    transform_reply_to_bipartite_network: no visible binding for global
    variable ‘reply_from’
    transform_reply_to_bipartite_network: no visible binding for global
    variable ‘reply_subject’
    weight_scheme_count_deleted_nodes: no visible binding for global
    variable ‘eliminated_node’
    weight_scheme_cum_temporal : sum_original_contributions: no visible
    binding for global variable ‘from_weight’
    weight_scheme_cum_temporal : sum_original_contributions: no visible
    binding for global variable ‘from_datetimetz’
    weight_scheme_cum_temporal : sum_original_contributions: no visible
    binding for global variable ‘to_weight’
    weight_scheme_cum_temporal : sum_original_contributions: no visible
    binding for global variable ‘to_datetimetz’
    weight_scheme_cum_temporal: no visible binding for global variable
    ‘.SD’
    weight_scheme_pairwise_cum_temporal: no visible binding for global
    variable ‘from_weight’
    weight_scheme_pairwise_cum_temporal: no visible binding for global
    variable ‘to_weight’
    weight_scheme_sum_edges: no visible binding for global variable
    ‘weight’
    Undefined global functions or variables:
    ..columns_of_interest ..expected_columns ..expected_comments_columns
    .GRP .SD action added architecture_issue_id architecture_issue_type
    author author_datetimetz author_email author_login author_name
    author_timestamp author_tz bug_id changed_line_number churn
    cluster_id color color_community combn comment_author_name
    comment_body comment_created_datetimetz comment_id comment_user_login
    commit_author_email commit_author_name commit_committer_email
    commit_committer_name commit_hash commit_hash_id commit_message
    commit_message_id committer committer_email committer_login
    committer_name committer_summary committer_timestamp committer_tz
    complete.cases content created_at data.Author data.AuthorDate
    data.Commit data.CommitDate data.commit data.files data.message
    datetimetz dcast dest_filepath element eliminated_node entity
    entity_definition_name file_churn file_path file_pathname
    file_pathname_renamed filepath first fread frequency from
    from_datetimetz from_projection from_weight fwrite i.entity_name
    i.entity_type i.line_end i.line_start identity_id indexes instance_id
    is_commit_line issue_created_datetimetz issue_creator_name
    issue_description issue_id issue_key issue_number issue_status
    issue_type issue_url issue_user_login line_end line_n_final_file
    line_n_original_file line_start lines_added lines_removed melt
    merge.data.table modes n_lines_changed name name_email newfile
    node_id owner pattern_name pr_id pr_number pr_user_login
    previous_commit_hash project_name raw_line raw_name removed
    reply_from reply_subject repo role_name row_id setDT
    setTxtProgressBar setcolorder src_filepath
    src_functions_call_filename src_functions_call_name
    src_functions_caller_filename src_functions_caller_name stri_match
    stri_match_first stri_split to to_datetimetz to_projection to_weight
    txtProgressBar type weight write_json x.commit_hash
    x.line_n_final_file
    Consider adding
    importFrom("stats", "complete.cases", "frequency")
    importFrom("utils", "combn", "setTxtProgressBar", "txtProgressBar")
    to your NAMESPACE file.
  • checking Rd files ... WARNING
    checkRd: (-1) commit_message_id_coverage.Rd:24: Lost braces; missing escapes or markup?
    24 | Other {metrics}:
    | ^
    prepare_Rd: ./man/download_pipermail.Rd:22: unknown macro '\item'
    prepare_Rd: ./man/download_pipermail.Rd:24: unknown macro '\item'
    prepare_Rd: ./man/download_pipermail.Rd:26: unexpected section header '\value'
    prepare_Rd: ./man/download_pipermail.Rd:29: unexpected section header '\description'
    prepare_Rd: ./man/download_pipermail.Rd:40: unexpected END_OF_INPUT '
    '
    checkRd: (-1) download_pipermail.Rd:22: Lost braces
    22 | \item{save_folder_path}{The folder path in which all the downloaded pipermail files will be stored}
    | ^
    checkRd: (-1) download_pipermail.Rd:22: Lost braces
    22 | \item{save_folder_path}{The folder path in which all the downloaded pipermail files will be stored}
    | ^
    checkRd: (-1) download_pipermail.Rd:24: Lost braces
    24 | \item{verbose}{if TRUE, prints diagnostic messages during the download process}
    | ^
    checkRd: (-1) download_pipermail.Rd:24: Lost braces
    24 | \item{verbose}{if TRUE, prints diagnostic messages during the download process}
    | ^
    checkRd: (-1) download_pipermail.Rd:26-28: Lost braces
    26 | \value{
    | ^
    checkRd: (-1) download_pipermail.Rd:29-39: Lost braces
    29 | \description{
    | ^
    checkRd: (-1) example_test_example_src_repo.Rd:23: Escaped LaTeX specials: _ _
    checkRd: (-1) git_create_sample_log.Rd:26: Lost braces; missing escapes or markup?
    26 | Other {unittest}:
    | ^
    checkRd: (-1) git_delete_sample_log.Rd:23: Lost braces; missing escapes or markup?
    23 | Other {unittest}:
    | ^
    checkRd: (-1) github_api_project_commits.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Commits from "GET /repos/{owner}/{repo}/commits" endpoint.
    | ^
    checkRd: (-1) github_api_project_commits.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Commits from "GET /repos/{owner}/{repo}/commits" endpoint.
    | ^
    checkRd: (-1) github_api_project_contributors.Rd:17: Lost braces; missing escapes or markup?
    17 | Download project contributors from GET /repos/{owner}/{repo}/contributors" endpoint.
    | ^
    checkRd: (-1) github_api_project_contributors.Rd:17: Lost braces; missing escapes or markup?
    17 | Download project contributors from GET /repos/{owner}/{repo}/contributors" endpoint.
    | ^
    checkRd: (-1) github_api_project_issue.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Issues from "GET /repos/{owner}/{repo}/issues" endpoint.
    | ^
    checkRd: (-1) github_api_project_issue.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Issues from "GET /repos/{owner}/{repo}/issues" endpoint.
    | ^
    checkRd: (-1) github_api_project_issue_events.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Issues from "GET /repos/{owner}/{repo}/issues/events" endpoint.
    | ^
    checkRd: (-1) github_api_project_issue_events.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Issues from "GET /repos/{owner}/{repo}/issues/events" endpoint.
    | ^
    checkRd: (-1) github_api_project_issue_or_pr_comments.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Issues' or Pull Request's Comments from "GET /repos/{owner}/{repo}/issues/comments" endpoint.
    | ^
    checkRd: (-1) github_api_project_issue_or_pr_comments.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Issues' or Pull Request's Comments from "GET /repos/{owner}/{repo}/issues/comments" endpoint.
    | ^
    checkRd: (-1) github_api_project_pull_request.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Pull Requests from "GET /repos/{owner}/{repo}/pulls" endpoint.
    | ^
    checkRd: (-1) github_api_project_pull_request.Rd:17: Lost braces; missing escapes or markup?
    17 | Download Pull Requests from "GET /repos/{owner}/{repo}/pulls" endpoint.
    | ^
    checkRd: (-1) make_jira_issue.Rd:69: Lost braces; missing escapes or markup?
    69 | Other {unittest}:
    | ^
    checkRd: (-1) metric_churn.Rd:24: Lost braces; missing escapes or markup?
    24 | Other {metrics}:
    | ^
    checkRd: (-1) metric_churn_per_commit_interval.Rd:21: Lost braces; missing escapes or markup?
    21 | Other {metrics}:
    | ^
    checkRd: (-1) metric_churn_per_commit_per_file.Rd:21: Lost braces; missing escapes or markup?
    21 | Other {metrics}:
    | ^
    checkRd: (-1) metric_file_bug_churn.Rd:21: Lost braces; missing escapes or markup?
    21 | Other {metrics}:
    | ^
    checkRd: (-1) metric_file_bug_frequency.Rd:21: Lost braces; missing escapes or markup?
    21 | Other {metrics}:
    | ^
    checkRd: (-1) metric_file_churn.Rd:19: Lost braces; missing escapes or markup?
    19 | Other {metrics}:
    | ^
    checkRd: (-1) metric_file_non_bug_churn.Rd:21: Lost braces; missing escapes or markup?
    21 | Other {metrics}:
    | ^
    checkRd: (-1) metric_file_non_bug_frequency.Rd:21: Lost braces; missing escapes or markup?
    21 | Other {metrics}:
    | ^
  • checking Rd metadata ... OK
  • checking Rd line widths ... OK
  • checking Rd cross-references ... WARNING
    Missing link or links in Rd file 'parse_mbox_latest_date.Rd':
    ‘download_mod_mbox_per_month’

See section 'Cross-references' in the 'Writing R Extensions' manual.

  • checking for missing documentation entries ... OK
  • checking for code/documentation mismatches ... OK
  • checking Rd \usage sections ... WARNING
    Documented arguments not in \usage in Rd file 'create_parent.Rd':
    ‘parent_issue_key’

Undocumented arguments in Rd file 'download_pipermail.Rd'
‘save_folder_path’ ‘verbose’

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

  • checking Rd contents ... NOTE
    Rd files without \description:
    ‘download_pipermail.Rd’

  • checking for unstated dependencies in examples ... OK

  • checking examples ... OK

  • checking for unstated dependencies in ‘tests’ ... OK

  • checking tests ... [13s/13s] ERROR
    Running ‘testthat.R’ [13s/13s]
    Running the tests in ‘tests/testthat.R’ failed.
    Last 13 lines of output:
    collapse = "")), stri_c("--kinds-", language, "=", stri_c(file_kinds,
    collapse = ""), collapse = ""), "-f", "-", filepath), stdout = TRUE,
    stderr = FALSE)`: error in running command
    Backtrace:

    1. ├─kaiaulu::parse_gitlog_entity(...) at test-graph.R:242:3
    2. │ ├─...[]
    3. │ └─data.table:::[.data.table(...)
    4. └─kaiaulu (local) blamed_git_log(...)
    5. └─kaiaulu::parse_line_type_file(utags_path, file_path, kinds)
    6. └─base::system2(...)
      

    [ FAIL 4 | WARN 12 | SKIP 2 | PASS 81 ]
    Error: Test failures
    Execution halted

  • checking for non-standard things in the check directory ... OK

  • checking for detritus in the temp directory ... OK

  • DONE
    Status: 1 ERROR, 4 WARNINGs, 6 NOTEs

daomcgill added a commit that referenced this issue Oct 2, 2024
- Renamed for match with convention set by issue #230

Signed-off-by: Dao McGill <[email protected]>
@carlosparadis
Copy link
Member

@daomcgill this would likely be easier on a call to diagnose however, if you search:

Undocumented arguments

This is the main thing you want to check on your code. It means you have undocumented parameters on your functions. Are you re-compiling the documentation with cmd + shift + D?

Found the following significant warnings:
Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:22: unknown macro '\item'
Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:24: unknown macro '\item'
Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:26: unexpected section header '\value'
Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:29: unexpected section header '\description'
Warning: /Users/dao/Desktop/kaiaulu.Rcheck/00_pkg_src/kaiaulu/man/download_pipermail.Rd:40: unexpected END_OF_INPUT '
See ‘/Users/dao/Desktop/kaiaulu.Rcheck/00install.out’ for details.

There is something going on the download_pipermail function docs on this line too.


Also, please check CONTRIBUTING.md in Kaiaulu, you should update NEWS.md and the DESCRIPTION contributors to include your name there.

daomcgill added a commit that referenced this issue Oct 3, 2024
- Reverted name change of save_folder_mail
- Removed previous documentation file for mail (download_mod_mbox.Rmd)
- Updates to dowmload_mail.Rmd
daomcgill added a commit that referenced this issue Oct 3, 2024
daomcgill added a commit that referenced this issue Oct 3, 2024
- parse_mbox_lateset_date() now uses new naming convention for files
- Added to download_mail.Rmd
- Fixed documentation for download_pipermail()

Signed-off-by: Dao McGill <[email protected]>
daomcgill added a commit that referenced this issue Oct 3, 2024
- added parse_mbox_latest_date
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants