Feature/fix url and add better ordering with numbers #3

GatorQue · 2019-11-09T23:52:47Z

Thank you for creating the General Conference Downloader tool. I fixed a few issues and added a new feature. Please accept this pull request or comment on what you would like me to change.

…t using playlists

GatorQue · 2019-11-10T00:10:27Z

While testing the full range I ran into a problem trying to download MP3 files from 2016. I'm going to investigate and try to find a fix for this.

GatorQue · 2019-11-10T00:27:01Z

OK it should be fixed now, waiting to see how it does on older talks.

…nabled by default)

jdshaeffer · 2020-01-16T02:51:13Z

any update on this?

GatorQue · 2020-01-16T03:18:16Z

I haven't heard anything from the original author but I have been told that if you use this branch it works great.

clarkshaeffer · 2020-01-20T19:17:47Z

Hi, I'm experiencing a problem with your branch:

Problem with http request (https://www.churchofjesuschrist.org/languages: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>The given language (eng) is not available. Please choose one of the following:

I run on Python 3.8.0 on MacOS 10.14.6.
Anything helps! I need more of President Nelson in my life!!

GatorQue · 2020-01-21T03:19:36Z

Thank you for reaching out, I did a quick Google search on your error and came up with the following:
https://stackoverflow.com/questions/22027418/openssl-python-requests-error-certificate-verify-failed
Which suggests that you type this in:
pip install certifi
You might also search for the Install Certificates program described here:
https://stackoverflow.com/questions/52805115/certificate-verify-failed-unable-to-get-local-issuer-certificate
/Applications/Python\ 3.8/Install\ Certificates.command

The issue is that some of the "root" certificates on your computer are missing and unable to validate the SSL connection to the church's website. Installing these "root" certificates will enable you to run the program.

clarkshaeffer · 2020-01-21T18:38:36Z

Worked like a charm! Thanks!

rafaelmx · 2020-01-30T20:20:18Z

Hello,
I'm a bit embarrassed that I have to ask this but... How do I run this script?
I'm totally new to this but I'm super excited to get this working.
So far I've done the following:

Installed Python 3.8 on Windows 10
Downloaded the original files and extracted them in a folder I created (C:\Users\rsanc\Documents\Rafa\LDS-GC)
Modified the three files GatorQue edited (I didn't know how to download them with the changes, so I made the changes one by one).
And this is what I think I'm not doing correctly:
I opened a terminal and navigated to the folder with the extracted files of the script.
I typed on the terminal window python pip install -r requirements.txt and didn't get any message (previously I didn't use "python" at the beginning but got an error message, with no message I assumed I was doing it right)
I typed on the terminal window python gen_conf_downloader.py. Nothing. No message. I also tried python gen_conf_downloader.py -s 2018 -d C:\GC with no success.

Steps 4 through 6 were made both before and after modifying the files with the same results. Any help?

GatorQue · 2020-01-31T02:52:14Z

Hello!
Welcome! I'm not 100% sure, but I suspect that your terminal window doesn't know how to find Python. Do you remember if you checked the "Add Python 3.8 to PATH" checkbox on the first screen? If not, can you try uninstalling Python 3.8 and reinstalling it again and make sure this checkbox is checked?
I think once you do this the command "pip install -r requirements.txt" should work as expected (you will see a bunch of things downloaded and installed probably) and "python gen_conf_downloader.py -s 2018" should work.

rafaelmx · 2020-01-31T17:06:30Z

Hello!
Welcome! I'm not 100% sure, but I suspect that your terminal window doesn't know how to find Python. Do you remember if you checked the "Add Python 3.8 to PATH" checkbox on the first screen? If not, can you try uninstalling Python 3.8 and reinstalling it again and make sure this checkbox is checked?
I think once you do this the command "pip install -r requirements.txt" should work as expected (you will see a bunch of things downloaded and installed probably) and "python gen_conf_downloader.py -s 2018" should work.

Wow... it worked! Just as you imagined, I didn't check the "Add Python 3.8 to PATH" option, so I uninstalled it and installed it again. It worked perfectly. Thank you very much for your help. I'm impressed with the result.

One question, what will happen with the current files the next time I download new audios? For instance, I downloaded just from 2018 and 2019, what if I want to download from 2016? Will this script skip those already downloaded?

GatorQue · 2020-02-01T00:50:55Z

During the download the python script usually makes a cache of all the HTML pages it downloads. This enables it to avoid re-downloading those files again. As long as you don't remove the cache directory then I think it should work as you expect. I believe it does recreate the "play list" files though since those are usually affected. There are play list files created by topic, speaker, and session if I recall correctly.

Jacobobber1087 · 2024-02-29T19:01:28Z

There have been a few changes to the church website, is there any chance this gets an update?

GatorQue · 2024-03-01T14:18:45Z

@Jacobobber1087, I have been keeping this tool updated under my Github fork of this project. Have you given that a try?
https://github.com/GatorQue/LDSGeneralConferenceDownloader/releases
I use it for myself after every conference. If my version isn't working, I will be happy to look into it.

Jacobobber1087 · 2024-03-01T16:48:25Z

@GatorQue Oh ok, thank you! It seems to work, but the destination folder is empty after it completes, do you know what could cause this?

GatorQue · 2024-03-02T01:14:04Z

@Jacobobber1087, I see the same results. Let me look into what is causing this and post a new version. Something must have changed in the format of the HTML to prevent the program from working right.

GatorQue · 2024-03-02T21:14:25Z

@Jacobobber1087 - It seems that the church has hidden the MP3 download link behind the "Options" side panel which only seems to load when you click on the "Options" button (3 dots) and then click on the Download arrow. There is Javascript code which loads the Options side panel and the Download arrow loads the link somehow. I haven't found a good way to do that with my current way of doing things. I will need to see if I can find a Python based web browser that is capable of performing the Javascript commands needed to trigger the MP3 media link to appear in order to fix this. I will keep looking into this but it isn't going to be an easy fix like I was hoping.

Jacobobber1087 · 2024-03-02T21:19:04Z

@GatorQue Yeah, I was very curious how you were getting around the Javascript in previous versions of this haha... I ended up writing an automation in Microsoft Power Automate Desktop that uses Firefox to iterate through the sites and manually click to the download link. It technically worked but it took forever and was super clunky. Is there any way to interact with Javascript through a script that you know of?

GatorQue · 2024-03-03T03:59:31Z

@Jacobobber1087,
Great question. In the recent past the MP3 URL could be found in the giant BASE64 content in the initial HTML download. This has changed at least sometime after October 2023.
From my research today, I have found that if it is possible to execute the following javascript lines after the page loads it should provide a DOM that includes an element (the last one mentioned) whose href value is what we want for the MP3 file:
document.querySelector('[title="Options"]').click()
document.querySelector('button[data-testid="download-menu-button"]').click()
document.querySelector("a[data-testid="download-link-0"]").href

As far as tools are concerned, I have initially looked at splash, a docker image with Qt5 WebKit and a HTTP API for performing queries (usually paired with a Python scrapy-splash package). I have also discovered requests-html which uses a headless chromium install downloaded using the pyppeteer python package (but since that package has been abandoned the download fails). There is also a Python package Selenium that also uses a headless chromium to perform web scrapes which I haven't done anything with yet. I think if we can combine the above javascript lines somehow with a headless install of chromium, we might be able to retrieve the information we need. Another approach would be to identify WHAT/HOW the javascript downloads and modifies the DOM to create the "This Page (MP3)" download reference element when we click on the Options and Download arrows. Yet another approach might be to "predict" the media URL by guessing the filename that would be used from the information in the initial HTML but I suspect that might not be as stable (but certainly faster) approach.
Thoughts?

Jacobobber1087 · 2024-03-03T18:00:55Z

@GatorQue Ok cool. I hadn't heard of a headless browser before, that seems like a really good solution. Would the browser need to be in the foreground? I assume not if you're sending requests through JS?
Predicting the URL would be tricky, they use titles for some General Authorities (but not all) and you would have to know the mp3 bitrate. If this information is in the HTML that could work really well.
How did you get the list of the links to each conference? I had to do that manually because of how the church groups the conferences on their website.
I wonder if there is any way to access /assets/general-conference/ on the media2.ldscdn.org site? It doesn't allow a direct visit, maybe wget?

GatorQue · 2024-03-05T04:48:29Z

@Jacobobber1087,
A headless browser means it doesn't provide a GUI/Window. This means requests must be sent some other way, usually through some REST api or other technique. For Splash it uses a custom REST api which allows for injecting some additional JavaScript commands to be processed after the page loads (which I haven't gotten to work fully yet).
As far as getting the list of conferences, I perform a HTTP GET request for /study/general-conference and parse the HTML using several regular expressions to extract each conference, sessions, and talks into Python tuple objects. Feel free to look at the gen_conf_downloader.py file in my repository for more details.
I will need to re-review the talk and conference HTML to see if enough information could be extracted to predict the media2 URL to use to get the MP3 file. As far as mp3 bitrate, we could just have it try a few different bitrates until it finds one that works.
Unfortunately, there is no way to "browse" for a list of all files on the media2.ldscdn.org site that I have found. Perhaps there is a hidden index file that would give the complete list but I haven't seen evidence of this yet. The wget program wouldn't likely yield any different results against the media2 website. I did a wget against the talk and it practically started downloading all conference years and talks since they are all interlinked together so I gave up since I want people to be able to limit the conferences they wish to download. I didn't let it run long enough to see if the mp3 files could be discovered but I suspect it wouldn't because of the Javascript menu factor.

GatorQue · 2024-03-08T04:48:12Z

@Jacobobber1087,
I am happy to report that using Selenium was successful in obtaining the media URL. The results are cached to a file, which is enabled by default now, such that future downloads will be faster. I am doing some more testing but should have an updated release posted soon.

Jacobobber1087 · 2024-05-28T01:44:07Z

@GatorQue Sorry for the late reply. I am currently serving as a missionary for the church so I do not have reliable access to a computer. I will look forward to the next release.

GatorQue added 5 commits November 9, 2019 16:44

BUGFIX: Change to use new URL churchofjesuschrist.org

1f85d99

BUGFIX: Address 'Problem with http request'

a181655

BUGFIX: Fix MP3 path retrieval issue due to changes in HTML by church

effe0cb

BUGFIX: Fix title characters causing invalid MP3 filenames

5669369

FEATURE: Add session and talk numbering to improve play order when no…

0d166dc

…t using playlists

BUGFIX: Better regular expression for Talk MP3 link retrieval

fd5f380

GatorQue added 3 commits November 9, 2019 17:49

BUGFIX: Change remaining lds.org references to churchofjesuschrist.org

30f12d6

FEATURE: Allow for disabling new session and talk numbering system (e…

d7d43fb

…nabled by default)

FEATURE: Document new -nonumbers command line argument

e1aaa3b

This was referenced Nov 14, 2019

Problem with http request #1

Open

Line 36 syntax error. #2

Open

Additional fixes for recent website changes in 2020

0732128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/fix url and add better ordering with numbers #3

Feature/fix url and add better ordering with numbers #3

GatorQue commented Nov 9, 2019

GatorQue commented Nov 10, 2019

GatorQue commented Nov 10, 2019

jdshaeffer commented Jan 16, 2020

GatorQue commented Jan 16, 2020

clarkshaeffer commented Jan 20, 2020

GatorQue commented Jan 21, 2020

clarkshaeffer commented Jan 21, 2020

rafaelmx commented Jan 30, 2020

GatorQue commented Jan 31, 2020

rafaelmx commented Jan 31, 2020 •

edited

Loading

GatorQue commented Feb 1, 2020

Jacobobber1087 commented Feb 29, 2024

GatorQue commented Mar 1, 2024

Jacobobber1087 commented Mar 1, 2024

GatorQue commented Mar 2, 2024

GatorQue commented Mar 2, 2024

Jacobobber1087 commented Mar 2, 2024

GatorQue commented Mar 3, 2024

Jacobobber1087 commented Mar 3, 2024

GatorQue commented Mar 5, 2024

GatorQue commented Mar 8, 2024

Jacobobber1087 commented May 28, 2024

Feature/fix url and add better ordering with numbers #3

Are you sure you want to change the base?

Feature/fix url and add better ordering with numbers #3

Conversation

GatorQue commented Nov 9, 2019

GatorQue commented Nov 10, 2019

GatorQue commented Nov 10, 2019

jdshaeffer commented Jan 16, 2020

GatorQue commented Jan 16, 2020

clarkshaeffer commented Jan 20, 2020

GatorQue commented Jan 21, 2020

clarkshaeffer commented Jan 21, 2020

rafaelmx commented Jan 30, 2020

GatorQue commented Jan 31, 2020

rafaelmx commented Jan 31, 2020 • edited Loading

GatorQue commented Feb 1, 2020

Jacobobber1087 commented Feb 29, 2024

GatorQue commented Mar 1, 2024

Jacobobber1087 commented Mar 1, 2024

GatorQue commented Mar 2, 2024

GatorQue commented Mar 2, 2024

Jacobobber1087 commented Mar 2, 2024

GatorQue commented Mar 3, 2024

Jacobobber1087 commented Mar 3, 2024

GatorQue commented Mar 5, 2024

GatorQue commented Mar 8, 2024

Jacobobber1087 commented May 28, 2024

rafaelmx commented Jan 31, 2020 •

edited

Loading