Skip to content

Turn any website into a minimal Orgmode buffer or .org file.

License

Notifications You must be signed in to change notification settings

rtrppl/website2org

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

website2org.el

website2org in action

website2org.el downloads a website, transforms it into minimalist Orgmode, and presents the results as either a temporary Orgmode buffer or creates an .org file in a specified directory.

I have now three primary uses cases for this. 1) Local storage/read it later: I often store websites locally to link them (and specific paragraphs) to my Zettelkasten in orgrr. (They are downloaded to the directory “findings”, an orgrr container, and the tag “orgrr-project” is automatically added. See orgrr for details.) 2) To quickly see the contents of a website in mastodon.el. 3) Loading full articles in Elfeed (see below).

This package is still in an early stage. I use it to replace orgrr-save-website, which draws on org-web-tools--eww-readable and org-web-tools--html-to-org-with-pandoc but has become more fragile by the day. orgrr-save-website also often does struggle to produce the kind of Orgmode I want to have - with as little HTML fragments as possible.

website2org requires wget but does not use Pandoc. The package parses HTML via RegExp to achieve rather minimal looking Orgmode files, which are much smaller in size (the downloaded version of this readme from Github has 311KB, the website2org version just 9KB).

You can automatically forward downloaded websites to an archiving website via (setq website2org-archive t). The standard service here is archive.today, which can be changed by modifying website2org-archive-url.

Minimal Orgmode

website2org ignores all information before the first <h1> headline and everything coming after the <footer>. All data about source, <div> and similar types of tags are also ignored. It respects all paragraphs, headlines, lists (ordered and not), inline code, block quotes, <pre>, links (including local links), <strong>, and <em>. Tabs and multiple spaces are reduced to one space. A new line cannot start with a space (or “- ” followed by nothing).

Known issues

Parsing HTML with RegEx comes with lots of issues. Most experienced coders strongly advice against doing so for good reason. And there are numerous tools to parse HTML, there is even one built-in (libxml-parse-html-region, which eventually might the basis of a proper rewrite of this package). I also considered tidy-html5, hxclean of html-xml-utils fame, and htmlq. All of these worked to some degree but stopped doing so when leaving the UTF-8 world. In other words, not a single one of them produced acceptable results for Chinese websites. Given the quality of the current solution, I don’t see the pressing need to add such HTML parsing. website2org will work for most sites - the more they stick to common standards and behavior, the better are the chances. Right now we may be at 85-95% of websites working, with a 5% chance of some small issue (please report the obvious ones).

Still, there are some known issues even with otherwise working websites. Orgmode does not deal well with source blocks within quote blocks. These will look weird.

Change log

0.2.9

  • Ensured that there is at least one space between a word and a link

0.2.8

  • Added option to press the spacebar to scroll in website2org-temp + added option to call website-url-to-org from Elfeed (see below)

0.2.7

  • Added support for elfeed-show-mode (and other non-Orgmode URLs in documents) + added minor mode for website2org-temp (press “q” to exit)

See also the changelog.

Installation

Clone the repository:

git clone https://github.com/rtrppl/website2org

To run Website2org, you need to load the package by adding it to your .emacs or init.el:

(load "/path/to/website2org/website2org.el") 

You should set a binding to website2org and website2org-temp.

(global-set-key (kbd "C-M-s-<down>") 'website2org) ;; this is what I use on a Mac
(global-set-key (kbd "C-M-s-<up>") 'website2org-temp)

Or, if you use straight:

(use-package website2org
  :straight (:host github :repo "rtrppl/website2org")
  :config
  (setq website2org-directory "/path/to/where/websites/should/be/stored/") ;; if needed, see below
  :bind
  (:map global-map)
  ("C-M-s-<down>" . website2org)
  ("C-M-s-<up>" . website2org-temp))

Additionally you can set these values:

;; If wget should be called with a different command.
(setq website2org-wget-cmd "wget -q ") 
;; Change the name of the local cache file.
(setq website2org-cache-filename "~/website2org-cache.html") 
;; Turn website2org-additional-meta nil if not applicable. This is for
;; use in orgrr (https://github.com/rtrppl/orgrr).
(setq website2org-additional-meta "#+roam_tags: website orgrr-project") 
;; By default all websites will be stored in the org-directory.
;; Set website2org-directory, if you prefer a different directory.
;; directories must end with /
(setq website2org-directory "/path/to/where/websites/should/be/stored/") 
(setq website2org-filename-time-format "%Y%m%d%H%M%S")
(setq website2org-archive nil) ;; If this is set to t, the URL called will be send to the archiving URL below
(setq website2org-archive-url "https://archive.today/") 

Functions

These are the primary functions of website2org.el:

website2org will download the website at point (or from a provided URL) and save it as an Orgmode file. website2org-temp will download a website at point (or from a provided URL) and present it as a temporary Orgmode buffer (press “q” to exit the screen; press “spacebar” to scroll).

Elfeed

I wrote a small integration for Elfeed (based on elfeed-show-visit), which may also be of interest for some:

(defun elfeed-show-visit-website2org (&optional use-generic-p)
  "Visit the current entry in a website2org temporary buffer.
Calling this function with C-u will use website2org-url-to buffer
to create an orgmode document."
  (interactive "P")
  (let ((link (elfeed-entry-link elfeed-show-entry)))
    (when link
      (message "Sent to browser: %s" link)
      (if use-generic-p
          (website2org-url-to-org link)
        (website2org-to-buffer link)))))

By adding a keybinding you are able to quickly open the current entry in a temporary website2org buffer.

My Elfeed setup basically looks like this:

(use-package elfeed
	:defer t
	:bind
	(:map global-map
	      ("C-x w" . elfeed))
	(:map elfeed-show-mode-map
	      ("w" . elfeed-show-visit-website2org)))

About

Turn any website into a minimal Orgmode buffer or .org file.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published