Skip to content

Commit

Permalink
Use Addressable to heuristically parse invalid URLs and normalize them
Browse files Browse the repository at this point in the history
  • Loading branch information
chvp committed Jan 27, 2025
1 parent edbc4db commit b269d73
Show file tree
Hide file tree
Showing 4 changed files with 18 additions and 3 deletions.
1 change: 1 addition & 0 deletions Gemfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ ruby '~> 3.3.1'

gem 'rails', '~> 8.0.1'

gem 'addressable' # More standards-compliant URI parser
gem 'bcrypt' # Use Active Model has_secure_password
gem 'bootsnap', require: false # Reduces boot times through caching; required in config/boot.rb
gem 'feedjira' # Parse RSS feeds
Expand Down
1 change: 1 addition & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -372,6 +372,7 @@ PLATFORMS
ruby

DEPENDENCIES
addressable
annotaterb
bcrypt
bootsnap
Expand Down
5 changes: 3 additions & 2 deletions app/models/entry.rb
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
#
# fk_rails_... (subscription_id => subscriptions.id)
#
require 'addressable/uri'

class Entry < ApplicationRecord
FEEDJIRA_KEYS_MAP = {
author: :author,
Expand Down Expand Up @@ -72,8 +74,7 @@ def read?
end

def normalize_url(input)
# Some urls might contain spaces, so we replace these
uri = URI(input.gsub(' ', '%20'))
uri = Addressable::URI.heuristic_parse(input).normalize
# Some entries might contain absolute/relative path to the page they were on
uri = URI(url).merge(uri) if url.present?
uri.to_s
Expand Down
14 changes: 13 additions & 1 deletion test/models/entry_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -71,10 +71,22 @@ class EntryTest < ActiveSupport::TestCase
end

# Methods
test 'should be able to normalize urls found in post' do
test 'should be able to normalize urls found in post when containing spaces' do
entry = build(:entry, url: 'https://example.com/posts/first.html')

assert_equal 'https://example.com/image%201.jpg', entry.normalize_url('https://example.com/image 1.jpg')
end

test 'should be able to normalize urls found in post when containing unicode' do
entry = build(:entry, url: 'https://example.com/posts/first.html')

assert_equal 'https://example.com/image%F0%9F%96%A41.jpg', entry.normalize_url('https://example.com/image🖤1.jpg')
assert_equal 'https://example.com/image%E2%80%941.jpg', entry.normalize_url('https://example.com/image—1.jpg')
end

test 'should be able to normalize urls found in post when missing host' do
entry = build(:entry, url: 'https://example.com/posts/first.html')

assert_equal 'https://example.com/image.jpg', entry.normalize_url('/image.jpg')
end
end

0 comments on commit b269d73

Please sign in to comment.