Skip to content

JamieMartin/URL-identifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

This function in Python will return all URLs that appear in free-form text.

#find all URLs def grabUrls(text): url_re = re.compile( # protocol identifier u"(?:(?:https?|ftp)://)" # user:pass authentication u"(?:\S+(?::\S*)?@)?" u"(?:" # IP address exclusion # private & local networks u"(?!(?:10|127)(?:.\d{1,3}){3})" u"(?!(?:169.254|192.168)(?:.\d{1,3}){2})" u"(?!172.(?:1[6-9]|2\d|3[0-1])(?:.\d{1,3}){2})" # IP address dotted notation octets # excludes loopback network 0.0.0.0 # excludes reserved space >= 224.0.0.0 # excludes network & broadcast addresses # (first & last IP address of each class) u"(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])" u"(?:.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}" u"(?:.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))" u"|" # host name u"(?:(?:[a-z\u00a1-\uffff0-9]-)[a-z\u00a1-\uffff0-9]+)" # domain name u"(?:.(?:[a-z\u00a1-\uffff0-9]-)[a-z\u00a1-\uffff0-9]+)" # TLD identifier u"(?:.(?:[a-z\u00a1-\uffff]{2,}))" u")" # port number u"(?::\d{2,5})?" # resource path u"(?:/\S)?" , re.UNICODE) output = url_re.findall(text) return output

About

Identifies a URL in free-form text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published