You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Block subdomains of known bad domains, while ignoring free services which allow you to register a subdomain on them, as we can not afford a blanket bank on those.
From 230 characters, we end up with 17 (92% reduction)
One possible way to parse a huge array and find common domains with different subdomains is to use a Python library called tldextract https://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url
This library can extract the top-level domain (TLD) and the second-level domain (SLD) from a URL, while ignoring the subdomain and the free services like pages.dev or fleek.co
For example, using tldextract on gifts9586.giftforyou.top will return giftforyou as the SLD and top as the TLD.
Sample code that demonstrates how to use tldextract on a list of URLs and store the unique domains in a set:
import tldextract
# Your array of URLs
urls = [
"gifts9586.giftforyou.top",
"gifts3807.giftforyou.top",
"gifts5344.giftforyou.top",
"gifts3803.giftforyou.top",
"gifts2487.giftforyou.top",
"gifts1423.giftforyou.top",
"gifts6549.giftforyou.top",
"gifts6999.giftforyou.top",
"blog.example.com",
"news.example.com",
"shop.example.com",
"foo.bar.baz.com",
"hello.world.com",
"test.netlify.app",
"demo.github.io"
]
# A set to store the unique domains
domains = set()
# Loop through the URLs and extract the domains
for url in urls:
# Use tldextract to get the SLD and TLD
ext = tldextract.extract(url)
sld = ext.domain
tld = ext.suffix
# Combine the SLD and TLD with a dot
domain = sld + "." + tld
# Add the domain to the set
domains.add(domain)
# Print the set of unique domains
print(domains)
The subdomains and the free services are ignored, and only the unique domains are kept in the set.
This dumb PoC could be used to try to blanket block fraudulent sites more efficiently with less clutter in the blocklist array with not much manual effort.
Block subdomains of known bad domains, while ignoring free services which allow you to register a subdomain on them, as we can not afford a blanket bank on those.
Ignore those
For instance com.cn is always bad (fake gTLD)
We don't need to have 11 explicit entries, just blanket ban on
*.com.cn
is good enough.Example 1:
Start with
End with
From 253 characters, we end up with 9 (96% shrink)
Example 2:
Start with
End with
From 81 characters, we end up with 19 (76% shrink)
Example 3:
Start with:
End with
From 77 characters, we end up with 16 (79% reduction)
Example 4:
Start with:
End with:
From 230 characters, we end up with 17 (92% reduction)
One possible way to parse a huge array and find common domains with different subdomains is to use a Python library called tldextract
https://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url
This library can extract the top-level domain (TLD) and the second-level domain (SLD) from a URL, while ignoring the subdomain and the free services like pages.dev or fleek.co
For example, using tldextract on gifts9586.giftforyou.top will return
giftforyou
as theSLD
and top as theTLD
.Sample code that demonstrates how to use tldextract on a list of URLs and store the unique domains in a set:
The output of this code is:
The subdomains and the free services are ignored, and only the unique domains are kept in the set.
This dumb PoC could be used to try to blanket block fraudulent sites more efficiently with less clutter in the blocklist array with not much manual effort.
similar to #13133, but different
The text was updated successfully, but these errors were encountered: