[Enhancement request] - Reduce clutter by careful pruning of only some nested subdomains #24787

dubstard · 2024-02-26T06:46:17Z

Block subdomains of known bad domains, while ignoring free services which allow you to register a subdomain on them, as we can not afford a blanket bank on those.

Ignore those

		"fleek.co",
		"amplifyapp.com",
		"ddns.net",
		"ddns.us",
		"duckdns.org",
		"firebaseapp.com",
		"github.io",
		"herokuapp.com",
		"hopto.org",
		"js.org",
		"netlify.app",
		"on.fleek.co",
		"pagekite.me",
		"pages.dev",
		"plesk.page",
		"servehttp.com",
		"square.site",
		"surge.sh",
		"sytes.net",
		"timeweb.ru",
		"vercel.app",
		"web.app",
		"webflow.io",
		"weebly.com",
		"wixsite.com",
		"zapto.org"

For instance com.cn is always bad (fake gTLD)
We don't need to have 11 explicit entries, just blanket ban on *.com.cn is good enough.

Example 1:

Start with

"ethereumclassic.com.cn",
"balancer.com.cn",
"trust-wallet.com.cn",
"0pensea.com.cn",
"coinbasewallet.com.cn",
"coinbase-wallet.com.cn",
"metamaskwallet.com.cn",
"coinbase-eth.com.cn",
"coinbaseusdt.com.cn",
"imtoken.com.cn",
"tokenim.com.cn",

End with

"com.cn",

From 253 characters, we end up with 9 (96% shrink)

Example 2:
Start with

"build.arbitrum-arb.icu",
"rewards.arbitrum-arb.icu",
"claim.arbitrum-arb.icu",

End with

"arbitrum-arb.icu",

From 81 characters, we end up with 19 (76% shrink)

Example 3:
Start with:

"zetachain.eth-air20.com",
"linea.eth-air20.com",
"arbitrum.eth-air20.com",

End with

"eth-air20.com",

From 77 characters, we end up with 16 (79% reduction)

Example 4:
Start with:

"gifts9586.giftforyou.top",
"gifts3807.giftforyou.top",
"gifts5344.giftforyou.top",
"gifts3803.giftforyou.top",
"gifts2487.giftforyou.top",
"gifts1423.giftforyou.top",
"gifts6549.giftforyou.top",
"gifts6999.giftforyou.top",

End with:

"giftforyou.top",

From 230 characters, we end up with 17 (92% reduction)

One possible way to parse a huge array and find common domains with different subdomains is to use a Python library called tldextract
https://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url
This library can extract the top-level domain (TLD) and the second-level domain (SLD) from a URL, while ignoring the subdomain and the free services like pages.dev or fleek.co
For example, using tldextract on gifts9586.giftforyou.top will return giftforyou as the SLD and top as the TLD.

Sample code that demonstrates how to use tldextract on a list of URLs and store the unique domains in a set:

import tldextract

# Your array of URLs
urls = [
"gifts9586.giftforyou.top",
"gifts3807.giftforyou.top",
"gifts5344.giftforyou.top",
"gifts3803.giftforyou.top",
"gifts2487.giftforyou.top",
"gifts1423.giftforyou.top",
"gifts6549.giftforyou.top",
"gifts6999.giftforyou.top",
"blog.example.com",
"news.example.com",
"shop.example.com",
"foo.bar.baz.com",
"hello.world.com",
"test.netlify.app",
"demo.github.io"
]

# A set to store the unique domains
domains = set()

# Loop through the URLs and extract the domains
for url in urls:
# Use tldextract to get the SLD and TLD
ext = tldextract.extract(url)
sld = ext.domain
tld = ext.suffix

# Combine the SLD and TLD with a dot
domain = sld + "." + tld

# Add the domain to the set
domains.add(domain)

# Print the set of unique domains
print(domains)

The output of this code is:

{'giftforyou.top', 'example.com', 'baz.com', 'world.com'}

The subdomains and the free services are ignored, and only the unique domains are kept in the set.
This dumb PoC could be used to try to blanket block fraudulent sites more efficiently with less clutter in the blocklist array with not much manual effort.

similar to #13133, but different

The text was updated successfully, but these errors were encountered:

AlexHerman1 added the improvement Issue or PR for features in the software of this repo label Feb 27, 2024

AlexHerman1 assigned samczsun Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement request] - Reduce clutter by careful pruning of only some nested subdomains #24787

[Enhancement request] - Reduce clutter by careful pruning of only some nested subdomains #24787

dubstard commented Feb 26, 2024 •

edited

Loading

[Enhancement request] - Reduce clutter by careful pruning of only some nested subdomains #24787

[Enhancement request] - Reduce clutter by careful pruning of only some nested subdomains #24787

Comments

dubstard commented Feb 26, 2024 • edited Loading

dubstard commented Feb 26, 2024 •

edited

Loading