Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement request] - Reduce clutter by careful pruning of only some nested subdomains #24787

Open
dubstard opened this issue Feb 26, 2024 · 0 comments
Assignees
Labels
improvement Issue or PR for features in the software of this repo

Comments

@dubstard
Copy link
Contributor

dubstard commented Feb 26, 2024

Block subdomains of known bad domains, while ignoring free services which allow you to register a subdomain on them, as we can not afford a blanket bank on those.

Ignore those

		"fleek.co",
		"amplifyapp.com",
		"ddns.net",
		"ddns.us",
		"duckdns.org",
		"firebaseapp.com",
		"github.io",
		"herokuapp.com",
		"hopto.org",
		"js.org",
		"netlify.app",
		"on.fleek.co",
		"pagekite.me",
		"pages.dev",
		"plesk.page",
		"servehttp.com",
		"square.site",
		"surge.sh",
		"sytes.net",
		"timeweb.ru",
		"vercel.app",
		"web.app",
		"webflow.io",
		"weebly.com",
		"wixsite.com",
		"zapto.org"

For instance com.cn is always bad (fake gTLD)
We don't need to have 11 explicit entries, just blanket ban on *.com.cn is good enough.

Example 1:

Start with

"ethereumclassic.com.cn",
"balancer.com.cn",
"trust-wallet.com.cn",
"0pensea.com.cn",
"coinbasewallet.com.cn",
"coinbase-wallet.com.cn",
"metamaskwallet.com.cn",
"coinbase-eth.com.cn",
"coinbaseusdt.com.cn",
"imtoken.com.cn",
"tokenim.com.cn",

End with

"com.cn",

From 253 characters, we end up with 9 (96% shrink)

Example 2:
Start with

"build.arbitrum-arb.icu",
"rewards.arbitrum-arb.icu",
"claim.arbitrum-arb.icu",

End with

"arbitrum-arb.icu",

From 81 characters, we end up with 19 (76% shrink)

Example 3:
Start with:

"zetachain.eth-air20.com",
"linea.eth-air20.com",
"arbitrum.eth-air20.com",

End with

"eth-air20.com",

From 77 characters, we end up with 16 (79% reduction)

Example 4:
Start with:

"gifts9586.giftforyou.top",
"gifts3807.giftforyou.top",
"gifts5344.giftforyou.top",
"gifts3803.giftforyou.top",
"gifts2487.giftforyou.top",
"gifts1423.giftforyou.top",
"gifts6549.giftforyou.top",
"gifts6999.giftforyou.top",

End with:

"giftforyou.top",

From 230 characters, we end up with 17 (92% reduction)


One possible way to parse a huge array and find common domains with different subdomains is to use a Python library called tldextract
https://stackoverflow.com/questions/1066933/how-to-extract-top-level-domain-name-tld-from-url
This library can extract the top-level domain (TLD) and the second-level domain (SLD) from a URL, while ignoring the subdomain and the free services like pages.dev or fleek.co
For example, using tldextract on gifts9586.giftforyou.top will return giftforyou as the SLD and top as the TLD.

Sample code that demonstrates how to use tldextract on a list of URLs and store the unique domains in a set:

import tldextract

# Your array of URLs
urls = [
"gifts9586.giftforyou.top",
"gifts3807.giftforyou.top",
"gifts5344.giftforyou.top",
"gifts3803.giftforyou.top",
"gifts2487.giftforyou.top",
"gifts1423.giftforyou.top",
"gifts6549.giftforyou.top",
"gifts6999.giftforyou.top",
"blog.example.com",
"news.example.com",
"shop.example.com",
"foo.bar.baz.com",
"hello.world.com",
"test.netlify.app",
"demo.github.io"
]

# A set to store the unique domains
domains = set()

# Loop through the URLs and extract the domains
for url in urls:
# Use tldextract to get the SLD and TLD
ext = tldextract.extract(url)
sld = ext.domain
tld = ext.suffix

# Combine the SLD and TLD with a dot
domain = sld + "." + tld

# Add the domain to the set
domains.add(domain)

# Print the set of unique domains
print(domains)

The output of this code is:

{'giftforyou.top', 'example.com', 'baz.com', 'world.com'}

The subdomains and the free services are ignored, and only the unique domains are kept in the set.
This dumb PoC could be used to try to blanket block fraudulent sites more efficiently with less clutter in the blocklist array with not much manual effort.

similar to #13133, but different

@AlexHerman1 AlexHerman1 added the improvement Issue or PR for features in the software of this repo label Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Issue or PR for features in the software of this repo
Projects
None yet
Development

No branches or pull requests

3 participants