Skip to content
This repository has been archived by the owner on Mar 1, 2023. It is now read-only.

flag/syntax to return exact matches only #5

Open
gryphius opened this issue Mar 3, 2021 · 11 comments
Open

flag/syntax to return exact matches only #5

gryphius opened this issue Mar 3, 2021 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@gryphius
Copy link

gryphius commented Mar 3, 2021

Thanks for searchzone.ch, it is a useful tool

Is it possible to somehow disable similarity search and only return results which contain the search string exactly?
For example I tried to perform "passive dns" like searches to see which ch-domains are hosted in certain ip ranges, but the results contain many unrelated results which just start with similar octets.

@pesc
Copy link
Owner

pesc commented Mar 3, 2021

Hi, thanks for the feedback!

Ehmm would something like this helps? Or what's your exact use case?

curl --location --request POST 'https://api.searchzone.ch/api/as/v1/engines/domains-prod/search' \
--header 'authorization: Bearer search-fwyyo4i26hj5nruvauu3d372' \
--header 'Content-Type: application/json' \
--data-raw '{
    "search_fields": {
        "a_record": {}
    },
    "result_fields": {
        "domain": {
            "raw": {}
        }
    },
    "query": "151.101.1."
}'

@gryphius
Copy link
Author

gryphius commented Mar 3, 2021

for example, if I wanted to search for domains which resolve to 2a02:168:2132::*:

curl --location --request POST 'https://api.searchzone.ch/api/as/v1/engines/domains-prod/search' \
--header 'authorization: Bearer search-fwyyo4i26hj5nruvauu3d372' \
--header 'Content-Type: application/json' \
--data-raw '{
    "search_fields": {
        "aaaa_record": {}
    },
    "result_fields": {
        "domain": {
            "raw": {}
        },
        "aaaa_record": {
            "raw": {}
        }
    },
    "query": "2a02:168:2132:"
}'

however, this currently also returns "similar" records, such as:

[...]
   {
      "domain": {
        "raw": "sayari.ch"
      },
      "aaaa_record": {
        "raw": [
          "2a02:168:be04::42"
        ]
      },
      "_meta": {
        "id": "sayari.ch",
        "engine": "domains-prod",
        "score": 5.4933805
      },
      "id": {
        "raw": "sayari.ch"
      }
    },
    {
      "domain": {
        "raw": "alainwolf.ch"
      },
      "aaaa_record": {
        "raw": [
          "2a02:168:f405::42"
        ]
      },
      "_meta": {
        "id": "alainwolf.ch",
        "engine": "domains-prod",
        "score": 5.4933805
      },
      "id": {
        "raw": "alainwolf.ch"
      }
    }

i.e. the aaaa record does not contain 2a02:168:2132

similarly, if I search for "picantepizza", I get tons of results which contain the word "pizza" but not necessarily "picatepizza", such as:

ristorantepizzerialafortuna.ch
ns1.hostserv.eu. info.computrade.ch. 2020101002 7200 120 2419200 10800
185.178.193.95
ns2.hostserv.eu.
ns1.hostserv.eu.
ns3.hostserv.eu.
mail.ristorantepizzerialafortuna.ch.

so, what I was hoping for is an option in the GUI/API to only return results which contain the full search string, and not perform any similarity searches.

@pesc
Copy link
Owner

pesc commented Mar 3, 2021

Alright, let me take a look on it on the weekend or evening. I guess it has to do how Elasticsearch is indexing this field...

@pesc pesc added the bug Something isn't working label Mar 5, 2021
@pesc
Copy link
Owner

pesc commented Mar 5, 2021

I've checked it and it seems a problem how the data gets indexed with ElasticSearch. I have contacted the ElasticSearch team how to solve it with the AppSearch I'm using under the hood. Will update if I get a solution from their side...

@pesc pesc self-assigned this Mar 7, 2021
@pesc
Copy link
Owner

pesc commented Apr 15, 2021

Sorry for the long delay. I'm quite busy with school and work. Sadly there was no progress from Elastic side: https://discuss.elastic.co/t/precise-regex-search/266141/4

I'll try to fix and reindex the data on the weekend...

@gryphius
Copy link
Author

no worries, thanks for the update!

@pesc
Copy link
Owner

pesc commented May 18, 2021

Ok, it's a product limitation of AppSearch (may be added in a future version).

Anyway, I planed to create a REST-API that queries the ElasticSearch backend. With that implemented it will be possible.

For example:

{
    "_source": [
        "domain$string"
    ],
    "query": {
        "prefix": {
            "aaaa_record$string": {
                "value": "2a02:168:2132:"
            }
        }
    }
}

or

{
    "_source": [
        "domain$string"
    ],
    "query": {
        "wildcard": {
            "aaaa_record$string": "2a02:168:2132:*"
        }
    }
}

Which currently result in 8 matches, possible? 🤔

My semester ends soon, hopefully I'll find some time to continue with the project.

@pesc
Copy link
Owner

pesc commented Aug 27, 2021

So, for testing purpose you can use this endpoint. Syntax is the elastic Search API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html

Currently it isn't documented on my side - and I'm not sure if I leave it like this (security, ...) - but if you need help with the syntax and fields let me know.

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id,hits.hits._source.aaaa_record' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": {
    "prefix": {
      "aaaa_record.enum": {
        "value": "2a02:168:2132:"
      }
    }
  }
}'

Resulting in:

{
    "hits": {
        "total": {
            "value": 8
        },
        "hits": [
            {
                "_id": "opteamal.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "organicbodycare.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "organic-body-care.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "hadornag.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "host-bliss.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "chromos.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "onlineshophosting.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            },
            {
                "_id": "websitedesign.ch",
                "_source": {
                    "aaaa_record": [
                        "2a02:168:2132::2"
                    ]
                }
            }
        ]
    }
}

@gryphius
Copy link
Author

Works very well, thanks! Apart from the "passive dns" use case this enables other interesting searches like "give me all domains with null MX" 👍

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id,hits.hits._source.mx_record' --header 'Content-Type: application/json' --data-raw '{
  "query": {
    "prefix": {
      "mx_record.enum": {
        "value": "."
      }
    }
  }
}'
{
  "hits" : {
    "total" : {
      "value" : 1845
    },
 [...]

No worries about the stable API - if you have to make changes/disable for security reasons that's obviously understandable.

@pesc
Copy link
Owner

pesc commented Aug 27, 2021

nothing easier than this ;)

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id&size=10000' \
--header 'Content-Type: application/json' \
--data-raw '{
  "query": {
    "term": {
        "mx_valid.enum": false
      }
    }
}'

Keep in mind elasticsearch returns 10000 results per query, check the https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.html for more results!

For each record I have the [type]_record & [type]_valid (true = it exists) field. My elasticsearch mapping got a little messed up with the last upgrade, have to review it later....

So currently I have these records:

  • a_record & a_valid
  • aaaa_record & aaaa_valid
  • caa_record & caa_valid
  • dmarc_record & dmarc_valid
  • dnskey_record & dnskey_valid
  • mx_record & mx_valid
  • soa_record & soa_valid
  • spf_record & spf_valid (Parsed from TXT record)
  • txt_record & txt_valid
  • ns_record & ns_valid
  • ds_record & ds_valid
  • domain & domain_valid (false = domain not renewed and is free)

@pesc
Copy link
Owner

pesc commented Aug 27, 2021

curl --location --request GET 'https://dev.searchzone.ch/domains/_search?pretty&filter_path=hits.total.value,hits.hits._id,hits.hits._source.mx_record' --header 'Content-Type: application/json' --data-raw '{
"query": {
"prefix": {
"mx_record.enum": {
"value": "."
}
}
}
}'

Ohh I may understood you wrong - https://datatracker.ietf.org/doc/html/rfc7505 😁 but still I hope my comment above helps

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants