Regex URL filtering for RSS and sitemap sources (#158)

battleoverflow · web-flow · commit 6d7d9c62267a · 2023-11-01T12:04:48.000-05:00
* Now accepts raw regex for sitemap and RSS exclusion
diff --git a/config.example.yml b/config.example.yml
@@ -74,27 +74,41 @@ sources:
     credentials: github-auth
     search: CVE-2018-
 
-  # Without regex filter
+  # Without regex include
   - name: rss-inquest-blog
     module: rss
     url: https://inquest.net/blog/rss
     feed_type: messy
 
-  # With regex filter
+  # With regex include
   # Keywords are seperated by '|'
   - name: rss-inquest-blog
     module: rss
     url: https://inquest.net/blog/rss
     feed_type: messy
-    filter: security|threat|research
+    include: security|threat|research
+
+  # With regex exclude
+  # Keywords are seperated by '|'
+  - name: rss-inquest-blog
+    module: rss
+    url: https://inquest.net/blog/rss
+    feed_type: messy
+    exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
 
   # Sitemap exmaples
 
-  # Keywords are seperated by '|' when using the filter option
+  # Keywords are seperated by '|' when using the include option
+  - name: inquest-sitemap-articles
+    module: sitemap
+    url: https://www.inquest.net/sitemap.xml
+    include: security|threat|research
+
+  # Keywords are seperated by '|' when using the exclude option
   - name: inquest-sitemap-articles
     module: sitemap
     url: https://www.inquest.net/sitemap.xml
-    filter: security|threat|research
+    exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
 
   # Defaults to "blog" keyword
   - name: inquest-sitemap-blog
@@ -105,9 +119,9 @@ sources:
   - name: inquest-sitemap-blog-articles-security
     module: sitemap
     url: https://www.inquest.net/sitemap.xml
-    filter: articles|security
+    include: articles|security
 
-  # Specify directories in the filter
+  # Only ingest from specific directories
   - name: inquest-sitemap-blog-category
     module: sitemap
     url: https://www.inquest.net/sitemap.xml
@@ -119,7 +133,7 @@ sources:
     module: sitemap
     url: https://www.inquest.net/sitemap.xml
     path: /blog/category/
-    filter: release|solutions
+    include: release|solutions
 
   - name: vt-comments-inquest
     module: virustotal
diff --git a/docs/basicusage.rst b/docs/basicusage.rst
@@ -29,7 +29,7 @@ Configure ThreatIngestor to run continuously or manually. If you set ``daemon``
 
 Next, create the ``sources`` section, and add your sources. To configure the source, you should give it a unique name like ``inquest-rss``. Each source also uses a module like twitter, rss, or sqs. Choose the module for the expected format of the source data. For easy testing, we'll use an :ref:`RSS <rss-source>` source and a :ref:`CSV <csv-operator>` operator for this example.
 
-You can also include a ``filter`` to parse ingested artifacts using regex. The filter uses a pipe (|) character as the delimeter.
+You can also include a ``include`` to parse ingested artifacts using regex. The include uses a pipe (|) character as the delimeter.
 
 .. code-block:: yaml
 
@@ -43,7 +43,7 @@ You can also include a ``filter`` to parse ingested artifacts using regex. The f
         module: rss
         url: http://blog.inquest.net/atom.xml
         feed_type: messy
-        filter: security|threat
+        include: security|threat
 
 Note the dash before the ``name`` key, signifying this and the following keys are part of a single list element. We'll circle back to this distinction below in the "Standard Case" walkthrough. For this source, we assign a name ``inquest-rss``, tell it to use the ``rss`` module, and fill in the required options for the ``rss`` module, which are ``url`` and ``feed_type``.
 
@@ -83,7 +83,7 @@ Putting it all together, here's our completed ``config.yml`` file:
         module: rss
         url: http://blog.inquest.net/atom.xml
         feed_type: messy
-        filter: security|threat
+        include: security|threat
 
     operators:
       - name: csv
diff --git a/docs/sources/rss.rst b/docs/sources/rss.rst
@@ -21,7 +21,8 @@ Configuration Options
 * ``module`` (required): ``rss``
 * ``url`` (required): URL to the RSS or Atom feed.
 * ``feed_type`` (required): see above; if unsure, use ``messy``.
-* ``filter`` (optional): Regex filtering for RSS feed.
+* ``include`` (optional): Include filter using simplified regex.
+* ``exclude`` (optional): Exclude filter using raw regex.
 
 Example Configuration
 ~~~~~~~~~~~~~~~~~~~~~
@@ -34,7 +35,8 @@ Inside the ``sources`` section of your configuration file:
       module: rss
       url: https://example.com/rss.xml
       feed_type: messy
-      filter: security|threat
+      include: security|threat
+      exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
 
 .. _sqs-source:
 
diff --git a/docs/sources/sitemap.rst b/docs/sources/sitemap.rst
@@ -10,6 +10,8 @@ Configuration Options
 
 * ``module`` (required): ``sitemap``
 * ``url`` (required): URL of the website with the sitemap path.
+* ``include`` (optional): Include filter using simplified regex.
+* ``exclude`` (optional): Exclude filter using raw regex.
 
 Example Configuration
 ~~~~~~~~~~~~~~~~~~~~~
@@ -21,3 +23,5 @@ Quick setup for sitemap parsing:
     - name: inquest-blog
       module: sitemap
       url: https://inquest.net/sitemap.xml
+      include: security|threat|research
+      exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
diff --git a/threatingestor/sources/rss.py b/threatingestor/sources/rss.py
@@ -1,3 +1,4 @@
+import bs4
 import feedparser
 import regex as re
 
@@ -8,17 +9,16 @@
     # feedparser 6.x
     from feedparser.datetimes import _parse_date
 
-import bs4
-
 from threatingestor.sources import Source
 
 class Plugin(Source):
 
-    def __init__(self, name, url, feed_type, filter=None):
+    def __init__(self, name, url, feed_type, include=None, exclude=None):
         self.name = name
         self.url = url
         self.feed_type = feed_type
-        self.filter = filter
+        self.include = include
+        self.exclude = exclude
 
     def run(self, saved_state):
         feed = feedparser.parse(self.url)
@@ -48,34 +48,45 @@ def run(self, saved_state):
             [x.unwrap() for x in soup.find_all('i')]
             soup = bs4.BeautifulSoup(soup.decode(), 'html.parser')
 
-            text = ''
-
-            if self.filter is not None:
-
-                rss_query = re.compile(r"{0}".format(self.filter)).findall(str(self.filter.split('|')))
-
-                for r in rss_query:
-                    if self.feed_type == 'afterioc':
-                        text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
-
-                        if r in text:
+            text = ""
+
+            if self.exclude is not None:
+                rss_exclude = re.sub(re.compile(fr"{self.exclude}", re.IGNORECASE), "", str(item.get('link')))
+
+                if rss_exclude:
+                    if "http" in rss_exclude:
+                        if self.feed_type == "afterioc":
+                            text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
+                            artifacts += self.process_element(text, item.get('link'), include_nonobfuscated=True)
+                        elif self.feed_type == "clean":
+                            text = soup.get_text(separator=' ')
+                            artifacts += self.process_element(text, item.get('link'), include_nonobfuscated=True)
+                        else:
+                            # Default: self.feed_type == 'messy'.
+                            text = soup.get_text(separator=' ')
+                            artifacts += self.process_element(text, item.get('link'))
+
+            if self.include is not None:
+                rss_include = re.compile(r"{0}".format(self.include)).findall(str(self.include.split('|')))
+
+                for rss_f in rss_include:
+                    if rss_f in item.get('link'):
+                        if self.feed_type == "afterioc":
+                            text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
                             artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
-                    elif self.feed_type == 'clean':
-                        text = soup.get_text(separator=' ')
-
-                        if r in text:
+                        elif self.feed_type == "clean":
+                            text = soup.get_text(separator=' ')
                             artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
-                    else:
-                        # Default: self.feed_type == 'messy'.
-                        text = soup.get_text(separator=' ')
-                        artifacts += self.process_element(text, item.get('link') or self.url)
-
-            else:
+                        else:
+                            # Default: self.feed_type == 'messy'.
+                            text = soup.get_text(separator=' ')
+                            artifacts += self.process_element(text, item.get('link') or self.url)
 
-                if self.feed_type == 'afterioc':
+            if self.include is None and self.exclude is None:
+                if self.feed_type == "afterioc":
                     text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
                     artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
-                elif self.feed_type == 'clean':
+                elif self.feed_type == "clean":
                     text = soup.get_text(separator=' ')
                     artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
                 else:
diff --git a/threatingestor/sources/sitemap.py b/threatingestor/sources/sitemap.py
@@ -8,10 +8,11 @@
 
 class Plugin(Source):
 
-    def __init__(self, name, url, filter=None, path=None):
+    def __init__(self, name, url, include=None, exclude=None, path=None):
         self.name = name
         self.url = url
-        self.filter = filter
+        self.include = include
+        self.exclude = exclude
         self.path = path
 
     def run(self, saved_state):
@@ -47,13 +48,28 @@ def run(self, saved_state):
             [x.unwrap() for x in soup.find_all('i')]
             soup = BeautifulSoup(soup.decode(), 'html.parser')
 
-            if self.filter is not None:
+            if self.exclude is not None:
+                # Regex input via config.yml
+                xml_exclude = re.sub(re.compile(fr"{self.exclude}", re.IGNORECASE), "", str(loc))
+
+                if xml_exclude:
+                    if self.path is None and "http" in xml_exclude:
+                        text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
+                        artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
+
+                    # Uses a path instead of a keyword
+                    if self.path is not None:
+                        if self.path in xml_exclude:
+                            text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
+                            artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
+
+            if self.include is not None:
                 # Regex input via config.yml
                 # Example: security|threat|malware
-                xml_query = re.compile(r"{0}".format(self.filter)).findall(str(self.filter.split('|')))
+                xml_include = re.compile(r"{0}".format(self.include)).findall(str(self.include.split('|')))
 
                 # Iterates over the regex output to locate all provided keywords
-                for x in xml_query:
+                for xi in xml_include:
                     # Uses a path instead of a keyword
                     if self.path is not None:
                         if self.path in loc:
@@ -62,19 +78,19 @@ def run(self, saved_state):
 
                     # Only filters using a keyword
                     if self.path is None:
-                        if x in loc:
+                        if xi in loc:
                             text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
                             artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
 
-            elif self.filter is None and self.path is not None:
-                # Filters only by path in XML loc, no set filter
+            if self.include is None and self.exclude is None and self.path is not None:
+                # Filters only by path in XML loc, no set include
                 # Default: /path/name/*
 
                 if self.path in loc:
                     text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
                     artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
             
-            else:
+            if self.include is None and self.exclude is None and self.path is None:
                 # Locates all blog links within the sitemap
                 if "blog" in loc:
                     text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]