Skip to content

Commit 6d7d9c6

Browse files
Regex URL filtering for RSS and sitemap sources (#158)
* Now accepts raw regex for sitemap and RSS exclusion
1 parent 4ffe115 commit 6d7d9c6

File tree

6 files changed

+96
-49
lines changed

6 files changed

+96
-49
lines changed

config.example.yml

+22-8
Original file line numberDiff line numberDiff line change
@@ -74,27 +74,41 @@ sources:
7474
credentials: github-auth
7575
search: CVE-2018-
7676

77-
# Without regex filter
77+
# Without regex include
7878
- name: rss-inquest-blog
7979
module: rss
8080
url: https://inquest.net/blog/rss
8181
feed_type: messy
8282

83-
# With regex filter
83+
# With regex include
8484
# Keywords are seperated by '|'
8585
- name: rss-inquest-blog
8686
module: rss
8787
url: https://inquest.net/blog/rss
8888
feed_type: messy
89-
filter: security|threat|research
89+
include: security|threat|research
90+
91+
# With regex exclude
92+
# Keywords are seperated by '|'
93+
- name: rss-inquest-blog
94+
module: rss
95+
url: https://inquest.net/blog/rss
96+
feed_type: messy
97+
exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
9098

9199
# Sitemap exmaples
92100

93-
# Keywords are seperated by '|' when using the filter option
101+
# Keywords are seperated by '|' when using the include option
102+
- name: inquest-sitemap-articles
103+
module: sitemap
104+
url: https://www.inquest.net/sitemap.xml
105+
include: security|threat|research
106+
107+
# Keywords are seperated by '|' when using the exclude option
94108
- name: inquest-sitemap-articles
95109
module: sitemap
96110
url: https://www.inquest.net/sitemap.xml
97-
filter: security|threat|research
111+
exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
98112

99113
# Defaults to "blog" keyword
100114
- name: inquest-sitemap-blog
@@ -105,9 +119,9 @@ sources:
105119
- name: inquest-sitemap-blog-articles-security
106120
module: sitemap
107121
url: https://www.inquest.net/sitemap.xml
108-
filter: articles|security
122+
include: articles|security
109123

110-
# Specify directories in the filter
124+
# Only ingest from specific directories
111125
- name: inquest-sitemap-blog-category
112126
module: sitemap
113127
url: https://www.inquest.net/sitemap.xml
@@ -119,7 +133,7 @@ sources:
119133
module: sitemap
120134
url: https://www.inquest.net/sitemap.xml
121135
path: /blog/category/
122-
filter: release|solutions
136+
include: release|solutions
123137

124138
- name: vt-comments-inquest
125139
module: virustotal

docs/basicusage.rst

+3-3
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Configure ThreatIngestor to run continuously or manually. If you set ``daemon``
2929

3030
Next, create the ``sources`` section, and add your sources. To configure the source, you should give it a unique name like ``inquest-rss``. Each source also uses a module like twitter, rss, or sqs. Choose the module for the expected format of the source data. For easy testing, we'll use an :ref:`RSS <rss-source>` source and a :ref:`CSV <csv-operator>` operator for this example.
3131

32-
You can also include a ``filter`` to parse ingested artifacts using regex. The filter uses a pipe (|) character as the delimeter.
32+
You can also include a ``include`` to parse ingested artifacts using regex. The include uses a pipe (|) character as the delimeter.
3333
3434
.. code-block:: yaml
3535
@@ -43,7 +43,7 @@ You can also include a ``filter`` to parse ingested artifacts using regex. The f
4343
module: rss
4444
url: http://blog.inquest.net/atom.xml
4545
feed_type: messy
46-
filter: security|threat
46+
include: security|threat
4747
4848
Note the dash before the ``name`` key, signifying this and the following keys are part of a single list element. We'll circle back to this distinction below in the "Standard Case" walkthrough. For this source, we assign a name ``inquest-rss``, tell it to use the ``rss`` module, and fill in the required options for the ``rss`` module, which are ``url`` and ``feed_type``.
4949

@@ -83,7 +83,7 @@ Putting it all together, here's our completed ``config.yml`` file:
8383
module: rss
8484
url: http://blog.inquest.net/atom.xml
8585
feed_type: messy
86-
filter: security|threat
86+
include: security|threat
8787
8888
operators:
8989
- name: csv

docs/sources/rss.rst

+4-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ Configuration Options
2121
* ``module`` (required): ``rss``
2222
* ``url`` (required): URL to the RSS or Atom feed.
2323
* ``feed_type`` (required): see above; if unsure, use ``messy``.
24-
* ``filter`` (optional): Regex filtering for RSS feed.
24+
* ``include`` (optional): Include filter using simplified regex.
25+
* ``exclude`` (optional): Exclude filter using raw regex.
2526

2627
Example Configuration
2728
~~~~~~~~~~~~~~~~~~~~~
@@ -34,7 +35,8 @@ Inside the ``sources`` section of your configuration file:
3435
module: rss
3536
url: https://example.com/rss.xml
3637
feed_type: messy
37-
filter: security|threat
38+
include: security|threat
39+
exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?
3840
3941
.. _sqs-source:
4042

docs/sources/sitemap.rst

+4
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Configuration Options
1010

1111
* ``module`` (required): ``sitemap``
1212
* ``url`` (required): URL of the website with the sitemap path.
13+
* ``include`` (optional): Include filter using simplified regex.
14+
* ``exclude`` (optional): Exclude filter using raw regex.
1315

1416
Example Configuration
1517
~~~~~~~~~~~~~~~~~~~~~
@@ -21,3 +23,5 @@ Quick setup for sitemap parsing:
2123
- name: inquest-blog
2224
module: sitemap
2325
url: https://inquest.net/sitemap.xml
26+
include: security|threat|research
27+
exclude: https:\/.inquest\.net\/blog[\/]?inquest-[\/]?

threatingestor/sources/rss.py

+38-27
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import bs4
12
import feedparser
23
import regex as re
34

@@ -8,17 +9,16 @@
89
# feedparser 6.x
910
from feedparser.datetimes import _parse_date
1011

11-
import bs4
12-
1312
from threatingestor.sources import Source
1413

1514
class Plugin(Source):
1615

17-
def __init__(self, name, url, feed_type, filter=None):
16+
def __init__(self, name, url, feed_type, include=None, exclude=None):
1817
self.name = name
1918
self.url = url
2019
self.feed_type = feed_type
21-
self.filter = filter
20+
self.include = include
21+
self.exclude = exclude
2222

2323
def run(self, saved_state):
2424
feed = feedparser.parse(self.url)
@@ -48,34 +48,45 @@ def run(self, saved_state):
4848
[x.unwrap() for x in soup.find_all('i')]
4949
soup = bs4.BeautifulSoup(soup.decode(), 'html.parser')
5050

51-
text = ''
52-
53-
if self.filter is not None:
54-
55-
rss_query = re.compile(r"{0}".format(self.filter)).findall(str(self.filter.split('|')))
56-
57-
for r in rss_query:
58-
if self.feed_type == 'afterioc':
59-
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
60-
61-
if r in text:
51+
text = ""
52+
53+
if self.exclude is not None:
54+
rss_exclude = re.sub(re.compile(fr"{self.exclude}", re.IGNORECASE), "", str(item.get('link')))
55+
56+
if rss_exclude:
57+
if "http" in rss_exclude:
58+
if self.feed_type == "afterioc":
59+
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
60+
artifacts += self.process_element(text, item.get('link'), include_nonobfuscated=True)
61+
elif self.feed_type == "clean":
62+
text = soup.get_text(separator=' ')
63+
artifacts += self.process_element(text, item.get('link'), include_nonobfuscated=True)
64+
else:
65+
# Default: self.feed_type == 'messy'.
66+
text = soup.get_text(separator=' ')
67+
artifacts += self.process_element(text, item.get('link'))
68+
69+
if self.include is not None:
70+
rss_include = re.compile(r"{0}".format(self.include)).findall(str(self.include.split('|')))
71+
72+
for rss_f in rss_include:
73+
if rss_f in item.get('link'):
74+
if self.feed_type == "afterioc":
75+
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
6276
artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
63-
elif self.feed_type == 'clean':
64-
text = soup.get_text(separator=' ')
65-
66-
if r in text:
77+
elif self.feed_type == "clean":
78+
text = soup.get_text(separator=' ')
6779
artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
68-
else:
69-
# Default: self.feed_type == 'messy'.
70-
text = soup.get_text(separator=' ')
71-
artifacts += self.process_element(text, item.get('link') or self.url)
72-
73-
else:
80+
else:
81+
# Default: self.feed_type == 'messy'.
82+
text = soup.get_text(separator=' ')
83+
artifacts += self.process_element(text, item.get('link') or self.url)
7484

75-
if self.feed_type == 'afterioc':
85+
if self.include is None and self.exclude is None:
86+
if self.feed_type == "afterioc":
7687
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
7788
artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
78-
elif self.feed_type == 'clean':
89+
elif self.feed_type == "clean":
7990
text = soup.get_text(separator=' ')
8091
artifacts += self.process_element(text, item.get('link') or self.url, include_nonobfuscated=True)
8192
else:

threatingestor/sources/sitemap.py

+25-9
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,11 @@
88

99
class Plugin(Source):
1010

11-
def __init__(self, name, url, filter=None, path=None):
11+
def __init__(self, name, url, include=None, exclude=None, path=None):
1212
self.name = name
1313
self.url = url
14-
self.filter = filter
14+
self.include = include
15+
self.exclude = exclude
1516
self.path = path
1617

1718
def run(self, saved_state):
@@ -47,13 +48,28 @@ def run(self, saved_state):
4748
[x.unwrap() for x in soup.find_all('i')]
4849
soup = BeautifulSoup(soup.decode(), 'html.parser')
4950

50-
if self.filter is not None:
51+
if self.exclude is not None:
52+
# Regex input via config.yml
53+
xml_exclude = re.sub(re.compile(fr"{self.exclude}", re.IGNORECASE), "", str(loc))
54+
55+
if xml_exclude:
56+
if self.path is None and "http" in xml_exclude:
57+
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
58+
artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
59+
60+
# Uses a path instead of a keyword
61+
if self.path is not None:
62+
if self.path in xml_exclude:
63+
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
64+
artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
65+
66+
if self.include is not None:
5167
# Regex input via config.yml
5268
# Example: security|threat|malware
53-
xml_query = re.compile(r"{0}".format(self.filter)).findall(str(self.filter.split('|')))
69+
xml_include = re.compile(r"{0}".format(self.include)).findall(str(self.include.split('|')))
5470

5571
# Iterates over the regex output to locate all provided keywords
56-
for x in xml_query:
72+
for xi in xml_include:
5773
# Uses a path instead of a keyword
5874
if self.path is not None:
5975
if self.path in loc:
@@ -62,19 +78,19 @@ def run(self, saved_state):
6278

6379
# Only filters using a keyword
6480
if self.path is None:
65-
if x in loc:
81+
if xi in loc:
6682
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
6783
artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
6884

69-
elif self.filter is None and self.path is not None:
70-
# Filters only by path in XML loc, no set filter
85+
if self.include is None and self.exclude is None and self.path is not None:
86+
# Filters only by path in XML loc, no set include
7187
# Default: /path/name/*
7288

7389
if self.path in loc:
7490
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]
7591
artifacts += self.process_element(content=text, reference_link=str(loc), include_nonobfuscated=True)
7692

77-
else:
93+
if self.include is None and self.exclude is None and self.path is None:
7894
# Locates all blog links within the sitemap
7995
if "blog" in loc:
8096
text = soup.get_text(separator=' ').split('Indicators of Compromise')[-1]

0 commit comments

Comments
 (0)