Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Project Direction / Relationship to parsedmarc #1

Open
nhairs opened this issue Aug 19, 2024 · 7 comments
Open

[Discussion] Project Direction / Relationship to parsedmarc #1

nhairs opened this issue Aug 19, 2024 · 7 comments
Labels
discussion Discussion

Comments

@nhairs
Copy link
Owner

nhairs commented Aug 19, 2024

Creating this issue as a way to open discussion without cluttering up the domainaware/parsedmarc.

Although this fork started as an experiment, I believe it has enough potential that we should discuss it's future direction.

It was born out of my own experience with parsedmarc, wanting to be able to support receiving report emails via AWS SES, and my own experience writing similar applications.

About parsedmarc-fork

Parsedmarc-fork contains the same functionality of parsedmarc but the main application has been completely rewritten to use a "streaming" model for handling reports. As this required a full rewrite of the main application a number of new internal interfaces have been added to support adding other data sources/destinations with minimal code changes to the main application.

We introduce the following interfaces:

  • Source things that produce reports.
  • Sink things that receive reports.
  • Job a container for tracking a report's lifecycle through the application.

At a high level the application operates as follows:

  1. Initialise and read configuration
  2. Create and initialise all configured sources and sinks
  3. Create a SourceWorker thread for each source.
    • This worker will get new jobs from the source and send them to a central "inbound" queue.
  4. Create a SinkWorker thread for each sink
    • This worker receives jobs on a queue and passes them to the sink.
    • The worker then passes the job and it's completion status (success/error) to a central "outbound" queue.
  5. Create an InboundWorker thread
    • This worker takes jobs from the inbound queue and fans out the job to each SinkWorker
  6. Create an OutboundWorker thread
    • This worker receives completed jobs then passes the completed job to it's source for acknowledgement.
    • This allows sources to decide how to handle reports based on status as well as have "in-flight" jobs.
    • In particular this opens up to new types of works such as those based on queues (e.g. RabbitMQ, AWS SQS).
    • It also allows us to delay things like archiving/deleting messages until we know they have been stored (email or file based sources).
  7. The main thread does nothing but wait for a shutdown signal. Once received it gracefully shuts down all the works and sends unhandled jobs (i.e. status=cancelled) back to the sources to be handled.

A few other things to note with the application/interfaces are:

  • Through the use of queues we can limit the number of in-flight jobs ensuring a fairly predictable memory footprint even when dealing with a large number of messages.
  • The application loads sources/sinks based on an importable name. This means that sources/sinks can be added simply by creating the appropriate class - no further code required.
    • This also means that we can load external classes, allowing users of the application to load their own custom sources/sinks without forking the project and requiring them to be in the main project. This is already implemented and held behind a CLI flag.
  • The configuration file allows for multiple sources/sinks to be configured using the same class. That is to say you could configure many IMAP sources and many ElasticSearch sinks and have them all operate with their won config.
  • Configuration for sources/sinks are defined using pydantic and attached to classes using type annotations. This allows us to validate configuration before creating objects and for type checking/auto-completion during development.
    • This is also why there is no application level changes to add a new source/sink to the application apart from creating the class.
  • Because the application is designed as a long lived process it can be managed through existing service/daemon tools such as systemd.

Example

Code is already available for browsing.

Sample config:

parser:
  nameservers:
    - 1.1.1.1
    - 8.8.8.8

sources:
  dev:
    class: .aws:SimpleEmailService
    session:
      profile_name: def-profile
    queue_name: dev-dmarc-receiving
  prod:
    class: .aws:SimpleEmailService
    session:
      profile_name: prod
    queue_name: prod-dmarc-receiving

sinks:
  #console:  # disable as currently can only have 1 sink enabled at a time.
  #  class: .util:Console
  #  pretty: true
  elasticsearch:
    class: .elasticsearch:Elasticsearch
    client:
      hosts: localhost:9200
      use_ssl: true
      ssl_cert_path: /var/tmp/es_ca.crt
      username: elastic
      password: secret

Sample output:
image

Related Issues

Here are some related issues from parsedmarc that could/would be resolved with this fork:

Potential Options

I see a few options available, they are detailed below.

To be clear I am open to any approach including new ideas.

A. Merge with parsedmarc

This would be the best in terms of not fracturing the community / having competing projects that essentially serve the same purpose.

However I don't think this is possible as I believe that the maintainer of parsedmarc and myself have incompatible approaches to:

  • What versions of python are supported.
  • The actual software architecture / structure of the code.
    • I'm not sure how true this is - this is mostly just from my experience of modifying the existing code base.

There would also be questions on who / how maintainer-ship would work.

B. Continue separately using a name derived from parsedmarc.

The reason for doing this would be to show the relationship to parsedmarc and that it is an almost drop in replacement.

Having a good working relationship with parsedmarc could be good for cross-linking / sharing work and trying to minimise the impact on the community.

I would only do this with the permission of the parsedmarc maintainers (there might also be something involving company IP as the project is under domainaware and not the original author's github).

C. Continue separately using a new name.

This risks fragmenting the community (and thus where efforts are focused) and at worst will lead to an adversarial relationship with the original project (I would like to avoid this).

It does however let this project do it's own thing.

@nhairs
Copy link
Owner Author

nhairs commented Aug 19, 2024

@seanthegeek as the original author / maintainer of domainaware/parsedmarc I'd appreciate hearing your thoughts on what options would be acceptable to you. I'm also happy to talk privately if you'd prefer.

@seanthegeek
Copy link
Contributor

seanthegeek commented Aug 22, 2024

@nhairs Sorry for my delay in responding. I would also prefer option A. I don't want to see the community fractured.

I am willing to drop support now that older versions of CentOS/Rocky have been out of support for a while.

Honestly, I haven't had the time to maintain parsedmarc and checkdmarc. I would welcome other maintainers.

@nhairs
Copy link
Owner Author

nhairs commented Aug 23, 2024

All good on the delay :)

I would also prefer option A. I don't want to see the community fractured.

Alright I'll work on making getting this fork to the point that it can be merged - since I started there's spurt of activity on parsedmarc so feature wise I'm a bit behind.

You might notice me making a bit of noise on a few issues in order to get some support in getting the fork tested / ready.

Honestly, I haven't had the time to maintain parsedmarc and checkdmarc. I would welcome other maintainers.

I'd be happy to help maintain it once merged - pending time/life I'd be happy to take over being the primary maintainer if you'd like to take a back seat for a while - but maybe we discuss that closer to the time.

I am willing to drop support now that older versions of CentOS/Rocky have been out of support for a while.

I'll target 3.7 for now even though it's download usage is negligible (see below). I'll investigate the main Linux distributions and their default python versions later.

image

@seanthegeek
Copy link
Contributor

Alright I'll work on making getting this fork to the point that it can be merged - since I started there's spurt of activity on parsedmarc so feature wise I'm a bit behind.

I feel that. Paet of my struggle in maintaining the project is often accepting one PR breaks several others. It's the main reason I've been reluctant to accept domainaware/parsedmarc/#509. It makes changes everywhere.

Honestly, I've never had a design for the code in mind. Things just happened organically as PRs were merged. I don't like the way variable names are passed to so many different functions. It's hard to keep track of everything to see if it flows correctly.

One bit of feedback about your design so far: I think calling something an output instead of a sink is a little clearer.

Currently parsedmarc only uses my mailsuite package to parse emails, but I'd love it if I or someone else could:

  • Backport the IMAP client fixes from parsedmarc to mailsuite
  • Add Gmail. Microsoft 365, and Amazon SES send and receive support to mailsuite, turning it into a full suite of email functions as I'd always intended
  • Replace email receiving functions in parsedmarc with calls to the mailsuite package APIs

@nhairs
Copy link
Owner Author

nhairs commented Aug 24, 2024

Note: there's lots of use of the royal we in here.

Paet of my struggle in maintaining the project is often accepting one PR breaks several others. It's the main reason I've been reluctant to accept domainaware/parsedmarc/#509. It makes changes everywhere.

Honestly, I've never had a design for the code in mind. Things just happened organically as PRs were merged. I don't like the way variable names are passed to so many different functions. It's hard to keep track of everything to see if it flows correctly.

Yeah I'm hoping this new design means that changes are better contained which should also ease the maintenance burden.

One bit of feedback about your design so far: I think calling something an output instead of a sink is a little clearer.

Yeah I spent a bit of time thinking and talking about this with some of my programmer friends. In the end I settled on Sink because in the future we may have sinks that don't actually output the data. Not a hill I'm willing to die on though - Output was a top contender.

Do keep the feedback coming 👌

Currently parsedmarc only uses my mailsuite package to parse emails, but I'd love it if I or someone else could:

Yeah I do remember looking at that a while ago. From memory I was looking for mailing libraries and knew of yours, but I also seem to recall spotting other more mature packages doing the same thing. Either way I don't think that's something I'd want to look at until after we get the fork back in.

Speaking of how to get the fork back in, I've been doing some thinking on how we'd do that. First though, let me lay out some of my plans (flexible, just what I was working on).

Overall I think trying to keep both the new application and the old CLI is going to lead to a maintenance nightmare. My intention was to keep /a/ version of compatibility (approximately where we are now) before dropping it.

However in order to drop it, we might need to have a compatible "one-shot" mode or something for those who just want to use the tool on a single folder of reports. Though maybe it get's implemented as a general way that sources can indicate that they will not produce any more jobs and if all sources enter that state the application automatically shuts down. On the flip side the service can easily safely be ended so it could be a case that you monitor the output and then shut it down when you're satisfied with the work it has completed.

Dropping the CLI support would allow us to properly move the existing sources/sinks into the new structure.

So in terms of a plan for merging, I don't actually think that a merge is going to be possible. Pretty much every line in every file has been touched.
image

Instead I think what we aim to do is once the fork gets closer to feature parity we: add the branch to the main repository; put a feature freeze on the main branch and direct people to the "fork" branch. Once we are happy with the fork branch (maybe after releasing some alpha/beta version) we then rename the "fork" to main, and main to "legacy".

I'll note if we are doing this then as we port features across (which could include a lot of copy paste code) we'd want to ensure that we keep proper attribution of the original authors.

...

Thinking aloud, I'm wondering if it's worth moving the fork branch to the main repo sooner rather than later as we're pretty much at the point where we could release development versions (maybe as 10.0.0.devN) and there might be others who are willing to help with the porting work.

This would probably involve adding me to the parsedmarc github/pypi if you'd like to me run with it (to reduce the time burden on you). I would probably end up adding some more issue tags / a project to manage the feature porting.

I should add I probably wouldn't want to focus on porting the features to be compatible with the old CLI instead just put them straight into the best way for the application as the above plan is to drop the old CLI.

...

I should add somewhere in here that I think I'd intend to move to mkdocs for generating the docs since most of them will need rewriting anyway and my personal opinion is it is the better tool moving forward. (I've just noticed since I recreated the repo I've neglected to re-upload the github pages - I'll get on that shortly)

@seanthegeek
Copy link
Contributor

FYI. I just merged a couple of the last PRs and released 8.13.0/

@rodpayne
Copy link
Contributor

@nhairs and @seanthegeek ,
I am fine with Option A.

Regarding PR #509, during the transition to the merged code that is being discussed here, it would be a good idea to ignore the PR and probably close it for now. Anything useful could be looked at after the merge has happened.

I have been finding that most of the performance improvement that I am enjoying comes from the previous cache changes. Multithreading may be adding more complexity than it is worth for the incremental improvement in performance. I will take a closer look to find anything that I may have improved outside of the multithreading change.

I haven't had the time to examine the parsedmarc-fork code yet, but I am interested in looking at the "streaming" and "queues". It may be that they either make multithreading easier or unnecessary. It would also be great to not lose input and results when Elasticsearch runs low on disk space,

@nhairs nhairs pinned this issue Sep 2, 2024
@nhairs nhairs added the discussion Discussion label Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion
Projects
None yet
Development

No branches or pull requests

3 participants