Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A vision for the next generation of Shynet #258

Open
milesmcc opened this issue Feb 16, 2023 · 6 comments
Open

A vision for the next generation of Shynet #258

milesmcc opened this issue Feb 16, 2023 · 6 comments

Comments

@milesmcc
Copy link
Owner

Hey everyone—

The past 3+ years of Shynet have been inspiring. Together, we've built a really great product — and we have some great adoption to show for it. Lots of people and organizations use Shynet, from privacy-minded individuals to some of the largest companies on Earth. (I wish I could give you exact figures on Shynet's adoption, but alas, we do not currently collect any metrics. 🙂)

Today, Shynet is an extremely simple Django app. Hits and sessions are stored in a Postgres database. There's no support for custom events, and our pruning/rollup strategy for old sessions... does not exist. The dashboard slows down dramatically for high-traffic sites. And while Shynet is all-things-considered very privacy friendly, there are additional steps we could take to dramatically improve the privacy assurances we can make (e.g., by using differential privacy, on-device aggregation, and so on).

So with all this in mind, I want to probe the community on what you'd like to see from Shynet moving forward. My engineering capacity is pretty maxed-out right now, so no major changes are imminent. But I think we have an opportunity to leapfrog some of the other analytics tools on the market if we're willing to make some major changes.

Here's one path we could take:

  • Use an OLAP database like ClickHouse. Right now, Shynet buckles if you feed it 1M+ hits per day. It just wasn't architected for that kind of scale, and Postgres isn't optimized for doing the sort of large-scale analytical queries that Shynet makes. We'd be much better served by storing event information in an OLAP database like ClickHouse. The downside of using something like ClickHouse is that it makes deploying Shynet harder/more complex. (Since you'd need ClickHouse for events, and Postgres for user accounts/site configurations/etc.)
  • Do aggregations on-device. Right now, almost all our processing happens on the server. That's fine, but we could provide some pretty interesting privacy assurances if we did more work on the client. For example, perhaps the client could keep track of which pages it has visited, roughly how long it spent on those pages, what its referrers were, and periodically send aggregated versions of that information up to the server.
  • Stop collecting IPs. We allow people to collect IPs right now. Ideally we would not do that.
  • GDPR/CCPA/etc compliance no matter what. We shouldn't let people configure themselves out of compliance. (Ideally we would also loop in a lawyer in the development process to properly advise us on what would/wouldn't be compliant.)
  • Full test coverage. Our testing strategy right now is lackluster.
  • Support custom events. Lots of people want Shynet to support custom events. Right now, that requires some significant changes to our architecture.
  • ...more novel assurances? For example, could we incorporate differential privacy to provide something genuinely new that other privacy-focused analytics tools don't?

Shynet technically hasn't hit 1.0, so nothing is really out of the question in terms of how we achieve these ends. For example, while of course I'd like to maintain perfect backwards compatibility, I think we should also consider making breaking changes and simply providing folks a migration tool if they'd like to use "Shynet v2".

P.S.: I have been in love with Elixir + Phoenix recently (used it to build https://atlos.org). That might be a more appropriate tech stack for a tool like Shynet (Plausible, for example, uses Elixir + Phoenix), but I also recognize that some in the community might appreciate the familiarity of Python.

@haaavk
Copy link
Contributor

haaavk commented Feb 19, 2023

First things first. Thanks a lot for building Shynet. It helped me escape from terrible Google Analytics.
Some thoughts about Shynet in not particular order:

  • I understand your worries about performance. Python and Postgres aren't good choice for huge traffic.
    There is definitely a place for "Enterprise Shynet" which should use appropriate technology.
    I really love Shynet simplicity and I benefit form it a lot. I'm afraid I will never have 1M+ request daily.
    I think there is room for both "Simple Shynet" and "Enterprise Shynet".
  • ClickHouse is a good idea for "Enterprise Shynet"
  • Aggregation on-device is interesting idea but I would start from aggregation on server.
    I think archiving old data by aggregations may be good path.
  • I don't like collecting any user data so I'm 100% for removing option to track IPs.
  • I'm not a big fan of full test coverage because tests need maintenance too.
    No tests at all is obviously no go. I may try to add some test in spare time.
  • I want custom events too. I’m planning to work on them after adding full API.
  • If there is a problem with engineering capacity I'm happy to become a maintainer and help when I can.

@c4lliope
Copy link

c4lliope commented May 3, 2023

Hello, I've been running Shynet for 2! days now and I'm really happy at how simple it's been to deploy using docker-compose.

So long as Clickhouse has a docker image which can be packed easily into a docker-compose.yml file, I see no reason to hold back from adoption. https://hub.docker.com/r/clickhouse/clickhouse-server/#

In my highly-localized application, I rely on IPs to see which states people are logging in from. As a mainly-USA application, I care less than many people do about GDPR, and so I'd make a proposal here: if you could make a small engine inside the application for plugins or bespoke code, then end-users could make up the logic on a per-application basis. In my case, this could be:

  • Do IP geolocation using https://ipinfo.io
  • If in the US, record the state-or-county level identifier
  • If in the EU, record the country-level identifier
  • In other places, decide based on local laws
  • Discard or blur the IP address
  • (possibly) map the accrued locations inside the app

I like and encourage your decision on Elixir and Phoenix, this seems like a prime use case for both.

@sergioisidoro
Copy link
Contributor

sergioisidoro commented Jul 17, 2023

My 2 cents on ClickHouse:

I've used Plausible, which runs with ClickHouse, and for small projects it starts using unreasonable amounts of space. For a few hundred events per day it starts to hog unreasonable disk (maybe I'm doing something wrong). Also I'm so much more familiar with Postgres backup and restore procedures, that it was a bit of a pain to learn and setup them for Clickhouse.

So I keep coming back to Shynet as the alternative for small projects. What if this is Shynet's niche?

I do miss custom events tho... I could give it another shot at #168 if you want.

@c4lliope
Copy link

c4lliope commented Jul 17, 2023 via email

@sergioisidoro
Copy link
Contributor

sergioisidoro commented Jul 18, 2023

Ok, maybe I need to contextualise "unreasonable" because it greatly depends on the use case.

I had a small project running a very simple VM on Digital Ocean. We had <100 visitors per day, and very little events. I deployed plausible for that project on a docker swarm. In such a small project (Postgres, Django, Redis, worker, ghost and maria for a blog, + plausible), clickhouse hogged out the disk space although there was not that many events (~50Gb if I recall).

Self hosting sometimes is small, and having small footprint projects (in memory and disk) such as Shynet is super nice for that use case. If all projects start looking for scale, and adopting dependencies with larger starting footprint (Clickhouse, Elastic, etc), the requirements for self hosting an entire small stack (eg, Service + Blog + Analytics) start to go up.

There is nothing wrong in having a large footprint when there is scale - All I'm arguing here is that if there is no scale, the footprint to self host should be minimal :)

Caveat: bear in mind that I might have done something wrong in deploying Clickhouse, since I was using mostly the defaults from the official image.

@rallisf1
Copy link

I know I'm late to the party, I've barely used shynet but I'd like to share my 2 cents:

  • I believe Timescale can greatly improve Postgres' performance without many code changes. It is an open source plugin, has a plug-n-play docker image (which includes Postgres) and data can be easily migrated.
  • Django can be a bottleneck, especially when handling high traffic. Decoupling the data collection endpoint with pretty much anything else (e.g. bun) will solve the performance issue without the need to change any other parts of shynet.
  • I'm leaning against client side aggregation. With all the performance boost from the above steps I don't think the added complexity will add much benefit. You could optimize the data transfers by replacing REST with gRPC streams, but that would possibly add even more complexity and you'll also need a memory buffer (Redis?) before aggregating all that data and writing them in the main db. On the other hand, you can do heatmaps or advanced fingerprinting with that type of data.
  • Don't save IPs YES, but improve geolocation.
  • IMHO a community-driven open source project can't really be compliant with anything. You can't take the blame for how everyone uses your code, at least not yet. If there was some sort of company or organization behind the project it would be different. Also keep in mind that shynet could be fully compliant by itself but the websites being tracked not be compliant (e.g. because of missing ToS/Privacy Policy and tracking consent).
  • You can never test everything. Don't sweat over full test coverage.
  • Custom Events, yes please!
  • Differential Privacy? Sure, that would make it a commercial analytics killer. You could run a laplace function straight in postgres using a cron.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants