-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial Proof of Concept for Targeting with embeddings #818
Conversation
This was setting folks back after we re-enabled paids ads. I'm not sure this is the cleanest way to do this, but seems reasonable.
This is very much a draft, but shows what creating and storing embeddings in Postgres looks like. Doesn't implement any querying, but can be done with something like:: # Load example data into DB import yaml from yaml import Loader data = yaml.load(yam, Loader) for dat in data: url = dat['url'] analyze_url(url, publisher_slug='ethicaladsio', force=True) # Run initial test query from pgvector.django import L2Distance from adserver.analyzer.tasks import analyze_url url = "https://observablehq.com/" analyze_url(url, publisher_slug='ethicaladsio', force=True) aurl = AnalyzedUrl.objects.get(url=url) for url in AnalyzedUrl.objects.order_by(L2Distance('embedding', aurl.embedding))[1:6]: print(url)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't get the migrations to run correctly even after building a new docker image. Am I missing something?
Running migrations:
Applying adserver_analyzer.0003_add_embeddings...Traceback (most recent call last):
...
django.db.utils.ProgrammingError: type "vector" does not exist
LINE 1: ...rver_analyzer_analyzedurl" ADD COLUMN "embedding" vector(3) ...
adserver/tasks.py
Outdated
for publisher in Publisher.objects.filter( | ||
allow_paid_campaigns=True, created__lt=threshold | ||
allow_paid_campaigns=True, created__lt=threshold, modified__lt=threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will not currently work as intended although I do think we want this. We calculate publisher CTRs nightly and this updates the modified time:
ethical-ad-server/adserver/tasks.py
Lines 702 to 717 in b9ac3b5
@app.task() | |
def calculate_publisher_ctrs(days=7): | |
"""Calculate average CTRs for paid ads on a publisher for the last X days.""" | |
sample_cutoff = get_ad_day() - datetime.timedelta(days=days) | |
for publisher in Publisher.objects.all(): | |
queryset = AdImpression.objects.filter( | |
date__gte=sample_cutoff, | |
publisher=publisher, | |
advertisement__flight__campaign__campaign_type=PAID_CAMPAIGN, | |
) | |
report = PublisherReport(queryset) | |
report.generate() | |
publisher.sampled_ctr = report.total["ctr"] | |
publisher.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change the publisher CTR calculations to only run on those where paid ads are approved. That way most publishers won't be updated nightly. Or we could make the save query into an update so the mod time isn't updated (and a historical record isn't created)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yea.. this must have snuck in from a PR I branched off... definitely didn't do it as part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will roll this back, since it's a different change.
@davidfischer ah, yea. You need to enable pgvector in the DB. I forgot to note that: https://github.com/pgvector/pgvector?tab=readme-ov-file#getting-started |
I think there's a few problems here.
|
@@ -55,6 +56,8 @@ class AnalyzedUrl(TimeStampedModel): | |||
), | |||
) | |||
|
|||
embedding = VectorField(dimensions=384, default=None, null=True, blank=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect we will need some sort of approximate index here, but that can come later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, that's definitely a next step once we start querying it.
Definitely interesting:
You can add an index to use approximate nearest neighbor search, which trades some recall for speed. Unlike typical indexes, you will see different results for queries after adding an approximate index.
https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing
- Create vector extension in the migration - Ensure psql on the django docker image - Use our maintenance scripts in the pg image while still using pgvector
requirements/analyzer.txt
Outdated
|
||
|
||
sentence-transformers | ||
pgvector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we import this directly in models.py
, this probably has to go in the base requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is probably just going to be a nightmare for testing. Having a field be a postgres specific field may require our testing setup to change since our tests are run with an in-mem sqlite setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, sqlite has a similar extension: https://github.com/asg017/sqlite-vss -- but might be worth just running tests in postgres? 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there isn't an easy way to use the sqlite package in Django. Another idea I had that probably makes sense:
Break the embeddings out into their own model, with a FK or OneToOne to the AnalyzedURL? That way we could keep this all self-contained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to have the embeddings on AnalyzedUrl
. I think we just have to change the test skip logic (this) for the analyzer. We could ensure that adserver.analyzer
is excluded from testing entirely and when it is tested that it uses Postgres. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that likely makes sense as well, if we don't have tests for the code currently that we'd be skipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and I think disabling the analyzer is a sensible default (especially for OSS users of our project). It might be nice to find a way to run tests on the analyzer but not a blocker for merging this.
This is very much a draft,
but shows what creating and storing embeddings in Postgres looks like.
Doesn't implement any querying, but can be done with something like::