Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Proof of Concept for Targeting with embeddings #818

Merged
merged 21 commits into from
Feb 19, 2024

Conversation

ericholscher
Copy link
Member

@ericholscher ericholscher commented Feb 2, 2024

This is very much a draft,
but shows what creating and storing embeddings in Postgres looks like.

Doesn't implement any querying, but can be done with something like::

# Load shell

ADSERVER_ANALYZER_BACKEND=adserver.analyzer.backends.SentenceTransformerAnalyzerBackend ./manage.py shell_plus

# Load example data into DB

import yaml
from yaml import Loader
from adserver.analyzer.tasks import analyze_url

data = yaml.load(open("/model/assets/categorized-data.yml"), Loader)

for dat in data:
     url = dat['url']
     analyze_url(url, publisher_slug='ethicaladsio', force=True)


# Run initial test query

from pgvector.django import L2Distance
from adserver.analyzer.tasks import analyze_url

url = "https://observablehq.com/"
analyze_url(url, publisher_slug='ethicaladsio', force=True)

aurl = AnalyzedUrl.objects.get(url=url)

for url in AnalyzedUrl.objects.exclude(url=aurl.url).order_by(L2Distance('embedding', aurl.embedding))[:10]:
     print(url)

This was setting folks back after we re-enabled paids ads.
I'm not sure this is the cleanest way to do this,
but seems reasonable.
This is very much a draft,
but shows what creating and storing embeddings in Postgres looks like.

Doesn't implement any querying, but can be done with something like::

	# Load example data into DB

	import yaml
	from yaml import Loader

	data = yaml.load(yam, Loader)

	for dat in data:
	     url = dat['url']
	     analyze_url(url, publisher_slug='ethicaladsio', force=True)

	# Run initial test query

	from pgvector.django import L2Distance
	from adserver.analyzer.tasks import analyze_url

	url = "https://observablehq.com/"
	analyze_url(url, publisher_slug='ethicaladsio', force=True)

	aurl = AnalyzedUrl.objects.get(url=url)

	for url in AnalyzedUrl.objects.order_by(L2Distance('embedding', aurl.embedding))[1:6]:
	     print(url)
@ericholscher ericholscher requested a review from a team as a code owner February 2, 2024 22:28
@ericholscher ericholscher changed the title embedding poc Initial Proof of Concept for Targeting with embeddings Feb 2, 2024
Copy link
Collaborator

@davidfischer davidfischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't get the migrations to run correctly even after building a new docker image. Am I missing something?

Running migrations:
  Applying adserver_analyzer.0003_add_embeddings...Traceback (most recent call last):
...
django.db.utils.ProgrammingError: type "vector" does not exist
LINE 1: ...rver_analyzer_analyzedurl" ADD COLUMN "embedding" vector(3) ...

for publisher in Publisher.objects.filter(
allow_paid_campaigns=True, created__lt=threshold
allow_paid_campaigns=True, created__lt=threshold, modified__lt=threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not currently work as intended although I do think we want this. We calculate publisher CTRs nightly and this updates the modified time:

@app.task()
def calculate_publisher_ctrs(days=7):
"""Calculate average CTRs for paid ads on a publisher for the last X days."""
sample_cutoff = get_ad_day() - datetime.timedelta(days=days)
for publisher in Publisher.objects.all():
queryset = AdImpression.objects.filter(
date__gte=sample_cutoff,
publisher=publisher,
advertisement__flight__campaign__campaign_type=PAID_CAMPAIGN,
)
report = PublisherReport(queryset)
report.generate()
publisher.sampled_ctr = report.total["ctr"]
publisher.save()

Copy link
Collaborator

@davidfischer davidfischer Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change the publisher CTR calculations to only run on those where paid ads are approved. That way most publishers won't be updated nightly. Or we could make the save query into an update so the mod time isn't updated (and a historical record isn't created)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yea.. this must have snuck in from a PR I branched off... definitely didn't do it as part of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will roll this back, since it's a different change.

@ericholscher
Copy link
Member Author

@davidfischer ah, yea. You need to enable pgvector in the DB. I forgot to note that: https://github.com/pgvector/pgvector?tab=readme-ov-file#getting-started

@davidfischer
Copy link
Collaborator

davidfischer commented Feb 8, 2024

I think there's a few problems here.

  • This branch hasn't taken any of the updates from main since ~August before we upgraded Postgres for Django 4.2. I think we need to recreate/rebase the PR as a bunch of things are going to be off and there's going to be conflicts.
  • Rather than using a 3rd party docker image using an unknown version of Postgres and an unknown version of pgvector, let's just stick with the pinned version of PG we are using (15.2) and add building the extension into the Dockerfile. Hopefully it's as easy as adding a few steps to the Dockerfile.
  • Seems fairly easy to add migrations.RunSQL('CREATE EXTENSION IF NOT EXISTS vector;'), to the migration (can we collapse the two migrations to one?)

@@ -55,6 +56,8 @@ class AnalyzedUrl(TimeStampedModel):
),
)

embedding = VectorField(dimensions=384, default=None, null=True, blank=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we will need some sort of approximate index here, but that can come later.

Copy link
Member Author

@ericholscher ericholscher Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, that's definitely a next step once we start querying it.

Definitely interesting:

You can add an index to use approximate nearest neighbor search, which trades some recall for speed. Unlike typical indexes, you will see different results for queries after adding an approximate index.

https://github.com/pgvector/pgvector?tab=readme-ov-file#indexing

ericholscher and others added 4 commits February 8, 2024 12:12
- Create vector extension in the migration
- Ensure psql on the django docker image
- Use our maintenance scripts in the pg image
  while still using pgvector


sentence-transformers
pgvector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we import this directly in models.py, this probably has to go in the base requirements.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is probably just going to be a nightmare for testing. Having a field be a postgres specific field may require our testing setup to change since our tests are run with an in-mem sqlite setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, sqlite has a similar extension: https://github.com/asg017/sqlite-vss -- but might be worth just running tests in postgres? 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there isn't an easy way to use the sqlite package in Django. Another idea I had that probably makes sense:

Break the embeddings out into their own model, with a FK or OneToOne to the AnalyzedURL? That way we could keep this all self-contained.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to have the embeddings on AnalyzedUrl. I think we just have to change the test skip logic (this) for the analyzer. We could ensure that adserver.analyzer is excluded from testing entirely and when it is tested that it uses Postgres. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that likely makes sense as well, if we don't have tests for the code currently that we'd be skipping.

Copy link
Collaborator

@davidfischer davidfischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I think disabling the analyzer is a sensible default (especially for OSS users of our project). It might be nice to find a way to run tests on the analyzer but not a blocker for merging this.

@ericholscher ericholscher merged commit 4db23c9 into main Feb 19, 2024
1 check passed
@ericholscher ericholscher deleted the embedding-poc branch February 19, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants