Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early design review for the Topics API #726

Closed
jkarlin opened this issue Mar 25, 2022 · 43 comments
Closed

Early design review for the Topics API #726

jkarlin opened this issue Mar 25, 2022 · 43 comments
Assignees
Labels
privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. Provenance: Privacy Sandbox Resolution: unsatisfied The TAG does not feel the design meets required quality standards Review type: CG early review An early review of general direction from a Community Group Topic: privacy

Comments

@jkarlin
Copy link

jkarlin commented Mar 25, 2022

Braw mornin' TAG!1

I'm requesting a TAG review of the Topics API.

The intent of the Topics API is to provide callers (including third-party ad-tech or advertising providers on the page that run script) with coarse-grained advertising topics that the page visitor might currently be interested in. These topics will supplement the contextual signals from the current page and can be combined to help find an appropriate advertisement for the visitor.

  • Explainer¹ (minimally containing user needs and example code): https://github.com/jkarlin/topics
  • User research: [url to public summary/results of research]
  • Security and Privacy self-review²: See below
  • GitHub repo (if you prefer feedback filed there): https://github.com/jkarlin/topics
  • Primary contacts (and their relationship to the specification):
    • Josh Karlin, jkarlin@, Google
    • Yao Xiao, xyaoinum@, Google
  • Organization/project driving the design: Chrome Privacy Sandbox
  • External status/issue trackers for this feature (publicly visible, e.g. Chrome Status): https://chromestatus.com/feature/5680923054964736

Further details:

  • [ x ] I have reviewed the TAG's Web Platform Design Principles
  • The group where the incubation/design work on this is being done (or is intended to be done in the future): Either WICG or PATCG
  • The group where standardization of this work is intended to be done ("unknown" if not known): unknown
  • Existing major pieces of multi-stakeholder review or discussion of this design: Lots of discussion on https://github.com/jkarlin/topics/issues/, and a white paper on fingerprintability analysis: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf
  • Major unresolved issues with or opposition to this design: We believe that the proposed API leans heavily towards user privacy in the privacy/utility tradeoff, as it should. But, the API’s utility isn’t yet clear. Until we try the API in an experiment, we can’t know for sure how the API will perform. Some changes are likely going to be needed. Knobs we may tweak include, but are not limited to, topics in the taxonomy, weights of the topics in the taxonomy, how a site might suggest topics for itself, and how we might get topic data from more places than just the domain (e.g., from the url if there is some signal that the url is privacy safe to parse).
  • This work is being funded by: Chrome

You should also know that...

This API was developed in response to feedback that we (Chrome) received from feedback on our first interest-based advertising proposal, FLoC. That feedback came from TAG, other browsers, Advertisers, and our users. We appreciate this feedback, and look forward to your thoughts on this API.

At the bottom of this issue is both the security survey responses, as well as responses to questions from TAG about FLoC, but answered in terms of Topics.

We'd prefer the TAG provide feedback as (please delete all but the desired option):

☂️ open a single issue in our GitHub repo for the entire review

Self Review Questionnaire: Security & Privacy

2.1. What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?

  • It exposes one of the user’s top-5 topics from the previous week to the caller if the calling context’s site also called the Topics API for the user on a page about that topic in the past three weeks. This is information that could have instead been obtained using third-party cookies. The part that might not have been obtained using third-party cookies is that this is a top topic for the user. This is more global knowledge that a single third-party may not have been able to ascertain.
  • 5% of the time the topic is uniformly random.
  • The topic comes from a taxonomy. The initial proposed taxonomy is here: https://github.com/jkarlin/topics/blob/main/taxonomy_v1.md
  • The topic returned (if one of the top 5 and not the random topic) is random among the top 5, and is set per calling top-frame site. So if any frame on a.com calls the API, it might get the topic with index 3, while b.com callers might get topic at index 1 for the week. This reduces cross-site correlation/fingerprintability.
  • Topics are derived only from sites the user visited that called the API.
  • Topics are derived only from the domain of the site, not the url or content of the site. Though this may change depending on utility results.

2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?

Yes. The entire design of the API is to minimize the amount of information about the user that is exposed in order to provide for the use case. We have also provided a theoretical (and applied) analysis of the cross-site fingerprinting information that is revealed: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf

2.3. How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?

The API intentionally provides some information about the user to the calling context. We’ve reduced the ability to use this information as a global identifier (cross site fingerprinting surface) as much as possible.

2.4. How do the features in your specification deal with sensitive information?

Sensitive information is reduced by only allowing topics in the Taxonomy that Chrome and the IAB have deemed are not sensitive (the topics in the proposed initial taxonomy are derived from the two respective organization’s advertising taxonomies).

This does not mean that topics in the taxonomy, or groups of topics in the taxonomy learned about the user over time cannot be correlated sensitive topics. This may be possible.

2.5. Do the features in your specification introduce new state for an origin that persists across browsing sessions?

The API provides some information about the user’s browsing history, and this is stored in the browser. The filtering mechanism used to provide a topic to a calling context if and only if that context has observed the user on a page about that topic in the past also stores data. This could be used to learn if the user has visited a specific site in the past (which third-party cookies can do quite easily today) and we’d like to make that hard. There may be interventions that the browser can take to detect and prevent such abuses.

2.6. Do the features in your specification expose information about the underlying platform to origins?

No.

2.7. Does this specification allow an origin to send data to the underlying platform?

The top-frame site’s domain is read to determine a topic for the site.

2.8. Do features in this specification enable access to device sensors?

No.

2.9. Do features in this specification enable new script execution/loading mechanisms?

No.

2.10. Do features in this specification allow an origin to access other devices?

No.

2.11. Do features in this specification allow an origin some measure of control over a user agent’s native UI?

No.

2.12. What temporary identifiers do the features in this specification create or expose to the web?

The topics that are returned by the API. They are per-epoch (week), per-user, and per site. It is cleared when the user clears state.

2.13. How does this specification distinguish between behavior in first-party and third-party contexts?

The topic is only returned to the caller if the calling context’s site has also called the API on a domain about that topic with that same user in the past three weeks. So whether the API returns anything or not depends on the calling context’s domain.

2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?

The API returns an empty list in incognito mode. We feel that this is safe because there are many reasons that an empty list might be returned. e.g., because the user is new, because the user is in incognito, because the site has not seen this user on relevant sites with the associated topics in the past three weeks, because the user has disabled the API via UX controls.

This is effectively the same behavior as the user being new, so this is basically the API working the same within incognito mode as in regular mode. We could have instead returned random topics in incognito (and for new users) but this has the deleterious effect of significantly polluting the API with noise. Plus, we don’t want to confuse users/developers by having the API return values when they expect it not to (e.g., after disabling the API).

2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections?

There is no formal specification yet, but the explainer goes into detail on the privacy considerations. The primary security consideration is that the API reveals information beyond third-party cookies in that learning a topic means that the topic is one of the users top topics for the week.

2.16. Do features in your specification enable origins to downgrade default security protections?

No.

2.17. How does your feature handle non-"fully active" documents?

No special considerations.

Responses to questions from the FLoC TAG review, as they apply to Topics

Sensitive categories

The documentation of "sensitive categories" visible so far are on google ad policy pages. Categories that are considered "sensitive" are, as stated, not likely to be universal, and are also likely to change over time. I'd like to see:

  • an in-depth treatment of how sensitive categories will be determined (by a diverse set of stakeholders, so that the definition of "sensitive" is not biased by the backgrounds of implementors alone);
  • discussion of if it is possible - and desirable (it might not be) - for sensitive categories to differ based on external factors (eg. geographic region);
  • a persistent and authoritative means of documenting what they are that is not tied to a single implementor or company;
  • how such documentation can be updated and maintained in the long run;
  • and what the spec can do to ensure implementers actually abide by restrictions around sensitive categories.
    Language about erring on the side of user privacy and safety when the "sensitivity" of a category is unknown might be appropriate.

A key difference between Topics and Cohorts is that the Topics taxonomy is human curated, whereas cohorts were the result of a clustering algorithm and had no obvious meaning. The advantage of a topics based approach is that we can help to clarify which topics are exposed. For instance, the initial topology we intend to use includes topics that are in both the IAB’s content taxonomy and Google’s advertising taxonomy. This ensures that at least two separate entities had reviewed the topics for sensitive categories. Assuming that the API is successful, we would be happy to consider a third-party maintainer of the taxonomy that incorporates both relevant advertising interests as well as up-to-date sensitivities.

Browser support

I imagine not all browsers will actually want to implement this API. Is the result of this, from an advertisers point of view, that serving personalised ads is not possible in certain browsers? Does this create a risk of platform segmentation in that some websites could detect non-implementation of the API and refuse to serve content altogether (which would severely limit user choice and increase concentration of a smaller set of browsers)? A mitigation for this could be to specify explicitly 'not-implemented' return values for the API calls that are indistinguishable from a full implementation.

The description of the experimentation phase mentions refreshing cohort data every 7 days; is timing something that will be specified, or is that left to implementations? Is there anything about cohort data "expiry" if a browser is not used (or only used to browse opted-out sites) for a certain period?

As always, it is up to each browser to determine which use cases and APIs it wishes to support. Returning empty lists is completely reasonable. Though a caller could still use the UA to determine if the API is really supported or not. I’m not sure that there is a good solution here.

In regards to the duration of a topic, I think that is likely to be per-UA.

In the Topics API, we ensure that each topic has a minimum number of users, by returning responses uniformly at random 5% of the time.

Opting out

I note that "Whether the browser sends a real FLoC or a random one is user controllable" which is good. I would hope to see some further work on guaranteeing that the "random" FLoCs sent in this situation does not become a de-facto "user who has disabled FLoC" cohort.
It's worth further thought about how sending a random "real" FLoC affects personalised advertising the user sees - when it is essentially personalised to someone who isn't them. It might be better for disabling FLoC to behave the same as incognito mode, where a "null" value is sent, indicating to the advertiser that personalised advertising is not possible in this case.
I note that sites can opt out of being included in the input set. Good! I would be more comfortable if sites had to explicitly opt in though.
Have you also thought about more granular controls for the end user which would allow them to see the list of sites included from their browsing history (and which features of the sites are used) and selectively exclude/include them?
If I am reading this correctly, sites that opt out of being included in the cohort input data cannot access the cohort information from the API themselves. Sites may have very legitimate reasons for opting out (eg. they serve sensitive content and wish to protect their visitors from any kind of tracking) yet be supported by ad revenue themselves. It is important to better explore the implications of this.

The current plan is for the Topics API to return an empty list in incognito mode.

Sites opt in via using the API. If the API is not used, the site will not be included. Sites can also prevent third parties from calling the API on their site via permission policy.

In regards to granular controls, we feel that this is possible with Topics (less so with FLoC) and expect to expose via UX the topics that are being returned, and allowing users to opt out of the API completely or disable individual topics.

The API is designed to facilitate ecosystem participation - as calling the API is both the way to contribute and receive value from the API. We do not want sites to be able to get topics without also supporting the ecosystem.

Centralisation of ad targeting

Centralisation is a big concern here. This proposal makes it the responsibility of browser vendors (a small group) to determine what categories of user are of interest to advertisers for targeting. This may make it difficult for smaller organisations to compete or innovate in this space. What mitigations can we expect to see for this?
How transparent / auditable are the algorithms used to generates the cohorts going to be? When some browser vendors are also advertising companies, how to separate concerns and ensure the privacy needs of users are always put first?

The Topics API helps to address broad, granular topics based advertising. For more niche topics, we suggest the usage of alternative sandbox APIs like FLEDGE.
In terms of transparency, the API is written plainly in open source code, the design is occurring on github with an active community, and the ML model used to classify topics will be available for anyone to evaluate.

Accessing cohort information

I can't see any information about how cohorts are described to advertisers, other than their "short cohort name". How does an advertiser know what ads to serve to a cohort given the value "43A7"? Are the cohort descriptions/metadata served out of band to advertisers? I would like an idea of what this looks like.

With Topics, the Taxonomy name is its semantic meaning.

Security & privacy concerns

I would like to challenge the assertion that there are no security impacts.

  • A large set of potentially very sensitive personal data is being collected by the browser to enable cohort generation. The impact of a security vulnerability causing this data to be leaked could be great.

In Chrome, the renderer is only aware of the topic for the given site. The browser stores information about which callers were on each top-level site, and whether the API was called. This is significantly better than the data stored for third-party cookies.

  • The explainer acknowledges that sites that already know PII about the user can record their cohort - potentially gathering more data about the user than they could ever possibly have access to without explicit input from the user - but dismisses this risk by comparing it to the status quo, and does not mention this risk in the Security & Privacy self-check.

The Topics API, unlike FLoC, only allows a site to learn topics if the caller has observed the user on a site about that topic. So it is no longer easy to learn more about the user than they could have without explicit input from the user.

  • Sites which log cohort data for their visitors (with or without supplementary PII) will be able to log changes in this data over time, which may turn into a fingerprinting vector or allow them to infer other information about the user.

Topics is more difficult to use as a cross-site fingerprinting vector due to the fact that different sites receive different topics during the same week. We have a white paper studying the impact of this: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf
Logging data over time does still increase knowledge about the user however. We’ve limited this as much as we think is possible.

  • We have seen over past years the tendency for sites to gather and hoard data that they don't actually need for anything specific, just because they can. The temptation to track cohort data alongside any other user data they have with such a straightforward API may be great. This in turn increases the risk to users when data breaches inevitably occur, and correlations can be made between known PII and cohorts.

The filtering mentioned above (not returning the topic if it was observed by the calling context for that user on a site about that topic) significantly cuts down on this hoarding. It’s no longer possible for any arbitrary caller on a page to learn the user’s browsing topics.

  • How many cohorts can one user be in? When a user is in multiple cohorts, what are the correlation risks related to the intersection of multiple cohorts? "Thousands" of users per cohort is not really that many. Membership to a hundred cohorts could quickly become identifying.

There are only 349 topics in the proposed Topics API, and 5% of the time a uniformly random topic is returned. We expect there to be significantly more users per topic that there were in FLoC.

@jkarlin jkarlin added Progress: untriaged Review type: CG early review An early review of general direction from a Community Group labels Mar 25, 2022
@torgo torgo added Topic: privacy privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. Provenance: Privacy Sandbox and removed Progress: untriaged labels Apr 13, 2022
@torgo torgo added this to the 2022-04-18-week milestone Apr 13, 2022
@lknik
Copy link
Member

lknik commented May 23, 2022

Is it possible to conduct a more formal leak-analysis?

We’ve reduced the ability to use this information as a global identifier (cross site fingerprinting surface) as much as possible.

@jkarlin
Copy link
Author

jkarlin commented May 23, 2022

Please see https://github.com/patcg-individual-drafts/topics/blob/main/topics_analysis.pdf for a more formal analysis.

@jkarlin
Copy link
Author

jkarlin commented May 25, 2022

Also, I'd appreciate your thoughts on if this API belongs in document, navigator, or somewhere else. We chose document.browsingTopics() because the topics are filtered by calling context. But perhaps it should be in navigator since it's more about the state of the user's browsing history?

@hadleybeeman
Copy link
Member

Hello! We discussed this at our W3C TAG breakout.

We are adding this to our agenda for our upcoming face-to-face in London, and we'll come back to this in more detail then.

@jkarlin
Copy link
Author

jkarlin commented Jul 12, 2022

Great, thanks for the update. Would it be useful for me to be present/available during that time?

@cynthia
Copy link
Member

cynthia commented Jun 30, 2023

As API-surface feedback was also promised on "document, navigator, or somewhere else", adding that to the review comment above. We briefly discussed this, and the current thoughts on where the API belongs are somewhat inconclusive.

While navigator might sound logical given that it will be exposing a lossy representation of the browsing history, this also implies it is global to the user agent - I'm not sure how that would hold in the long term. If there is a necessity to change the behavior so that the API is contextual (e.g. different topics based on the caller's origin), it would definitely be out of place. Also, there are a lot of things somewhat unnecessarily hanging off of navigator, so bloat would be another reason.

This leaves document as the natural location for access via the browsing context. One question on the API surface would be whether there would be a reason to access topics from a worker (e.g. for background/off-thread/SW-based bidding), in which case you would probably want to expose it to WorkerGlobalScope as well. We don't know if it would be a critical use case, but if the ad tax in the main thread can go down as a side effect of this, it would be worth considering.

@hadleybeeman
Copy link
Member

Hi all. We've looked at this during our W3CTAG f2f. We are still hoping for replies to our previous two comments from @plinss and @cynthia. Any thoughts?

@siliconvoodoo
Copy link

Let's sum this up in very lay man terms:
Topics = google money.
It's not in users interest, nor should it be at the agenda of a moral society.
We, the people, want an integrally anonymized internet. If your business model can't survive because you can't monetize on the back of the data of your visitors, go do something more useful for society.
Stochastic plausible deniability is whitewashing of an otherwise dystopian behavior.
Pretension that "studies" demonstrated a desire from users to have targeted ads, is just done on the back of uneducated respondents about the risks of identifiability, and freedom of the web in general.
And an "improvement from cookies" is just a sophistry as explained by brave's devs on their blog, I quote

Google claims that these systems, [...], improve privacy because they’re designed to replace third-party cookies. The plain truth is that privacy-respecting browsers [...] have been protecting users against third-party tracking (via cookies or otherwise) for years now.

Google’s proposals are privacy-improving only from the cynical, self-serving baseline of “better than Google today.” Chrome is still the most privacy-harming popular browser on the market, and Google is trying to solve a problem they introduced by taking minor steps meant to consolidate their dominance on the ad tech landscape. Topics does not solve the core problem of Google broadcasting user data to sites, including potentially sensitive information.

@jkarlin
Copy link
Author

jkarlin commented Aug 2, 2023

Thanks for the feedback. I’ve added responses to both plinss and cynthia below:

Do you have a response to the points raised in Webkit's review?

They are similar in nature to what has already been brought up by TAG and discussed in this thread. If there are particular questions I’d be happy to respond.

Do you have any analysis or response to the papers that Martin pointed to?

Yes, please see my previous comment. To add to that, I think it’s important to understand that all of the papers are using different data sets with different modeling assumptions on evolution of user interests, number of users present etc. Our own research utilized real user data, while the others understandably had to generate synthetic web traces and interests, which Jha et al. notes may not be representative of the general population. Nonetheless, they all found that it took a large number of epochs to reidentify the majority of users across sites.

Please could you elaborate if it is in fact the case that all sites browsed by a user are included by default as input data for generating a user's topics list? If this is the case, what recourse is there for sites which are misclassified?

This is not the case. Only sites that call the API are included as input to generating the user’s topics list.

Can you clarify the situation with regard to definition of user preference / opt out?

Users can opt out of the API wholesale within Chrome's privacy preferences. They can also disable topics that have been selected. In the future, they will be able to preemptively remove topics.

Sites can choose not to use the API, in which case user visits to their site will not be included in topics calculation. Sites can further ensure that nobody on their site calls the API via permission policy.

Have you considered dropping the part where topics are calculated from browsing history, and instead entirely configured by the user in their browser settings? This would be much closer to people being able to meaningfully opt in to targeted advertising, and would make several of the other concerns raised moot.

It’s been raised in our public meetings. Folks have raised multiple issues with such an approach. One is that user interests are dynamic, whereas settings are generally quite static. A second is that it seems like many users might not bother to configure this, even if doing so would improve their ads and the revenue of the sites they visit.

This leaves document as the natural location for access via the browsing context. One question on the API surface would be whether there would be a reason to access topics from a worker (e.g. for background/off-thread/SW-based bidding), in which case you would probably want to expose it to WorkerGlobalScope as well. We don't know if it would be a critical use case, but if the ad tax in the main thread can go down as a side effect of this, it would be worth considering.

Excellent, thanks for that guidance. It seems reasonable to expose the API to WorkerGlobalScopebut I don’t think it would alleviate any main thread costs, as the browsingTopics call itself is asynchronous and efficient. If developers start to ask for it, then we can consider adding it more seriously.

@siliconvoodoo
Copy link

What happens when one visits the China embassy website, they decide they don't like your topics too much and make your visa obtention difficult or impossible? Or USA for that matter, it regularly happens https://techcrunch.com/2019/09/02/denied-entry-united-states-whatsapp/.

@jkarlin
Copy link
Author

jkarlin commented Aug 3, 2023

@siliconvoodoo your hypothetical doesn't make sense. If the authorities were looking at your browser, surely they would be far more interested in your actual browsing history (readily available in the browser) than your topics? And if you cleared your history, then your topics would be cleared too.

Edit: Ah, I was looking at the article you linked to about phones being scanned and missed the first part about the website. In the website case, said website would a) have to have a third-party on it that observed you on such a site and is willing to share that information, b) that topic could very well be noise, c) the taxonomy is coarse grained with highly sensitive topics removed, and finally, compared to third-party cookies (what Chrome is trying to deprecate), topics conveys tiny amounts of information.

@dmarti
Copy link

dmarti commented Aug 3, 2023

@jkarlin Governments have a limited number of secret police hours to work with. Not all citizens and visitors can be fully observed at all times. Governments will be able to use a lightweight remote screening system like Topics API to identify people for further, more resource-consuming attention like a full device search. Clearing Topics API data or using a browser without Topics API turned on could also be a factor in selection. And the set of possible callers is big enough that we don't know in advance which callers will be owned by, or have a data sharing agreement with, which governments.

The Topics API taxonomy is free of obvious sensitive topics, but can still encode sensitive information (such as people who like music A and food B in country C)

@siliconvoodoo
Copy link

@jkarlin Your argument is trying to justify gas burning because coal is worse. When I'm telling you to go nuclear. It's a sort of tu quoque fallacy. The problem is systemic, don't compartment it in pieces to find ad-hoc ways to give incompatible whataboutisms in each case. Surely you must understand that authority directly looking at your device is one situation, which must be fought, a la Apple versus FBI case. But not the one I'm worried about with Topics, that would be remote mass profiling. The surface of attack against individuals just keep being magnified, third party cookies is not a standard of reference, as the brave blog explained.
There are enough NGO to alert about our predicaments, Big Brother Watch, Quadrature du net, Snowden; fictions: black mirror, brave new world... I can't understand how you can willingly be participating in implementing pathways that enable dystopias, instead of pushing for a society with more safety nets against what's coming. Why are you not aiming at tor-like anonymity for all? No cookies, no Topics, fingerprinting jamming, IP spoofing...
Surely you've noticed alt-right horrors becoming mainstream, you must be able to picture what fascists power houses a la 1984 are becoming enabled to do with all the technology that we provide them? Immigration officers are not motivated to take jobs because they have nothing else to do, it's because they love the power to be nationalist right wingers and deny brown skin people entry on fake excuses. In Russia it will be because you have gay topics. In China, because you visited uyghurs activists sites. In Iran because you have feminists interests... They don't have to access any device, they will have your profile on the database, gathered and refined any time you visit an end point controlled by the agencies. The more instruments you provide the more fascist the society veers, the more risk you expose us to citizen scores, unjust incarceration, visa denials, lynching, executions or worse.

@jyasskin
Copy link
Contributor

jyasskin commented Aug 4, 2023

There are two sides of the Topics API: the interface it exposes to pages to tell them what topics a user is probably interested in, and the interface it exposes to users to figure out or guess what topics they're actually interested in. The interface with pages is the traditional realm of web standards and involves a bunch of tradeoffs around the rate that pages can identify users, which @martinthomson has focused on above.

On the other hand, the interface with users is not generally something that we standardize or specify, instead giving user agents wide freedom to do what's best for their users, even if that's very different from what other UAs do. There are some limits here—if pages need to adapt to particular UI, it may be worth constraining the variation—but I don't think Topics falls into that category, and I suspect that the Topics spec actually has too much normative text specifying the user-facing part of its behavior.

Unfortunately, a large fraction of the TAG's review that @plinss recounted focuses on the particular UI that Chrome plans to ship, rather than the question of whether UAs have the freedom to do the right thing for their users. The TAG suggests that many users would appreciate if their interests were "entirely configured by the user in their browser settings", and I agree. As far as I can see, this UI is completely supported by the Topics API and would require no changes to the page-facing API or page behavior. Whether or not Chrome initially ships that UI, other browsers can do so, and Chrome could switch to it in the future. If I'm wrong, and that UI would require changes to the page-facing API, that would be a really good thing to point out soon, so that Chrome can ship a more-compatible API instead.

@chrisvls
Copy link

chrisvls commented Aug 4, 2023

There are a few places where some of the assurances described in the beginning of this TAG discussion (quite a while ago now!), and even some more recently, don't quite track what is in the spec.

  • The discussion here states the taxonomy is coarse-grained, but the spec does not limit the depth of the taxonomy. From this discussion, it may be intentional that the spec would allow a taxonomy of a billion items. A bit more on this in a separate section below.

  • The discussion here states the “taxonomy name is its semantic meaning”, but the spec does not require that a topic have more than an integer ID. There is no requirement for a human-readable taxonomy name, nor for a utility for localizing that name.

  • The discussion here states that the taxonomy will exclude sensitive topics and hews to certain existing taxonomies, but the spec does not provide for any assurance or process for this.

  • The discussion here states that the spec has done as much as it can to allow for user consent, as this is generally left to UX implementation, but it is not clear that the permissions framework wouldn’t offer other options, such as treating each topic as a powerful feature or requiring powerful feature treatment of Topics.

  • The discussion here implies that security concerns are minimized because topics calculation will be done on the domain or url and occur locally in the browser, but the spec would allow the implementer to analyze the entire document in the context of the implementer’s choosing, including a server. While today’s specs allow server-based browser implementation, it is rare, and the marketing for the Privacy Sandbox features on-device processing pretty prominently.

To return to a moment to the assertion that a billion topics would mean no privacy loss because only five may be eligible for reporting out to sites.

  • A billion topics would invalidate all of the applicable analyses of the cross-identification probability. For example, the theoretical limit for leakage (log2(N,k) where N is taxonomy size and k is topics tracked) would go from ~6 bits to ~29 bits for the spec's current limitation that the TopicID be an integer.

  • Users can’t proactively review and opt-out of a billion topics.

  • The five-percent random results would not ensure that all topics had users if there are a billion topics.

  • It is not clear that the security and privacy concerns could be addressed by relying on the fact that cookies have effectively less data than topics.

  • The rewards for site collusion to game the system would be much higher. These may not have been explored in sufficient detail for a coarse-grained taxonomy, or even for trying to game multiple taxonomies; they certainly haven’t for a super-fined-grained one.

Finally one question one might ask: why comment on a spec when there seems so small a chance of cross-browser implementation?

As an enterprise customer of Google Workspace and Chrome, I am already subjected to small, creeping changes to the interpretation of the terms of service – and updates to those terms that are difficult to opt out of. So, even if Chrome is the only full implementer, I would rather see the critical privacy promises in a draft spec so that they stick for longer.

Also, it is really important that implementations match their marketing. There are big implications for the web as a whole if the most popular browser can market a feature as "only local calculation of coarse-grained topics" when we decide to opt in, but then, since they don't think it is a big deal, change that over time.

@shivanigithub
Copy link

FYI, Chrome plans to start gating topics API invocation behind the enrollment and attestation mechanism. (explainer, spec PR)

@plinss
Copy link
Member

plinss commented Feb 27, 2024

To summarize and close this review, we note that there are some disagreements about goals here that underpin the disconnect.

The goals you have set out in the explainer are:

  • It must be difficult to reidentify significant numbers of users across sites using just the API.
  • The API should provide a subset of the capabilities of third-party cookies.
  • The topics revealed by the API should be less personally sensitive about a user than what could be derived using today’s tracking methods.
  • Users should be able to understand the API, recognize what is being communicated about them, and have clear controls. This is largely a UX responsibility but it does require that the API be designed in a way such that the UX is feasible.

The set of goals also implictly compares the privacy characteristics of this API to the web with 3rd party cookies (and tracking). In the spirit of "leaving the web better than you found it," we would like to see the design goals achieved whilst also preserving the privacy characteristics of the web without third party cookies.

We do acknowledge that you have arguably achieved the 4th goal, with an API that does not actively prevent the user from understanding and recognizing what is being communicated about them. However the implicit privacy labour that would be required to manage this set of topics on an ongoing basis remains a key question.

Finally, we challenge the assertion that reidentification in the absence of other information is the right benchmark to apply. As we previously noted, the potential for this to affect privacy unevenly across different web users is a risk that is not adequately mitigated.

@plinss plinss closed this as completed Feb 27, 2024
@plinss plinss added Resolution: unsatisfied The TAG does not feel the design meets required quality standards and removed Progress: pending external feedback The TAG is waiting on response to comments/questions asked by the TAG during the review labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
privacy-tracker Group bringing to attention of Privacy, or tracked by the Privacy Group but not needing response. Provenance: Privacy Sandbox Resolution: unsatisfied The TAG does not feel the design meets required quality standards Review type: CG early review An early review of general direction from a Community Group Topic: privacy
Projects
None yet
Development

No branches or pull requests