-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early design review for the Topics API #726
Comments
Is it possible to conduct a more formal leak-analysis?
|
Please see https://github.com/patcg-individual-drafts/topics/blob/main/topics_analysis.pdf for a more formal analysis. |
Also, I'd appreciate your thoughts on if this API belongs in |
Hello! We discussed this at our W3C TAG breakout. We are adding this to our agenda for our upcoming face-to-face in London, and we'll come back to this in more detail then. |
Great, thanks for the update. Would it be useful for me to be present/available during that time? |
As API-surface feedback was also promised on "document, navigator, or somewhere else", adding that to the review comment above. We briefly discussed this, and the current thoughts on where the API belongs are somewhat inconclusive. While This leaves |
Let's sum this up in very lay man terms:
|
Thanks for the feedback. I’ve added responses to both plinss and cynthia below:
They are similar in nature to what has already been brought up by TAG and discussed in this thread. If there are particular questions I’d be happy to respond.
Yes, please see my previous comment. To add to that, I think it’s important to understand that all of the papers are using different data sets with different modeling assumptions on evolution of user interests, number of users present etc. Our own research utilized real user data, while the others understandably had to generate synthetic web traces and interests, which Jha et al. notes may not be representative of the general population. Nonetheless, they all found that it took a large number of epochs to reidentify the majority of users across sites.
This is not the case. Only sites that call the API are included as input to generating the user’s topics list.
Users can opt out of the API wholesale within Chrome's privacy preferences. They can also disable topics that have been selected. In the future, they will be able to preemptively remove topics. Sites can choose not to use the API, in which case user visits to their site will not be included in topics calculation. Sites can further ensure that nobody on their site calls the API via permission policy.
It’s been raised in our public meetings. Folks have raised multiple issues with such an approach. One is that user interests are dynamic, whereas settings are generally quite static. A second is that it seems like many users might not bother to configure this, even if doing so would improve their ads and the revenue of the sites they visit.
Excellent, thanks for that guidance. It seems reasonable to expose the API to |
What happens when one visits the China embassy website, they decide they don't like your topics too much and make your visa obtention difficult or impossible? Or USA for that matter, it regularly happens https://techcrunch.com/2019/09/02/denied-entry-united-states-whatsapp/. |
@siliconvoodoo your hypothetical doesn't make sense. If the authorities were looking at your browser, surely they would be far more interested in your actual browsing history (readily available in the browser) than your topics? And if you cleared your history, then your topics would be cleared too. Edit: Ah, I was looking at the article you linked to about phones being scanned and missed the first part about the website. In the website case, said website would a) have to have a third-party on it that observed you on such a site and is willing to share that information, b) that topic could very well be noise, c) the taxonomy is coarse grained with highly sensitive topics removed, and finally, compared to third-party cookies (what Chrome is trying to deprecate), topics conveys tiny amounts of information. |
@jkarlin Governments have a limited number of secret police hours to work with. Not all citizens and visitors can be fully observed at all times. Governments will be able to use a lightweight remote screening system like Topics API to identify people for further, more resource-consuming attention like a full device search. Clearing Topics API data or using a browser without Topics API turned on could also be a factor in selection. And the set of possible callers is big enough that we don't know in advance which callers will be owned by, or have a data sharing agreement with, which governments. The Topics API taxonomy is free of obvious sensitive topics, but can still encode sensitive information (such as people who like music A and food B in country C) |
@jkarlin Your argument is trying to justify gas burning because coal is worse. When I'm telling you to go nuclear. It's a sort of tu quoque fallacy. The problem is systemic, don't compartment it in pieces to find ad-hoc ways to give incompatible whataboutisms in each case. Surely you must understand that authority directly looking at your device is one situation, which must be fought, a la Apple versus FBI case. But not the one I'm worried about with Topics, that would be remote mass profiling. The surface of attack against individuals just keep being magnified, third party cookies is not a standard of reference, as the brave blog explained. |
There are two sides of the Topics API: the interface it exposes to pages to tell them what topics a user is probably interested in, and the interface it exposes to users to figure out or guess what topics they're actually interested in. The interface with pages is the traditional realm of web standards and involves a bunch of tradeoffs around the rate that pages can identify users, which @martinthomson has focused on above. On the other hand, the interface with users is not generally something that we standardize or specify, instead giving user agents wide freedom to do what's best for their users, even if that's very different from what other UAs do. There are some limits here—if pages need to adapt to particular UI, it may be worth constraining the variation—but I don't think Topics falls into that category, and I suspect that the Topics spec actually has too much normative text specifying the user-facing part of its behavior. Unfortunately, a large fraction of the TAG's review that @plinss recounted focuses on the particular UI that Chrome plans to ship, rather than the question of whether UAs have the freedom to do the right thing for their users. The TAG suggests that many users would appreciate if their interests were "entirely configured by the user in their browser settings", and I agree. As far as I can see, this UI is completely supported by the Topics API and would require no changes to the page-facing API or page behavior. Whether or not Chrome initially ships that UI, other browsers can do so, and Chrome could switch to it in the future. If I'm wrong, and that UI would require changes to the page-facing API, that would be a really good thing to point out soon, so that Chrome can ship a more-compatible API instead. |
There are a few places where some of the assurances described in the beginning of this TAG discussion (quite a while ago now!), and even some more recently, don't quite track what is in the spec.
To return to a moment to the assertion that a billion topics would mean no privacy loss because only five may be eligible for reporting out to sites.
Finally one question one might ask: why comment on a spec when there seems so small a chance of cross-browser implementation? As an enterprise customer of Google Workspace and Chrome, I am already subjected to small, creeping changes to the interpretation of the terms of service – and updates to those terms that are difficult to opt out of. So, even if Chrome is the only full implementer, I would rather see the critical privacy promises in a draft spec so that they stick for longer. Also, it is really important that implementations match their marketing. There are big implications for the web as a whole if the most popular browser can market a feature as "only local calculation of coarse-grained topics" when we decide to opt in, but then, since they don't think it is a big deal, change that over time. |
To summarize and close this review, we note that there are some disagreements about goals here that underpin the disconnect. The goals you have set out in the explainer are:
The set of goals also implictly compares the privacy characteristics of this API to the web with 3rd party cookies (and tracking). In the spirit of "leaving the web better than you found it," we would like to see the design goals achieved whilst also preserving the privacy characteristics of the web without third party cookies. We do acknowledge that you have arguably achieved the 4th goal, with an API that does not actively prevent the user from understanding and recognizing what is being communicated about them. However the implicit privacy labour that would be required to manage this set of topics on an ongoing basis remains a key question. Finally, we challenge the assertion that reidentification in the absence of other information is the right benchmark to apply. As we previously noted, the potential for this to affect privacy unevenly across different web users is a risk that is not adequately mitigated. |
Braw mornin' TAG!1
I'm requesting a TAG review of the Topics API.
The intent of the Topics API is to provide callers (including third-party ad-tech or advertising providers on the page that run script) with coarse-grained advertising topics that the page visitor might currently be interested in. These topics will supplement the contextual signals from the current page and can be combined to help find an appropriate advertisement for the visitor.
Further details:
You should also know that...
This API was developed in response to feedback that we (Chrome) received from feedback on our first interest-based advertising proposal, FLoC. That feedback came from TAG, other browsers, Advertisers, and our users. We appreciate this feedback, and look forward to your thoughts on this API.
At the bottom of this issue is both the security survey responses, as well as responses to questions from TAG about FLoC, but answered in terms of Topics.
We'd prefer the TAG provide feedback as (please delete all but the desired option):
☂️ open a single issue in our GitHub repo for the entire review
Self Review Questionnaire: Security & Privacy
2.1. What information might this feature expose to Web sites or other parties, and for what purposes is that exposure necessary?
2.2 Do features in your specification expose the minimum amount of information necessary to enable their intended uses?
Yes. The entire design of the API is to minimize the amount of information about the user that is exposed in order to provide for the use case. We have also provided a theoretical (and applied) analysis of the cross-site fingerprinting information that is revealed: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf
2.3. How do the features in your specification deal with personal information, personally-identifiable information (PII), or information derived from them?
The API intentionally provides some information about the user to the calling context. We’ve reduced the ability to use this information as a global identifier (cross site fingerprinting surface) as much as possible.
2.4. How do the features in your specification deal with sensitive information?
Sensitive information is reduced by only allowing topics in the Taxonomy that Chrome and the IAB have deemed are not sensitive (the topics in the proposed initial taxonomy are derived from the two respective organization’s advertising taxonomies).
This does not mean that topics in the taxonomy, or groups of topics in the taxonomy learned about the user over time cannot be correlated sensitive topics. This may be possible.
2.5. Do the features in your specification introduce new state for an origin that persists across browsing sessions?
The API provides some information about the user’s browsing history, and this is stored in the browser. The filtering mechanism used to provide a topic to a calling context if and only if that context has observed the user on a page about that topic in the past also stores data. This could be used to learn if the user has visited a specific site in the past (which third-party cookies can do quite easily today) and we’d like to make that hard. There may be interventions that the browser can take to detect and prevent such abuses.
2.6. Do the features in your specification expose information about the underlying platform to origins?
No.
2.7. Does this specification allow an origin to send data to the underlying platform?
The top-frame site’s domain is read to determine a topic for the site.
2.8. Do features in this specification enable access to device sensors?
No.
2.9. Do features in this specification enable new script execution/loading mechanisms?
No.
2.10. Do features in this specification allow an origin to access other devices?
No.
2.11. Do features in this specification allow an origin some measure of control over a user agent’s native UI?
No.
2.12. What temporary identifiers do the features in this specification create or expose to the web?
The topics that are returned by the API. They are per-epoch (week), per-user, and per site. It is cleared when the user clears state.
2.13. How does this specification distinguish between behavior in first-party and third-party contexts?
The topic is only returned to the caller if the calling context’s site has also called the API on a domain about that topic with that same user in the past three weeks. So whether the API returns anything or not depends on the calling context’s domain.
2.14. How do the features in this specification work in the context of a browser’s Private Browsing or Incognito mode?
The API returns an empty list in incognito mode. We feel that this is safe because there are many reasons that an empty list might be returned. e.g., because the user is new, because the user is in incognito, because the site has not seen this user on relevant sites with the associated topics in the past three weeks, because the user has disabled the API via UX controls.
This is effectively the same behavior as the user being new, so this is basically the API working the same within incognito mode as in regular mode. We could have instead returned random topics in incognito (and for new users) but this has the deleterious effect of significantly polluting the API with noise. Plus, we don’t want to confuse users/developers by having the API return values when they expect it not to (e.g., after disabling the API).
2.15. Does this specification have both "Security Considerations" and "Privacy Considerations" sections?
There is no formal specification yet, but the explainer goes into detail on the privacy considerations. The primary security consideration is that the API reveals information beyond third-party cookies in that learning a topic means that the topic is one of the users top topics for the week.
2.16. Do features in your specification enable origins to downgrade default security protections?
No.
2.17. How does your feature handle non-"fully active" documents?
No special considerations.
Responses to questions from the FLoC TAG review, as they apply to Topics
Sensitive categories
A key difference between Topics and Cohorts is that the Topics taxonomy is human curated, whereas cohorts were the result of a clustering algorithm and had no obvious meaning. The advantage of a topics based approach is that we can help to clarify which topics are exposed. For instance, the initial topology we intend to use includes topics that are in both the IAB’s content taxonomy and Google’s advertising taxonomy. This ensures that at least two separate entities had reviewed the topics for sensitive categories. Assuming that the API is successful, we would be happy to consider a third-party maintainer of the taxonomy that incorporates both relevant advertising interests as well as up-to-date sensitivities.
Browser support
As always, it is up to each browser to determine which use cases and APIs it wishes to support. Returning empty lists is completely reasonable. Though a caller could still use the UA to determine if the API is really supported or not. I’m not sure that there is a good solution here.
In regards to the duration of a topic, I think that is likely to be per-UA.
In the Topics API, we ensure that each topic has a minimum number of users, by returning responses uniformly at random 5% of the time.
Opting out
The current plan is for the Topics API to return an empty list in incognito mode.
Sites opt in via using the API. If the API is not used, the site will not be included. Sites can also prevent third parties from calling the API on their site via permission policy.
In regards to granular controls, we feel that this is possible with Topics (less so with FLoC) and expect to expose via UX the topics that are being returned, and allowing users to opt out of the API completely or disable individual topics.
The API is designed to facilitate ecosystem participation - as calling the API is both the way to contribute and receive value from the API. We do not want sites to be able to get topics without also supporting the ecosystem.
Centralisation of ad targeting
The Topics API helps to address broad, granular topics based advertising. For more niche topics, we suggest the usage of alternative sandbox APIs like FLEDGE.
In terms of transparency, the API is written plainly in open source code, the design is occurring on github with an active community, and the ML model used to classify topics will be available for anyone to evaluate.
Accessing cohort information
With Topics, the Taxonomy name is its semantic meaning.
Security & privacy concerns
In Chrome, the renderer is only aware of the topic for the given site. The browser stores information about which callers were on each top-level site, and whether the API was called. This is significantly better than the data stored for third-party cookies.
The Topics API, unlike FLoC, only allows a site to learn topics if the caller has observed the user on a site about that topic. So it is no longer easy to learn more about the user than they could have without explicit input from the user.
Topics is more difficult to use as a cross-site fingerprinting vector due to the fact that different sites receive different topics during the same week. We have a white paper studying the impact of this: https://github.com/jkarlin/topics/blob/main/topics_analysis.pdf
Logging data over time does still increase knowledge about the user however. We’ve limited this as much as we think is possible.
The filtering mentioned above (not returning the topic if it was observed by the calling context for that user on a site about that topic) significantly cuts down on this hoarding. It’s no longer possible for any arbitrary caller on a page to learn the user’s browsing topics.
There are only 349 topics in the proposed Topics API, and 5% of the time a uniformly random topic is returned. We expect there to be significantly more users per topic that there were in FLoC.
The text was updated successfully, but these errors were encountered: