Web Discovery Project is a methodology and system developed by Brave but heavily inspired by Cliqz's Human Web, we recommend to check the blog post as additional material even though there are, and will be, significant departures as WDP evolves.
Web Discovery Project is a methodology and system developed by Brave to collect data generated by their users while protecting their privacy and anonymity.
Brave needs data to power its privacy search. This data, provided by Brave users, is collected in a very different way than typical data collection. We want to depart from the current standard model, where users must trust that the company collecting the data will not misuse it, ever, in any circumstance. We do not want users to have no other choice but to trust us. There are many ways a trust model can fail. Hackers can steal data. Governments can issue subpoenas, or get direct access to the data. Unethical employees can dig into the data for personal interests. Companies can go bankrupt and the data auctioned to the highest bidder. Finally, companies can unilaterally decide to change their privacy policies.
In the trust model, which is the industry standard, the user has very little control and protection comes only from privacy policy and enforcing bodies. We believe we must do better, if only for selfish reasons because we use our own products, and consequently, our own data is collected. We are not comfortable with only a promise based on Terms of Service and Privacy Policy agreements. It is not enough for us, and should not be enough for our users either. As someone once said, if you do not like reality, feel free to change it. The Web Discovery Project is our proposal for a more responsible and less invasive data collection.
The fundamental idea of the Web Discovery Project data collection is simple: to actively prevent Record Linkage.
Record linkage is the ability to know that multiple data elements, e.g. messages or records, come from the same user. This linkage leads to sessions, and these sessions are very dangerous with regards to privacy. For instance, Google Analytics data can be used to build sessions that can sometimes be de-anonymized by anyone that has access to them. Was it intentional? Most likely not. Will Google Analytics try to de-anonymize the data? I bet not. But still, the session is there, stored somewhere, and trust that it is not going to be misused is the only protection we have.
The Web Discovery Project is a methodology and system designed to collect data which cannot be turned into sessions once they reached Brave. How? By strictly forbidding any user-identifier that could be used to link records as belonging to the same person, considering not only explicit UIDs but also implicit ones. Consequently, aggregation of user's data in the server-side (on Brave premises) is not technically feasible, as we have no means to know who is the original owner of the data.
This is a strong departure from the industry standard of data collections. Let us illustrate it with an example (a real one),
Since Brave Search is a search engine we need to know for which queries our results are not good enough. A very legitimate use-case, let's call it bad-queries. How do we achieve this?
It is easy to do if the users help us with their data. Simply observe the event in which a user does a query q in Brave and then, within one hour, does the same query on a different search engine. That would be a good signal that Brave's results for query q need to be improved. There are several approaches to collect the data needed for quality assessment. We want to show you why the industry standard approach has privacy risks.
Let's first start with the typical way to collect data: the server-side aggregation,
In this model, we would collect URLs for search engine result pages, the query and search engine can be extracted from the URL. We would also need to keep a timestamp and a UID so that we know which queries were done by the same person. With this data. It is then straightforward to implement a script that finds the bad-queries we are looking for.
The data that we would collect with the server-side aggregation approach would look like the following:
...
SERP=search.brave.com/q=brave hq address, UID=X, TIMESTAMP=2016...
SERP=google.com/q=brave hq address, UID=X, TIMESTAMP=2016...
SERP=google.com/q=facebook cristina grillo, UID=X, TIMESTAMP=2016...
SERP=google.com/q=file papers for divorce bad, UID=Y, TIMESTAMP=2016...
...
A simple script would traverse the file(s) checking for the repetitions of the tuple UID
and query
within one hour intervals. By doing so, in the example, we would find that the query "brave hq address" seems to be problematic. Problem solved? Yes.
This data can in fact be used to solve many other use-cases. The problem is, that some of this additional use-cases are extremely privacy sensitive.
With the same data, we could build a session for a user. Let's take a user with the anonymous UID X,
user=X, queries={'brave hq address','facebook cristina grillo'}
Suddenly we have the full history of that person's search queries! On top of that, perhaps one of the queries contains personal identifiable information (PII) that puts a real name to the user X. That was never the intention of whoever collected the data. But now the data exists, and the user can only trust that her search history is not going to be misused by the company that collected it.
This is what happens when you collect data that can be aggregated by UID on the server-side. It can be used to build sessions. And the scope of the session is virtually unbounded, for the good, solving many use-cases, and for the bad, compromising the user's privacy.
We do not want to aggregate the user's data on the server due to privacy implications, and at some point, all the queries of the user in a certain timeframe must be accessible somewhere, otherwise we cannot resolve the use-case. But that place does not need to be on the server-side, it can be done on the client, in the browser. We called it client-side aggregation.
What we do is to move the script that detects bad-queries to the browser, run it against the queries that the user does in real-time and then, when all conditions are met, send the following data back to our servers,
...
type=bad_query, query=brave hq address, target=google
...
This is exactly what we were looking for, examples of bad queries. Nothing more, nothing less.
The aggregation of user's data can always be done on the client-side, i.e. the user's device and therefore under the full control of the user. That is the place to do it. As a matter of fact, this is the only place where it should be allowed.
The snippet above satisfies the bad-queries use-case and most likely will not be reusable for other use-cases. On top of that, it comes without any privacy implication or side-effect.
The query itself could contain sensitive information, of course, but even if we would be able to associate that record to a real person, that would be the only information that would be learned. Think what happens on the server-side aggregation model. The complete session of that user would be compromised, all the queries in her history. Or only a fraction of it if the company collecting that data was sensitive enough to not use permanent UIDs. Still, unnecessary. And sadly, server-side aggregation is the norm not the exception.
Client-side aggregation has some drawbacks, namely:
- It requires a change of mindset by the developers.
- Processing and mining data implies code to be deployed and running on the client-side.
- The data collected might not be suitable to satisfy other use-cases. Because data collected has been aggregated by users, it might not be reusable.
- Aggregating past data might not be possible as the data to be aggregated may no longer be available on the client.
However, these drawbacks are a very small price to pay in return for the ease of mind of knowing that the data being collected cannot be transformed into sessions with uncontrollable privacy side-effects.
The goal of Web Discovery Project is not so much to anonymize data, for that purpose there are good methods like differential privacy, l-diversity, etc. Rather than trying to preserve the privacy of a data-set that contains sensitive information, the aim of Web Discovery Project is to prevent those data-set from being collected in the first place.
We hope that we convinced you that there are alternatives to the standard server-side aggregation model. We can get rid of UIDs and the session they generate by changing the approach of data collection to client-side. Such approach is general and can satisfy a wide-range of use-cases. As a matter of fact, we have yet to find a use-case that cannot be satisfied by client-side aggregation alone.
Client-side aggregation at Brave is done at the browser level. However, it is perfectly possible to do the same using only standard JavaScript and HTML5; check out the Data Collection without Privacy Side-Effects short paper for more information.
Client-side aggregation is the approach that removes explicit UIDs. The UIDs that are added to make the data linkable on the server-side. However, even if you remove all explicit UIDs the job is not done. There are more UIDs than the explicit ones…
Data needs to be transported from the user's device to the data collection servers. This communication, if direct, can be used to establish record-linkage via network level information such as the IP address and other network level data, doubling as UIDs.
Anonymous communication is a well-studied problem that has off the shelf solutions like Tor. This is not enough for our use-case, though, as we also need to account for message replays; a malicious actor could try to send multiple messages unlawfully inflating the popularity of pages of their choice, and consequently, affecting the ranking of our search engine. To achieve replay protection and anonymous communication we had to devise an additional sub-system called HPN (HumanWeb Proxy Network).
For instance, we want to collect the audience of a certain domain. When a user visits a web page which domain has not been visited in the last day, the following message will be emitted,
{url-visited: 'http://josepmpujol.net/', timestamp: '2016-10-10'}
If all users are normative we can assume that if the above message is received 100 times, it means that 100 different users visited that domain on October 10th 2016. However, there is a non-zero chance that not all users are "normative”.
A malicious actor can exploit this setup to artificially inflate the popularity of a site. He only needs to replay the message as much as he wants. Given that we have absolutely no information about the user sending the data, how can we known if 100 messages are from 100 different users and not from a single malicious one?
HPN solves this issue by filtering out this kind of attacks by heavy use of crypto, which allow us to filter out repeated messages from the same user without ever knowing anything about the user.
Please check the paper Preventing Attacks on Anonymous Data Collection. The source code is always available:
- extension code
- anonymous-credentials (the crypto part, implementing the paper)
We have seen that we get rid of explicit UIDs by using client-side aggregation and communication UIDs by using the HPN. However, there is still another big group of user identifiers: the implicit UIDs.
Even in the case of anonymous communication, the way and time in which the data arrives can still be used to achieve certain record linkage, a weak one, but still a session. For instance,
- Spatial correlations. Messages need to be atomic. If messages are grouped or batched on the same network request for efficiency, the receiver will be able to tag them as coming from the same user.
- Temporal correlations. Even if messages are send atomically in different requests an attacker could still use the time messages arrive to probabilistically link multiple messages to the same user. Messages should be sent at random intervals to remove such correlations.
The Web Discovery Project already takes care of those two kinds of implicit UIDs. Whenever a message is sent via WebDiscoveryProject.sendMessage
it will be placed in a queue that is emptied at random intervals. Naturally, messages are not grouped or pipelined, each message (encrypted) will use a brand-new HTTP request. Keys used for encryption are always one time only, to prevent the key from becoming a UID.
The content dependent implicit UIDs are, as the name suggest, specific to the content of the message, thus application dependent. For that reason, it is not possible to offer a general solution since it varies from message to message, or in other words, it varies from use-case to use-case.
We can, however, provide some examples of good practices and elaborate how we make sure that implicit UIDs, or other private information, never reaches Brave's servers for some of our more complex messages.
We will cover 2 different types of messages of user data collected by Brave putting special emphasis on how we prevent content dependent implicit UIDs.
In this section, we will refer to the code deployed to our users at the time of writing of this document.
The latest version of the Web Discovery Project is always available in our open-source repository. Brave has a policy to open-source any code delivered to our users.
The first kind of message generated by the user is: query
,
{
"type": "wdp",
"action": "query",
"payload": {
"r": {
"0": {
"t": "Ranked: Best Browsers for Privacy in 2021 | ExpressVPN Blog",
"u": "https://www.expressvpn.com/blog/best-browsers-for-privacy/",
"age": null
},
"1": {
"t": "Ranked: Best Browsers for Privacy in 2021 | ExpressVPN Blog",
"u": "https://www.expressvpn.com/blog/best-browsers-for-privacy/",
"age": null
},
"2": {
"t": "Best browser for privacy 2021: Secure web browsing | ZDNet",
"u": "https://www.zdnet.com/article/best-browser-for-privacy/",
"age": null
},
"3": {
"t": "Stop Trackers Dead: The Best Private Browsers for 2021",
"u": "https://www.pcmag.com/picks/stop-trackers-dead-the-best-private-browsers",
"age": null
},
"4": {
"t": "Best Browsers for Privacy and Security 2021 [Top 12] | NordVPN",
"u": "https://nordvpn.com/blog/best-privacy-browser/",
"age": null
},
"5": {
"t": "Most secure browser for your privacy in 2021 - ProtonMail Blog",
"u": "https://protonmail.com/blog/best-browser-for-privacy/",
"age": null
},
"6": {
"t": "The Best Browsers for Privacy in 2021 | Digital Trends",
"u": "https://www.digitaltrends.com/computing/best-browsers-for-privacy/",
"age": null
},
"7": {
"t": "9 Most Secure Web Browsers That Protect Your Privacy In 2021",
"u": "https://www.bitcatcha.com/blog/most-secure-browser/",
"age": null
},
"8": {
"t": "The Best Browsers for Security and Privacy in 2021 - AVG",
"u": "https://www.avg.com/en/signal/best-browsers-most-security-privacy",
"age": null
},
"9": {
"t": "Best anonymous browsers of 2021 | TechRadar",
"u": "https://www.techradar.com/best/anonymous-browsing",
"age": null
},
"10": {
"t": "Secure Browsers That Protect Your Privacy - RestorePrivacy",
"u": "https://restoreprivacy.com/browser/secure/",
"age": null
}
},
"q": "best private browser",
"qurl": "https://www.google.com/search?q=best+private+browser",
"ctry": "--"
},
"ver": "1.0",
"channel": "brave",
"ts": "20210817",
"anti-duplicates": 1576633,
"sender": "hpnv2"
}
This message type is generated every time a user visits a search engine result page (SERP). Each URL in the address bar is evaluated against a set of regular dynamically loaded patterns, which determines if the user is on a SERP of either Brave/Google/Bing/Yahoo/Linkedin.
Please allow us to emphasize that the message above is real, and that the data above is the only information we will receive. This message arrives to Brave data collection servers through the HPN anonymization layer.
The message does not contain anything that could be related to an individual person, it does not contain any sort of UID that could be used to build a session at the Brave backend.
The only thing that we, as Brave, could learn is that someone who is a Brave user on the day 20210817 queried for best private browser on Google and the results that Google yielded. Sending any sort of explicit UID on that message, or not removing implicit UIDs, could lead to a session for the user that would contain all her queries, or at least a fraction of them. If any of the queries in the session would contain personal identifiable information (PII), for instance, the whole session could be de-anonymized. Even though using UIDs is the industry standard, we are not willing to take such a risk. We need the data, which is used to train our ranking algorithms, so it is crucial for us. But we only want to know about the query, nothing else. Having the same message with a UID would be convenient, as the data could be repurposed for other use-cases but the risks to the user's privacy are not acceptable.
The message of type query
is a tricky message. It contains URLs and a fragment of a user input, we must ensure that no implicit UID is present in this data that is introduced by external actors, in this case Google and the user.
Let's go over how the message is build,
The first and most important rule: send only what you need, not more.
A typical SERP URL (can be found in your browser's address bar) and looks like this,
https://www.google.com/search?q=best+private+browser&gl=us&hl=en&ei=Q80bYZrpK5GRlwSk76eQDA&oq=best+private+browser&gs_lcp=Cgdnd3Mtd2l6EAxKBAhBGABQAFgAYK65A2gAcAJ4AIABfYgBfZIBAzAuMZgBAMABAQ&sclient=gws-wiz&ved=0ahUKEwjaltrGp7jyAhWRyIUKHaT3CcIQ4dUDCA4
One would be tempted to send this URL as is, and then extract the query on the server side.
But that is dangerous as we cannot be certain that no UID is embedded in the query string. What is the purpose of &ei=Q80bYZrpK5GRlwSk76eQDA&oq=best+private+browser&gs_lcp=Cgdnd3Mtd2l6EAxKBAhBGABQAFgAYK65A2gAcAJ4AIABfYgBfZIBAzAuMZgBAMABAQ&sclient=gws-wiz&ved=0ahUKEwjaltrGp7jyAhWRyIUKHaT3CcIQ4dUDCA4
? Is any of that data specific to the user so that it could be used as a UID?
It is always safer to sanitize any URL, instead of the raw SERP URL we would send this:
{
"q": "best private browser",
"qurl": "https://www.google.com/search?q=best+private+browser"
}
The URL has been sanitized through (WebDiscoveryProject.maskURL
). The query q
itself is also subjected to some sanitization, always required when dealing with user input. In the case of a query, we apply some heuristics to evaluate the risk of the query (WebDiscoveryProject.isSuspiciousQuery
). If the query is suspicious the full message will be discarded and nothing will be sent. The query heuristics cover things like,
- query too long (>50 characters)
- too many tokens (>7)
- contains a number longer than 7 digits, fuzzy, e.g. (090)90-2, 5555 3235
- contains and email, fuzzy
- contains a URL with HTTP username or password
- contains a string longer than 12 characters that is classified as Hash (Markov Chain classifier defined at
WebDiscoveryProject.probHashLogM
)
The WebDiscoveryProject.isSuspiciousQuery
does not guarantee that no personal information will ever be received. And that is also true for other heuristics that will be introduced later. That said, one must consider that even in the case that a PII would escape sanitization, either because of a bug or because of lack of coverage, the only thing compromised would be the PII and the record itself. Because sessions in Web Discovery Project are not allowed, the damage of a PII failure is contained.
The last part of the message is the field r
which contains the results returned by Google. One could think that the data is safe, but that might not always be the case.
We cannot rule out the possibility that the user was logged in and that the content of the page (in this case Google's SERP) was not customized/personalized. If that was the case, the content could contain elements that could be used as PII or UIDs.
It is very dangerous to send any content extracted from a page that is rendered to the user. The only information that can be sent is information that is public, period.
To deal with this problem we rely on what we called double fetch. Which is an out-of-band (a.k.a. anonymous) HTTP request to the same URL (or a canonized version of the URL) without session. We rely on the fetch
API without credentials.
By doing this the content is not user-specific as the site has no idea who is issuing the request, the anonymous request does not allow cookies or any other network session. If the site requires authentication, the response of the site will be a redirect to a login page. (There are some caveats to this that will be covered in the next section where we dig further on double fetch).
The results in field r
are the ones scrapped from the content of the double fetch rather than from the original content presented to the user.
Another example of message sent by Web Discovery Project is the message page
,
{
"type": "wdp",
"action": "page",
"payload": {
"url": "https://protonmail.com/blog/best-browser-for-privacy/",
"a": 14,
"x": {
"lh": 118423,
"lt": 33993,
"t": "4 web browsers that really care about your privacy [2021] - ProtonMail Blog",
"nl": 228,
"ni": 6,
"ninh": 4,
"nip": 0,
"nf": 1,
"pagel": "en-US",
"ctry": "--",
"iall": true,
"canonical_url": "https://protonmail.com/blog/best-browser-for-privacy/",
"nfsh": 0,
"nifsh": 0,
"nifshmatch": true,
"nfshmatch": true,
"nifshbf": 0,
"nfshbf": 0
},
"e": { "cp": 0, "mm": 3, "kp": 0, "sc": 3, "md": 0 },
"st": 200,
"c": null,
"ref": "https://www.google.com/ (PROTECTED)",
"red": null,
"qr": { "q": "best private browser", "t": "go", "d": 1 },
"dur": 25904
},
"ver": "1.0",
"channel": "brave",
"ts": "20210817",
"anti-duplicates": 779777,
"sender": "hpnv2"
}
We use this message to learn that someone has visited the URL https://protonmail.com/blog/best-browser-for-privacy/
. We also want to learn how users interact with the pages as a proxy to infer the page quality. The page
messages are heavily aggregated, for instance,
- field
payload.a
tells us the amount of time the user was engaged with the page, - field
paylaod.e.mm
tells us the number of mouse movement, - field
payload.e.sc
tells us the number of scrolling events,
The aggregated information about the page is very useful for us. Web Discovery Project crawling is a collaborative effort of Brave users, and to gather data on how users interact with the page helps us to figure out the quality and relevance of it.
All this information is aggregated as the user interacts with the page, once the user closes the page or the page becomes inactive for more than 20 minutes, aggregation stops and the process to decide whether or not the message can be sent starts (the proper lifecycle is better described in the section below).
Besides engagement the page
message above also contains the field payload.qr
, which is only present when the page was visited after a query on a search engine,
{
"qr": {
"q": "best private browser",
"t": "go",
"d": 1
}
}
in this case, the page was loaded after a request to Google for the query best private browser. The field t
stands for the search engine and d
is the recursive depth. qr
is only sent if depth is 1, otherwise it could be used to build sessions. We will discuss how payload.qr
and payload.c.ref
(referral) can affect record linkage in a follow-up section.
At this point it should be evident why we are interested in the data contained in page
messages. However, sending URLs (web pages) that users are visiting is very tricky and needs proper handling:
-
a URL can contain information that can identify the user on the path or query-string. For instance, https://analytics.twitter.com/user/solso/home is only available if you can login as
solso
on analytics.twitter.com, which means that only the user solso can access that page, a clear PII leak. -
a URL can give access to a resource that was not meant to be public and that can contain personal information or private content like personal pics, etc. Google docs, thank you for your purchase page, invoices, etc. are typical capability URL that should not ever be collected.
Sending all pages that a user visits is not possible without major privacy side-effects.
Even though we get rid of all explicit or communication UIDs the URL itself or the title of the page can contain plenty of implicit UIDs. That would allow record-linkage and sessions on the server-side, yet it is true that sessions would be small or limited to certain domains they are sessions nonetheless. To make things worse, URLs can lead to resources with highly sensitive information. In the next section we describe how we prevent such pages (URLs) to be ever sent to Brave.
Let us describe the page lifecycle in more details.
We detect a user visiting a web page by monitoring the chrome.webNavigation.onCommitted
and chrome.webNavigation.onHistoryStateUpdated
events.
At this point we check if the user is in private mode. If so we stop. If not we get the URL on the current tab and check if it is acceptable using WebDiscoveryProject.isSuspiciousURL
, which checks for:
- not on odd ports, only 80 and 443 allowed
- no HTTP auth URL
- no domain as IP
- protocol must be http or https
- no localhost
- no hash
#
in the URL unless it is a known SERP URL or the text after # is smaller than 10 characters.
If URL passes the first check we continue, otherwise we stop the process and ignore that page.
The next step is to determine if the URL is a known SERP URL (WebDiscoveryProject.checkSearchURL
). If it is, then a message type query
is generated (described in the previous section). If it is not a SERP URL we continue our way to generate a page
type message.
The URL under analysis is kept in memory at WebDiscoveryProject.state['v']
. Additional data is aggregated to the object in memory as the user interacts with the page. For instance, number of key presses, mouse movement, referral, whether the pages comes out of a query (field payload.qr
), etc. We can also aggregate information than will be discarded later on. For instance, we still aggregate the links that the user follows when on the page.
Another important piece of data that will be kept is the page signature extracted from the content document after a timeout of 2 seconds. The signature looks like the field payload.x
on the message above, however, the first signature will not be sent since it is generated from a page rendered to the user. The first page signature is a required input for the double fetch process that will be explained in a bit.
The page will stay in memory until,
- the page is inactive for more than 20 minutes
- the user closes the tab containing the page,
- the tab loads another page or,
- the user closes the browser (or window).
when the page is unloaded from memory it will be persisted to disk (WebDiscoveryProject.addURLtoDB
) and it will be considered as a pending page.
Up to this point, nothing has been sent yet. The information is either in memory or in local storage.
Every minute, an out-of-band process on the main thread (see pacemaker
) will check for URLs that are no longer active and try to finalize the analysis.
The selection of the URLs to be processed is determined by (WebDiscoveryProject.processUnchecks
), a queue that picks pending pages from storage. The number of URLs on the queue depends on the browsing activity of the user; the incoming rate in pages visited (URL on address bar, no 3rd parties or frames), and the outgoing rate is 1 per minute while the browser is opened.
We will evaluate isPrivate
on the analyzed URL, which returns if the page has been seen before and flagged as private or whether it is unknown. If the URL was already marked as private in the past the process stops and no message is generated.
There are multiple ways a page can be classified as private:
- because the double fetch process fails, either it cannot be completed or the signature of the pages after double fetch does not match (more on that later)
- because the URL comes out of a referrer that was private
- because the referral chain is too long (>10), typically suspicious pages with odd behaviors that we want to ignore
If the page is not private we will continue for the double fetch to assess whether it is public or not.
On a double fetch the URL being analyzed will be fetched using an anonymous HTTP(s) request using the fetch API without credentials.
The content will be anonymously fetched, parsed/rendered on a hidden window and finally we will obtain the signature of the page:
"x": {
"lh": 118423,
"lt": 33993,
"t": "4 web browsers that really care about your privacy [2021] - ProtonMail Blog",
"nl": 228,
"ni": 6,
"ninh": 4,
"nip": 0,
"nf": 1,
"pagel": "en-US",
"ctry": "--",
"iall": true,
"canonical_url": "https://protonmail.com/blog/best-browser-for-privacy/",
"nfsh": 0,
"nifsh": 0,
"nifshmatch": true,
"nfshmatch": true,
"nifshbf": 0,
"nfshbf": 0
}
The signature of the page contains the canonical URL (if it exists), the title of the page, as well as some structural information of the page such as number of input fields (ni
), number of input fields of type password (nip
), number of forms (np
), number of iframes (nif
), length of text without html (lt
), etc.
At this point we have two page signatures for the URL: a) with the content rendered with the session of the user, x_before
. And, b) with the content rendered without the session of the user, x_after
, i.e. the content rendered as if some random person had visited the same page.
If the signatures do not match it means that the content of the page is user-specific and that the page should be treated as private. It will be flagged as private and ignored forever. Of course, no message will be sent.
Matching the signatures is defined at WebDiscoveryProject.validDoubleFetch
. Note that the match is fuzzy since signatures can differ a bit even in case of public pages. Furthermore, there are many pages that have co-existing private and public versions, e.g. https://github.com/solso, https://twitter.com/solso will have different x_before
and x_after
signatures depending the user solso was logged in or not. However, in both cases the public version of the page is indeed public. Long story short, the function validDoubleFetch
controls whether the URL should be considered public or private depending on how the signatures of the pages depart.
Any page that requires a login will fail on the double fetch validation for multiple different reasons: the URL requested anonymously will be unreachable due to redirect towards a login or error page, titles will not match, the structure of the page (e.g. number of passwords or forms) will be too different and so on. If validation fails the page will be flagged as private and never processed again. Furthermore, any URL whose referral is marked as private will also be considered as such.
There is one particular case in which the double fetch method is not effective: when the authorization is based on the fact that the user is on a private network. This setup is often encountered at home and in office environments; access to routers, company wikis, pages whose authorization relies on access through a VPN. In such cases the anonymous requests of double fetch has no effect, because the double fetch is done on the same network. To detect such cases we rely on DNS resolution (using the onCompleted event of the webRequest API) to detect if domain resolves to a private IP range (WebDiscoveryProject.isLocalURL
). If so, the URL is marked as private.
Another aspect to consider are capability URLs, some of which are not protected by any authorization process and simply rely on obfuscation. Google Docs, Github gists, dropbox links, thank you pages on e-commerce sites, etc. Some of these capability URLs are meant to be shared but others are not.
Some providers like dropbox.com or github.com are careful enough to mark pages that are meant to be private as noindex
pages,
<meta content="noindex">
We flag as private any page that has been declared not indexable by the site owner. However, not all site owners are so careful. Google Docs for instance do not use this tag, and they can contain a fair amount of privacy sensitive and/or PII data.
To detect such URLs we rely on WebDiscoveryProject.dropLongURL
that heuristically determines if the URL looks as a potentially capability URL (the name dropLongURL
is somewhat legacy of the first version of Web Discovery Project where length of the URL was the only heuristic rule). Nowadays there are many more rules to increase coverage, for instance:
- query string (or post # data) is too long (>30)
- segments of the path or the query string are too long (>18)
- segments of the path or the query string as classified as Hashes (Markov Chain classifier defined at
WebDiscoveryProject.probHashLogM
) - the URL contains a long number on path or query string,
- the URL contains an email on path or query string
- query string or path contain certain keywords like: admin, share, Weblogic, token, logout, edit, uid, email, pwd, password, ref, track, share, login, session, etc.
The heuristics on WebDiscoveryProject.dropLongURL
are quite conservative, meaning that a lot of pages get incorrectly classified as suspicious to be capability URLs and consequently flagged as private. False positives, however, are not such a big deal for our use-case.
However, there is no guarantee that a false negative does not slip through the cracks of the heuristics. We routinely check for URLs that reach us and from time to time, not often, but we still are able to find URLs that we rather not receive, we are talking about single digit figures in millions of records.
It is worth noticing that both WebDiscoveryProject.validDoubleFetch
and WebDiscoveryProject.dropLongURL
have a variable strictness level, controlled by WebDiscoveryProject.calculateStrictness
. Both functions are a bit less restrictive if the page has a canonical URL and that referral page contains the URL as a public link in the content loaded on an anonymous request.
To reduce the probability of collecting capability URLs, we have devised a state-less quorum system based on the STAR protocol (open source code of the library to be released soon) by which a page message will only be decrypted by the server-side aggregator if more than k people have already seen that same URL; if this quorum is not reached, it is impossible for the server to retrieve the content of the page message. Here is a brief explanation of how this works.
Whenever WDP wants to collect a message of type page
(corresponding to a given url
), it will:
- Derive an encryption
key
, atag
and ashare
fromurl
(using the STARcreate_share
function).key
is a symmetric (AES) encryption key used to encrypt thepage
message (it is exclusively used by the client and will never be shared with the server),tag
is, roughly, a hash ofurl
which allows the server to group shares belonging to a samepage
message (without being able to learn anything about the page or the URL itself),share
is a piece of data derived from the encryptionkey
. The server will only be able to recoverkey
if at least k people have sent ashare
corresponding to the same page (the scheme is based on Shamir's Secret Sharing).
- The
page
message is then encrypted usingkey
(on the client, before sending). - WDP then sends a triple
(tag, share, encrypted_page)
to the server.
Whenever a new message is received by the server, we are only able to retrieve page messages for which at least k shares have been collected by:
- Grouping all
share
(s) together to recoverkey
(using the STARgroup_shares
function). - Using
key
to decryptencrypted_page
and access the originalpage
message.
Additionally, and to make the protection even stronger, WDP is sending an extra parameter oc
, which is the last octet of the IPv4 address from the client. The server will only decrypt page messages which have been sent by enough people with different values of oc
. This is because capability URLs are often shared in a certain context; when this context is the workplace or home network it is possible that all the computers have the same public IPv4 address (or a limited range of public IP addresses). An extra check on oc
allows to avoid issues in these cases.
We only apply quorum to URLs of message type page
that do not have a qr
field and that have either path longer than /
or that have a query string, e.g. https://github.com/
would not be subjected to quorum whereas https://github.com/solso
would be.
The URL is declared not-private when: 1) double fetch validation passes, 2) URL is standard enough so that it does not look like a capability URL, and 3) the quorum check also passes.
At this point we are almost done with the analysis but there are some additional steps.
We do one last additional double fetch -- let's call it triple fetch -- on a forcefully clean-up version of the URL in which we remove either the query string (if it exists) or the last segment of the URL path. Let us give an example, a URL on the address bar could look like this,
http://high-tech-gruenderfonds.de/en/cognex-acquires-3d-vision-company-enshape/?utm_source=CleverReach&utm_medium=email&utm_campaign=COGNEX+ACQUIRES+3D+VISION+COMPANY+EnShape&utm_content=Mailing_10747492
The query string might contain some redundant data that could be used as implicit UID, or perhaps the data in the query string is needed. There is no easy way to tell, but we can test it.
If we remove the query string utm_source=CleverReach&utm_medium=email&utm_campaign=COGNEX+ACQUIRES+3D+VISION+COMPANY+EnShape&utm_content=Mailing_10747492
the signature of the page does not change, so it is safer to send the cleaned-up version of the URL,
http://high-tech-gruenderfonds.de/en/cognex-acquires-3d-vision-company-enshape/
Incidentally, it is also the canonical URL,
<link rel="canonical" href="http://high-tech-gruenderfonds.de/en/cognex-acquires-3d-vision-company-enshape/" />
We always prefer to send the minimal version of the URL to minimize the risk of sending data that could be exploited as implicit UIDs.
We would like to emphasize that the signature of the page is not the signature from the original render to the user but the signature of the content of the double or triple fetch.
The triple fetch was introduced after the quorum check but it happens before.
At this point the message type page
is ready to be sent. We apply one last check to sanitize the message (WebDiscoveryProject.msgSanitize
which takes care of the last steps) such as:
- remove the referral if it does not pass
WebDiscoveryProject.isSuspiciousURL
, mask it usingWebDiscoveryProject.maskURL
otherwise, - remove any continuation (
payload.c
) since we do not really want to send this information, - check title with
WebDiscoveryProject.isSuspiciousTitle
, if it contains long numbers, emails, etc. the whole message will be dropped, - make sure the
payload.url
has been set to the cleanest URL available (the original, canonical, the forcefully cleaned URL on the triple fetch), - etc.
Most of the checks on msgSanitize
are redundant as they are already taken care of on other parts of the code, however, having one last centralized place to do sanity checks is highly recommendable.
The lifecycle of the message type involves persistent storage, so we must be careful of not creating a parallel history of the user's browsing.
The local storage of the extension (with key usafe
) will store records with the visited URLs as long as it was not already private plus the signature of the page. The record is created when the user visits the web page and removed once the record has been processed by double fetch, on average about 20 minutes. If double fetch were to fail due to network reasons it will be retried 3 times, if all 3 fail it will be considered as private and removed to avoid having the URL of the web page orphan in the database.
URLs flagged as private need to be kept forever, otherwise we would do unnecessary double fetch processing. To maintain the privacy of private URL we do not store them as plain text but rather we store the truncated MD5 in a bloom filter,
var hash = (md5(url)).substring(0,16);
WebDiscoveryProject.bloomFilter.testSingle(hash);
The bloom filter provides plausible deniability to anyone who wants to prove that the user visited a certain URL.
The same technique of bloom filters is used on the Quorum validation check to know if the URL has been visited in the last 30 days. In this case, an array of bloom filters (WebDiscoveryProject.quorumBloomFilters
) is needed, one filter per natural day so that we can discount days older than 30 days ago.
Messages of type page are susceptible to very limited record linkage due to the fields: qr
and ref
.
For instance, the field ref
can be used to probabilistically link two or more messages of type page
but those sessions are bound to be small since ref
is forced to pass WebDiscoveryProject.isSuspiciousURL
, and if not suspicious, it will also be masked by WebDiscoveryProject.maskURL
. So, in practice, referrals are kept at a very general level. For instance,,
WebDiscoveryProject.maskURL('http://high-tech-gruenderfonds.de/en/cognex-acquires-3d-vision-company-enshape/?utm_source=CleverReach&utm_medium=email&utm_campaign=COGNEX+ACQUIRES+3D+VISION+COMPANY+EnShape&utm_content=Mailing_10747492')
would yield,
"http://high-tech-gruenderfonds.de/ (PROTECTED)"
which will be the URL finally used as ref
. That still allows for some probabilistic record linkage, but the resulting session would be extremely small.
A similar argument goes for the field qr
. The query in qr.q
can be used to link the message of type query
to the page
type message that should follow. This type of two records sessions is in fact harmless since they do not provide additional information that was not already contained in one of the messages.
Web Discovery Project is not a closed system, it is constantly evolving to offer the maximum privacy guarantees to the users whose data is collected.
We do firmly believe that this methodology is a major step forward from the typical server-side aggregation used by the industry which leaves the user only the promise of privacy (the trust model). With the Web Discovery Project approach, we mitigate the risk of gathering information that we would rather not have. The risks for privacy leaks are close to zero, although there is no formal proof of privacy. We would never be able to know things like the list of queries a particular person has done in the last year. Not because our policy on security and privacy prevents us from doing so. But because it cannot be done, it is not technically possible even if we were forced to do so. In our opinion, the Web Discovery Project is a major shift in the way data is collected.