-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Enable overriding undocumented telemetry identifier with PIP_TELEMETRY_USER_AGENT_ID #13560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It is not accurate that commit was pushed directly to master, it was submitted as a PR and reviewed #9987 (I'm not a pip committer and couldn't have pushed to master if I wanted) |
I'm extremely sorry. I will delete that immediately. I pinged you because I found it very confusing given the other work I've seen from you. |
I will review those PR comments now. I'm incredibly sorry once again and thanks so much for your very patient and thoughtful response. |
Ok, I have deleted the revert and retained the changes of #9987. Thanks so much @alex for linking me there. This now just proposes a separate environment variable to override the I can't reproduce the test failure and I am confident my changes continue to produce valid JSON for |
@cosmicexplorer FYI I fixed a typo in your post
I don't really understand why the While I see there being other motivations for wanting to override the telemetry part of the user agent string, I'm not sure I quite follow this reasoning. I wonder if it makes sense to appropriate #13038 or make a new issue where we can discuss the reasons and design for overriding the telemetry, rather than the technical issues of the PR. Overall, I am currently neither for or against this feature, but I will leave a quick review on an immediate technical issue I see with your code. |
def user_agent_id() -> str: | ||
return Telemetry.calculate_user_agent_id(os.environ) | ||
|
||
CLOBBER_USER_AGENT_ENV_VAR: ClassVar[str] = "PIP_TELEMETRY_USER_AGENT_ID" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever the option is names or becomes please make this a regular pip option, e.g. --telemetry-user-agent-id
, it will therefore be accessible via command line, environmental variable (as PIP_TELEMETRY_USER_AGENT_ID
), and configuration files.
You can see recent PRs #13534 and #13520 for adding new pip options.
We now introduce `PIP_TELEMETRY_USER_AGENT_ID`, which will completely overwrite the | ||
string transmitted to remote hosts that pip communicates it. To maintain backwards | ||
compatibility, it is disabled by default, but can be set to the empty string or any | ||
other value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unconvinced by this design over having an option like "disable telemetry", what is the value in users being able to replace the user string with a static arbitrary string?
FIXME: This string is currently propagated as a header into every single HTTP | ||
request pip makes. It should really be subject to a formal PEP process | ||
in order to ensure user consent around telemetry is respected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please propose this in https://discuss.python.org/c/packaging/14, not in a pip doc string. It's not clear to me that everyone will agree with this.
Environment markers can therefore be viewed as a language for codebases to | ||
communicate their build process and requirements | ||
(https://packaging.python.org/en/latest/flow/#the-packaging-flow) to a generic | ||
resolver. Spack specs | ||
(https://spack.readthedocs.io/en/latest/spec_syntax.html#sec-specs) | ||
are very similar in spirit, and arose precisely to codify the recursive | ||
and conditional relationships across multi-language software stacks. | ||
|
||
Extending this concept, we could consider the *user agent id* as a language for | ||
resolvers to communicate their requirements to external services and | ||
execution environments. For example, clients for the GNU Make Jobserver protocol use | ||
an environment variable to indicate how the jobserver should communicate to them, | ||
and to indicate the maximum bandwidth they can tolerate for parallel execution: | ||
(https://www.gnu.org/software/make/manual/html_node/Job-Slots.html). | ||
|
||
The pants build tool specifically highlights the risk of exposing proprietary | ||
information through thoughtless telemetry (https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry). | ||
As a result, not only do they explicitly specify the information being recorded | ||
(https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry#what-data-is-sent), | ||
but they additionally incorporate anonymity as an explicit design goal | ||
(https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry#how-we-ensure-anonymity). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems all quite opinionated to be in a docstring, and doesn't recognize a malicious index could extract most of the information from a user by simply observing what wheels are downloaded.
Negotiating standards around useful telemetry data for PyPI began here | ||
(https://github.com/pypa/pip/issues/5499), but never became a full PEP. This commit | ||
(https://github.com/pypa/pip/commit/f787788a65cf7a8af8b7ef9dc13c0681d94fff5f) added | ||
the output of an arbitrary subprocess execution into the string pip attaches to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an "arbitrary" subprocess execution, it's an explicitly named subprocess execution.
# because some CI systems mimic a tty (e.g. Travis CI). Thus that | ||
# method doesn't provide definitive information in either direction. | ||
return any(name in os.environ for name in CI_ENVIRONMENT_VARIABLES) | ||
_rustc_output_regex: ClassVar[re.Pattern[str]] = re.compile(r"^rustc ([^\s]+)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a concern about what the rustc
execution could be sending to the index then I am willing to volunteer reviewing a separate PR that adds more strict validation (presumably based on rustc documentation).
But as best as I can tell this is just a regex implementation of the prior code:
if rustc_output.startswith(b"rustc "):
# The format of `rustc --version` is:
# `b'rustc 1.52.1 (9bc8c42bb 2021-05-09)\n'`
# We extract just the middle (1.52.1) part
data["rustc_version"] = rustc_output.split(b" ")[1].decode()
IMO the prior code had multiple advantages:
- It documents it's goal
- I don't have to understand a regex to understand the code
- It didn't require decoding the whole string (which if we don't trust the string seems better?)
I would at least want the documented goal of this back.
Fixes #13038.
The commit f787788 by @alex performed a
PATH
traversal and subsequent process execution (the output ofrustc --version
) in order to add some more information to theUser-Agent
request header that pip provides to its remote HTTP requests.f787788 makes reference to information valuable to package maintainers, which appears to be propagated to PyPI's BigQuery output: #9987 (comment) (thanks so much to Alex for explaining this to me -- I'm not sure why the commit message didn't have a link to his PR discussion).
I had previously proposed removing the
rustc --version
call, but that has been removed and this PR now just enables overriding the telemetry identifier, as in title.A discussion of standardized telemetry data began to form in #5499, but failed to graduate into a full PEP. Given the very high quality of other Python packaging standards, I would like to propose that this one also be made into a full-on standard, probably somewhere around the Simple Repository API page.
This PR does two things:
PIP_TELEMETRY_USER_AGENT_ID
, an environment variable which completely overrides the value used to form pip'sUser-Agent
header.