-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsetting_up.qmd
500 lines (389 loc) · 20.4 KB
/
setting_up.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
# Setting up a development environment
I have to start with on of the hardest chapters of the book, how to set up a
development environment for Python.
If you are already using Python, you already have an environment set up and
might be familiar with `pyenv` or other similar tools. I would still suggest you
read this chapter and see if you agree with how I approach this issue. You may
want to adapt your current workflow to it, or keep on doing what you’ve been
doing up until now, it is up to. If you’re completely new to Python, then you
definitely need to read this chapter, but also, I need to remind you that this
is not a book about Python per se. So I won’t be teaching you any Python (I
wouldn’t really be competent to do so either) and you might want to complement
reading this book with another that focuses on actually teaching Python.
Remember, this book is about building reproducible analytical pipelines!
## Why is installing Python such a hard problem?
If you google "how to install Python" you will find a surprising amount of
articles explaining how to do it. I say "surprising amount" because one might
expect to install Python like any other piece of software. If you’re already
familiar with R, you could think that installing Python would be done the same
way: download the installer for your operating system, and then install it. And,
actually, you can do just that for Python as well. So why are there 100s of
articles online explaining how to install Python, and why aren’t all of these
articles simply telling you to download the installer to install Python? Why
did I write this chapter on installing Python?
Well, there are several thing that we need to deal with if we want to install
and use Python the "right way". First of all, Python is pre-installed on Linux
distributions and older versions of macOS. So if you’re using one of these
operating systems, you could use the built-in Python interpreter, but this is
not recommended. The reason being that these bundled versions are generally
older, and that you don’t control their upgrade process, as these get updated
alongside the operating system. On Windows and newer versions of macOS, Python
is, as far as I know, never bundled, so you’d need to install it anyways.
Another thing that you need to consider is that newer Python versions can
introduce breaking changes, making code written for an earlier version of Python
not run on a newer version of Python. This is not a Python-specific issue: it
happens with any programming language. So this means that ideally you would want
to bundle a Python version with your project’s code.
The same holds true for individual packages: newer versions of packages might
not even work with older releases of Python, so to avoid any issues, an analysis
would get bundled with a Python release and Python packages. This bundle is what
I call a development environment, and in order to build such development environment,
specific tools have to be used. And there’s a lot of these tools in the Python
ecosystem... so much so that when you’re first starting, you might get lost.
So here is he two tools that I use for this, and that I think work quite well: `pyenv`
and `pipenv`.
## First step: installing Python
In this section, I will assume that you have no version of Python already
installed on your computer. If you do have Python, you may want to skip this
section: but know that if you’re not using `pyenv` to download and install
Python versions, you will need to keep installing your environments manually.
`pipenv` however install environments automatically if `pyenv` is available,
so you might want to switch over to `pyenv`.
So first of all, let’s install `pyenv`. `pyenv` is a tool that allows you to install
as many versions of Python that you need, and that does not depend on Python
itself, so there’s no bootstrap problem. But it is only available for Linux
and macOS: Windows users should install `pyenv-win`: the following subsection
deals with installing `pyenv` on Linux and macOS, the next one with installing
`pyenv-win` for Windows.
### `pyenv` for Linux or macOS
I recommend you read and follow the [official
instructions](https://github.com/pyenv/pyenv?tab=readme-ov-file#installation)^[https://github.com/pyenv/pyenv?tab=readme-ov-file#installation],
but I provide here the main steps. In case something doesn’t work as expected,
refer to the official documentation linked previously. First, install the build
dependencies by following the [instructions
here](https://github.com/pyenv/pyenv/wiki#suggested-build-environment)^[https://github.com/pyenv/pyenv/wiki#suggested-build-environment].
On a Linux distribution like Ubuntu, this means installing a bunch of
development libraries from your usual package manager. On macOS, this means
installing Xcode and also a bunch of packages with Homebrew. Also, make sure you
have installed Git as well. We will be using Git later in the book too, but it
is required to install `pyenv`.
Then, on macOS, you can use Hombrew to install `pyenv`:
```bash
brew update
brew install pyenv
```
For Linux distributions, run the automatic installer by opening a terminal and
executing this line here:
```bash
curl https://pyenv.run | bash
```
Now comes the more complex part of the installation process. You need to edit
some files for your terminal, and these files are different depending on which
shell you use. For most Linux distributions, if not all, the default terminal
will use bash, and on macOS the default shell is zsh. The detailled
instructions, which I recommend you read, are
[here](https://github.com/pyenv/pyenv?tab=readme-ov-file#set-up-your-shell-environment-for-pyenv)^[https://github.com/pyenv/pyenv?tab=readme-ov-file#set-up-your-shell-environment-for-pyenv],
but here are my recommendations: if you’re using a Linux distribution, you
likely are using bash, to find out, run `echo $0` in a terminal. If your
terminal is using bash, "bash" will get printed, if not, "zsh" will get printed
instead. If your terminal using bash, run the following in a terminal:
```bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
```
and then, you need to check whether you have a `.profile`, `.bash_profile` or a
`.bash_login` file in your home directory. These are files used at startup by
bash to set some variables and options. They’re similar to the `.bashrc` file,
but the `.bashrc` is executed when starting a terminal when you’re already
logged into the system, while `.bash_profile`, `.bash_login` and `.profile` get
executed when logging in through ssh into the system. If you have a `.profile`
file in your HOME, this should return something:
```bash
ls .profile
```
if nothing gets returned, you might have instead a `.bash_profile` or
`bash_login` file instead, try with:
```bash
ls .bash_profile
```
or
```bash
ls .bash_login
```
If you have more than one, only add the lines to the `.bash_profile`:
```bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_profile
echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
```
but if you have only have a `.bash_login` file, run this:
```bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bash_login
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bash_login
echo 'eval "$(pyenv init -)"' >> ~/.bash_login
```
or if you only have a `.profile` file, or none of these three files at all, run these lines:
```bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.profile
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.profile
echo 'eval "$(pyenv init -)"' >> ~/.profile
```
If you’re running zsh (which would be the case on macOS, or if you changed form
bash to zsh on Linux), then you need to add these lines to the `.zshrc` file:
```bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init -)"' >> ~/.zshrc
```
and also to `.zprofile` or `.zlogin`, depending on which you have (if you have none,
put the lines into `.zprofile`):
```bash
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zprofile
echo '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zprofile
echo 'eval "$(pyenv init -)"' >> ~/.zprofile
```
If you’re using another shell, such as fish, you can checkout the instructions
on `pyenv`’s GitHub repository, or simply add these lines in the equivalent
startup files.
Finally, also add this line to your `.bashrc` or `.zshrc`:
```bash
eval "$(pyenv virtualenv-init -)"
```
to load `pyenv-virtualenv` automatically (as stated by the installer).
### `pyenv-win` for Windows
If you’re using Windows, you should install `pyenv-win` instead, which you can
find
[here](https://github.com/pyenv-win/pyenv-win)^[https://github.com/pyenv-win/pyenv-win].
Installation on Windows is similar to how you install `pyenv` on Linux, if you
choose the PowerShell method. It’s simply running this line (see
[here](https://github.com/pyenv-win/pyenv-win/blob/master/docs/installation.md#powershell)^[https://github.com/pyenv-win/pyenv-win/blob/master/docs/installation.md#powershell])
in Powershell:
```bash
Invoke-WebRequest -UseBasicParsing -Uri "https://raw.githubusercontent.com/pyenv-win/pyenv-win/master/pyenv-win/install-pyenv-win.ps1" -OutFile "./install-pyenv-win.ps1"; &"./install-pyenv-win.ps1"
```
Open a new PowerShell prompt and run `pyenv`. If you see a "command not found"
error, you need to manually add `pyenv` to the PATH. On Windows, this is done
graphically. Click on the start menu, and look for "Edit environment variables
for your account".
```{r, echo = F, out.width="300px"}
#| fig-cap: "Look for 'Edit environment variables for your account'"
knitr::include_graphics("images/edit_user_envvars.png")
```
Then, click on the "New..." button, and the `PYENV`, `PYENV_HOME` and
`PYENV_ROOT`. Make sure to adjus the paths to the ones on your machine.
```{r, echo = F, out.width="300px"}
#| fig-cap: "Make sure your environment variables are correctly set"
knitr::include_graphics("images/pyenv-win-envvars.png")
```
Also check the
[FAQ](https://github.com/pyenv-win/pyenv-win/wiki)^[https://github.com/pyenv-win/pyenv-win/wiki]
if you need further details.
### Using pyenv
Now that you have `pyenv` installed, we can use it to install as many different
Python versions that we need, typically one per project. First, I’ll install a
version of Python that I want to use. Let’s got with the latest available. Let’s
see what is available:
```bash
pyenv install -l
```
As of writing, the latest version that is not in beta is `3.10.5`, so let’s
install this one (if you’re reading this from the future and a new version of
Python is available, install `3.10.5` nonetheless):
```bash
pyenv install 3.10.5
```
this will install Python `3.10.5`. But how do we use it now? Let’s suppose that
we want to use it for a project. Let’s create a folder that will contain the
different scripts that we are going to write in the next chapter and called it
`housing`.
Open a terminal inside that folder, or open a terminal and `cd` into that
folder:
```bash
cd ~/home/Documents/housing
```
(depending on your operating system, you could first open the folder, then right
click somewhere empty and then click on *Open in terminal*, which would open a
terminal in that folder). We would like to use Python `3.10.5` for this project,
and we would like to do so each time we interact with the files in that folder.
We need to set version `3.10.5` as the *local* Python version, meaning, the
Python version that will be used for this project only:
```bash
pyenv local 3.10.5
```
This will generate a `.python-version` with the following line in it:
```bash
3.10.5
```
This file will only be used by `pyenv`, not `pipenv`, but it’s nice to keep
it as a reminder of which Python version you actually need for this project.
`pipenv` specifies the Python version in its own file, which I will discuss
in the next section.
## Second step: pipenv
We can now use our local Python interpreter to install `pipenv`:
```bash
pip install --user pipenv
```
Right after install `pipenv`, your terminal might print this message:
```bash
WARNING: The script virtualenv is installed in '/home/USER/.local/bin'
which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning,
use --no-warn-script-location.
WARNING: The scripts pipenv and pipenv-resolver are installed in '/home/cbrunos/.local/bin'
which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning,
use --no-warn-script-location.
```
This is important, because if you don't add that folder to your `PATH`, `pipenv`
won't work. Open your `.bashrc` or `.zshrc` in your favorite editor and add
this line:
```bash
export PATH="$HOME/.local/bin:$PATH"
```
Close the terminal (and every other open terminals if necessary), and open a new one.
We can now use `pipenv` to install the packages for our project. But why don’t
we just keep using `pip` instead of `pipenv`? Why introduce yet another tool? In
my opinion, `pipenv` has one absolutely crucial feature for reproducibility:
`pipenv` enables deterministic builds, which means that when using `pipenv`, we
will *always* get exactly the same packages installed.
"But isn’t that exctaly what using `requirements.txt` file does?" you wonder.
You are not entirely wrong. After all, if you make the effort to specify the
packages you need, and their versions, wouldn’t running `pip install -r
requirements.txt` also install exactly the same packages?
Well, not quite. Imagine for example that you need a package called `hello`
and you put it into your `requirements.txt` like so:
```bash
hello==1.0.1
```
Suppose that `hello` depends on another package called `ciao`. If you run `pip
install -r requirements.txt` today, you’ll get `hello` at version `1.0.1` and
`ciao`, say, at version `0.3.2`. But if you run `pip install -r
requirements.txt` in 6 months, you would still get `hello` at version `1.0.1`
but you might get a newer version of `ciao`. This is because `ciao` itself is
not specified in the `requirements.txt`, unless you made sure to add it (and
then also add its dependencies, and their dependencies...). `pipenv` takes care
of this for you, and adds further security checks by comparing sha256 hashes
from the lock file to the ones from the downloaded packages, making sure that
you are actually installing what you believe you are.
Now that `pipenv` is installed, let’s start using it to install the packages we
need for our project:
```bash
pipenv --python 3.10.5 install beautifulsoup4==4.12.2 pandas==2.1.4 plotnine==0.12.4 skimpy==0.0.11
```
(Little sidenote: in Chapter 2, we will be working together on a project using
real data with `polars` instead of `pandas`. But for now, let’s get `pandas`
installed: I’ll do a short comparison of the two, and then we’ll keep on using
`polars` for the remainder of the book.)
As you can see, I chose to install specific versions of packages. This is
because I want you to follow along with the same versions as in the book. You
could remove the `==x.y.z` strings from the command above to install the latest
versions of the packages available if you prefer, but then there would be no
guarantee that you would find the same results as I do in the remainder of the
book. Also, if you don’t specify versions, the `Pipfile` file won’t specify them
either, only the `Pipfile.lock` will, which in certain cases could lead to
undesired behaviour. I won’t go into specifics, because I highly recommend you
always take the time to specify package versions when installing. Check package
versions on Pypi. For example,
[here](https://pypi.org/project/pandas/)^[https://pypi.org/project/pandas/] is
the page to check out versions of the pandas package.
You should see now two new files in the `housing` folder, `Pipfile` and
`Pipfile.lock`. Start by opening `Pipfile`, it should look like this:
```
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"
[packages]
beautifulsoup4 = "==4.12.2"
pandas = "==2.1.4"
plotnine = "==0.12.4"
skimpy = "==0.0.11"
[dev-packages]
[requires]
python_version = "3.10"
python_full_version = "3.10.5"
```
I think that this file is pretty self-evident: the packages being used for this
project are listed alongside their versions. The exact Python version is also
listed. Sometimes, depending on how you set up the project, it could happen that
the Python version listed is not the same, or that only the field
`python_verison=` is listed. In this case, I highly recommend you change it to
match the same version as in the `.python-version` file that was generated by
`pyenv`, and add the `python_full_version=` field if necessary, in order to
avoid any possible differences in how the code runs. If you edited the `Pipfile`
then you need to run `pipenv lock` to regenerate the `Pipfile.lock` file as well:
```bash
pipenv lock
```
this will make sure to also set the required/correct Python version in there.
If you open the `Pipfile.lock` in a text editor, you will see that it is a json
file and that is also lists the dependecies of your project, but also the
dependencies’ dependencies. You will also notice several fields called `hashes`.
These are there for security reasons: whenever someone, (or you in the future)
will regenerate this environment, the packages will get downloaded and their
hashes will get compared to the ones listed in the `Pipfile.lock`. If they don’t
match, then something very wrong is happening and packages won’t get installed.
To check whether everything installed correctly, drop into the development shell using:
```bash
pipenv shell
```
and start the Python interpreter:
```bash
python
```
Then check that the correct versions of the packages were installed:
```{python, eval=F, python.reticulate=F}
import pandas as pd
pd.__version__
```
You should see `2.1.4` as the listed version.
## Debugging
It can sometimes happen that `pipenv` does not use the right version of Python.
Running `pipenv shell` and then starting the Python interpreter will start a
different version of Python than the one you expect/need.
In that case, it might be best to simply start over. Start by completely
removing the virtual environment and packages by running:
```bash
pipenv --rm
```
Make sure that the correct version of Python (or rather, the version you need)
is specified in the `Pipfile`. For example, in my case:
```
[requires]
python_full_version = "3.10.5"
```
(you might instead see `python_version = "3.10"`, or any other version, in that
case I recommend you specify `python_full_version` as I did above, check the
`.python-version` file that was generated by `pyenv` if you don’t remember the
right version).
Now, restore the environment and packages by running:
```bash
pipenv install
```
`pipenv install` will regenerate the `Pipfile.lock` file to list the required Python
version, and then install the packages. It essentially runs `pipenv lock` to generate
the lock file, and then `pipenv sync` to install the packages from the lock file.
Check if `pipenv` now uses the correct Python version.
```bash
pipenv run python --version
```
## A high-level description of how to set up a project
Ok, so to summarise, we install `pyenv` which will make it easy to install any
version of Python that we require, and then we installed `pipenv`, which we use
to install packages. The advantage of using `pipenv` is that we get
deterministic builds, and `pipenv` works well with `pyenv` to build
project-specific environments.
The way I would suggest you use these tools now, is that for each project, you
install the latest available version of Python, define it as the local version
using `pyenv local`, and then install packages by specyifing their versions,
like so:
```bash
pipenv --python 3.10.5 install beautifulsoup4==4.12.2 pandas==2.1.4 plotnine==0.12.4 skimpy==0.0.11
```
(Replace the versions of Python and packages by the latest, or those you need.)
This will ensure that your project uses the correct software stack, and that collaborators
or future you will be able to regenerate this environment by calling `pipenv sync`.
`pipenv sync` differs from `pipenv install` because it will not touch the `Pipfile.lock`
file, and will just download the packages as listed in the lock file. This is also
the command that we will use later, in the chapter on CI/CD.