Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PTV-1882 upload fixes and improvements #189

Open
wants to merge 33 commits into
base: develop
Choose a base branch
from

Conversation

eapearson
Copy link

@eapearson eapearson commented Jun 14, 2023

PR Description

Main bug fixes

The primary motivation for this set of changes was initially to ensure that the service could handle upload of files greater than 5GB, for there were reports that it could not.

Crash reading file for md5 calculation

Indeed it was found that the code that generates the metadata for a given file would read the entire file into memory in order to compute the md5. It would crash the interpreter at a file size of between 4.0GB and 4.4GB when reading the file into memory. The solution was to read the file in "chunks", i.e. a buffer, and compute the md5 cumulatively.

(This code was also rather cursed in that it uses synchronous read() and therefore would block the entire server until completion.)

Crash counting lines

After fixing this, a secondary issue was discovered with code which computes the total number of "lines" in a text file. It would attempt to loop through the file, reading one line at a time, counting them. The bug is that in, some admitedly rare cases, but nonetheless encountered with test data, a file may appear to be "text" yet contain no line endings. Such a file would suffere the same issue as the md5 bug - it would crash the server.

The solution here was to include the line counting and "is it a text or binary file" test within the same loop over the file that calculates the md5. Instead of reading "lines", the count is of line endings encountered in each chunk of file within the loop.

So those were the two big bugs that would crash the server with large files.

Crash race condition in globus code

Another doozy was found in the globus processing code. With each upload, a check is made to see if the a special globus file has been created for a user with a KBase globus link. The issue was revealed with concurrency testing, and resulted in the server crashing.

The cause was in globus.py, a race condition in which there it first checks if the user directory exists, and if not, creates it. Another request must have created the directory between the two requests. This would have been revealed when running parallel requests, not just concurrent, as the code itself is synchronous. The solution was simple, don't check for existence first, just attempt to make the directory using exists_ok option.

Although I am generally suspicious of any usage of "check for existence and then take an action that will fail if that assumption becomes invalid", in this case finding it was pure luck, as I decided to run manual tests of concurrent parallel handling by uploading a large directory with the development upload front end while also removing the data directory before each run.

Data corruption potential

Another issue that was not encountered but was apparent through code analysis was the potential for data corruption if more than one file is uploaded with the same name at the same time but different file contents. This was not hard to trigger, under the right conditions. It may seem like an unnatural case, and perhaps would be rare, but one can imagine a user with many files of the same name in different directories which are uploaded at the same time without the subdirectory being recreated on the server.

The solution was to first save to a temporary file, and only to move the file to the final destination after the file upload was complete.

Other fixes

There was quite a bit of rearrangement and repair of code within the /upload handling code in app.py, as well as in metadata.py. There was a lot of duplicated code, which made it difficult to isolate the behavior. For instance, several locations separately handled the logic of generating file metadata if it was found missing.

The primary focus was to consolidate code.

At the same time it is very difficult in Python code to even inspect code if there are violations of coding standards (PEP8 and others). So each file that was touched was subjected to black formatting, pylint errors addressed, pylint exceptions added where necessary.

Secondary - concurrency

A secondary issue was concurrency. As I worked on the code and excercised the runtime with a test harness front end, I could observe two problematic aspects of the service: concurrency and performance.

The service was running on a single threaded aiohttp server, with no front end. This meant that any CPU-bound process, like a blocking read operation, would block the entire server for the duration. In practice this would lead to uploads, downloads, or any other api call stalling for the duration, which could be many 10s of seconds, up to a minute. With the prospect, as well, of the server crashing (even with automatic container restarting in our deployments), requests to the server could completely fail.

So I added gunicorn as a front end. Gunicorn will start a number of workers, each one an aiohttp server. In observations, this does increase concurrency, performance, reliability, and adds parallelism. While uploading concurrent files, I could observe that a blocking operation, like copying a file, would normally not block other uploads. Occasionally it would, and I presume that is because the blocking operation was in the same aiohttp server as the affected uploads. Effective paralellism is enabled because multiple workers could handle uploads at the same time, whereas aiohttp's concurrency is is cooperative.

Secondary - performance

As mentioned above, performance was another concern, as the service seemed to process uploads much more slowly than it should. When I switched the upload file system destination to the container's filesystem, I found that the rate went up from about 20MB/s to 70MB/s, and after adding in the necessary file copy, 65MB/s. Of course, this could be an artifact of docker running on macOS ... there are a number of possible factors.

I also increased the read buffer size from the standard 8K to 1M. This also increased performance, but not much above the high water mark.

Shane suggested that I add in an environment variable switch to allow either strategy.

After making it easy to evaluate both strategies, I found that, due to the buffer size increase, there is very little difference between the two. I think this is due to the fact that the number of writes is reduced by 125x, and I have suspected that each write to a non-local directory carries a significant overhead.

Secondary - code quality fixes

In response to local linting errors, and to code quality, formatting, and security issues reported at GitHub and by a local sonarlint vsc plugin, there were many small fixes made. This was partly exacerbated by some cross-file fixes, which triggered those files to be caught. One such change was the the ubuquitous "Path" class, which conflicted with the standard library's Path class. In the code the Python Path was aliased to PyPath or PathPy, but only in files where there was a conflict, and our own Path class was allowed to use that name. That is very confusing, and opposite good practice, so was fixed by allowing Python path to have it's normal name, and our Path renamed to StagingPath, which also happens to reflect it's scope, as it is not a general purpose Path class. Even though a small change, this forced those files to be reformatted and linting errors fixed. Extra work for the PR, but good in the end.

Testing

Before testing could begin, there were several broken tests. Two tests were broken initially, and another broke with my changes. After fixing the tests, I found that coverage did not go above 50%, and after adding the html report for coverage, found that coverage was broken. E.g. the reports did not correctly highlight missing lines of code.

This required an update to the dependencies, which were indeed quite out of date. After this, test coverage jumped to over 80%, and the html coverage report worked perfectly.

Tests were updated to include newly covered code, although at this point, I'm sure a few more tests are needed to cover changes made since then, a few days ago.

Although some tests do use the aiohttp test server library to simulate server endpoints, and an http client for calling them, I think actual integration tests against a running server would be very useful in the future, especially for exercising scenarios under load, concurrency conditions, etc. within the production Docker image and container.

Developer tools

Finally, I added a directory developer containing tools I've used via a Python container. I wanted to run the tests under conditions closer to the server - so in an equivalent container. I chose to use a different Dockerfile as the testing container, also used to run other Python scripts, does not need as much support and setup - just the python dependencies - and uses a different entrypoint mechanism.

eapearson added 27 commits June 14, 2023 11:24
…t of test fixes), gunicorn (part of server runtime improvements) [PTV-1882]
…ary code fixes for clarity. [PTV-1882]

All changed files are subject to black formatting, for uniformity and readability, and to remove linting errors.
The Path class in utils was renamed to StagingPath as it conflicts with Python's path, and forced a renaming of that to PyPath or PathPy.
Many changes in app.py, the home of all endpoints, and where half of the fixes occur.
Refactored much of the config-related code into a new file config.py
Hoisted many hard-coded values into constant-like variables at the module level.
Moved app startup logic into main (from __main__) as it is now the entrypoint for gunicorn
Applied stream-reading, md5-generation, text/binary recognition fixes to metadata.py.
Look for PR comment to explain the changes in more detail.
use python 3.11 since this is an opportunity
simplify os package installation because it was unnecessarily long; more on line RUN line fewer layers.
only COPY necessary directories - everything was copied before which was excessive, potentially insecure
…-1882]

The production entrypoint should not serve a dual purpose - local dev and prod runtime. The local dev support needs work anyway, as it relies on Visual Studio Code, which is out of date.
Gunicorn is used now as it helps ameliorate stuck aiohttp servers and relaunching crashed aiothttp servers. It's configuration uses env variables, all with defaults (which may need to thought about more deeply and/or tested.)
Add tests to cover new code (need to do more)
Generate coverage report to terminal and html
Test runner also now supports running specific tests.
Fix async tests when update dependencies
Improve auth client mocking (replace hand-crafted monkeypatch with pytest method mock)
fix tests and code with obvious typos
Replace silly aliasing of Optional to O with Optional
Moved test_utils to test_helpers as test utils is a module in pytest!
Fixed most warnings reported (mostly deprecations)
Matches the dockerfile changes
Isolated development tools, to aid in development and testing within a container.
Although the host-based server and testing works, it requires installation of the precise same compiler version, which is not really possible cross-platform (though pretty close).
Also dependencies in linux (and in the specific version in deployment), do affect Python dependencies, so this reduces that variable greatly. (Of course, the way the service is run in a container in production will be different.)
This adds a small markdown file to document testing procedures, also covering the just-added development tools for running w/in a docker container.
More docs should be added to cover, e.g., configuration, service architecture, etc.
I'll stop when the # of reported issues is low enough.
Doesn't need to be perfect, just enough so that the # of issues to ignore is tractable to review.
@eapearson
Copy link
Author

Lots of fixes for code quality tools.
The remaining issues (code smells) are all related to test data.
The one remaining security issue relates to the image being left to run as root.
@bio-boris opinion? It runs as non-root for me in tests, upload, download, delete, list work fine, but there is a lot more in the service.
I really will work on the PR description next.

@bio-boris
Copy link
Collaborator

Its possible that there will be some permissions issues if it is run as non root as NFS is mounted in as a specific UID/GID

@bio-boris
Copy link
Collaborator

It's way harder to read this PR when there are so many linting changes. Could you leave comments near where there are actual changes or do the linting in a separate PR?

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 13 Code Smells

No Coverage information No Coverage information
0.3% 0.3% Duplication

@bio-boris
Copy link
Collaborator

Does this PR need a rebase?

@@ -0,0 +1,13 @@
# Scripts

## `generate-data-files.py`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these files also appear in the other PR -- should one of the PRs be merged first and then the other one or is there duplicate content?

@@ -0,0 +1,29 @@
FROM python:3.11.4-slim-buster
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment at the top of the file to explain what this dockerfile is for?

@@ -2,14 +2,42 @@
This class is in charge of determining possible importers by determining the suffix of the filepath pulled in,
and by looking up the appropriate mappings in the supported_apps_w_extensions.json file
"""
from typing import Optional, Tuple, Dict
from typing import Any, Dict, Optional, Tuple
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3.9 onwards allows you to use dict and tuple instead of having to import them from typing. You can also use the shorthand | for Optional - e.g. str | None

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just a comment, no need to do anything about it)

except KeyError as no_query:
show_hidden = False

show_hidden = request.query.get("showHidden", "false").lower() == "true"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@@ -138,42 +106,70 @@ def remove_dir(self, path):
def test_path_cases(username_first, username_rest):
username = username_first + username_rest
assert (
username + "/foo/bar" == utils.Path.validate_path(username, "foo/bar").user_path
username + "/foo/bar"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at some point these should be parametrised... not this PR, though, it is big enough already! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants