Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursive exploration of remote datasets #7912

Merged
merged 27 commits into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a54c5b3
WIP: Add recursive exporation of remote s3 layer
MichaelBuessemeyer Jun 21, 2024
f4e973a
WIP: finish first version of recursive exploration of remote s3 layer
MichaelBuessemeyer Jun 24, 2024
2ad6765
WIP: add gcs support
MichaelBuessemeyer Jun 24, 2024
d5fba44
WIP: add gcs support
MichaelBuessemeyer Jun 25, 2024
7b06e4e
WIP: run explorers in parallel on same subdirectory
MichaelBuessemeyer Jun 25, 2024
b7e7096
Code clean up (mainly extracted methods)
MichaelBuessemeyer Jun 25, 2024
20cf6b5
Merge branch 'master' of github.com:scalableminds/webknossos into rec…
MichaelBuessemeyer Jul 2, 2024
1e42c45
add local file system exploration
MichaelBuessemeyer Jul 2, 2024
c3f0ec3
Merge branch 'master' of github.com:scalableminds/webknossos into rec…
MichaelBuessemeyer Jul 3, 2024
2798b1d
do not include mutableReport in requests regarding the local file system
MichaelBuessemeyer Jul 3, 2024
3ea72a1
add missing override of listDirectory of MockDataVault
MichaelBuessemeyer Jul 3, 2024
abc7b2e
some cleanup
MichaelBuessemeyer Jul 3, 2024
10d5c3e
add command to build backend parts like in CI to be ablte to detect e…
MichaelBuessemeyer Jul 3, 2024
4db966f
clean up code
MichaelBuessemeyer Jul 3, 2024
bc6e93e
format backend code
MichaelBuessemeyer Jul 3, 2024
45a86ff
update docs to mention recursive exploration
MichaelBuessemeyer Jul 3, 2024
ca418b6
add changelog entry
MichaelBuessemeyer Jul 3, 2024
c8e35bf
apply some feedback
MichaelBuessemeyer Jul 5, 2024
aeb7569
Merge branch 'master' into recursive-exploration
MichaelBuessemeyer Jul 15, 2024
d6b5719
apply some feedback; Mainly extract methods in ExploreRemoteLayerServ…
MichaelBuessemeyer Jul 15, 2024
b8cdb7e
Merge branch 'recursive-exploration' of github.com:scalableminds/webk…
MichaelBuessemeyer Jul 15, 2024
e8d1da6
erge branch 'master' of github.com:scalableminds/webknossos into recu…
MichaelBuessemeyer Jul 16, 2024
afea2df
Only let explorers of simple dataset formats explore for additional l…
MichaelBuessemeyer Jul 16, 2024
ef69593
apply pr feedback
MichaelBuessemeyer Jul 18, 2024
dafcc7e
Merge branch 'master' of github.com:scalableminds/webknossos into rec…
MichaelBuessemeyer Jul 18, 2024
84fea6d
restore accidentally deleted changelog entry
MichaelBuessemeyer Jul 22, 2024
d4b1465
Merge branch 'master' into recursive-exploration
MichaelBuessemeyer Jul 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.unreleased.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ For upgrade instructions, please check the [migration guide](MIGRATIONS.released
[Commits](https://github.com/scalableminds/webknossos/compare/24.07.0...HEAD)

### Added
- WEBKNOSSOS now automatically searches in subfolder / sub-collection identifiers for valid datasets in case a provided link to a remote dataset does not directly point to a dataset. [#7912](https://github.com/scalableminds/webknossos/pull/7912)
- Added the option to move a bounding box via dragging while pressing ctrl / meta. [#7892](https://github.com/scalableminds/webknossos/pull/7892)
- Added route `/import?url=<url_to_datasource>` to automatically import and view remote datasets. [#7844](https://github.com/scalableminds/webknossos/pull/7844)
- The context menu that is opened upon right-clicking a segment in the dataview port now contains the segment's name. [#7920](https://github.com/scalableminds/webknossos/pull/7920)
- The context menu that is opened upon right-clicking a segment in the dataview port now contains the segment's name. [#7920](https://github.com/scalableminds/webknossos/pull/7920)
- Upgraded backend dependencies for improved performance and stability. [#7922](https://github.com/scalableminds/webknossos/pull/7922)

### Changed
Expand Down
2 changes: 1 addition & 1 deletion docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ In particular, the following file formats are supported for uploading (and conve
Once the data is uploaded (and potentially converted), you can further configure a dataset's [Settings](#configuring-datasets) and double-check layer properties, fine tune access rights & permissions, or set default values for rendering.

### Streaming from remote servers and the cloud
WEBKNOSSOS supports loading and remotely streaming [Zarr](https://zarr.dev), [Neuroglancer precomputed format](https://github.com/google/neuroglancer/tree/master/src/neuroglancer/datasource/precomputed) and [N5](https://github.com/saalfeldlab/n5) datasets from a remote source, e.g. Cloud storage (S3) or HTTP server.
WEBKNOSSOS supports loading and remotely streaming [Zarr](https://zarr.dev), [Neuroglancer precomputed format](https://github.com/google/neuroglancer/tree/master/src/neuroglancer/datasource/precomputed) and [N5](https://github.com/saalfeldlab/n5) datasets from a remote source, e.g. Cloud storage (S3 / GCS) or HTTP server.
WEBKNOSSOS supports loading Zarr datasets according to the [OME NGFF v0.4 spec](https://ngff.openmicroscopy.org/latest/).

WEBKNOSSOS can load several remote sources and assemble them into a WEBKNOSSOS dataset with several layers, e.g. one Zarr file/source for the `color` layer and one Zarr file/source for a `segmentation` layer.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -461,7 +461,7 @@ function AddRemoteLayer({
const [showCredentialsFields, setShowCredentialsFields] = useState<boolean>(false);
const [usernameOrAccessKey, setUsernameOrAccessKey] = useState<string>("");
const [passwordOrSecretKey, setPasswordOrSecretKey] = useState<string>("");
const [selectedProtocol, setSelectedProtocol] = useState<"s3" | "https" | "gs">("https");
const [selectedProtocol, setSelectedProtocol] = useState<"s3" | "https" | "gs" | "file">("https");
const [fileList, setFileList] = useState<FileList>([]);

useEffect(() => {
Expand Down Expand Up @@ -489,9 +489,11 @@ function AddRemoteLayer({
setSelectedProtocol("s3");
} else if (userInput.startsWith("gs://")) {
setSelectedProtocol("gs");
} else if (userInput.startsWith("file://")) {
setSelectedProtocol("file"); // Unused
} else {
throw new Error(
"Dataset URL must employ one of the following protocols: https://, http://, s3:// or gs://",
"Dataset URL must employ one of the following protocols: https://, http://, s3://, gs:// or file://",
);
}
}
Expand Down
5 changes: 5 additions & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,11 @@
"scripts": {
"start": "node tools/proxy/proxy.js",
"build": "node --max-old-space-size=4096 node_modules/.bin/webpack --env production",
"@comment build-backend": "Only check for errors in the backend code like done by the CI. This command is not needed to run WEBKNOSSOS",
"build-backend": "yarn build-wk-backend && yarn build-wk-datastore && yarn build-wk-tracingstore",
"build-wk-backend": "sbt -no-colors -DfailOnWarning compile stage",
"build-wk-datastore": "sbt -no-colors -DfailOnWarning \"project webknossosDatastore\" copyMessages compile stage",
"build-wk-tracingstore": "sbt -no-colors -DfailOnWarning \"project webknossosTracingstore\" copyMessages compile stage",
MichaelBuessemeyer marked this conversation as resolved.
Show resolved Hide resolved
"build-dev": "node_modules/.bin/webpack",
"build-watch": "node_modules/.bin/webpack -w",
"listening": "lsof -i:5005,7155,9000,9001,9002",
Expand Down
3 changes: 3 additions & 0 deletions test/backend/DataVaultTestSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ class DataVaultTestSuite extends PlaySpec {
class MockDataVault extends DataVault {
override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = ???

override def listDirectory(path: VaultPath,
maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] = ???
}

"Uri has no trailing slash" should {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,11 @@ import play.api.libs.json.Json
import play.api.mvc.{Action, AnyContent, MultipartFormData, PlayBodyParsers}

import java.io.File
import com.scalableminds.webknossos.datastore.storage.AgglomerateFileKey
import com.scalableminds.webknossos.datastore.storage.{AgglomerateFileKey, DataVaultService}
import net.liftweb.common.{Box, Empty, Failure, Full}
import play.api.libs.Files

import java.net.URI
import scala.collection.mutable.ListBuffer
import scala.concurrent.{ExecutionContext, Future}
import scala.concurrent.duration._
Expand Down Expand Up @@ -721,10 +722,14 @@ class DataSourceController @Inject()(
Action.async(validateJson[ExploreRemoteDatasetRequest]) { implicit request =>
accessTokenService.validateAccess(UserAccessRequest.administrateDataSources(request.body.organizationName), token) {
val reportMutable = ListBuffer[String]()
val hasLocalFilesystemRequest = request.body.layerParameters.exists(param =>
new URI(param.remoteUri).getScheme == DataVaultService.schemeFile)
for {
dataSourceBox: Box[GenericDataSource[DataLayer]] <- exploreRemoteLayerService
.exploreRemoteDatasource(request.body.layerParameters, reportMutable)
.futureBox
// Remove report of recursive exploration in case of exploring the local file system to avoid information exposure.
_ <- Fox.runIf(hasLocalFilesystemRequest)(Fox.successful(reportMutable.clear()))
MichaelBuessemeyer marked this conversation as resolved.
Show resolved Hide resolved
dataSourceOpt = dataSourceBox match {
case Full(dataSource) if dataSource.dataLayers.nonEmpty =>
reportMutable += s"Resulted in dataSource with ${dataSource.dataLayers.length} layers."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,6 @@ import scala.concurrent.ExecutionContext
trait DataVault {
def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)]

def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]]
}
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,18 @@ import org.apache.commons.lang3.builder.HashCodeBuilder

import java.nio.ByteBuffer
import java.nio.file.{Files, Path, Paths}
import java.util.stream.Collectors
import scala.concurrent.ExecutionContext
import scala.jdk.CollectionConverters._

class FileSystemDataVault extends DataVault {

override def readBytesAndEncoding(path: VaultPath, range: RangeSpecifier)(
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] = {
val uri = path.toUri
implicit ec: ExecutionContext): Fox[(Array[Byte], Encoding.Value)] =
for {
_ <- bool2Fox(uri.getScheme == DataVaultService.schemeFile) ?~> "trying to read from FileSystemDataVault, but uri scheme is not file"
_ <- bool2Fox(uri.getHost == null || uri.getHost.isEmpty) ?~> s"trying to read from FileSystemDataVault, but hostname ${uri.getHost} is non-empty"
localPath = Paths.get(uri.getPath)
_ <- bool2Fox(localPath.isAbsolute) ?~> "trying to read from FileSystemDataVault, but hostname is non-empty"
localPath <- vaultPathToLocalPath(path)
bytes <- readBytesLocal(localPath, range)
} yield (bytes, Encoding.identity)
}

private def readBytesLocal(localPath: Path, range: RangeSpecifier)(implicit ec: ExecutionContext): Fox[Array[Byte]] =
if (Files.exists(localPath)) {
Expand Down Expand Up @@ -53,6 +50,30 @@ class FileSystemDataVault extends DataVault {
}
} else Fox.empty

override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
vaultPathToLocalPath(path).map(
localPath =>
if (Files.isDirectory(localPath))
Files
.list(localPath)
.filter(file => Files.isDirectory(file))
.collect(Collectors.toList())
.asScala
.toList
.map(dir => new VaultPath(dir.toUri, this))
.take(maxItems)
else List.empty)

private def vaultPathToLocalPath(path: VaultPath)(implicit ec: ExecutionContext): Fox[Path] = {
val uri = path.toUri
for {
_ <- bool2Fox(uri.getScheme == DataVaultService.schemeFile) ?~> "trying to read from FileSystemDataVault, but uri scheme is not file"
_ <- bool2Fox(uri.getHost == null || uri.getHost.isEmpty) ?~> s"trying to read from FileSystemDataVault, but hostname ${uri.getHost} is non-empty"
localPath = Paths.get(uri.getPath)
_ <- bool2Fox(localPath.isAbsolute) ?~> "trying to read from FileSystemDataVault, but hostname is non-empty"
} yield localPath
}

override def hashCode(): Int =
new HashCodeBuilder(19, 31).toHashCode

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import java.io.ByteArrayInputStream
import java.net.URI
import java.nio.ByteBuffer
import scala.concurrent.ExecutionContext
import scala.jdk.CollectionConverters.IterableHasAsScala

class GoogleCloudDataVault(uri: URI, credential: Option[GoogleServiceAccountCredential]) extends DataVault {

Expand Down Expand Up @@ -72,6 +73,17 @@ class GoogleCloudDataVault(uri: URI, credential: Option[GoogleServiceAccountCred
} yield (bytes, encoding)
}

override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
tryo({
val objName = path.toUri.getPath.tail
val blobs =
storage.list(bucket, Storage.BlobListOption.prefix(objName), Storage.BlobListOption.currentDirectory())
val subDirectories = blobs.getValues.asScala.toList.filter(_.isDirectory).take(maxItems)
val paths = subDirectories.map(dirBlob =>
new VaultPath(new URI(s"${uri.getScheme}://$bucket/${dirBlob.getBlobId.getName}"), this))
paths
})

private def getUri = uri
private def getCredential = credential

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ class HttpsDataVault(credential: Option[DataVaultCredential], ws: WSClient) exte

}

override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
// HTTP file listing is currently not supported.
Fox.successful(List.empty)
MichaelBuessemeyer marked this conversation as resolved.
Show resolved Hide resolved

private val headerInfoCache: AlfuCache[URI, (Boolean, Long)] = AlfuCache()

private def getHeaderInformation(uri: URI)(implicit ec: ExecutionContext): Fox[(Boolean, Long)] =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,24 @@ import com.amazonaws.auth.{
import com.amazonaws.client.builder.AwsClientBuilder.EndpointConfiguration
import com.amazonaws.regions.Regions
import com.amazonaws.services.s3.{AmazonS3, AmazonS3ClientBuilder}
import com.amazonaws.services.s3.model.{GetObjectRequest, S3Object}
import com.amazonaws.services.s3.model.{GetObjectRequest, ListObjectsV2Request, S3Object}
import com.amazonaws.util.AwsHostNameUtils
import com.scalableminds.util.tools.Fox
import com.scalableminds.util.tools.Fox.box2Fox
import com.scalableminds.webknossos.datastore.storage.{
LegacyDataVaultCredential,
RemoteSourceDescriptor,
S3AccessKeyCredential
}
import net.liftweb.common.Box.tryo
import net.liftweb.common.{Box, Failure, Full}
import org.apache.commons.io.IOUtils
import org.apache.commons.lang3.builder.HashCodeBuilder

import java.net.URI
import scala.collection.immutable.NumericRange
import scala.concurrent.ExecutionContext
import scala.jdk.CollectionConverters._

class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI) extends DataVault {
private lazy val bucketName = S3DataVault.hostBucketFromUri(uri) match {
Expand All @@ -50,7 +53,8 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI

private def getRequest(bucketName: String, key: String): GetObjectRequest = new GetObjectRequest(bucketName, key)

private def performRequest(request: GetObjectRequest)(implicit ec: ExecutionContext): Fox[(Array[Byte], String)] = {
private def performGetObjectRequest(request: GetObjectRequest)(
implicit ec: ExecutionContext): Fox[(Array[Byte], String)] = {
var s3objectRef: Option[S3Object] = None // Used for cleanup later (possession of a S3Object requires closing it)
try {
val s3object = client.getObject(request)
Expand Down Expand Up @@ -82,10 +86,38 @@ class S3DataVault(s3AccessKeyCredential: Option[S3AccessKeyCredential], uri: URI
case SuffixLength(l) => getSuffixRangeRequest(bucketName, objectKey, l)
case Complete() => getRequest(bucketName, objectKey)
}
(bytes, encodingString) <- performRequest(request)
(bytes, encodingString) <- performGetObjectRequest(request)
encoding <- Encoding.fromRfc7231String(encodingString)
} yield (bytes, encoding)

override def listDirectory(path: VaultPath, maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
for {
prefixKey <- Fox.box2Fox(S3DataVault.objectKeyFromUri(path.toUri))
s3SubPrefixKeys <- getObjectSummaries(bucketName, prefixKey, maxItems)
vaultPaths <- tryo(
s3SubPrefixKeys.map(key => new VaultPath(new URI(s"${uri.getScheme}://$bucketName/$key"), this))).toFox
} yield vaultPaths

private def getObjectSummaries(bucketName: String, keyPrefix: String, maxItems: Int)(
implicit ec: ExecutionContext): Fox[List[String]] =
try {
val listObjectsRequest = new ListObjectsV2Request
listObjectsRequest.setBucketName(bucketName)
listObjectsRequest.setPrefix(keyPrefix)
listObjectsRequest.setDelimiter("/")
listObjectsRequest.setMaxKeys(maxItems)
val objectListing = client.listObjectsV2(listObjectsRequest)
val s3SubPrefixes = objectListing.getCommonPrefixes.asScala.toList
Fox.successful(s3SubPrefixes)
} catch {
case e: AmazonServiceException =>
e.getStatusCode match {
case 404 => Fox.empty
case _ => Fox.failure(e.getMessage)
}
case e: Exception => Fox.failure(e.getMessage)
}

Comment on lines +103 to +120
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question @fm3.
Should I use tryo here or this try and catch construct? Which one is better in which case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryo is a shortcut, it will always produce a Box.Failure in case of any exception. In this code, however, we want to create different results based on a parameter of the exception (Fox.empty for status code 404), so we need the full try/catch to implement that custom logic.

private def getUri = uri
private def getCredential = s3AccessKeyCredential

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ class VaultPath(uri: URI, dataVault: DataVault) extends LazyLogging {
}
}

def listDirectory(maxItems: Int)(implicit ec: ExecutionContext): Fox[List[VaultPath]] =
dataVault.listDirectory(this, maxItems)

private def decodeBrotli(bytes: Array[Byte]) = {
Brotli4jLoader.ensureAvailability()
val brotliInputStream = new BrotliInputStream(new ByteArrayInputStream(bytes))
Expand Down
Loading