Store the resources in S3 buckets (data file and revisions offset files) #582

JohannesLichtenberger · 2023-02-23T21:54:57Z

As an alternative backend or a combination of a local store and an S3 bucket store we could add a new storage type using JClouds blob store for instance (as in Treetank, the former project from which this project was forked).

sudip-unb · 2023-03-07T17:22:53Z

Hi @JohannesLichtenberger I would like to work on this issue.

JohannesLichtenberger · 2023-03-22T10:08:11Z

@sudip-unb did you make any advances or do you need help?

sudip-unb · 2023-03-22T10:20:12Z

Hi @JohannesLichtenberger please give me a little bit time. I am currently busy for my exam. I already explored the link you have provided me. I can start development from this weekend.

JohannesLichtenberger · 2023-03-23T08:49:57Z

Sure, just trying to ping assigned people because usually it's likely that they do not have time at all in my experience ;-) but glad that you're starting to work on it afterward. Take your time -- as said, I just wanted to make sure that you're still up for the task :-)

JohannesLichtenberger · 2023-04-12T12:52:10Z

@sudip-unb any news? :)

Yashendr · 2023-05-02T13:34:02Z

Hi can I help out with this issue?

JohannesLichtenberger · 2023-05-03T19:54:15Z

@Yashendr you can have a look at Treetank (https://github.com/sebastiangraf/treetank) where Sebastian already implemented such a backend (also a combined storage)...

JohannesLichtenberger · 2023-05-03T19:56:05Z

https://github.com/sebastiangraf/treetank/tree/master/coremodules/core/src/main/java/org/treetank/io/jclouds

sband · 2023-05-04T11:49:11Z

Hi @JohannesLichtenberger ,

I went through the org.sirix.io package to understand the storage types that SirixDB has currently.
From what I have understood, FileStorage is already supported, however the above issue speaks about the combination of local file storage and S3 bucket store. Could you please elaborate more on this ask ? Or is it just that a support for S3 storage is needed as per the issue ?
Once I understand this, I'd be happy to contribute my bit to this project.
However, for starters here are my approaches of how I could implement this:

Using the FileWriter write data and revision files on the filesystem and copy them to S3. (I may be completely wrong about this given that I have not looked at the logic of how the page references are written.
Extend the current logic to write on to S3 object store directly
a. Externalize the configuration to be able to provide AWS keys and S3 bucket details
b. Create a connection to S3 bucket using the above configuration (in point "a")
c. Using AWS SDK APIs, use the current logic to write the page references on to S3 object directly.

In approach 2 I could refer/use the jclouds package you have mentioned

Let me know what you think about these approaches.

JohannesLichtenberger · 2023-05-04T12:25:31Z

I'd probably see it as a kind of automatic backup. A local file based store and an async store via JClouds. IIRC the pure S3 storage was way too slow. So, in short I prefer your second option. To combine the storage approaches we can implement something as simple as this combined storage: https://github.com/sebastiangraf/treetank/tree/master/coremodules/core/src/main/java/org/treetank/io/combined

BTW: If you dig a bit deeper into the storage mechanism (simply store word aligned page fragments instead of same sized full pages), it would also be interesting to find out, why the iouring based backend currently is slower than the simple file channel based solution and the memory mapped backend (I think somehow because of the event loop)...

sband · 2023-05-04T13:08:54Z

Ok looking at the above for combined, here is my understanding of the requirement:
We should write a Storage type that would be a combination of two storages -

Local - Could be any type that is currently supported in SirixDB - for example file channel, file, iouring OR memory mapped
remote storage - this asynchronously writes to remote storage - For now S3, in future this could be Azure Blob store OR GCP blob store
So here is plain english implementation detail:
Write a class to facilitate remote storage
This class reads the cloud platform type
Based on the cloud platform, reads appropriate properties from a config file to create a connection
Contains methods to read and write from remote storage.
A CombinedStorage class (similar to the treetank url above) will encapsulate the SirixDB storage type for local storage and the above remote storage class with methods to write asynchronously to the remote storage using Executors framework and leveraging the read/write methods of the respective local storage used in this class. It could literally be this class with some tweaks that are suitable for SirixDB

sband · 2023-05-04T13:09:33Z

BTW: If you dig a bit deeper into the storage mechanism (simply store word aligned page fragments instead of same sized full pages), it would also be interesting to find out, why the iouring based backend currently is slower than the simple file channel based solution and the memory mapped backend (I think somehow because of the event loop)...

I Could take this up as a different task may be ?

JohannesLichtenberger · 2023-05-04T13:22:40Z

Thanks for working on this :-) and probably the upcoming task

Yashendr · 2023-05-04T15:42:43Z

Ok looking at the above for combined, here is my understanding of the requirement: We should write a Storage type that would be a combination of two storages -

1. Local - Could be any type that is currently supported in SirixDB - for example file channel, file, iouring OR memory mapped

2. remote storage - this asynchronously writes to remote storage - For now S3, in future this could be Azure Blob store OR GCP blob store
   So here is plain english implementation detail:

3. Write a class to facilitate remote storage

4. This class reads the cloud platform type

5. Based on the cloud platform, reads appropriate properties from a config file to create a connection

6. Contains methods to read and write from remote storage.
   A CombinedStorage class (similar to the treetank url above) will encapsulate the SirixDB storage type for local storage and the above remote storage class with methods to write asynchronously to the remote storage using Executors framework and leveraging the read/write methods of the respective local storage used in this class. It could literally be [this ](https://github.com/sebastiangraf/treetank/blob/master/coremodules/core/src/main/java/org/treetank/io/combined/CombinedStorage.java) class  with some tweaks that are suitable for SirixDB

@sband Hey do you need help with this. Do you have anything in particular you would like me to do?

sband · 2023-05-05T05:49:16Z

@Yashendr sure I will let you know if I need any help around this.

sband · 2023-05-09T04:58:49Z

Hi @JohannesLichtenberger ,

This is in progress. I am hoping to complete this by coming friday...

sband · 2023-05-10T11:46:44Z

hi @JohannesLichtenberger

quick question:

For coding the reader that is used in as part of the Cloud storage, for instance AWS, my approach is to get the object from S3(remote storage in this case). I would get the data in bytes. Should I write this byte data into a local file by using the File Storage and then use the FileReader ? OR should I just leverage the same reader as that would be used by the user in the CombinedStorage class that I would be writing ? - I would prefer the later to avoid any confusion, but please let me know.
Same question above for writer.
For now I am writing code to support AWS only, is that ok ?

JohannesLichtenberger · 2023-05-10T12:48:44Z

Hi @sband, I'd simply read-write the page(-fragments) into S3 buckets. If we want a local cache and/or use S3 as a backup more or less I'd use the CombinedStorage.

AWS is okay :-) in the future, we could also support for instance writing to/reading from Kafka or Pulsar/BookKeeper...

However, what I'm even more interested in is making the local storage first of all as fast as possible before even using horizontal scaling/sharding... so I'd be rather interested why the IO-uring storage is currently on my Notebook at least slower in comparison to the FileChannel based approach.

JohannesLichtenberger · 2023-05-10T12:51:55Z

Furthermore, it's kind of sad that Intel Optane Non-Volatile Memory isn't produced anymore, as the page(-fragments) are not aligned to a predefined size, and thus, sometimes if only a few nodes are changed due to the sliding snapshot algorithm only mainly these nodes are written to a new location instead of the full page (thus generating a page-fragment). However, for iouring, I guess it would be great to have classes of page sizes and to use predefined buffers (as in Umbra from Thomas Neumann...).

JohannesLichtenberger · 2023-05-10T12:54:26Z

Thanks for working on SirixDB, BTW :-) really looking forward to your PR (and maybe future contributions?)

…t files) sirixdb#582

sband · 2023-05-12T12:22:57Z

hi @JohannesLichtenberger ,

I have created a DRAFT pull request for this #611
Could not complete the implementation as promised on this. However, please feel free to advice if you see that I am going in the wrong direction in terms of implementing this fix for the required use-case.

ighosh98 · 2023-09-30T19:04:09Z

Hi,
I see no activity on this for some time. If it's not being, actively worked on can I work on this issue?

JohannesLichtenberger · 2023-10-04T20:10:48Z

@ighosh98 do you intend to work on this?

ighosh98 · 2023-10-04T23:23:52Z

Hi @JohannesLichtenberger , yes I will be working on this issue.

JohannesLichtenberger · 2023-11-14T16:57:41Z

@ighosh98 ping

ighosh98 · 2023-11-14T17:07:40Z

Hi. I've been tied up with some work. It would take me some time to raise the PR. If someone else can develop it faster, they can take over.

JohannesLichtenberger added enhancement good first issue help wanted labels Feb 23, 2023

JohannesLichtenberger assigned sudip-unb Mar 9, 2023

JohannesLichtenberger unassigned sudip-unb Apr 24, 2023

sband added a commit to sband/sirix that referenced this issue May 12, 2023

fix: Store the resources in S3 buckets (data file and revisions offse…

3ba8287

…t files) sirixdb#582

sband mentioned this issue May 12, 2023

fix: Store the resources in S3 buckets #611

Open

sband added a commit to sband/sirix that referenced this issue Jun 8, 2023

fix sirixdb#582: Replace FileReader with FileChannelReader

d7de235

sband added a commit to sband/sirix that referenced this issue Jun 11, 2023

fix sirixdb#582: Rectify failing test'

1e92222

JohannesLichtenberger assigned ighosh98 Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store the resources in S3 buckets (data file and revisions offset files) #582

Store the resources in S3 buckets (data file and revisions offset files) #582

JohannesLichtenberger commented Feb 23, 2023

sudip-unb commented Mar 7, 2023

JohannesLichtenberger commented Mar 22, 2023

sudip-unb commented Mar 22, 2023

JohannesLichtenberger commented Mar 23, 2023

JohannesLichtenberger commented Apr 12, 2023

Yashendr commented May 2, 2023

JohannesLichtenberger commented May 3, 2023

JohannesLichtenberger commented May 3, 2023

sband commented May 4, 2023

JohannesLichtenberger commented May 4, 2023

sband commented May 4, 2023

sband commented May 4, 2023

JohannesLichtenberger commented May 4, 2023

Yashendr commented May 4, 2023

sband commented May 5, 2023

sband commented May 9, 2023

sband commented May 10, 2023 •

edited

Loading

JohannesLichtenberger commented May 10, 2023

JohannesLichtenberger commented May 10, 2023

JohannesLichtenberger commented May 10, 2023

sband commented May 12, 2023

ighosh98 commented Sep 30, 2023

JohannesLichtenberger commented Oct 4, 2023

ighosh98 commented Oct 4, 2023

JohannesLichtenberger commented Nov 14, 2023

ighosh98 commented Nov 14, 2023

Store the resources in S3 buckets (data file and revisions offset files) #582

Store the resources in S3 buckets (data file and revisions offset files) #582

Comments

JohannesLichtenberger commented Feb 23, 2023

sudip-unb commented Mar 7, 2023

JohannesLichtenberger commented Mar 22, 2023

sudip-unb commented Mar 22, 2023

JohannesLichtenberger commented Mar 23, 2023

JohannesLichtenberger commented Apr 12, 2023

Yashendr commented May 2, 2023

JohannesLichtenberger commented May 3, 2023

JohannesLichtenberger commented May 3, 2023

sband commented May 4, 2023

JohannesLichtenberger commented May 4, 2023

sband commented May 4, 2023

sband commented May 4, 2023

JohannesLichtenberger commented May 4, 2023

Yashendr commented May 4, 2023

sband commented May 5, 2023

sband commented May 9, 2023

sband commented May 10, 2023 • edited Loading

JohannesLichtenberger commented May 10, 2023

JohannesLichtenberger commented May 10, 2023

JohannesLichtenberger commented May 10, 2023

sband commented May 12, 2023

ighosh98 commented Sep 30, 2023

JohannesLichtenberger commented Oct 4, 2023

ighosh98 commented Oct 4, 2023

JohannesLichtenberger commented Nov 14, 2023

ighosh98 commented Nov 14, 2023

sband commented May 10, 2023 •

edited

Loading