Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Export] shorten the oss uploading time in the beginning of every month #1306

Open
tyn1998 opened this issue Jun 2, 2023 · 5 comments
Labels
waiting for author need issue author's feedback

Comments

@tyn1998
Copy link
Member

tyn1998 commented Jun 2, 2023

Description

Hi community,

Is it possible to shorten the oss uploading time in the beggining of every month? Or could you choose a fixed day from a month and anounce it as a due date before which all data exporting tasks are completed?

This is so important for downstream apps who consume OpenDigger's valuable data.

image image
@github-actions github-actions bot added the waiting for repliers need other's feedback label Jun 2, 2023
@tyn1998 tyn1998 changed the title [Export] shorten the oss uploading time in the beginning of every month [Data Export] shorten the oss uploading time in the beginning of every month Jun 2, 2023
@frank-zsy
Copy link
Contributor

@tyn1998 I think we had the discussion before and the solution is put the update time into meta data of each repo, like in the file https://oss.x-lab.info/open_digger/github/X-lab2017/open-digger/meta.json , there will be a field called updatedAt which is timestamp to indicate when the data is updated, you can use the field to find out if the data has been updated or not for current month.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Jun 2, 2023
@tyn1998
Copy link
Member Author

tyn1998 commented Jun 2, 2023

Hi @frank-zsy, thanks for your reply.

I know the existence of meta.json files. Actually in this issue, I mean if some methods like parallel computing and uploading can be adopted to speed up the data exporting and uploading processes. So hopefully all export tasks can be completed in 24 hours or even in several hours.

I noticed that writeFileSync (the syncronous version of writeFile) is used in cron tasks to write json files into the file system of your machine. Would it be faster if fs.writeFile is used instead so following computing tasks don't need to wait for file writing?

I also assume that after the cron tasks are executed another set of scripts(not included in this repository) are run to upload the exported files to the aliyun oss. Could those scripts for uploading files be improved to shorten the uploading time?

What is the bottleneck now? Computing or uploading?

@github-actions github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels Jun 2, 2023
@frank-zsy
Copy link
Contributor

Understood, so I will elaborate the tasks here, there are several steps needed for the data update process.

  • First, we need to wait all the data imported into ClickHouse instance for last month. As we are in UTC+8, it will be about 10am or 11am the first day of a month.
  • Then before the metrics export task, we need to calculate the OpenRank for all the repos and users for last month.
    • We need to calculate the activity for whole GitHub collaboration network and import into Neo4j database.
    • Calculate the OpenRank of all the repos and users.
  • After that, we need to export the OpenRank values from Neo4j database back into ClickHouse instance to make sure the metrics export task can read OpenRank values directly from ClickHouse which is much faster than reading from graph database.
  • Then we can run the monthly export task to generate metrics data, it normally takes hours to finish, I agree that writeFile instead of writeFileSync may improve the performance but most of the time are consumed by ClickHouse computation, the improvement maybe limited.
  • And we also need to export networks for all the repos and users to be exported. It is also a CPU intense task on Neo4j database which may take more time than metrics data export task, although we can make the tasks running at same time.
  • The final step is to upload all the files to OSS by a shell script with oss-util. This step will also take 5-6 hours to complete for about 23 million files.

So if we start all the task in the 11am the first day of a month, OpenRank data import, calculation and export may take about 2 hours, then metrics computation and network export may take 5 hours, and the data upload may also take 5-6 hours to complete.

So if we can make all the process parallel and automated, the whole process may take about 12-13 hours to complete which is the midnight of the first day of the month.

But right now, the process is not fully automated so the data may be updated about the 2nd day of a month like for 2023.5, the data is updated on today's morning.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Jun 2, 2023
@tyn1998
Copy link
Member Author

tyn1998 commented Jun 2, 2023

@frank-zsy Thanks for your detailed elaboration! This is the first time I have known the complete steps for exporting monthly data and I am convinced that the tasks are indeed time consuming.

I recommend to write the steps mentioned above into src/cron/README.md so more interested people can share the knowledge of how datas are exported by OpenDigger in every month :D

@github-actions github-actions bot added waiting for repliers need other's feedback and removed waiting for author need issue author's feedback labels Jun 2, 2023
@frank-zsy
Copy link
Contributor

frank-zsy commented Jun 2, 2023

Agreed, I will add the information into README file, and to improve the performance, I think several things can be done.

  • Use asynchronous functions instead of synchronous functions in data export task.
  • Actually I have already checked multiple ways to upload the data to OSS, and currently I think the time consumption is hard to reduce.
    • I have tried to compress all the metrics data and upload to OSS, then use uncompress serverless service to extract the files from the compressed file. But compress more than 20 million files into a file is very challenging and time consuming, and further more, the serverless service to uncompress file on OSS do not provide enough resource like only 2c4g which will definitely crash during uncompress process.
    • And to iterate more than 20 million files under a single folder is also very challenging. oss-util does a great work in the situation. Right now, the script I use is:
ossutilmac64 sync ~/github_data/open_digger/github oss://xlab-open-source/open_digger/github --force --job=1000 --meta "Expires:2023-07-01T22:00:00+08:00" --config-file=~/.ossutilconfig-xlab

The script upload files in 1000 parallel thread and set meta making the process a little bit longer than just upload the files to OSS. So maybe bigger parallel job parameter or deploy the task in the same VPC with OSS may reduce the time but not very much I think because right now the network payload is not very high, maybe because the files iteration process are time consuming.

@github-actions github-actions bot added waiting for author need issue author's feedback and removed waiting for repliers need other's feedback labels Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author need issue author's feedback
Projects
None yet
Development

No branches or pull requests

2 participants