Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 20 additions & 7 deletions ckanext/xloader/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,27 @@ def _submit_all(self, sync=False, queue=None):
# submit every package
# for each package in the package list,
# submit each resource w/ _submit_package
package_list = tk.get_action('package_search')(
{'ignore_auth': True}, {'include_private': True, 'rows': 1000})
package_list = [pkg['id'] for pkg in package_list['results']]
arguments = {'include_private': True, 'start': 0, 'rows': 10}

response = tk.get_action('package_search')({'ignore_auth': True}, {'include_private': True})
num_pages = (response['count'] + 999) // 1000
arguments['rows'] = 1000
package_list = []

for page in range(0, num_pages):
paged_response = tk.get_action('package_search')({'ignore_auth': True}, arguments)
package_list.extend([pkg['id'] for pkg in paged_response['results']])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Since the whole purpose of pagination is to handle arbitrarily large numbers of datasets, perhaps it would be better to process each batch of 1000 before retrieving the next, rather than assembling a package ID list of arbitrarily large size?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on @ThrawnCA input, you could get a total number of packages in the system (without all the data), then ask the question.

If yes, then do the pagination and submit the jobs so you can discard/release memory. This becomes very required when you have over 100,000+ datasets and working on small containers.

arguments['start'] += 1000

print('Processing %d datasets' % len(package_list))
user = tk.get_action('get_site_user')(
{'ignore_auth': True}, {})
for p_id in package_list:
self._submit_package(p_id, user, indent=2, sync=sync, queue=queue)
check_start = input('This action could take a few minuts depending on the number of DataSets:\nDid you want to start the process? y/N\n')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minuts/minutes


if check_start == 'y':
user = tk.get_action('get_site_user')({'ignore_auth': True}, {})
for p_id in package_list:
self._submit_package(p_id, user, indent=2, sync=sync, queue=queue)
else:
print('Submit all process stoped')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stoped/stopped


def _submit_package(self, pkg_id, user=None, indent=0, sync=False, queue=None):
indentation = ' ' * indent
Expand Down