Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EN] Collect the images of the Armenian monuments #15

Open
ansakoy opened this issue Jun 14, 2023 · 0 comments
Open

[EN] Collect the images of the Armenian monuments #15

ansakoy opened this issue Jun 14, 2023 · 0 comments
Labels
extraction Task that require data extraction (scraping) skills parsing Tasks that require data parsing topic-culture Tasks dedicatated Armenian culture, language and history

Comments

@ansakoy
Copy link
Collaborator

ansakoy commented Jun 14, 2023

Goal

Collect the images of the Armenian monuments accompanied by metadata.

Tasks

The task has two components. First, it is necessary to collect all the metadata of all the images at all the pages and them in a machine readable format. By the metadata, we mean:

  • the attributes that form the description of each individual image;
  • the URL of each image;
  • the relative path to the file in a folder.

The tricky part with the metadata is the lack of a predictable structure of a description. For instance, some may contain a year, a location, etc, while others only have name of a person. One possible approach to solving this may be as follows:

  • Analyze a number of image descriptions;
  • Make a parser to grab the values you managed to single out;
  • Store these values in separate fields in your output data;
  • Store the whole unparsed description text as HTML in a separate field (just in case your parser misses something important).

Or you can simply grab the description (as HTML) as a single value and the image url as another, without any extra parsing.

Second, download all the pictures and store them in a folder. This folder should be temporarily published as a zipped archive at your server or at a sharing platform, such as Google Drive. After this is done, please, let us know about it and we shall copy these files to our own storage, so that it does not occupy your disk space.

Preferably, both components should be completed. Alternatively, if you have a perfect idea of how to collect the metadata, but have no room to store the result, the first component would be sufficient.

Context

The website presents an impressive collection of historical images of various kinds, from pictures of monuments to old photos of people. Problem is they are all presented as a web-gallery only. There is no option to download these images and their metadata in bulk. In other words, if this website disappears, its huge collection may be lost for the public for ever. It would be nice to have a backup for such a project, as well as to provide a convenient way of using these data automatically.

Requirements

A public GitHub repository should be created to store and publish the code and the data under one of the free and open licenses, such as Creative Commons or MIT.

Wishes

It would be best if your code is reusable, that is can be launch again by anyone who might want to update the dataset at a later point. For the same reason, we encourage you to comment your code, supplement it with at least a very brief README description, and specify the requirements and dependencies necessary to use the code.

Resources

http://www.armenianmonumentsimages.com/

Parsing may require a library that imitates human interaction with the website.

Prepared by

The Open Data Armenia team prepared this task.

@ansakoy ansakoy added parsing Tasks that require data parsing extraction Task that require data extraction (scraping) skills topic-culture Tasks dedicatated Armenian culture, language and history labels Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extraction Task that require data extraction (scraping) skills parsing Tasks that require data parsing topic-culture Tasks dedicatated Armenian culture, language and history
Projects
None yet
Development

No branches or pull requests

1 participant