[EN] Collect the images of the Armenian monuments #15
Labels
extraction
Task that require data extraction (scraping) skills
parsing
Tasks that require data parsing
topic-culture
Tasks dedicatated Armenian culture, language and history
Goal
Collect the images of the Armenian monuments accompanied by metadata.
Tasks
The task has two components. First, it is necessary to collect all the metadata of all the images at all the pages and them in a machine readable format. By the metadata, we mean:
The tricky part with the metadata is the lack of a predictable structure of a description. For instance, some may contain a year, a location, etc, while others only have name of a person. One possible approach to solving this may be as follows:
Or you can simply grab the description (as HTML) as a single value and the image url as another, without any extra parsing.
Second, download all the pictures and store them in a folder. This folder should be temporarily published as a zipped archive at your server or at a sharing platform, such as Google Drive. After this is done, please, let us know about it and we shall copy these files to our own storage, so that it does not occupy your disk space.
Preferably, both components should be completed. Alternatively, if you have a perfect idea of how to collect the metadata, but have no room to store the result, the first component would be sufficient.
Context
The website presents an impressive collection of historical images of various kinds, from pictures of monuments to old photos of people. Problem is they are all presented as a web-gallery only. There is no option to download these images and their metadata in bulk. In other words, if this website disappears, its huge collection may be lost for the public for ever. It would be nice to have a backup for such a project, as well as to provide a convenient way of using these data automatically.
Requirements
A public GitHub repository should be created to store and publish the code and the data under one of the free and open licenses, such as Creative Commons or MIT.
Wishes
It would be best if your code is reusable, that is can be launch again by anyone who might want to update the dataset at a later point. For the same reason, we encourage you to comment your code, supplement it with at least a very brief README description, and specify the requirements and dependencies necessary to use the code.
Resources
http://www.armenianmonumentsimages.com/
Parsing may require a library that imitates human interaction with the website.
Prepared by
The Open Data Armenia team prepared this task.
The text was updated successfully, but these errors were encountered: