II-20 is a multimedia analytics system for intelligent analytic categorization of image collections. II-20 loads in your dataset, and allows you to define your categories of relevance - called buckets - to which you add images that you deem relevant. II-20's intelligent AI model learns to understand the buckets, providing you with instant suggestions of relevant items. You can add/delete/redefine/update buckets at any time, making II-20's analytics truly flexible.
There are two modes in which you can conduct your analytics - a classic grid interface, and a playful "Tetris" interface where images flow from the top to the buckets on the bottom (useful if you want to focus on individual images, or just want a change of pace). The process is fully interactive, and the system is responsive even on large data (hundreds of thousands to millions of images in one dataset).
Feedback welcome. If you use II-20 and have any comments or suggestions on your mind (especially if you used II-20 on your own data - see how below), please do not hesitate to send them to me at jan@zahalka.net.
If you are using II-20 or its parts in your scientific work, please cite the II-20 paper:
J. Zahálka, M. Worring, and Jarke J. van Wijk. II-20: Intelligent and pragmatic analytic categorization of image collections. IEEE Transactions on Visualization and Computer Graphics, 27 (2), pp. 422 - 431, February 2021.
(https://arxiv.org/abs/2005.02149)
II-20 is licensed under the terms of the MIT License (see LICENSE.txt).
II-20 is implemented as a Django web app utilizing scientific and deep learning Python libraries in the backend, with the front end being realized through React.js. The software was tested on Ubuntu and Mac OS. I am not aware of any specific reasons it shouldn't run on Windows, but I have not tested that. In this section, we describe how to get started with analytics on demo data (a random 10K subset of YFCC100M).
- Clone this repository. In further text,
$II20_ROOTdenotes the root directory of the repository.cdto it. - Download the demo dataset from here: https://data.ciirc.cvut.cz/public/groups/ml/zahalka/yfcc10k.zip. Store it wherever convenient for you, unzip, note the absolute path to the dataset (the
yfcc10kdirectory). Open the$II20_ROOT/ii20/data/datasets/yfcc10k.jsonfile in a text editor, change theroot_direntry to the absolute path to the dataset, save and close the JSON file. - Install the prerequisites:
sudo apt-get install virtualenv mysql-server libmysqlclient-dev(on Ubuntu, if using a different distro or Mac OS, install the equivalents). - Create the virtual environment:
virtualenv -p python3 env_ii20. - Activate the virtual environment:
source env_ii20/bin/activate. - Install the required Python packages:
pip install -r requirements.txt. cd scriptspython generate_secret_key.py- Create the app's DB user on II-20's side:
python db_superuser.py(and note the DB user info, further denoted by<db_username>and<db_password>). - Create the MySQL database used by the system and the DB user on the DB side:
sudo mysql -u root
CREATE DATABASE ii20;
CREATE USER '<db_username>'@'localhost' IDENTIFIED BY '<db_password>';
GRANT ALL PRIVILEGES ON ii20.* TO '<db_username>'@'localhost';
exit
cd ../ii20python manage.py migrate- Create the Django superuser:
python manage.py createsuperuser, note the username (further:<django_admin_username>) and password (further:<django_admin_password>). - (Optional, but recommended) Set a user account other than the Django superuser to log in to II-20 (if skipped, you can log in with the Django superuser credentials). First,
cd $II20_ROOT/ii20. Then, run the server:python manage.py runserver. Open up your web browser, go tolocalhost:8000/admin, log in to the admin interface with the Django superuser credentials and create the new user account there.
After installing II-20, you run the server using those commands (don't forget to activate your virtual environment first):
cd $II20_ROOT/ii20
python manage.py runserver
Then you open your web browser, go to localhost:8000, log in to the system, select your dataset, and then you can start your analytic session.
In this section, we describe how you can use II-20 on your own image dataset. Let $DATASET_ROOT denote the absolute path to the root directory of your dataset.
- Download the ImageNetShuffle 13k deep net from here: http://isis-data.science.uva.nl/koelma/pthmodels/resnet101_rbps13k_scratch_b256_lr0.1_nep75_gpus1x4/model_best.pth. Store it in
$II20_ROOT/ii20/data/mlmodels(create the directory if it doesn't exist). - Create the dataset JSON config file for your dataset (this is the basic version, for all accepted configs, refer to the Dataset config section below):
{
"root_dir": "$DATASET_ROOT",
"load": false
}
- Store the JSON config file at
$II20_ROOT/ii20/data/datasets/<dataset_name>.json. The<dataset_name>is the name used for your dataset on the dataset selection screen. cd $II20_ROOT/ii20- Process the dataset:
python manage.py processdataset <dataset_name>. The dataprocessing script will first check your dataset config. Then, it will find all images in$DATASET_ROOTand its subdirectories. Non-image files will be ignored, and note that every image is treated as unique and unrelated to others; if you have various versions of the same images in subdirectories of$DATASET_ROOT, consider cleaning up before you start dataprocessing. Then, features are extracted from the images and compressed into an efficient interactive learning representation. Finally, the collection index is constructed. - Open
$II20_ROOT/ii20/data/datasets/<dataset_name>.jsonagain, and set it totrue. - Your dataset should now be selectable in II-20 and you should be able to perform analytics on it.
The basic version of the dataset config should do the trick, but if you need to change locations where the feature files are going to be stored or other parameters, here's a full reference to the accepted config values:
root_dir(required) --- The absolute path to the root directory of your dataset.load(required) --- A boolean flag (trueorfalse) denoting whether the dataset should be loaded into II-20 on system startup. This should always befalsebefore the dataset was processed successfully (otherwise II-20 will crash on start-up due to missing feature/index files). Post processing, this can be used to switch datasets on and off (toggling between making analytics available on the dataset and saving memory and resources).image_ordering(optional) --- The path to the list of images (path relative toroot_dir) in the dataset that consitutes the "canonical ordering" (the images in the feature representations and the index are in the same order). The list itself is constructed automatically during dataprocessing. Default:"image_ordering.json".il_raw_features_path(optional) --- The path (relative toroot_dir) where the raw uncompressed concept features are stored. Default:"ii20model/il_raw_features.h5".il_features_path(optional) --- The path (relative toroot_dir) where the compressed concept features used by II-20's interactive learning component are stored. Default:"ii20model/il_features.h5".il_n_processes(optional) --- The number (positive integer) of CPU processes used to compress the interactive learning features. Default: 1.il_n_feat_per_image(optional) --- The number (positive integer) of top features by value to be preserved for the compressed interactive learning features. Default: 50.index_features_path(optional) --- The path (relative toroot_dir) where the abstract features used to construct II-20's collection index are stored. Default:"ii20model/index_features.h5".index_dir_path(optional) --- The path (relative toroot_dir) to the directory where the index data structures are stored. Default:"ii20model/index".index_n_submat(optional) --- The number of product-quantization submatrices (column-wise splits) and thus subquantizers to be used. This needs to be a positive integer and the number of features (columns) in the feature matrix @index_features_pathmust be divisible by this number. Default: 32. The number of abstract features extracted by ImageNetShuffle 13k is 2048, so powers of 2 work here.
II-20 uses a fairly standard Django project structure. There are three Django apps in II-20:
ui--- The frontend, which is chiefly in React.js (files in$II20_ROOT/ii20/ui/src/components, the entry class isII20Main.js). Hooked to Django through theui/templates/ui/analytics.htmltemplate (essentially provides a container for the UI, and specifiesmain.js, a Javascript compiled from the React.js components, as the entry point).aimodel--- The "live" backend during the analytics sessions. This is, essentially, where the II-20 model resides. The entry class isAnalyticSession, which in turn is a UI-backend middleman wrapper relying onAIModel, which provides the actual intelligent functionality. The most notable class invoked byAIModelisBucket, which encapsulates all of the intelligence of each bucket, i.e., analytic category.data--- Responsible for both the dataprocessing (converting raw data to II-20 data structures) and for handling the data within the analytic session proper.
The code is documented by inline comments, NumPy style docstrings, and my best attempt at good coding style (PEP 8 adherence, making sure the variable/method names are meaningful...).