HHobeck
diff --git a/Diff for: ‎README.md
+31-29 b/Diff for: ‎README.md
+31-29
diff --git a/Diff for: ‎config.yaml
+6-5 b/Diff for: ‎config.yaml
+6-5
diff --git a/Diff for: ‎docs/contribute.md
+9 b/Diff for: ‎docs/contribute.md
+9
diff --git a/Diff for: ‎docs/setup.md
+10-16 b/Diff for: ‎docs/setup.md
+10-16
diff --git a/Diff for: ‎mapping_information/sources.json
+16 b/Diff for: ‎mapping_information/sources.json
+16
diff --git a/Diff for: ‎run.py
+1-2 b/Diff for: ‎run.py
+1-2
@@ -13,6 +13,7 @@ Currently, the supported blockchains are:
 - Ethereum
 - Litecoin
 - Tezos
+
 We intend to add more ledgers to this list in the future.
 
 ## Installation
@@ -28,44 +29,45 @@ project:
 
     python -m pip install -r requirements.txt
 
+### System requirements
+
+Running the tool requires loading the raw input data on memory. To avoid running
+out of memory, we recommend RAM at least double the largest raw data file.
+
 ### Mapping information
 
 The mapping information for Cardano is too large for Github.
 To retrieve it do the following:
 - Download the file from
 [here](https://uoe-my.sharepoint.com/:u:/g/personal/dkarakos_ed_ac_uk/EXseoT-v1xBHn1TWG1IvqHIB2L3Pm35-UtKIcUKmk1IQZw?e=YgTfjR&download=1).
-- Move the file to the folder `mapping_information/addresses/`. Note that the file should be named `cardano.jsonl`.
+- Move the file to the folder `mapping_information/addresses/`. The file _should be named_ `cardano.jsonl`.
 
 ## Run the tool
 
-Place all raw data (which could be collected from [BigQuery](https://cloud.google.com/bigquery/) for example) in the `input` directory. 
-Each file named as `<project_name>_<snapshot_date>_raw_data.json` (e.g. `bitcoin_{2023-01-01}_raw_data.json`). By default, there
-is a (very small) sample input file for some supported projects. To use the
-samples, remove the prefix `sample_`. For more extended raw data and instructions on how to retrieve it, see
-[here](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/).
-
-Run `python run.py --ledgers <ledger_1> ... <ledger_n> --snapshots <date_1> <date_2>` to produce and analyze the database files.
-For each ledger and for each snapshot one SQLite file is created, which contains the address mapping and the balance information.
-Note that both arguments are optional, so it's possible to omit one or both of them (in which case the default values
-will be used). Specifically:
-
-- The `ledgers` argument accepts any number of supported ledgers (case-insensitive). 
-For example, `--ledgers bitcoin` runs the analysis for Bitcoin, `--ledgers Bitcoin Ethereum Cardano` runs the analysis 
-for Bitcoin, Ethereum and Cardano, etc. Ledgers with  more words should be defined with an underscore; for example 
-Bitcoin Cash should be set as `bitcoin_cash`.
-- The `snapshots` argument should be of the form `YYYY-MM-DD`. 
-For example, `--snapshots 2022-02-01` runs it for 1 February 2022.
-
-`run.py` prints on stdout the output of each implemented metric for the specified ledgers and snapshot.
-
-To mass produce and analyze data, omit one or both arguments. If some arguments
-are omitted, the default values from `config.yaml` will be used. If only the
-`ledgers` is given, all snapshots for which a raw data and/or database file exists will be
-analyzed. If only the timeframe is specified, all ledgers will be analyzed for
-the given timeframe (if the raw data and/or database files exist).
-
-A single file `output.csv` is also created in the `output` directory, containing the output data from the 
-last execution of `run.py`.
+Place all raw data (which could be collected from
+[BigQuery](https://cloud.google.com/bigquery/) for example) in the `input`
+directory.  Each file named as `<project_name>_<snapshot_date>_raw_data.json`
+(e.g.  `bitcoin_2023-01-01_raw_data.json`). By default, there is a (very
+small) sample input file for some supported projects. To use the samples, remove
+the prefix `sample_`. For more extended raw data and instructions on how to
+retrieve it, see [here](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/).
+
+Edit the configuration file `config.yaml` to choose which ledgers to analyze,
+for which snapshots, with which metrics, etc (see
+[here](https://blockchain-technology-lab.github.io/tokenomics-decentralization/setup/)
+for more information on each parameter).
+
+Run `python run.py` to perform the analysis and print on stdout the output of
+each implemented metric for the specified ledgers and snapshot.
+
+For each ledger and for the chosen combination of mapping sources, a SQLite file
+is created in `mapping_information/addresses`, which contains the address
+mapping information.
+
+A single file `output_{params}.csv` is also created in the `output` directory,
+containing the output data from the last execution of `run.py`. Here, "params"
+corresponds to analysis parameters like whether to apply clustering,
+thresholding, etc.
 
 ## Contributing
 
 
@@ -23,17 +23,18 @@ ledgers:
 # Execution flags
 execution_flags:
   force_map_addresses: false
-  force_map_balances: false
-  force_analyze: false
 
 # Analyze flags
 analyze_flags:
-  clustering: true
+  clustering_sources:
+    - "Explorers"
+    - "Staking Keys"
+    - "Multi-input transactions"
   top_limit_type: "absolute"  # one of two types: "absolute" or "percentage"; if absolute then value should be integer; if percentage then value should be float in [0, 1]
   top_limit_value: 0
   exclude_contract_addresses: false
   exclude_below_fees: false
-  exclude_below_usd_cent: true
+  exclude_below_usd_cent: false
 
 # The snapshots for which an analysis should be performed.
 # Each snapshot is a string of the form YYYY-MM-DD.
@@ -57,7 +58,7 @@ output_directories:
 
 # Plot flags
 plot_parameters:
-  plot: true
+  plot: false
   ledgers:
     - bitcoin
     - bitcoin_cash
 
@@ -38,6 +38,15 @@ To contribute mapping information you can either update an existing file, by
 changing and/or adding some entries, or create a new file for a newly-supported
 ledger.
 
+Note: If you add an entry in `mapping_addresses` with a source that does not
+already exist, you should also add this source in the file
+`mapping_information/sources.json`. Specifically, if it comes from a
+publicly-available website you should add it under "Explorers", otherwise either
+use an existing appropriate keyword or create a new one. If you create a new
+one, make sure to also include it in the configuration file `config.yaml` and in
+the description of the [Setup
+page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/setup/)).
+
 ### Price information
 
 The directory `price_data/` contains information about the supported ledgers'
 
@@ -35,23 +35,19 @@ page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/co
 ledgers are included here (to add support for a new ledger see the [conributions
 page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/contribute/)).
 
-`execution_flags` defines various flags that control the data handling (all set to false by default):
+`execution_flags` defines flags that control the data handling (all set to false by default):
 
-* `force_map_addresses`: if set to true, the address helper data from the directory
+* `force_map_addresses`: if set to true, the address mapping data from the directory
   `mapping_information` is re-computed; you should set this flag to true if the
-  data has been updated since the last execution for the given ledger
-* `force_map_balances`: is set to true, the balance data of the ledger's addresses is
-  recomputed; you should set this flag to true if the data has been updated
-  since the last execution for the given ledger
-* `force_analyze`: if set to true, the computation of a metric is recomputed; you should set
-  this flag to true if any type of data has been updated since the last
-  execution for the given ledger
+  mapping data has been updated since the last execution for the given ledger
 
 `analyze_flags` defines various analysis-related flags:
 
-* `clustering`: a boolean that determines whether addresses will be clustered into entities
- (as defined in the mapping information). If set to false, no clustering takes
-  place and the addresses are treated as distinct entities.
+* `clustering_sources`: a list of sources that should be used to compute the
+  address mapping information. If empty, no clustering takes place and the
+  addresses are treated as distinct entities. The list should contain any
+  combination of the following options (_case sensitive_): "Explorers", "Staking
+  Keys", "Multi-input transactions".
 * `top_limit_type`: a string that can take one of two values (`absolute` or `percentage`) that
   enables applying a threshold on the addresses that will be considered
 * `top_limit_value`: the value of the top limit that should be applied; if 0,
@@ -84,9 +80,7 @@ define the source of data. `input_directories` defines the directories that
 contain raw address balance information, as obtained from BigQuery or a full
 node (for more information about this see the [data collection
 page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/)).
-`output_directories` defines the directories to store the databases which
-contain the mapping information and analyzed data. The first entry in the output
-directories is also used to store the output files of the analysis and the
-plots.
+`output_directories` defines the directory to store the output files of the
+analysis and the plots.
 
 Finally, `plot_parameters` contains various parameters that control whether plots will be produced for the results and for which configurations.
@@ -0,0 +1,16 @@
+{
+    "Explorers": [
+        "https://bitinfocharts.com",
+        "https://dogecoinwhalealert.com",
+        "https://www.walletexplorer.com",
+        "https://api.tzkt.io/",
+        "https://etherscan.io"
+    ],
+    "Staking Keys": [
+        "staking key",
+        "payment key"
+    ],
+    "Multi-input transactions": [
+        "multi-input"
+    ]
+}
@@ -9,8 +9,7 @@
 
 def main(ledgers, snapshot_dates):
     for ledger in ledgers:
-        for snapshot in snapshot_dates:
-            apply_mapping(ledger, snapshot)
+        apply_mapping(ledger)
 
     analyze(ledgers, snapshot_dates)