Skip to content

Commit 0a4821d

Browse files
* Add mapping sources file * Create get active source info helper functions * Move get_output_row, write_csv_output to helper * Remove clustering flag and computed it based on config sources * Description comment in get_circulation_from_entries * Add helper function to compute entity clusters when using multiple sources * Update README * Update documentation * Disable exclude_below_usd_cent config flag by default * Disable plot config flag by default * Refactor mapping process * Refactor analyzing process * Refactor db_helper * Remove schema.py * Remove old helper functions to get force map balances and analyze flag * Refactor tests * README typo * Change tau computation to return only index * Add helper function to get tau from param string * Exclude contract addrs from entries if flag is set * Change entries object to list of ints instead of tuples * Add test for excluding contracts flag * Add small testcase
1 parent d485706 commit 0a4821d

18 files changed

+925
-903
lines changed

Diff for: README.md

+31-29
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Currently, the supported blockchains are:
1313
- Ethereum
1414
- Litecoin
1515
- Tezos
16+
1617
We intend to add more ledgers to this list in the future.
1718

1819
## Installation
@@ -28,44 +29,45 @@ project:
2829

2930
python -m pip install -r requirements.txt
3031

32+
### System requirements
33+
34+
Running the tool requires loading the raw input data on memory. To avoid running
35+
out of memory, we recommend RAM at least double the largest raw data file.
36+
3137
### Mapping information
3238

3339
The mapping information for Cardano is too large for Github.
3440
To retrieve it do the following:
3541
- Download the file from
3642
[here](https://uoe-my.sharepoint.com/:u:/g/personal/dkarakos_ed_ac_uk/EXseoT-v1xBHn1TWG1IvqHIB2L3Pm35-UtKIcUKmk1IQZw?e=YgTfjR&download=1).
37-
- Move the file to the folder `mapping_information/addresses/`. Note that the file should be named `cardano.jsonl`.
43+
- Move the file to the folder `mapping_information/addresses/`. The file _should be named_ `cardano.jsonl`.
3844

3945
## Run the tool
4046

41-
Place all raw data (which could be collected from [BigQuery](https://cloud.google.com/bigquery/) for example) in the `input` directory.
42-
Each file named as `<project_name>_<snapshot_date>_raw_data.json` (e.g. `bitcoin_{2023-01-01}_raw_data.json`). By default, there
43-
is a (very small) sample input file for some supported projects. To use the
44-
samples, remove the prefix `sample_`. For more extended raw data and instructions on how to retrieve it, see
45-
[here](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/).
46-
47-
Run `python run.py --ledgers <ledger_1> ... <ledger_n> --snapshots <date_1> <date_2>` to produce and analyze the database files.
48-
For each ledger and for each snapshot one SQLite file is created, which contains the address mapping and the balance information.
49-
Note that both arguments are optional, so it's possible to omit one or both of them (in which case the default values
50-
will be used). Specifically:
51-
52-
- The `ledgers` argument accepts any number of supported ledgers (case-insensitive).
53-
For example, `--ledgers bitcoin` runs the analysis for Bitcoin, `--ledgers Bitcoin Ethereum Cardano` runs the analysis
54-
for Bitcoin, Ethereum and Cardano, etc. Ledgers with more words should be defined with an underscore; for example
55-
Bitcoin Cash should be set as `bitcoin_cash`.
56-
- The `snapshots` argument should be of the form `YYYY-MM-DD`.
57-
For example, `--snapshots 2022-02-01` runs it for 1 February 2022.
58-
59-
`run.py` prints on stdout the output of each implemented metric for the specified ledgers and snapshot.
60-
61-
To mass produce and analyze data, omit one or both arguments. If some arguments
62-
are omitted, the default values from `config.yaml` will be used. If only the
63-
`ledgers` is given, all snapshots for which a raw data and/or database file exists will be
64-
analyzed. If only the timeframe is specified, all ledgers will be analyzed for
65-
the given timeframe (if the raw data and/or database files exist).
66-
67-
A single file `output.csv` is also created in the `output` directory, containing the output data from the
68-
last execution of `run.py`.
47+
Place all raw data (which could be collected from
48+
[BigQuery](https://cloud.google.com/bigquery/) for example) in the `input`
49+
directory. Each file named as `<project_name>_<snapshot_date>_raw_data.json`
50+
(e.g. `bitcoin_2023-01-01_raw_data.json`). By default, there is a (very
51+
small) sample input file for some supported projects. To use the samples, remove
52+
the prefix `sample_`. For more extended raw data and instructions on how to
53+
retrieve it, see [here](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/).
54+
55+
Edit the configuration file `config.yaml` to choose which ledgers to analyze,
56+
for which snapshots, with which metrics, etc (see
57+
[here](https://blockchain-technology-lab.github.io/tokenomics-decentralization/setup/)
58+
for more information on each parameter).
59+
60+
Run `python run.py` to perform the analysis and print on stdout the output of
61+
each implemented metric for the specified ledgers and snapshot.
62+
63+
For each ledger and for the chosen combination of mapping sources, a SQLite file
64+
is created in `mapping_information/addresses`, which contains the address
65+
mapping information.
66+
67+
A single file `output_{params}.csv` is also created in the `output` directory,
68+
containing the output data from the last execution of `run.py`. Here, "params"
69+
corresponds to analysis parameters like whether to apply clustering,
70+
thresholding, etc.
6971

7072
## Contributing
7173

Diff for: config.yaml

+6-5
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,18 @@ ledgers:
2323
# Execution flags
2424
execution_flags:
2525
force_map_addresses: false
26-
force_map_balances: false
27-
force_analyze: false
2826

2927
# Analyze flags
3028
analyze_flags:
31-
clustering: true
29+
clustering_sources:
30+
- "Explorers"
31+
- "Staking Keys"
32+
- "Multi-input transactions"
3233
top_limit_type: "absolute" # one of two types: "absolute" or "percentage"; if absolute then value should be integer; if percentage then value should be float in [0, 1]
3334
top_limit_value: 0
3435
exclude_contract_addresses: false
3536
exclude_below_fees: false
36-
exclude_below_usd_cent: true
37+
exclude_below_usd_cent: false
3738

3839
# The snapshots for which an analysis should be performed.
3940
# Each snapshot is a string of the form YYYY-MM-DD.
@@ -57,7 +58,7 @@ output_directories:
5758

5859
# Plot flags
5960
plot_parameters:
60-
plot: true
61+
plot: false
6162
ledgers:
6263
- bitcoin
6364
- bitcoin_cash

Diff for: docs/contribute.md

+9
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,15 @@ To contribute mapping information you can either update an existing file, by
3838
changing and/or adding some entries, or create a new file for a newly-supported
3939
ledger.
4040

41+
Note: If you add an entry in `mapping_addresses` with a source that does not
42+
already exist, you should also add this source in the file
43+
`mapping_information/sources.json`. Specifically, if it comes from a
44+
publicly-available website you should add it under "Explorers", otherwise either
45+
use an existing appropriate keyword or create a new one. If you create a new
46+
one, make sure to also include it in the configuration file `config.yaml` and in
47+
the description of the [Setup
48+
page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/setup/)).
49+
4150
### Price information
4251

4352
The directory `price_data/` contains information about the supported ledgers'

Diff for: docs/setup.md

+10-16
Original file line numberDiff line numberDiff line change
@@ -35,23 +35,19 @@ page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/co
3535
ledgers are included here (to add support for a new ledger see the [conributions
3636
page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/contribute/)).
3737

38-
`execution_flags` defines various flags that control the data handling (all set to false by default):
38+
`execution_flags` defines flags that control the data handling (all set to false by default):
3939

40-
* `force_map_addresses`: if set to true, the address helper data from the directory
40+
* `force_map_addresses`: if set to true, the address mapping data from the directory
4141
`mapping_information` is re-computed; you should set this flag to true if the
42-
data has been updated since the last execution for the given ledger
43-
* `force_map_balances`: is set to true, the balance data of the ledger's addresses is
44-
recomputed; you should set this flag to true if the data has been updated
45-
since the last execution for the given ledger
46-
* `force_analyze`: if set to true, the computation of a metric is recomputed; you should set
47-
this flag to true if any type of data has been updated since the last
48-
execution for the given ledger
42+
mapping data has been updated since the last execution for the given ledger
4943

5044
`analyze_flags` defines various analysis-related flags:
5145

52-
* `clustering`: a boolean that determines whether addresses will be clustered into entities
53-
(as defined in the mapping information). If set to false, no clustering takes
54-
place and the addresses are treated as distinct entities.
46+
* `clustering_sources`: a list of sources that should be used to compute the
47+
address mapping information. If empty, no clustering takes place and the
48+
addresses are treated as distinct entities. The list should contain any
49+
combination of the following options (_case sensitive_): "Explorers", "Staking
50+
Keys", "Multi-input transactions".
5551
* `top_limit_type`: a string that can take one of two values (`absolute` or `percentage`) that
5652
enables applying a threshold on the addresses that will be considered
5753
* `top_limit_value`: the value of the top limit that should be applied; if 0,
@@ -84,9 +80,7 @@ define the source of data. `input_directories` defines the directories that
8480
contain raw address balance information, as obtained from BigQuery or a full
8581
node (for more information about this see the [data collection
8682
page](https://blockchain-technology-lab.github.io/tokenomics-decentralization/data/)).
87-
`output_directories` defines the directories to store the databases which
88-
contain the mapping information and analyzed data. The first entry in the output
89-
directories is also used to store the output files of the analysis and the
90-
plots.
83+
`output_directories` defines the directory to store the output files of the
84+
analysis and the plots.
9185

9286
Finally, `plot_parameters` contains various parameters that control whether plots will be produced for the results and for which configurations.

Diff for: mapping_information/sources.json

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"Explorers": [
3+
"https://bitinfocharts.com",
4+
"https://dogecoinwhalealert.com",
5+
"https://www.walletexplorer.com",
6+
"https://api.tzkt.io/",
7+
"https://etherscan.io"
8+
],
9+
"Staking Keys": [
10+
"staking key",
11+
"payment key"
12+
],
13+
"Multi-input transactions": [
14+
"multi-input"
15+
]
16+
}

Diff for: run.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,7 @@
99

1010
def main(ledgers, snapshot_dates):
1111
for ledger in ledgers:
12-
for snapshot in snapshot_dates:
13-
apply_mapping(ledger, snapshot)
12+
apply_mapping(ledger)
1413

1514
analyze(ledgers, snapshot_dates)
1615

0 commit comments

Comments
 (0)