d-portal database dump #393

bill-anderson · 2021-01-15T09:22:02Z

bill-anderson
Jan 15, 2021
Maintainer

Is this worth importing so that we have access to all non-transaction data?

From Shi on Discord

Would a database dump like this be useful to anyone? It's about 3GB in size and is the entire d-portal database which you have live access to via http://d-portal.org/dquery/ It can be imported locally and queried using the same sql code that is used on the web interface so you can run large queries on it without clogging up d-portal.

http://d-portal.org/db/dstore.sql.gz

This is currently a test file but could be updated nightly after each import. We've been getting lots of specific (and expensive) queries from people about getting data out of d-portal in many unique ways (like IATI/D-Portal#589) and thought this might be useful.

Let us know if it is and we can make it better o/

This is a PostgreSQL dump though and uses the json data type in postgresql. Technically, you could spin up a postgresql server and import it just as is and start querying.

@akmiller01 @dean-breed @wakibi @k8hughes

bill-anderson · 2021-01-15T09:23:47Z

bill-anderson
Jan 15, 2021
Maintainer Author

Update:
FYI - the d-portal PostgreSQL dump is now updating nightly.

0 replies

k8hughes · 2021-01-18T14:23:36Z

k8hughes
Jan 18, 2021
Maintainer

@wakibi @akmiller01 what do you think?

0 replies

akmiller01 · 2021-01-21T19:43:13Z

akmiller01
Jan 21, 2021
Maintainer

Unpacking the file right now. It's about 38 GB fully extracted, so we might not want to run a sync daily. Coincidentally we're both on Postgres 12.5, so there should be no compatibility issues. We would probably need to edit the SQL before importing, as ownership is given to a user ubuntu and there are 12 table definitions contained in the file. It's unclear if we could import just a few selectively. Here are the table definitions:

CREATE TABLE public.act (
    aid text NOT NULL,
    reporting public.citext,
    reporting_ref public.citext,
    funder_ref public.citext,
    title public.citext,
    slug public.citext,
    status_code integer,
    day_start integer,
    day_end integer,
    day_length integer,
    description public.citext,
    commitment real,
    spend real,
    commitment_eur real,
    spend_eur real,
    commitment_gbp real,
    spend_gbp real,
    commitment_cad real,
    spend_cad real,
    flags integer
);
CREATE TABLE public.budget (
    aid text,
    budget public.citext,
    budget_priority integer,
    budget_type public.citext,
    budget_day_start integer,
    budget_day_end integer,
    budget_day_length integer,
    budget_currency public.citext,
    budget_value real,
    budget_usd real,
    budget_eur real,
    budget_gbp real,
    budget_cad real,
    budget_org public.citext,
    budget_country public.citext,
    budget_sector public.citext,
    budget_sector_group public.citext,
    budget_id integer
);
CREATE TABLE public.country (
    aid text,
    country_code public.citext,
    country_percent real
);
CREATE TABLE public.hash (
    aid text NOT NULL,
    hash_day integer,
    hash_jml text
);
CREATE TABLE public.jml (
    aid text NOT NULL,
    jml text
);
CREATE TABLE public.location (
    aid text,
    location_code public.citext,
    location_gazetteer_ref public.citext,
    location_gazetteer public.citext,
    location_name public.citext,
    location_longitude real,
    location_latitude real,
    location_precision integer,
    location_percent real
);
CREATE TABLE public.policy (
    aid text,
    policy_code public.citext
);
CREATE TABLE public.related (
    aid text NOT NULL,
    related_aid text NOT NULL,
    related_type integer
);
CREATE TABLE public.sector (
    aid text,
    sector_group public.citext,
    sector_code public.citext,
    sector_percent real
);
CREATE TABLE public.slug (
    aid text,
    slug public.citext
);
CREATE TABLE public.trans (
    aid text,
    trans_ref public.citext,
    trans_description public.citext,
    trans_day integer,
    trans_currency public.citext,
    trans_value real,
    trans_usd real,
    trans_eur real,
    trans_gbp real,
    trans_cad real,
    trans_code public.citext,
    trans_flow_code public.citext,
    trans_finance_code public.citext,
    trans_aid_code public.citext,
    trans_flags integer,
    trans_country public.citext,
    trans_sector public.citext,
    trans_sector_group public.citext,
    trans_id integer
);
CREATE TABLE public.xson (
    aid text,
    pid text,
    root text NOT NULL,
    xson jsonb NOT NULL
);

0 replies

akmiller01 · 2021-01-21T19:50:50Z

akmiller01
Jan 21, 2021
Maintainer

There are options to import schema, and then table-by-table import data @wakibi https://thequantitative.medium.com/restoring-individual-tables-from-postgresql-pg-dump-using-pg-restore-options-ef3ce2b41ab6

Any other thoughts on how we could selectively import things into the proper schema/table names? Maybe we could even do string replacement in Bash e.g.

sed -i 's/ ubuntu/ analyst_ui_user/g' dstore.sql
sed -i 's/public.act/repo.dstore_activity/g' dstore.sql

1 reply

akmiller01 Jan 21, 2021
Maintainer

This wasn't actually as bad as I thought it would be. We should benchmark this against pg_restoring into the public schema and renaming:

$ time sed -i 's/public.act/repo.dstore_activity/g' dstore.sql 

real	26m27.738s
user	3m16.380s
sys	5m43.659s

akmiller01 · 2021-01-25T15:12:31Z

akmiller01
Jan 25, 2021
Maintainer

Tested an import over the weekend. It took 1750 minutes to fully load the .sql file. Unfortunately it's not a binary pg_dump, the file is just raw SQL, so we cannot use pg_restore to selectively import bits and pieces.

With a full import taking about 30 hours (which would also significantly slow down all queries against the database during those 30 hours), I don't think this particular file is feasible.

If we want a d-portal mirror, I would suggest we reach back out to Kriss and Shi and ask for a streamlined export of just what we need. Otherwise it would be much more efficient to extract budget data ourselves.

0 replies

bill-anderson · 2021-01-25T15:19:43Z

bill-anderson
Jan 25, 2021
Maintainer Author

With that performance it can’t be of much use to anyone. Worth a chat with Kris? Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Alex Miller <[email protected]> Sent: Monday, January 25, 2021 3:12:48 PM To: devinit/ddw-analyst-ui <[email protected]> Cc: Bill Anderson <[email protected]>; Author <[email protected]> Subject: Re: [devinit/ddw-analyst-ui] d-portal database dump (#393) Tested an import over the weekend. It took 1750 minutes to fully load the .sql file. Unfortunately it's not a binary pg_dump, the file is just raw SQL, so we cannot use pg_restore to selectively import bits and pieces. With a full import taking about 30 hours (which would also significantly slow down all queries against the database during those 30 hours), I don't think this particular file is feasible. If we want a d-portal mirror, I would suggest we reach back out to Kriss and Shi and ask for a streamlined export of just what we need. Otherwise it would be much more efficient to extract budget data ourselves. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#393 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA2UN2XA75KEIBXOBWKUCHTS3WC7BANCNFSM4WDYOSLA>.

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com

______________________________________________________________________

______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com

______________________________________________________________________

0 replies

k8hughes · 2021-01-25T17:33:32Z

k8hughes
Jan 25, 2021
Maintainer

If the main reason we want this is for budget data, I think it might be better to work with an IATI tool that we know has long term support eg the registry.
Given that we know dportal will be upgraded this year it might not be a good plan to become dependant on it. If there is a short conversation that could be had with Kris about how to work with what they have then great, but otherwise I don't think we should invest time on it.

0 replies

notshi · 2021-02-02T23:00:09Z

notshi
Feb 2, 2021

Hello,

Thanks for having a go at this and sharing your insights at the process.
We have been using this as a way of quickly getting test data for local hosting and debugging.

So as well as the file we've mentioned, there are also a couple more options, all updated nightly:

pg_dump custom format so pg_restore can be used with its various options
http://d-portal.org/db/dstore.pg

A zip of all the raw cached xml
http://d-portal.org/db/cache.xml.zip

For us, a full restore of the database takes about 7 hours or 2 hours if you use multithreading with pg_restore.

--jobs=(number of cpu cores)

Almost all of the time is spent rebuilding indexes so it's not the data itself that takes the time.

We create a lot of indexes because, as a public facing site, we need the database queries to run as fast as possible.

@akmiller01 I'm wondering if most of the 30 hours import you experienced was from slow disk access.
For comparison, this is our test server specification that imported the data in 2 hours.

CPU	RAM	Hard Drive(s)
Intel Atom C2750 - 2.4 GHz - 8 core(s)	16GB - DDR3	1x 256GB (SSD SATA)

1 reply

akmiller01 Feb 2, 2021
Maintainer

@akmiller01 I'm wondering if most of the 30 hours import you experienced was from slow disk access.
For comparison, this is our test server specification that imported the data in 2 hours.

Yes, probably. I was testing the import on a 4 year old laptop with an HDD. I believe the vast majority of the time was the re-creation of the indexes; so even just selectively importing all of the tables and excluding the default indexes should significantly improve performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

d-portal database dump #393

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

d-portal database dump #393

bill-anderson Jan 15, 2021 Maintainer

Replies: 8 comments · 2 replies

bill-anderson Jan 15, 2021 Maintainer Author

k8hughes Jan 18, 2021 Maintainer

akmiller01 Jan 21, 2021 Maintainer

akmiller01 Jan 21, 2021 Maintainer

akmiller01 Jan 21, 2021 Maintainer

akmiller01 Jan 25, 2021 Maintainer

bill-anderson Jan 25, 2021 Maintainer Author

k8hughes Jan 25, 2021 Maintainer

notshi Feb 2, 2021

akmiller01 Feb 2, 2021 Maintainer

bill-anderson
Jan 15, 2021
Maintainer

Replies: 8 comments 2 replies

bill-anderson
Jan 15, 2021
Maintainer Author

k8hughes
Jan 18, 2021
Maintainer

akmiller01
Jan 21, 2021
Maintainer

akmiller01
Jan 21, 2021
Maintainer

akmiller01 Jan 21, 2021
Maintainer

akmiller01
Jan 25, 2021
Maintainer

bill-anderson
Jan 25, 2021
Maintainer Author

k8hughes
Jan 25, 2021
Maintainer

notshi
Feb 2, 2021

akmiller01 Feb 2, 2021
Maintainer