In this project we're analysing GDELT data with GraphQL. GDELT is a free, constantly-updating data source that publishes world-event data every 15 minutes.
The project was done by:
- Riccardo [https://github.com/riccardotommasini]
- Maxim [https://github.com/MaximSantalov]
- Karl-Gustav [https://github.com/KGKallasmaa]
The project was part of the Big Data Management course at the University of Tartu. You can read more about on Medium
Docker
Python 3.5
Node.js
All of the commands in this block should be run sequentially each int its separate Terminal window.
- Starting the Kafka cluster and database
bash kafka_cluster.sh
- Starting the Kafka producer
bash producer.sh
- Starting the Kafka consumer
bash consumer.sh
- Starting the production-server
bash server.sh
nodemon src/server.jsx
Navigate to localhost:3000. There you can find the GraphQl GUI. It's advisable that you study src/graphql/schema.jsx before hand.
There are currently 10 queries:
-
everything() -> returns every value in the database
-
top_nr_source(n:Int) -> returns top n value with the most sources
-
get_results_between_time_periods(FractionDate_start:Float,FractionDate_end:Float) -> returns the results between 2 dates
-
get_results_between_tones(min_tone:Float,max_tone:Float) -> returns the results between 2 tone values
-
get_actions_month(month:String) -> returns the actions within a given month
-
get_data_with_n_events_happend_in_dates(n:Int, start_SQLDATE:String, end_SQLDATE:String) -> returns the values that happened between two dates and that had at least n events in a month
-
get_top_n_actors_with_most_mentions_per_day(n:Int,start_SQLDATE:String,end_SQLDATE:String) -> returns n actors per day between the two dates sorted by the nr of mentions
-
get_top_n_negative_actors_near_location(n:Int, actor1Geo_Lat:Float,actor1Geo_Long:Float, start_SQLDATE:String,end_SQLDATE:String) -> returns the top n values for every day between two dates that happened within 100 km of the given location
-
find_n_most_powerful_actor_events_using_pagerank_between_two_dates(n:Int,start_SQLDATE:String,end_SQLDATE:String) -> returns the most powerful actors between two dates determined by the PageRank algorithm
-
find_n_most_powerful_domains_between_two_dates(n:Int,start_SQLDATE:String,end_SQLDATE:String,Geo_Lat:Float,Geo_Long:Float) -> returns n most powerful news sites within 1,000 km of the given location between the 2 dates
Example 1
{
everything {
GLOBALEVENTID
}
}
{
"data": {
"everything": [
{
"GLOBALEVENTID": "932366174"
},
{
"GLOBALEVENTID": "932366175"
},
...
]
}
}
Example 2
{
get_top_n_actors_with_most_mentions_per_day(n: 5, start_SQLDATE: "20200520", end_SQLDATE: "20200701") {
SQLDATE
events {
Actor1Name
}
}
}
{
"data": {
"get_top_n_actors_with_most_mentions_per_day": [
{
"SQLDATE": "20200531",
"events": [
{
"Actor1Name": "CROATIA"
},
{
"Actor1Name": "AMIT"
},
{
"Actor1Name": "LAWYER"
},
{
"Actor1Name": "AMIT"
},
{
"Actor1Name": "NEW SOUTH WALES"
}
]
},
...
]
}
}
Example 3
{
find_n_most_powerful_domains_between_two_dates(n: 5,
start_SQLDATE: "20200601", end_SQLDATE: "20200701",
Geo_Lat: 51.5074, Geo_Long: 0.1278)
}
{
"data": {
"find_n_most_powerful_domains_between_two_dates": [
"express.co.uk",
"famagusta-gazette.com",
"telegraph.co.uk",
"dw.com",
"sbs.com.au"
]
}
}
We're happy if you want to contribute to this project. Github Super-linter analyses the code before hand.
With docker-compose : 'ERROR: Version in "./docker-compose.yml" is unsupported' (1) sudo apt-get remove docker-compose OR sudo rm /usr/local/bin/docker-compose
(2) sudo curl -L "https://github.com/docker/compose/releases/download/1.23.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
(3) sudo chmod +x /usr/local/bin/docker-compose
(4) sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose