GMTMaps.jmd

# Visualizing Spatial Data with GMT

GMT is the "Generic Mapping Tools" a powerful toolset for geospatial visualizations. With great power comes great learning curves.
The goal here is to make it possible for someone who has a data analysis background but not much specific knowledge of geospatial software to
make maps that show their data in 2D maps, and simultaneously to learn something about the framework and what it is capable of, and how to
use it.

# Problem 1: a Choropleth map of US Census Data

One of the most common tasks in basic geospatial data analysis is to create a map in which regions of the world are colored based on
the value of some statistic pertaining to the region. We will start with a very simple map of population for each county in the US. 
The data for this can be found from the Census bureau at https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv

```julia
using CSV,DataFrames,GMT,Printf,DataFramesMeta

countypopurl = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/counties/totals/co-est2020-alldata.csv"

download(countypopurl,"data/countypops.csv")

cpopall = CSV.read("data/countypops.csv",DataFrame)

cpop = cpopall[:,[:STATE,:COUNTY,:STNAME,:CTYNAME,:POPESTIMATE2020]]

```

Now we have a large number of different estimates for each county in the US. For simplicity let's use the POPESTIMATE2020 variable. The COUNTY 
variable is the FIPS county code for the county.

In order to build a Choropleth map, we will need to have a file that defines the spatial boundaries of each county! GMT is a generic mapping tool
It will read various geospatial data formats which can be used to define the regions to be plotted. The Census bureau has a file which defines
the boundaries of the counties at 3 different resolutions. These are in "shapefile" format. The files and other related shapefiles are available
[at the Census website](https://www.census.gov/geographies/mapping-files/time-series/geo/carto-boundary-file.html). We will use the medium
resolution version of the county shapefiles (5 million meter resolution).

```julia{eval=false}
countyshapeurl = "https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip"

download(countyshapeurl,"data/cb_2018_us_county_5m.zip")
cd("data")
run(`unzip cb_2018_us_county_5m.zip`)
cd("..")

```

Inside the zip file are a number of files, the one ending in ".shp" is in the shape file format. We can read this using GMT's function
gmtread

```julia
counties = gmtread("data/cb_2018_us_county_5m.shp")
```

Now, what kind of thing is "counties?" 

```julia
typeof(counties)
```

It's a Vector{GMTdataset}. Basically GMT treats the shape file as a collection of shapes, each shape gets its own GMTdataset, which is a structure

```{eval=false}
search: GMTdataset

  No documentation found.

  Summary
  ≡≡≡≡≡≡≡≡≡

  mutable struct GMTdataset{T<:Real, N}

  Fields
  ≡≡≡≡≡≡≡≡

  data     :: Array{T<:Real, N}
  ds_bbox  :: Vector{Float64}
  bbox     :: Vector{Float64}
  attrib   :: Dict{String, String}
  colnames :: Vector{String}
  text     :: Vector{String}cpop.CFIPS = format("%03d",cpop.COUNTY)

  geom     :: Int64

  Supertype Hierarchy
  ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡

  GMTdataset{T<:Real, N} <: AbstractArray{T<:Real, N} <: Any
```

In addition to "data" which is lattitude and longitude values in this case, there are a variety of metadata fields. one of which is "header".

Let's look at the header of the first county:
```julia
counties[1].header
counties[1].attrib
```

This shows that the header is a comma separated values string. We want to "join" our data to these counties such that the county FIPS identifier 
is the same. The FIPS identifier is a unique numerical code assigned to each county (the names are not necessarily unique). 

In this case, we have the value 39 being the FIPS code for the state of Ohio, and 071 being the county FIPS code for Highland County.

So we'd like to join field 2 of the header to the COUNTY column of our population data. 
However our population data represents the COUNTY field as a number, not a 0 padded string. Let's fix that then do the join


```julia

cpop.COUNTY = Printf.format.(Ref(Printf.Format("%03d")),cpop.COUNTY)
cpop.STATE = Printf.format.(Ref(Printf.Format("%02d")),cpop.STATE)

cpop = cpop[cpop.COUNTY .!= "000",:] ## eliminate county "000" which is "the whole state" for each state

dfc = DataFrame(STATE = map(x->x.attrib["STATEFP"],counties),COUNTY=map(x -> x.attrib["COUNTYFP"],counties),ORDER=1:length(counties))


joineddata = @chain leftjoin(dfc,cpop,on= [:STATE,:COUNTY],makeunique=true) begin
@orderby(:ORDER)
end

cptvallog = makecpt(range=(log(1000),log(11e6)),C=:plasma)

#imshow(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),proj=:guess,colorbar=true)

```

There are missing values of POPESTIMATE2020 so let's replace them with NaN so we at least still get the polygon drawn. GMT will draw NaN 
as a special color.

```julia
joineddata.POPESTIMATE2020 = replace(joineddata.POPESTIMATE2020,missing => NaN)


GMT.plot(counties,level=log.(1.0 .+ joineddata.POPESTIMATE2020),cmap=cptvallog,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethlog.png",title="Continental US County log(Population)")
```
![Choropleth of log(population) at county level](choroplethlog.png)


Let's see what it looks like if we don't take the logarithm...


```julia
cptval = makecpt(range=(0,11_000_000),C=:plasma)

GMT.plot(counties,level=joineddata.POPESTIMATE2020,cmap=cptval,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethnolog.png",title="Continental US County Population")
```

![Continental Population](choroplethnolog.png)

It may be interesting to work further with the polygons that represent the counties in some additional ways. For example we could calculate the population
Density by dividing the population by the area of the polygon. We can calculate the area using the function gmtspatial.

```julia

measures = gmtspatial(counties, area="k")[1] # to get area in km^2
print(typeof(measures))
length(measures)
print(measures)
```

But some counties will consiste of several polygons, like some along the coast may include some islands, etc, to get the total area, we 
need to group by county identifier, and sum all the areas.

```julia
countyareas = @chain dfc begin
  @transform(:area = measures[:,3])
  groupby([:STATE,:COUNTY])
  @combine(:totarea = sum(:area))
  @select(:STATE,:COUNTY,:totarea)
end


cpop = @chain leftjoin(cpop,countyareas,on = [:STATE, :COUNTY]) begin
  @transform(:density = :POPESTIMATE2020 ./ :totarea)
  @orderby(:STATE,:COUNTY)
end

print(cpop[1:10,:])

joineddata = @chain leftjoin(dfc,cpop,on = [:STATE,:COUNTY]) begin 
  @orderby(:ORDER)
end


```

```julia
denscpt = makecpt(range=(0,11e6/(100*100)),C=:plasma)


GMT.plot(counties,level=Vector{Float64}(replace(joineddata.density,missing=>NaN)),cmap=denscpt,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethdens.png",title="Continental US County Pop Density (1/km^2)")

```

![Continental US Density](choroplethdens.png)


## Smallest Area Containing ~50% of the population

Our approach to this question will be to sort the counties by density in decreasing order, cumsum the populations... and then create an indicator 
variable for whether the cumsum is less than 50% of the total, then plot this indicator variable.

```julia

totpop = sum(cpop[cpop.COUNTY .!= "000",:POPESTIMATE2020]) ## ignore county 0, that's the "whole state"
print(totpop)

areads = @chain cpop begin
@subset(:COUNTY .!= "000")
@orderby(-:density)
@transform( :csum = cumsum(:POPESTIMATE2020))
@transform( :inset = :csum .< totpop/2.0,:in80set = :csum .< 0.8 * totpop)
end

joineddata = @chain leftjoin(dfc,areads,on=[:STATE,:COUNTY]) begin
  @orderby(:ORDER)
end

insetcmap = makecpt(range(0,1),C=:plasma)

GMT.plot(counties,level=Float64.(replace(joineddata.inset,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethhpdset.png",title="Continental US smallest 50% set ")

```

![high probability density set 50%](choroplethhpdset.png)


## 80%tile smallest area

```julia
GMT.plot(counties,level=Float64.(replace(joineddata.in80set,missing => NaN)),cmap=insetcmap,close=true,fill="+z",pen=0.25,region=(-125,-65,24,50),
    proj=:guess,colorbar=true,figname="choroplethhpd80set.png",title="Continental US smallest 80% set ")

```

![high probability density set 50%](choroplethhpd80set.png)