-
Notifications
You must be signed in to change notification settings - Fork 0
/
Florida2016B.Rmd
1129 lines (794 loc) · 53.5 KB
/
Florida2016B.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Florida Campaign Contributions 2016"
output: html_document
---
========================================================
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=6, fig.height=4, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages that you end up using
# in your analysis in this code chunk.
# Notice that the parameter "echo" was set to FALSE for this code chunk.
# This prevents the code from displaying in the knitted HTML output.
# You should set echo=FALSE for all code chunks in your file.
library(ggplot2)
library(plyr)
library(dplyr)
library(vioplot)
library(gridExtra)
library(zoo)
library(GGally)
library(scales)
library(memisc)
```
# Introduction
Though tempted by the subject and predicted analytical brevity of the wine datasets, I was curious about data related to 2016 election, which most mainstream analyses got so wrong. (Though it could be argued, as Raza Irizarry <a href = "http://simplystatistics.org/2016/11/09/not-all-forecasters-got-it-wrong/">does here </a> that if you look at confidence intervals, Nate Silver actually did get it right.)
The FEC offers data on campaign contributions by state <a href = "http://fec.gov/disclosurep/pnational.do"> here </a>. I decided to look at Florida's because of its swing state status, interesting mix of demographics, and relatively large data set.
# Univariate Plots
```{r echo=FALSE, Load_the_Data}
# Load the Data
FL <- read.csv("P00000001-FL.csv", header=T, check.names = FALSE)
dim(FL)
names(FL)
str(FL)
```
```{r Reformatting dates and zips}
#Create new variable "date" with date in date format
FL$date <- as.Date(FL$contb_receipt_dt, format = "%d-%B-%y")
#Create new variable "zip" with only valid zipcodes in 5 digit format
FL$zip <- substr(FL$contbr_zip, 0, 5)
FL$zip[FL$zip < 32000 | FL$zip > 35000] <- NA
FL$zip[FL$zip == 3334 | FL$zip == 3429] <- NA
FL$zip <- as.integer(FL$zip)
```
There is a good explanation of the data <a href = "ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt"> here </a> The one major category I found missing was the political party of the candidate, so I went about adding that feature first.
```{r creating political parties based on candidates}
summary(FL$cand_nm)
unique(FL$cand_nm)
index <- c("Johnson, Gary",
"Stein, Jill",
"McMullin, Evan",
"Clinton, Hillary Rodham",
"Sanders, Bernard",
"Lessig, Lawrence",
"O'Malley, Martin Joseph",
"Webb, James Henry Jr.",
"Bush, Jeb",
"Carson, Benjamin S.",
"Christie, Christopher J.",
"Cruz, Rafael Edward 'Ted'",
"Fiorina, Carly",
"Gilmore, James S III",
"Graham, Lindsey O.",
"Huckabee, Mike",
"Jindal, Bobby",
"Kasich, John R." ,
"Pataki, George E." ,
"Paul, Rand",
"Perry, James R. (Rick)",
"Rubio, Marco",
"Santorum, Richard J.",
"Trump, Donald J.",
"Walker, Scott")
values <- c("libertarian",
"green",
"independent",
"democrat", "democrat",
"democrat", "democrat",
"democrat", "republican",
"republican", "republican",
"republican", "republican",
"republican", "republican",
"republican", "republican",
"republican", "republican",
"republican", "republican",
"republican", "republican",
"republican", "republican")
FL$party <- values[match(FL$cand_nm, index)]
FL$party <- factor(FL$party)
summary(FL$party)
```
```{r Number of contributions per party}
qplot(data = FL, x = party)
```
There were nearly twice as many contributions towards Democratic candidates as Republicans, of course this has nothing to say about amounts. Third party contributions trail far behind.
```{r Number of contributions per candidate}
qplot(data = FL, x = cand_nm, fill = party) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
Because of the large disparity in contributions between the least and most popular candidates, a log transform of y-axis is appropriate.
```{r Number of contributions per candidate scaled}
party_color <- c('#0e50D8', '#009933', '#6600CC', '#FCD003', '#CC0000')
ggplot(FL, aes(x = cand_nm, fill = party)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_y_log10() +
scale_fill_manual(values = party_color)
```
```{r Amount histogram}
names(FL)[names(FL) == "contb_receipt_amt"] <- "amt"
qplot(FL$amt, geom = "histogram", bins = 1000)
```
Presumably the x-axis has been plotted this way because there are large positive and negative amounts. Let's check those out more closely.
```{r Negative Donations}
head(subset(FL, amt < 0)$receipt_desc)
subset(FL, amt == 20000 | amt < -20000)
```
It appears that these negative numbers are in fact legitimiate and are either refunds or redesignation to the general funds. There's also one very interesting data point: a 20,000 dollar contribution and subsequent refund the following day from a self-employed physician in Jupiter to Hillary Clinton. Let's try the boxplots again just looking within the 1 - 99% of data.
```{r Contribution Amount Histogram}
qplot(FL$amt, geom = "histogram", bins = 50) +
xlim(1, quantile(FL$amt, 0.9)) +
scale_x_log10(breaks = c(1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500))
```
When addng a log10 transformation to the x-axis there is a relatively normal distribution, though there are spikes at regular intervals like 10, 25, 50, 100, 250, 500, 1000, and 2500, because people are more like to contirbute 100 dollars than say, 88.
```{r Contributors}
names(FL)[names(FL)=="contbr_nm"] <- "name"
cont_table <- table(FL$name)
more_than_1 <- subset(FL, name %in% names(cont_table[cont_table > 1]))$name
length(unique(more_than_1))
top_conts <- subset(FL, name %in% names(cont_table[cont_table > 150]))
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x))))
}
ggplot(top_conts,
aes(x = reorder_size(name))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
head(sort(cont_table, decreasing = T))
```
There were quite a few people (42,066) who contributed more than once, and even quite a few who contributed more than 150 times, though the maximum number is 261 times.
```{r Counts of contributions in cities}
names(FL)[names(FL)=="contbr_city"] <- "city"
city_table <- table(FL$city)
top_cities <- subset(FL, city %in% names(city_table[city_table > 3000]))
ggplot(top_cities,
aes(x = reorder_size(city))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
As expected, the city with the highest number of contributions was Miami, followed by other large urban centers such as Tampa, Orlando, and Naples. There were some unknown to me cities such as Boynton Beach, The Villages, and Vero Beach that also had a large number of contributions.
```{r Zip Bar, fig.height=10, fig.width=10}
zip_table <- table(FL$zip)
top_zips <- subset(FL, zip %in% names(zip_table[zip_table > 1500]))
p1 <- ggplot(top_zips,
aes(reorder_size(zip))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))
p2 <- ggplot(subset(top_zips,
party == 'democrat' | party == 'republican'),
aes(reorder_size(zip), fill = zip)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
geom_hline(yintercept = 1000, linetype = 2, color = 'red') +
facet_wrap(~party)
grid.arrange(p1, p2, ncol = 1)
```
Though this is venturing into the bivariate, bear me with me. From the first plot, we were aware that there were more Democratic contributions than Republican. However, a few zipcodes stand out when we compare beyond this trend. The bars are ordered from least number of overall contributions to highest number of overall contributions. The following zip codes stand out because they have a particularly high number of Republican to Democratic contributions:
- 32802 - Ponte Vedra Beach -- Where the PGA tour takes places, an upper income resort area
- 32963 - Vero Beach
- 33480 - South Palm Beach
The following stand out because they have relatively equal ratios of R to D:
- 32162 - The Villages
- 33908 - Fort Myers
And these stand out because they have much higher numbers of D to R:
- 32605 - Gainesville -- Where the large state univeristy is
- 33040 - Key West
- 33193 - Miami
- 33437 - Boynton Beach
While there are already interesting trends revealing themselves here, the main purpose of this is to show that most of the highest contributing zip codes also have counterparts as cities. For that reason they are redundant and I will only analyze city from here on, since it is more easily understandable on first sight. If I were building a regression model, zip codes would come more in handy since they are numerical.
```{r Occupation bar chart, fig.height=4, fig.width=3}
names(FL)[names(FL)=="contbr_occupation"] <- "occ"
FL$occ[FL$occ == "INFORMATION REQUESTED PER BEST EFFORTS"] <- "INFORMATION REQUESTED"
FL$occ[FL$occ == ""] <- "INFORMATION REQUESTED"
FL$occ[FL$occ == "SELF EMPLOYED"] <- "SELF-EMPLOYED"
occ_table <- table(FL$occ)
top_occs <- subset(FL, occ %in% names(occ_table[occ_table > 2000]))
ggplot(top_occs,
aes(x = reorder_size(occ))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
While it will be very interesting to see how these breakdown across party or candidate, for now we know that the top occupations are retired, not-employed, and people who did not specify what their employment was.
```{r Employed information}
FL$contbr_employer[FL$contbr_employer == "INFORMATION REQUESTED PER BEST EFFORTS"] <- "INFORMATION REQUESTED"
FL$contbr_employer[FL$contbr_employer == ""] <- "INFORMATION REQUESTED"
FL$contbr_employer[FL$contbr_employer == "SELF EMPLOYED"] <- "SELF-EMPLOYED"
head(sort(table(FL$contbr_employer), decreasing = TRUE), n = 15)
```
Because the top contributors are retired, not employed or self-employed, using employer information is redundant and will not be further considered. It is interesting to note that the largest employers are universities.
```{r Election type bar}
ggplot(FL, aes(x = election_tp)) + geom_bar()
```
Clearly there were more contributions during the primaries than during the general election, though there was one contribution towards 2020, and several unmarked contributions. Perhaps this commentary on the electorate's enthutiasm about about the final candidates.
```{r echo=FALSE}
p1 <- ggplot(FL, aes(x = date)) +
geom_histogram(binwidth = 1) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
p2 <- ggplot(FL, aes(x = as.numeric(date))) +
geom_histogram(bins = 1000) +
scale_x_log10()
grid.arrange(p1, p2, ncol = 1)
```
As my final univariate plot, I look at a histogram of date. Though more contributions were made to primary races, contributions did not pick up until April of 2015, peaking in July and August of 2016. Using a log10 transform on a numeric version of the data did not appear to significatly alter the the highly left-skewed data.
# Univariate Analysis
### Dataset structure
This data set includes 359419 contributions (or contribution refunds) towards presidential campaigns in the 2016 election cycle in Florida. Each observation includes the following 18 features: committee id, candidate id, candidate name, contributor name, contributor state (in this case all = FL), contributor zip code, contributor city, contributor employer, contributor occupation, contribution receipt amount, contribution receipt date, receipt description, memo code, memo text, form type, file number, transaction id, and election type.
Of these, I only considered or will consider candidate name (synonymous with candidate id), contributor name, contributor city, contributor occupation, contribution receipt amount, receipt description, date and party.
There is a more detailed explanation of the data available <a href = "ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt"> here </a>.
Other observations:
- There are some negative values because of refunds or because a contribution was reattributed to a spouse or redesignated for the general election rather than the primaries. (This was discovered by reading the memo line of negative contributions.)
- Far more contrbutions went towards the primary: 265962 as compared to 92280 towards the general election. (And one towards the 2020 election.) This explains why there are contributions to 25 candidates rather than just a handful.
- Despite this, most contributions were made in 2016 rather than earlier on.
### What is/are the main feature(s) of interest in your dataset?
I am interested in the following questions:
- Which candidates received the highest number of contributions?
- What was the distribution of contribution amount for each candidate?
- How did location and occupation affect which candidate or party the contribution went to?
### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Digging deeper I would like to look at the following:
- How did contributions change over time in regards to whom received the contributions? I would suspect that at first contributions were widely spread, but then came to a narrow focus after the convetions took place, when third party candidates got a boost.
- How did the dollar amount of contributions change over time?
- Were there people who contributed multiple times? How did their contributions change over time?
### Did you create any new variables from existing variables in the dataset?
So far I have created the political party variable. I plan on also finding the mean and median amount of contribution for each candidate and for each day of the election cycle.
### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
There were a couple of unusual distributions. First off, as previously stated, I was surprised to find negative contribution amounts. However, they were legitimate and I decided to keep them in order to more accurately calculate total and therefore mean contributions. Additionally I changed the receipt_dt into date format for better time series plots. Finally I standardized zip codes, as some where in the 9 digit format. Some were also not zipcodes as they had less than 5 digits or were out of state zip codes. According to <a href = "http://www.unitedstateszipcodes.org/fl/"> reference </a> all Florida zip codes are between [32000, 35000]. I renamed lengthy variable names, and consolidated common values( e.g. "self-employed" and "self employed").Also, when plotting histograms of the number of contributions or the contribution amount I used the log10 tranformation in order to better visualize the relative distribution.
# Bivariate Plots Section
```{r Creating a smaller dataframe}
cand_table <- table(FL$cand_nm)
small_fl <- subset(FL, select = c("cand_nm", "city","amt", "date", "party"))
small_fl <- subset(small_fl, cand_nm %in% names(cand_table[cand_table > 1000]))
small_fl <- subset(small_fl, city %in% names(city_table[city_table > 5000]))
#Recast as character than factor to jettison extraneous factors
small_fl$cand_nm <- as.character(small_fl$cand_nm)
small_fl$cand_nm <- as.factor(small_fl$cand_nm)
small_fl$city <- as.character(small_fl$city)
small_fl$city <- as.factor(small_fl$city)
```
```{r GGpairs, fig.height=12, fig.width=12}
set.seed(2016)
fl_samp <- small_fl[sample(1:length(small_fl$cand_nm), 10000), ]
ggpairs(fl_samp,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
```
Since most of the data is categorical, there i only one correlation coefficient (between amout and date) that is calculated. It is negative showing that amounts decreased over time, but very weak (-.198) I will explore much of the bivariate data using boxplots and barcharts.
```{r Contribution Amount Histograms faceted by party with log transforms1}
dem_rep <- subset(FL, party == 'democrat' | party == 'republican')
ggplot(dem_rep, aes(x = amt)) +
geom_histogram(bins = 30)+
scale_x_log10(breaks = c(1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
facet_wrap(~party)
ggplot(dem_rep, aes(x = amt)) +
geom_histogram(bins = 30)+
xlim(101, 3000) +
ylim(0,10000) +
facet_wrap(~party)
```
Democratic contributions, especially below 100 dollars, were much more common while larger contributions were more popular for Republican contributors.
```{r Party box plots}
ggplot(top_cities,
aes(x = reorder(party, amt, FUN = median), y = amt)) +
geom_boxplot() +
coord_cartesian(ylim = c(0,500)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Because there were so many Democratic and Republican contributions, there were many outliers in those boxplots. The medians from lowest to highest were Democrat, Green, Republican and Libertarian, Independent. Democrats had the smallest interquartile range but the most outliers.
```{r Frequency Polygons by Party}
p1 <- ggplot(FL, aes(x = amt, color = party, bins = 1000)) +
geom_freqpoly() + scale_x_log10()
p2 <- p1 + scale_y_log10()
grid.arrange(p1, p2, ncol=1)
```
When comparing all parties' contribution amounts, transfomring both x and y axis by log 10 allows us to see the relatively normal distribution of all. However, it clear from the first plots that there were more Democratic contributions than other parties for amounts less than $100 dollars, but that after that point, the number of Republican contributions is equal to or greater than the amount for democrats. (As shown in the previous amount histograms.)
```{r Amount boxplots }
ggplot(FL, aes(x = cand_nm , y = amt)) +
geom_boxplot() +
ylim(0, quantile(FL$amt, 0.9)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
```{r Scaled Democratic boxplots}
ggplot(subset(FL, party == 'democrat'),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.01), quantile(FL$amt, 0.99))
```
Adding a jitter to these boxplots may help us see how densely packed they are.
```{r Democratic Boxplots + Jitter}
ggplot(subset(FL, party == 'democrat'),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
geom_boxplot(outlier.shape = NA, color = 'blue')+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.01), quantile(FL$amt, 0.99)) +
geom_jitter(alpha = .1, position=position_jitter(width=.4, height=0))
summary(subset(FL, party == 'democrat')$amt)
```
From this graph it's clear that Clinton and Sanders received far more contributions than the other democratic candidates, and this helps to explain the number of outliers for these two candiates. It also appears they are the only ones with significant negative "contributions". O'Malley had a higher 3rd quartile and median than his other two lesser known competitors. Let's take a look at just Clinton's and Sander's boxplots.
```{r Sanders Clinton Boxplots + Jitter, fig.width=8}
p1 <- ggplot(subset(FL,
(cand_nm == 'Clinton, Hillary Rodham' |
cand_nm == 'Sanders, Bernard')),
aes(x = cand_nm , y = amt)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.01), quantile(FL$amt, 0.99)) +
geom_jitter(alpha = .1, position=position_jitter(width=.3, height=0)) +
geom_boxplot(outlier.shape = NA, color = 'blue', alpha = 0.5)
p2 <- ggplot(subset(FL,
(cand_nm == 'Clinton, Hillary Rodham' |
cand_nm == 'Sanders, Bernard')),
aes(x = cand_nm , y = amt)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
coord_cartesian(ylim = c(0, 100)) +
geom_boxplot()
grid.arrange(p1, p2, ncol =2)
```
So it's clear that Clinton received more contributions with higher amounts than Sanders. As can be seen, many of the most popular values are regular intervals of 100 or 1000, espeically 1000 and also 1500, 1750, and 2000 for Clinton.
Both Clinton and Sanders had a median contribution amount of 25 dollars, though Clinton's 3rd quartile is 75 while Sanders is 50. Let's take a look at the Republican candidate's amount boxplots.
```{r Republican Boxplots + Jitter}
ggplot(subset(FL, party == 'republican'),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.01), quantile(FL$amt, 0.99)) +
geom_jitter(alpha = .1, position=position_jitter(width=.4, height=0)) +
geom_boxplot(outlier.shape = NA, color = 'red', alpha = 0.5)
```
As in the democratic boxplots, there are regular intervals (such as 500 and 1000) that are more common. Interestingly, many candidates 3rd quantiles reach the 99% cut-off, including Bush, Christie, Jindal, and Pataki, meaning they had several large amount contributions.
The fact that the more popular candidates, Cruz, Carson, Rubio, and Trump do not have high 3rd quartiles does not mean that they did not receive such contributions, just that they have many more contributions that fall in a lower range and therefore large donation are deemed outliers. Let'stake a look at it without the jitter.
```{r Republican boxplots}
ggplot(subset(FL, party == 'republican'),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.01), quantile(FL$amt, 0.99))
```
This plot actually shows things a bit more clearly than the boxplots with jitter. Let's focus in on the four republican contenders with the most number of contributions.
```{r Focused Repblican Boxplots}
ggplot(subset(FL,
(cand_nm == 'Carson, Benjamin S.' |
cand_nm == "Cruz, Rafael Edward 'Ted'" |
cand_nm == 'Rubio, Marco' |
cand_nm == 'Trump, Donald J.')),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
coord_cartesian(ylim = c(0, 300))
```
Interestingly, though Trump boasted about how most of his contributions were in small amounts, his median is higher than Cruz's and Carson's. Rubio has the highest median and largest IQR. How about the third party candidates?
```{r Third party boxplots, fig.width=8}
p1 <- ggplot(subset(FL, party != 'republican' & party != 'democrat'),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.01), quantile(FL$amt, 0.99))
p2 <- ggplot(subset(FL, party != 'republican' & party != 'democrat'),
aes(x = reorder(cand_nm, amt, FUN = median), y = amt)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
coord_cartesian(ylim = c(0, 500))
grid.arrange(p1, p2, ncol = 2)
```
Without limiting the data we can see that Johnson had many more outliers beyond the third quartile than the other two, while Stein had a few and one distinct outlier. McMullin had only one, but had highest median of the three.
How about trying a violinplot to see the distribution for the major candidates?
```{r Violin plots of number of contributions}
# Violin Plots
hil <- FL$amt[FL$cand_nm == 'Clinton, Hillary Rodham']
bern <- FL$amt[FL$cand_nm == 'Sanders, Bernard']
trump <- FL$amt[FL$cand_nm == 'Trump, Donald J.']
vioplot(hil, bern, trump, names=c('Clinton, Hillary Rodham',
'Sanders, Bernard', 'Trump, Donald J.'),
col="gold")
title("Violin Plots of Contribution Amounts")
```
Nothing out of the ordinary here -- the distributions seem relatively normal for each, however that distribution is widest for Clinton (much in thanks to our self-employed physician from Jupiter) and narrowest for Trump.
```{r Number of contributions by party}
cand_table <- table(FL$cand_nm)
top_cands <- subset(FL, cand_nm %in% names(cand_table[cand_table > 1500]))
ggplot(top_cands, aes(x = party, fill = cand_nm)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_bar()
```
It makes sense that Trump and Clinton got the largest number of contributions for their party since they ended up being the party nominations.
```{r Candidates in Cities }
ggplot(subset(top_cities, cand_nm %in% names(cand_table[cand_table > 1500])),
aes(reorder_size(city), fill = cand_nm)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
coord_flip()
```
While the disparities here are interesting, it's a bit difficult to read on this stacked bar chart, so I've decided to facet it out.
```{r Candidates in Cities separated out, fig.height=10, fig.width=10}
city1 <- ggplot(subset(
subset(top_cities, cand_nm %in% names(cand_table[cand_table > 2500])),
party == 'republican'),
aes(reorder_size(city), fill = cand_nm)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
geom_hline(yintercept = 500, linetype =2) +
facet_wrap(~cand_nm)
city2 <- ggplot(subset(
subset(top_cities, cand_nm %in% names(cand_table[cand_table > 2500])),
party == 'democrat'),
aes(reorder_size(city), fill = cand_nm)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
facet_wrap(~cand_nm) +
geom_hline(yintercept = 2500, linetype =2)
grid.arrange(city1, city2, ncol = 1)
```
The interesting points here are Rubio's popularity in Miami as compared to his Republican contributors and that Sanders received more contributions in Gainesville than Clinton, though the opposite is true in every other city.
```{r City by party, fig.width=14, fig.height=8}
ggplot(subset(top_cities, party == 'democrat' | party == 'republican'),
aes(reorder_size(city))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
geom_hline(yintercept = 5000, linetype = 2, color = 'red') +
geom_hline(yintercept = 2500, linetype = 2, color = 'red') +
facet_wrap(~party)
```
Naples stands out as the only city whith more Republican contributions than Democratic, but it's difficult to tell with the smaller bar. Let's dig into the ratios a bit more.
```{r, Dem/Rep Ratios in cities, fig.height = 8, fig.width = 8}
rep_cities = as.data.frame(table(subset
(top_cities, party == 'republican')$city))
dem_cities = as.data.frame(table(subset
(top_cities, party == 'democrat')$city))
names(rep_cities) <- c("city", "freq")
names(dem_cities) <- c("city", "freq")
city_ratio <- dem_cities
city_ratio$city <- rep_cities$city
city_ratio$freq = dem_cities$freq / rep_cities$freq
names(city_ratio) <- c("city", "ratio")
city_ratio <- subset(city_ratio, !is.na(ratio))
city_ratio <- city_ratio[order(city_ratio$ratio),]
top_cities <- merge(top_cities, city_ratio, by = "city")
rat <- ggplot(city_ratio,
aes(x = reorder(city, ratio), y = ratio)) +
geom_line(group = 1) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12)) +
scale_y_continuous(breaks = c(0, 1, 2, 3, 4, 5, 6, 7)) +
geom_hline(yintercept = 1, linetype = 2, color = 'red')
cont <- ggplot(top_cities,
aes(x = reorder(city, ratio))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5, size = 12))
grid.arrange(rat, cont, ncol =1)
```
We can see that there are relatively few cities were the ratio of Democratic contributions to Republican contributions is less than one. (I.e. there were more Republican contributions) These cities were Vero Beach, Naples, The Villages. On the other hand, Gaineseville had over 7x as many Democratic contributions and Miami Beach had over 5 times as many Democratic contributions. The degree to which a city is Democratic or Republican does not seem to correlate with the total number of contributions, which might somewhat approximate the size or population of the city.
```{r, Occupation by party, fig.height=8, fig.width=10}
p1 <- ggplot(subset(top_occs, party == 'democrat' | party == 'republican'),
aes(reorder_size(occ))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
geom_hline(yintercept = 1000, linetype = 2, color = 'red') +
scale_y_sqrt() +
facet_wrap(~party)
p2 <- ggplot(subset(top_occs, (party != 'democrat' & party != 'republican')),
aes(reorder_size(occ))) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
scale_y_log10() +
facet_wrap(~party)
grid.arrange(p1, p2, ncol = 1)
```
Across parties, most contributions came from retired people. For Democrats, the occupations of the next top contributors were "not-employed", "information requested" and attorney. For Republicans, not-employed was the lowest category, and for the third parties this category did not exist. The next highest contributors among Republicans where "Information Requested", which was significantly higher than in Democrats and then attorney. Other notable Republican dearths when compared to Democrats were managers, teachers, and professors. Of the third party contributions, Libertarians had the widest spread of occupations followed by Greens and then Independents. However, there were zero professors for Libertarians, and it was one of the few occupations supporting Greens. Note that the y-axis has undergone a log transformation for the third parties but not the main parties so that it was easier to see which occupations had contributed.
```{r Occupation breakdown, fig.height=8, fig.width=12}
p1 <- ggplot(subset(top_occs, party == 'republican'),
aes(occ)) + geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
scale_y_sqrt() +
ggtitle("All Republican")
p2 <- ggplot(subset(top_occs, cand_nm == "Trump, Donald J."),
aes(occ)) + geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
scale_y_sqrt() +
ggtitle("Trump")
rep <- grid.arrange(p1, p2, ncol = 2)
p3 <- ggplot(subset(top_occs, party == 'democrat'),
aes(occ)) + geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
scale_y_sqrt() +
ggtitle("All Democrat")
p4 <- ggplot(subset(top_occs, cand_nm == "Clinton, Hillary Rodham"),
aes(occ)) + geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
scale_y_sqrt() +
ggtitle("Clinton")
p5 <- ggplot(subset(top_occs, cand_nm == "Sanders, Bernard"),
aes(occ)) + geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4)) +
scale_y_sqrt() +
ggtitle("Sanders")
dem <- grid.arrange(p3, p4, p5, ncol = 3)
grid.arrange(rep, dem, ncol =1)
```
The only difference between Trump's distribution amoung occupations and the overall Republican distribution is that there were fewer contributions from homemakers and more from "information requested" for Trump. Far fewer self-employed and unemployed people contributed to Clinton as compared to Sanders.
```{r Occupation boxplots1}
ggplot(top_occs, aes(x = reorder(occ, amt, FUN=median), y = amt)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 750)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Not surprisingly CEOs had the largest IQR and highest median while unmployed had the lowest median, followed by professor and teacher.
```{r contribution amt over time, fig.height=5, fig.width=10}
ggplot(FL, aes(x = date, y = amt)) +
geom_point(alpha = .5, size = 0.75, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
ylim(0, quantile(FL$amt, 0.90))
```
Though there were some contributions so early in the campaign, I'm going to zoom in on when most of the contributions came, namely starting April 2015.
```{r echo=FALSE, fig.height=5, fig.width=10}
ggplot(subset(FL, date > '2015-04-01'),
aes(x = date, y = amt)) +
geom_point(alpha = .1, size = 0.75, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
coord_cartesian(ylim = c(0, 300)) +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
linetype = 2, color = 'blue') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5),
color = '#FCD003') +
geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
linetype = 2, color = 'red')
```
This paints a very interesting picture indeed, though it is still too noisy. Though there are peaks in the dollar amount of contributions 2016, the daily medians are much higher earlier on, before the conventions. This needs to be smoothed out as there is currently one bin per day.
```{r grouping by month and finding means and medians}
FL$month <- substr(FL$date, 0, 7)
month_groups <- group_by(FL, month) #need to pass in a data frame and grouping
FL_by_month <-summarise(month_groups, #need to save as a new variable
amt_mean = mean(amt),
amt_median = median(amt),
n = n())
FL_by_month$year <- substr(FL_by_month$month, 3, 4)
FL_by_month$monthyear <- substr(FL_by_month$month, 6,7)
FL_by_month$monthyear <- paste (FL_by_month$year, FL_by_month$monthyear, sep = "")
FL_by_month$monthyear <- as.yearmon(as.character(FL_by_month$monthyear), "%y%m")
```
```{r plotting by month}
month_medians <- ggplot(subset(FL_by_month, monthyear > "April 2015"),
aes(x = month, y = amt_median)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_line(group = 1)
month_mean <- ggplot(subset(FL_by_month, monthyear > "April 2015"),
aes(x = month, y = amt_mean)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
geom_line(group = 1)
grid.arrange(month_medians, month_mean, ncol =1)
```
According to this to this, mean and median contribution dollar amounts decreased from April 2015 on, though the early univariate histogram showed that the number of contributions increased in the final few months of the election. Though the graphs have similar shapes, it is worth it to point out that the scales on the y-axes are quite different.
# Bivariate Analysis
### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset? Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Bivariate plots supported truisms regarding party and candidate affiliation. Contributions for Democratic candidates were likely to be smaller in amount and come from the unemployed, teachers, and professors. This was particularly true for Sanders as Clinton received more contributions of higher amounts. Contributions for Republicans tended to be of higher amount and were more popular among self-employed or business owners.
Interesting information about candidate and party support was shown across different cities, with Gainesville, a large university city showing the highest ratio of democrat to republican contributions, particularly for Sanders. Rubio had a relatively large following in Miami which makes sense since that is where he is from.
Though the univariate analyses showed that the number of contributions spiked in the summer of 2016, the final plot of the bivariate analysis showed that the dollar amount of the contributions generally declined from April 2015 on.
### What was the strongest relationship you found?
Democrats receive higher numbers of contributions but of a lower dollar amount.
# Multivariate Plots Section
```{r Contribution Amount Histograms faceted by party with log transforms, fig.height= 5, fig.width=10}
ggplot(subset(FL, election_tp == 'P2016' | election_tp == 'G2016'),
aes(x = amt)) +
geom_histogram(bins = 50, aes(fill = party))+
scale_x_log10(breaks = c(1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500)) +
scale_y_sqrt() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_fill_manual(values = party_color) +
facet_wrap(~election_tp)
```
While the number of contributions was much higher in the primaries, it is interesting to watch see how Republican contributions became left skewed towards higher dollar amounts in the general election.
```{r City boxplots with ratios}
ggplot(top_cities,
aes(x = reorder(city, amt, FUN = median), y = amt, fill = ratio)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
ylim(quantile(FL$amt, 0.1), quantile(FL$amt, 0.9) )
```
Arranging these boxplots by medians shows there is no discernible relationship between median contribution amount in a city and its democratic to republican ratio.
```{r Occupation boxplots, fig.height=5, fig.width=12}
ggplot(subset(top_occs, party == 'republican' | party == 'democrat'),
aes(x = reorder(occ, amt, FUN=median), y = amt, fill = party)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 500)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
In all occupations except for "Information Requested" Republican contributions have a higher dollar median. This is especially true of homemakers, attorneys, physicians, and CEOs.
```{r Multiple Contributor box plots3, fig.height= 6, fig.width = 14}
ggplot(top_conts,
aes(x = reorder(name, amt, FUN = median), y = amt, fill = cand_nm)) +
geom_boxplot() +
coord_cartesian(ylim =c(-50, 100)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
facet_wrap(~party)
```
People who contribute multiple times are far more likely to be contributing to Democratic campaigns and Republican multiple contributors had higher median amounts. Some of the multiple contributors have several negative values showing that they refund or reassign their contributions.
```{r Multiple Contributor box plots1, fig.height= 6, fig.width = 14}
ggplot(subset(top_conts, party == 'democrat'),
aes(x = reorder(name, amt, FUN = median),
y = amt, fill = cand_nm)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 50)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
The median amounts for multiple Democratic contributors were all below 50 dollars, but other than that there was not much to unite them. Sanders contributors did not switch to Hilary after the primary.
```{r Multiple Contributor box plots2, fig.height= 3, fig.width = 12}
ggplot(subset(top_conts, party == 'republican'),
aes(x = reorder(name, amt, FUN = median),
y = amt, fill = cand_nm)) +
geom_boxplot() +
coord_flip() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
Edwin Gray contributed to three separate candiates though he strong preferred Cruz. James Coffman only contributed to Cruz.
```{r Faceted by Candidate, fig.height=7, fig.width=10}
p1 <- ggplot(subset(top_cands, date > '2015-04-01' & party == 'republican'),
aes(x = date, y = amt, color = cand_nm)) +
geom_point(alpha = .5, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
ylim(0, quantile(FL$amt, 0.90))
p2 <- ggplot(subset(top_cands, date > '2015-04-01' & party == 'democrat'),
aes(x = date, y = amt, color = cand_nm)) +
geom_point(alpha = .5, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
ylim(0, quantile(FL$amt, 0.90))
grid.arrange(p1, p2, ncol = 1)
```
The abrupt shift towards the party candidate is apparent starting in June, otherwise there is too much noise to discern patterns.
```{r Contribution amount over time, faceted by party1, echo=FALSE, fig.height=8, fig.width=10}
two <- ggplot(subset(subset(FL, date > '2015-04-01'),
(party == 'democrat' | party == 'republican')),
aes(x = date, y = amt, color = party)) +
geom_point(alpha = .5, size = 0.75, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
scale_color_manual(values = c("#0e50D8", "#CC0000")) +
ylim(0, quantile(FL$amt, 0.90)) +
geom_smooth(color = 'black') +
facet_wrap (~party)
three <- ggplot(subset(subset(FL, date > '2015-04-01'),
(party != 'democrat' & party != 'republican')),
aes(x = date, y = amt, color = party)) +
geom_point(position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
scale_color_manual(values = c('#009933', '#6600CC', '#FCD003')) +
ylim(0, quantile(FL$amt, 0.90)) +
facet_wrap (~party)
grid.arrange(two, three, ncol = 1)
```
As suspected, the majority of third party contributions came towards the end of the campaign. Of the third party contributions, Green had the earliest start. Interestingly while Democratic contributions stayed relatively steady in the last few months, with a small upswing in number and amt in August 2016, there were gaps in Republican contributions of lower amounts (<100) in June 2016 and after August 2016. The smoother shows the uptick in the Republican contribution amount towards the end.
Both general and primary contributions show a normal distribution with spikes at regular intervals, however there were many more contributions to the primaries, as previously shown. While there were more Democratic contributions overall, Republicans contributions were more left-skewed in the general election.
```{r Republican Candidate Amounts, fig.height=5, fig.width=10}
ggplot(subset(top_cands, date > '2015-04-01'),
aes(x = date, y = amt, color = cand_nm)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month") +
ylim(0, quantile(FL$amt, 0.90)) +
geom_smooth() +
facet_wrap(~party)
```
Though Clinton received the party nomination, the dollar amount of her contributions did not receive the jump that Trump's did once he did. Republican contributions had a far higher variability than Democrat's contributions.
```{r, fig.height = 10, fig.width = 10}
less <- ggplot(subset(top_cands,
(date > "2015-04-01" & amt < 500)),
aes(x = date, y = cand_nm, color = party)) +
geom_point(alpha = 0.25, size = 0.5, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
more <- ggplot(subset(top_cands,
(date > "2015-04-01" & amt >= 500 & amt < 2500)),
aes(x = date, y = cand_nm, color = party)) +
geom_point(alpha = 0.25, size = 0.5, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
even_more <- ggplot(subset(top_cands,
(date > "2015-04-01" & amt >= 2500)),
aes(x = date, y = cand_nm, color = party)) +
geom_point(alpha = 0.25, size = 0.5, position = 'jitter') +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
grid.arrange(less, more, even_more, ncol = 1)
```
This plot breaks down contributions towards candates across time, and faceted into amounts of less than 500, between 500 and 2500 and above 2500.
```{r date-candidate-amt scatterplot, fig.height=3, fig.width=8}
top_cands$scaled <- scale(top_cands$amt, center = 1)
top_cand_samp <- top_cands[sample(1:length(top_cands$tran_id), 1000), ]
ggplot(subset(top_cand_samp, date > '2015-04-01' & scaled > 0 & scaled < 5),
aes(x = date, y = cand_nm, color = party, size = scaled)) +
geom_point(alpha = .5) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
scale_x_date(date_breaks = "1 month")
```
Here is a similar plot which uses only a sample of the data since with all of the data it is impossible to notice the different sizes of contributions.