-
Notifications
You must be signed in to change notification settings - Fork 10
/
index.xml
404 lines (358 loc) · 84 KB
/
index.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>David Childers on David Childers</title>
<link>/</link>
<description>Recent content in David Childers on David Childers</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<copyright>&copy; 2018</copyright>
<lastBuildDate>Sun, 15 Oct 2017 00:00:00 -0400</lastBuildDate>
<atom:link href="/" rel="self" type="application/rss+xml" />
<item>
<title>Online Data Collection for Efficient Semiparametric Inference</title>
<link>/publication/onlinemomentselectionsemiparametric/</link>
<pubDate>Sat, 07 Sep 2024 00:00:00 -0400</pubDate>
<guid>/publication/onlinemomentselectionsemiparametric/</guid>
<description></description>
</item>
<item>
<title>Timing as an Action: Learning When to Observe and Act</title>
<link>/publication/timingasanaction/</link>
<pubDate>Sat, 09 Mar 2024 00:00:00 -0500</pubDate>
<guid>/publication/timingasanaction/</guid>
<description></description>
</item>
<item>
<title>Papers I Liked 2023</title>
<link>/post/papers-2023/</link>
<pubDate>Sun, 31 Dec 2023 00:00:00 +0000</pubDate>
<guid>/post/papers-2023/</guid>
<description><script src="/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I felt like I barely did any serious reading this year, and maybe that’s even true, but my read folder contains 168 papers for 2023, so even subtracting the ones that are in there by mistake, that’s enough to pick a few highlights. As usual, I hesitate to call these favorites, but I learned something from them. They are in no particular order except chronological by when I read them. Themes are kind of all over the place because this year has been one of topical whiplash for me. Broadly, early in the year I was reading a lot more economics, and later in the year more Machine Learning. Computational econ was a focus because I taught <a href="https://github.com/donskerclass/ComputationalMethodsClass">that class</a> again after a 2 year hiatus and added Python. Learning Python was a bigger focus: I can say that I am now quite middling at it, which was an uphill battle. I spent the middle of the year trying to catch up with the whole language modeling thing that is apparently hot right now. A lot of the learning on each of these topics was books and classes, so I will add a section on those too.</p>
<div id="classes-and-books" class="section level2">
<h2>Classes and Books</h2>
<ul>
<li>Python, introductory
<ul>
<li>I quite liked the <a href="https://quantecon.org/lectures/">QuantEcon</a> materials for the basics, though that’s idiosyncratic to it being targeted to numerical methods in economics and to having already used the Julia materials.</li>
</ul></li>
<li>Python, advanced
<ul>
<li>Please help me, I’m dying. Send recs. Part of it is that I still need a deeper foundation in the <a href="https://missing.csail.mit.edu/">basics of computation</a> (like, command line utils, not CS theory). Part of it is that the one good thing about Python, its huge community and rich library ecosystem, is also the terrible thing about it, the whole thing being a huge and ever shifting set of incompatible hacks and patches fixing basic flaws in older patches fixing basic flaws in, etc ad infinitum.</li>
</ul></li>
<li>General Deep learning
<ul>
<li><a href="https://dell-research-harvard.github.io/blog.html">Melissa Dell’s Harvard class</a> is the only one I’m aware of that’s aimed at economists that will explain modern practical deep learning, including contemporary vision, text, and generative architectures, with a focus on transformers. Use this if you want to do research with text, images, documents. Taught by an economic historian, but orders of magnitude more up to date than anything by an econometrician or computational economist, including what gets published in top econ journals (which are great, but not for ML).</li>
</ul></li>
<li>Natural Language Processing
<ul>
<li>Jurafsky and Martin, <a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and Language Processing, 3rd ed</a>: Learn the history of NLP, up to the modern era. A lot of the old jargon remains, the methods mostly don’t. But this will explain the tasks and how we got to modern methods.</li>
<li><a href="https://huggingface.co/learn/nlp-course">HuggingFace Transformers</a> is the library people actually use for text processing. This is mostly a software how to, but then again modern NLP is pretty much nothing but software, so you may as well get it directly.</li>
<li>Grimmer, Roberts, and Stewart, <a href="https://press.princeton.edu/books/hardcover/9780691207544/text-as-data">Text as Data</a>: Fantastic on research methods, and how to learn systematically from document corpora. Technical methods are from the Latent Dirichlet Allocation era, now charmingly dated, though their <a href="https://www.structuraltopicmodel.com/">stm software</a> will get you quite far very quickly in the exploratory phase of a project.</li>
</ul></li>
</ul>
</div>
<div id="papers-i-liked" class="section level2">
<h2>Papers I liked</h2>
<ul>
<li>Russo and van Roy (2013): “<a href="https://web.stanford.edu/~bvr/pubs/Eluder.pdf">Eluder Dimension and the Sample Complexity of Optimistic Exploration</a>”
<ul>
<li>Recommended to me as “well-written”. Foundational for interesting <a href="https://arxiv.org/abs/2312.16730">modern work</a> in bandits and RL.</li>
</ul></li>
<li>García-Trillos, Hosseini, Sanz-Alonso “<a href="https://arxiv.org/abs/2302.11449">From Optimization to Sampling Through Gradient Flows</a>”
<ul>
<li>A quick and readable explanation of how Langevin-based sampling algorithms are just gradient descent in the right space: over the past two years I’ve caved in to the optimal transport bandwagon. For a comprehensive overview, see the monograph by <a href="https://chewisinho.github.io/#book">Sinho Chewi</a> or the <a href="https://simons.berkeley.edu/programs/geometric-methods-optimization-sampling">Simons Program</a>, especially the bootcamp lectures by <a href="https://simons.berkeley.edu/talks/sampling-crash-course-0">Eberle</a>.</li>
</ul></li>
<li>Bouscasse, Nakamura, Steinsson “<a href="https://eml.berkeley.edu/~jsteinsson/papers/malthus.pdf">When Did Growth Begin?
New Estimates of Productivity Growth in England from 1250 to 1870</a>”
<ul>
<li>Structural Bayesian estimation of a neo-Malthusian model of English population and wage history. Modeling here both allows transparent interpretation of data and expression of many sources of uncertainty in historical series that often go unacknowledged. On these issues, as my favorite paper title of the year put it, “<a href="https://www.cambridge.org/core/journals/journal-of-economic-history/article/we-do-not-know-the-population-of-every-country-in-the-world-for-the-past-two-thousand-years/D747DDC6E499C799B0471DBE33FEB0BB">We Do Not Know the Population of Every Country in the World for the Past Two Thousand Years</a>”</li>
</ul></li>
<li>Kovachki, Li, Liu, Azizzadenesheli, Bhattacharya, Stuart, Anandkumar, JMLR (2023) “<a href="https://www.jmlr.org/papers/volume24/21-1524/21-1524.pdf">Neural Operator: Learning Maps Between Function Spaces With Applications to PDEs</a>”
<ul>
<li>Learning of nonlinear operators (maps with functions as input and output), as opposed to linear ones, has been a weak spot of functional data analysis. Neural operator architectures are part of a class of methods that are usable in the setting. Applications include speeding up massive scientific models, <a href="https://arxiv.org/abs/2302.07400">generative models of functions</a>, etc.</li>
</ul></li>
<li>Mikhail Belkin, Acta Numerica (2021) “<a href="https://www.cambridge.org/core/journals/acta-numerica/article/abs/fit-without-fear-remarkable-mathematical-phenomena-of-deep-learning-through-the-prism-of-interpolation/DBAC769EB7F4DBA5C4720932C2826014">Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation</a>”
<ul>
<li>Since <a href="https://proceedings.neurips.cc/paper_files/paper/1996/file/fb2fcd534b0ff3bbed73cc51df620323-Paper.pdf">Bartlett (1997)</a> and re-emphasized by <a href="https://arxiv.org/abs/1611.03530">Zhang et al (2016)</a>, we’ve known classical learning theory doesn’t quite work for neural networks in the modern regime. They are overparameterized, interpolate (“overfit”) the training data, do not converge uniformly, and bounds based on theories like VC or Rademacher complexity are typically vacuous. But they seem to generalize fine. We’re still assembling the story here, and I don’t think it’s completely stitched up, but this gives a good overview of the problems and elements of the solutions (data dependent bounds, selecting good global minima among the many that exist by some aspect of the training dynamics), and some precise results in the NTK regime.</li>
<li>See also work on PAC-Bayes bounds by people in the Gordon-Wilson lab, with a different and more promising data-dependent approach: see eg “<a href="https://arxiv.org/abs/2211.13609">PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization</a>” or “<a href="https://arxiv.org/abs/2304.05366">The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning</a>”</li>
</ul></li>
<li>Hu and Laurière “<a href="https://arxiv.org/abs/2303.10257">Recent Developments in Machine Learning Methods for Stochastic Control and Games</a>”
<ul>
<li>Survey on the Neural PDEs literature for optimal control and mean field games. The applications where neural networks improve upon classical numerical methods are currently being scoped out, but they seem useful in certain high dimensional situations that have eluded traditional techniques (specifically, inequality with portfolio choice, aggregate risk, and aging).</li>
</ul></li>
<li>Egami, Hinck, Stewart, Wei “<a href="https://arxiv.org/abs/2306.04746">Using Large Language Model Annotations for Valid Downstream Statistical Inference in Social Science: Design-Based Semi-Supervised Learning</a>”
<ul>
<li>You can and should use classical semiparametric techniques with sample-splitting to get confidence intervals when using large language models. The methods are old and well established, but LLM users need to hear it. See also <a href="https://arxiv.org/abs/2309.16598">Zrnic and Candès</a> and <a href="https://arxiv.org/abs/2309.13666">Mozer and Miratrix</a> who also suggested exactly the same estimator (literally the same formula in all 3 papers), but who cares, any good idea should be published multiple times.</li>
</ul></li>
<li>Lew, Tan, Grand, Mansinghka: “<a href="https://arxiv.org/abs/2306.03081">Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs</a>”
<ul>
<li>Language models like LLaMA are autoregressive probability models of sequences. You should be able to do all kinds of sampling algorithms on that sequence, not just the typical beam search with some penalties. Full Bayesian inference by filtering is just one example: see also work like “<a href="https://arxiv.org/abs/2310.09139">The Consensus Game: Language Model Generation via Equilibrium Search</a>” which computes a Nash equilibrium over language output. All of this is greatly facilitated by having the actual probabilities output by the model and requires many samples, so own a lot of GPUS or use a small model, but this is promising that future inference will look very different from current practice.</li>
</ul></li>
<li>David Donoho “<a href="https://arxiv.org/abs/2310.00865">Data Science at the Singularity</a>”
<ul>
<li>Old man yells at cloud computing. Kind of an opinion piece: one of the top scientists of the previous generation of ML on how the real secret to modern ML success is nothing about theory or methods but a research paradigm of “frictionless reproducibility” and ceaseless competition. See also Ben Recht’s <a href="https://www.argmin.net/p/patterns-predictions-and-actions">running commentary on his ML class</a> from a related perspective.</li>
</ul></li>
<li>Bengs, Busa-Fekete, El Mesaoudi-Paul, Hüllermeier JMLR (2021) “<a href="https://jmlr.org/papers/v22/18-546.html">Preference-based Online Learning with Dueling Bandits: A Survey</a>”
<ul>
<li>Learning from comparisons, rather than numerical values, leads to a field that combines bandits, sorting algorithms, voting theory, and preference estimation, and a dazzling array of algorithms based on each of these perspectives. This work touches on the issues that arise when trying to figure out, for example, what it is that “Reinforcement Learning with Human Feedback” is optimizing language model output for.</li>
</ul></li>
</ul>
</div>
</description>
</item>
<item>
<title>Local Causal Discovery for Estimating Causal Effects</title>
<link>/publication/localcausaldiscovery/</link>
<pubDate>Fri, 17 Feb 2023 00:00:00 -0500</pubDate>
<guid>/publication/localcausaldiscovery/</guid>
<description></description>
</item>
<item>
<title>Differentiable State Space Models and Hamiltonian Monte Carlo Estimation</title>
<link>/publication/differentiablestatespace/</link>
<pubDate>Thu, 06 Oct 2022 00:00:00 -0400</pubDate>
<guid>/publication/differentiablestatespace/</guid>
<description></description>
</item>
<item>
<title>Automated Solution of Heterogeneous Agent Models</title>
<link>/publication/automatedsolution/</link>
<pubDate>Wed, 13 Jul 2022 00:00:00 -0400</pubDate>
<guid>/publication/automatedsolution/</guid>
<description></description>
</item>
<item>
<title>SED 2022 - Notes on contemporary macro</title>
<link>/post/sed-2022/</link>
<pubDate>Sat, 02 Jul 2022 00:00:00 +0000</pubDate>
<guid>/post/sed-2022/</guid>
<description><script src="/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>I just got back from an excellent <a href="https://www.economicdynamics.org/sedam_2022/">meeting of the Society for Economic Dynamics</a>, a top conference for work in dynamic economics, principally but not exclusively in macroeconomics. As one of the first in-person conferences I’ve been to since 2020 (last year they were hybrid and I presented from home), it was a chance to catch up not just with colleagues and friends but also with the state of modern academic macro, after some time focusing more on other things. While the conference is fresh in my mind, I thought I’d jot down a few bigger picture notes to start thinking about where the field is and where it might be headed.</p>
<p>Of course, the conference has so many parallel sessions that I’m sure no two people had the same experience, aside from the plenary talks, and my particular focus, mostly on computational and econometric methods, is a specialized niche within the whole, but given that it’s particularly valuable for methodologists to have a sense of what applied problems people currently work on and how they’re going about it, I did try to explore a little more. That said, even within what I saw these are just first impressions and themes.</p>
<p><strong>HANK models have matured</strong></p>
<p>Research literatures in macroeconomics seem to have a life cycle that goes in stages.</p>
<p>First, some creative thinkers come up with a concept and implement an early version showing it can be done. For heterogeneous agent New Keynesian (HANK) literature, around the early to mid 2010s, the idea was to merge our <a href="https://donskerclass.github.io/post/an-empirical-heterogeneous-agents-models-reading-list/">benchmark incomplete markets models</a> of inequality and individual spending and saving behavior with our <a href="https://press.princeton.edu/books/hardcover/9780691164786/monetary-policy-inflation-and-the-business-cycle">benchmark New Keynesian models</a> of monetary policy, inflation, and business cycles to start to answer questions about how they interact.</p>
<p>Second, the research community enters a stage of twiddling with all the knobs, investigating all the features of a type of model and understanding which features are important for which outcomes and why, and how they interact. Some of the choices in the initial model may have been just placeholder first guesses, that after a period of trial and error over different specifications will be swapped out for something more robust or tractable until the literature settles down on a small set of benchmarks. HANK had been actively in this stage in the late 2010s, with many competing variations working out questions about specification in terms of fiscal transfers, portfolio choices, preferences and so on, along with the role of methods (discrete vs continuous time, MIT shocks vs Krusell-Smith vs perturbation, sequence versus state space, etc). There’s still some of this settling of basic questions going on, but there seemed to be more of this in the previous 2 or 3 SED meetings.</p>
<p>Third, after enough knob twiddling, people understand the framework well enough to put the model to work, as a tool for basic measurement and for policy analysis. HANK now seems to be entering in this mature stage, what Kuhnians would call “normal science”, with lots of applications to understanding the effects of particular policy proposals or shocks, measuring and quantifying different sources of inequality, and as a baseline for incorporating new proposed deviations or frictions. I went to a lot of talks where the format was “here’s a policy issue or fact to explain. We motivate with some simple empirics and maybe a toy model with just the one force, then embed it in a quantitative HANK model to measure that it explains x percent of this trend, or implies that this policy is x percent more/less effective” and fewer that were trying to resolve basic issues of the form “what happens if we take a HANK model and swap out sticky prices vs wages, or real vs nominal debt, etc”.</p>
<p>I suppose beyond those stages, other literatures in macroeconomics that have tread the path before (representative agent DSGE?) maintain a long lifetime of continued routine use, often fading a bit from the academic spotlight but continuing to be useful to policy-makers, both for day-to-day measurement and as a mature and reliable way to get a first pass at the pressing issues of the day. Beyond that, either continued probing of points of empirical dissatisfaction or merging with ideas from some previously disjoint strand lead to new strands of research ideas. For HANK, I suspect much of the original interest was from the desire to reconcile the profession’s incompatible benchmark models of individual behavior and of business cycle aggregates, with the most notable empirical problem being dissatisfaction with the MPC implications of the typical representative agent <a href="http://noahpinionblog.blogspot.com/2014/01/the-equation-at-core-of-modern-macro.html">Euler equation</a>. I think there are many years left of both basic exploration and normal science left to do with HANK, though pattern matching suggests that something will eventually come along to form the next generation. It seems too early to specify what that might be. Despite widespread dis-ease with other aspects of the New Keynesian paradigm for monetary economics and many proposals for modifications or replacements, it has proved remarkably persistent by serving as a baseline framework for encompassing divergent views on mechanisms and policies. While disagreements remain, the days when “freshwater” and “saltwater” macroeconomists expressed disagreement mostly through <a href="https://en.wikiquote.org/wiki/Robert_Lucas_Jr.#Quotes:~:text=The%20main%20development,submitted%20any%20more.">laughter</a> have largely been replaced by conversations about parameter values in shared model classes between researchers who cannot be clustered nearly so easily into ideological camps.</p>
<p><strong>Plenaries</strong></p>
<p>The plenary talks were a good chance to get bigger picture overviews of different subfields, at a range of stages of maturity.</p>
<ul>
<li><p>Giuseppe Moscarini gave a talk on his work over the past decade in the area of cross-sectional wage dynamics, which has developed since the seminal work of <a href="https://www.jstor.org/stable/2527292">Burdett and Mortensen</a> into a mature area providing a foundation for studies of employer-firm matched data, wage inequality, monopsony, career progression, and so on. He started with an overview of his work with Postel-Vinay on the role of job-to-job switching in wage growth. Then, continuing the theme of the New-Keynesianization of everything, while the more tractable Diamond-Mortensen-Pissarides model of <em>aggregate</em> labor market flows had been merged into monetary models, he presented new work merging the disaggregated “job-ladder” style models into a NK framework suggesting that <a href="https://campuspress.yale.edu/moscarini/data/">aggregate job-to-job recruiting</a>, as opposed to just unemployment, is an important and cyclically distinct determinant of aggregate wage inflation.</p></li>
<li><p>Esteban Rossi-Hansberg presented work that strikes me as very much in the new paradigm stage, on integrating regional heterogeneity into integrated assessment climate models. The question is compelling: while the carbon cycle is global, impacts and adaptation efforts are highly diverse across places, and figuring out how locations which may face very different flooding, extreme weather, temperature changes, and so on will adapt economically is important for measuring global costs and coordinating mitigation efforts. With recent progress in <a href="https://rossihansberg.economics.uchicago.edu/QSE.pdf">quantitative spatial economics</a> and computational methods applicable to high resolution heterogeneity, models can now incorporate detailed spatial economic data along with high resolution climate data and simulations, and Rossi-Hansberg and collaborators have provided some noteworthy examples. But as he emphasized in the talk, there’s still a lot to learn about how the basic economic mechanisms work, given their current difficulty, and I suspect there is a lot of “knob-twiddling” work to do just to figure out what are the important aspects to put into such a model and how to specify and solve them, before these reach the normal science stage where we can just focus on arguing over a few crucial parameters like climate macroeconomists working with aggregate models have been up to for years now. This talk inspired me, though I currently don’t do any work in climate, to attend some of the climate sessions later on in the conference, where young researchers are working hard to figure it out.</p></li>
<li><p>On the last day, IMF chief economist Gita Gopinath gave a talk on how open economy macroeconomic research informs the current work of the Fund and its policy framework. The talk was surprisingly academic in style, with a discussion of models and empirics in a way that you don’t usually get out of public-facing speeches by policy-makers, but directed at informing working economists about the role the work plays in the policy process. This involves aggregating a long history of work on available policy tools to synthesize policy recommendations, not any single model or study but a systematic review of many, with some modeling work done mainly to quantify and reconcile models of competing effects each described individually. The resulting framework reflects a very gradual evolution of the Fund’s views, from its 1990s Mundell-Fleming inspired recommendations for exchange rate flexibility as a stabilizing buffer, to incorporating decades of work since the Asian Financial Crisis of 1997-98 on models of borrowing constraints, sudden stops, and financial frictions suggesting that in some contingent cases capital controls may be a desirable measure. This stance broadly was already conventional wisdom by the time I graduated with an International Economics degree in 2009, but putting it in an official IMF policy document collecting a large number of careful studies of pros and cons represents a long process. As a bonus, she also gave a brief overview of one of her pre-IMF era research contributions, on dominant currency pricing in open economy New Keynesian models. This is a clear example of valuable knob-twiddling research, showing that the symmetric pricing assumption used in early models largely out of convenience was not only implausible but also consequential, with likely implications for global trade volumes during the current Fed tightening cycle.</p></li>
</ul>
<p><strong>Miscellaneous thoughts</strong></p>
<ul>
<li><p>On the econometrics side, after watching a presentation by Ashesh Rambachan on IRF interpretation (<a href="https://asheshrambachan.github.io/assets/files/arns_commontimeseries_causal.pdf">paper</a>, <a href="https://donskerclass.github.io/CausalEconometrics/TimeSeries.html">my notes</a>), I saw the implications all over other talks. Micro people have been reckoning with the need to precisely define the counterfactual path of a shock in dynamic models, as measured IRFs can be a mixture of other things. Some talks with IRFs gave this serious thought, formally or not, others not so much; gratefully, audiences seemed willing to provide helpful feedback in those cases.</p></li>
<li><p><a href="https://scholar.harvard.edu/straub/publications/using-sequence-space-solve-and-estimate-heterogeneous-agent-models">Sequence space methods</a> for heterogeneous agents models are seeing really fast adoption, from a fairly technical 2021 Econometrica paper to a relatively common approach. In addition to speed, I think this in part reflects interpretability, since it lets economists derive equilibrium conditions which can be informative even before fully solving numerically.</p></li>
<li><p>Wisconsin cheese curds are better than I expected.</p></li>
</ul>
</description>
</item>
<item>
<title>Papers I Liked 2021</title>
<link>/post/papers-i-liked-2021/</link>
<pubDate>Fri, 31 Dec 2021 00:00:00 +0000</pubDate>
<guid>/post/papers-i-liked-2021/</guid>
<description><script src="/rmarkdown-libs/header-attrs/header-attrs.js"></script>
<p>A list of 10 papers I read and liked in 2021. As in previous years, this is by date read rather than released or published, and selection is in no particular order. Overall, my list reflects my interests this year, prompted by research and teaching, in online learning, micro-founded macro, and causal inference, and, to the extent possible, intersections of these areas. As usual, I’m likely to have missed a lot of great work even in areas on which I focus, so absence likely indicates that I didn’t see it, or it’s on my ever expanding to read list, so ping me with your recommendations!</p>
<ul>
<li>Block, Dagan, and Rakhlin <a href="http://arxiv.org/abs/2102.01729">Majorizing Measures, Sequential Complexities, and Online Learning</a>
<ul>
<li>Sequential versions of metric entropy type conditions which are the bread and butter of iid data analysis extended to the setting of online estimation.</li>
<li>See also: This builds on Rakhlin, Sridharan, and Tewari (2015)’s <a href="https://link.springer.com/content/pdf/10.1007/s00440-013-0545-5.pdf">essential earlier work</a> on uniform martingale laws of large numbers by sequential versions of Rademacher complexity. More generally, there were many advances in inference and estimation for online-collected data this year: see the papers at the NeuRIPS <a href="https://sites.google.com/view/causal-sequential-decisions/home">Causal Inference Challenges in Sequential Decision Making</a> workshop for a few.</li>
</ul></li>
<li>Klus, Schuster, and Maundet <a href="https://arxiv.org/abs/1712.01572">Eigendecompositions of Transfer Operators in Reproducing Kernel Hilbert Spaces</a>
<ul>
<li>A Koopman operator <span class="math inline">\(\mathcal{K}[f_t](x)\)</span> mapping <span class="math inline">\(f\to E[f(X_{t+\tau})|X_t=x]\)</span>, is a way of summarizing a possibly nonlinear and high dimensional dynamical system using a linear operator, which allows summarization and computation using linear algebra tools. Since this is effectively an evaluation operator, it pairs nicely with kernel mean embeddings and RKHS theory, which gives precisely the properties needed to make these objects well behaved, allowing analysis in arbitrarily high dimension at no extra cost.</li>
<li>See also: Budišić, Mohr, and Mezić (2009) <a href="https://doi.org/10.1063/1.4772195">Applied Koopmanism</a> for an intro to Koopman analysis of dynamical systems.</li>
</ul></li>
<li>Antolín-Díaz, Drechsel, and Petrella <a href="http://econweb.umd.edu/~drechsel/papers/advances.pdf">Advances in Nowcasting Economic Activity: Secular Trends, Large Shocks and New Data</a>
<ul>
<li>Classic linear time series models used in forecasting, causal, and structural macroeconomics have taken a beating in the past two years with the huge fluctuations due to the pandemic. But a dirty secret known to forecasters is that black box ML models designed to be much more flexible have, if anything, an even more dismal track record. This work adding carefully specified and empirically validated mechanisms for shifts, outliers, mean and volatility changes and so on to the kind of dynamic factor models that have substantially outperformed offers a chance to substantially improve fit and handling of big shifts while retaining performance. This attention to distributional properties of macro data is surprisingly rare, and should encourage more work on understanding the sources of these features.</li>
</ul></li>
<li>Karadi, Schoenle, and Wursten <a href="https://sites.google.com/site/pkaradi696/KaradiSchoenleWursten.pdf">Measuring Price Selection in Microdata: It’s Not There</a>
<ul>
<li>A venerable result in sticky price models, going back to Golosov and Lucas, is that “menu costs” of price changes ought to result in very limited actual real response of output to monetary impulses because even though costs keep prices fixed most of the time, any product that is seriously mispriced will be selected to have its price changed, so real effects should be minimal. This paper tests that theory directly using price microdata and shows that in response to identified monetary shocks, the prices that change do not appear to be those which are out of line, suggesting a much smaller selection effect than in baseline menu cost models. I liked this paper, beyond the importance of its empirical results, as a model for combining micro and macro data: to claim a microeconomic mechanism responds to an aggregate shock, your results are much more credible actually measuring variation in that shock and the micro response to it, rather than only using macro or only using micro variation.</li>
<li>See also: Wolf <a href="http://economics.mit.edu/files/22576">The Missing Intercept: A Demand Equivalence Approach</a> describing how <em>both</em> causal variation at the micro (cross-sectional) level and at the macro (time series) level are necessary to identify aggregate responses. This kind of hybrid approach is a welcome change which takes into account both the value of “<a href="https://doi.org/10.1257/jep.32.3.59">identified moments</a>” using microeconomic causal inference tools in macro with the reality that if you want to credibly measure aggregate causal effects, you need random variation at the aggregate level also.</li>
</ul></li>
<li>Hall, Payne, Sargent, and Szöke <a href="https://people.brandeis.edu/~ghall/papers/Yield_Curve_May_10_2021.pdf">Hicks-Arrow Prices for US Federal Debt 1791-1930</a>
<ul>
<li>A time series of risk and term structure adjusted U.S. interest rates going way back, estimated using a Bayesian hierarchical term structure model, which allows handling the variety of bond issuance terms and missingness that make comparing over time using models for modern yield curves quite difficult.</li>
<li>See also: <a href="https://turing.ml/">Turing</a>, the probabilistic programming language used for these results, which combines modern MCMC sampling algorithms with the full power of Julia’s Automatic Differentiation stack to allow fitting even complicated structural models with elements not standard in more statistics-specialized programming languages and benefitting from the ability of Bayes to handle inference with complicated missingness and dependence structures that become extremely challenging without it, even for simulation-based estimators.</li>
</ul></li>
<li>Callaway and Sant’Anna <a href="https://doi.org/10.1016/j.jeconom.2020.12.001">Difference-in-Differences with multiple time periods</a>
<ul>
<li>The diff-in-differdämmerung struck hard this year, with methods for handling DiD (particularly but not only with variation in treatment timing) up in the air and new papers coming out at an increasing pace. In trying to summarize at least <a href="https://donskerclass.github.io/CausalEconometrics/DifferenceinDifferences.html">a bit of this literature</a> for <a href="https://donskerclass.github.io/CausalEconometrics.html">a new class</a>, I found this paper, and others by Pedro Sant’Anna and collaborators, crystal clear about the sources of the issues and how to resolve them, with the bonus of well-documented <a href="https://bcallaway11.github.io/did/">software</a> and extensive examples.</li>
<li>See also: too many papers on DiD to list.</li>
</ul></li>
<li>Farrell, Liang, and Misra <a href="https://arxiv.org/abs/2010.14694">Deep Learning for Individual Heterogeneity: An Automatic Inference Framework</a>
<ul>
<li>Derives influence functions and doubly robust estimators for conditional loss-based estimation allowing, e.g., nonparametric dependence of coefficients on high-dimensional inputs in Generalized Linear Models. Results are flexible enough to be widely applicable, and simple enough to be easy to implement and interpret.</li>
<li>See also: Hines, Dukes, Diaz-Ordaz, Vansteelandt <a href="https://arxiv.org/abs/2107.00681">Demystifying statistical learning based on efficient influence functions</a> for an overview of this increasingly essential but always-confusing topic</li>
</ul></li>
<li>Foster and Syrgkanis <a href="https://arxiv.org/abs/1901.09036">Orthogonal Statistical Learning</a>
<ul>
<li>A very general theory extending “Double Machine Learning” approach to loss function based estimation where instead of a root n estimable regular parameter, you may have a more complex object like a function (e.g., a conditional treatment effect, a policy, etc) which you want to make robust to high dimensional nuisance parameters.</li>
<li>See also: I went back and reread the published version of the <a href="https://doi.org/10.1111/ectj.12097">original “Double ML” paper</a> to write up teaching notes, which was helpful for really thinking through the results.</li>
</ul></li>
<li>Rambachan and Shephard <a href="https://asheshrambachan.github.io/assets/files/arns_commontimeseries_causal.pdf">When do common time series estimands have nonparametric causal meaning?</a>
<ul>
<li>Potential outcomes for time series are a lot harder than you would think at first, because repeated intervention necessarily vastly expands the space of possible relationships between treatments, and between treatments and outcomes. This paper lays out the issues and proposes some solutions.</li>
<li>See also: I based my <a href="https://donskerclass.github.io/CausalEconometrics/TimeSeries.html">time series causal inference teaching notes</a> mostly on this paper.</li>
</ul></li>
<li>Breza, Kaur, and Shamdasani <a href="https://drive.google.com/file/d/1RiMgkKu7DJqfnqU3vIqQszD1TEREY9jf/view?usp=sharing">Labor Rationing</a>
<ul>
<li>How do economies respond to labor supply shocks? Breza and et al just go out there and run the experiment, setting up a bunch of factories and hiring away a quarter of eligible workers in half of 60 villages in Odisha. In peak season wages rise, like textbook theory, but in lean season, wages do nothing as it appears most workers are effectively unemployed.</li>
<li>See also: The authors’ other work in the same setting testing theories of wage rigidity. For example, they find strong <a href="https://drive.google.com/file/d/1Z2ZsrFZ71-Upq7dvN7MXy225HT0HgkqY/view?usp=sharing">experimental support</a> for <a href="https://delong.typepad.com/files/bewley-wages.pdf">Bewley’s</a> morale theory for why employers don’t just cut wages. By running experiments at the market level, they have been able to provide a lot of compelling evidence on issues that have previously been relegated to much more theoretical debate.</li>
</ul></li>
</ul>
</description>
</item>
<item>
<title>Efficient Online Estimation of Causal Effects by Deciding What to Observe</title>
<link>/publication/onlinemomentselection/</link>
<pubDate>Mon, 23 Aug 2021 00:00:00 -0400</pubDate>
<guid>/publication/onlinemomentselection/</guid>
<description></description>
</item>
<item>
<title>Estimating Treatment Effects with Observed Confounders and Mediators</title>
<link>/publication/confoundersmediators/</link>
<pubDate>Mon, 14 Jun 2021 00:00:00 -0400</pubDate>
<guid>/publication/confoundersmediators/</guid>
<description></description>
</item>
<item>
<title>Top Papers 2020</title>
<link>/post/top-papers-2020/</link>
<pubDate>Wed, 30 Dec 2020 00:00:00 +0000</pubDate>
<guid>/post/top-papers-2020/</guid>
<description><p>The following is a look back at my reading for 2020, identifying a totally subjective set of the top 10 papers I read this year. My reading patterns, as usual, have not been so systematic, so if your brilliant work is missing it either slipped past my attention or is living in an ever-expanding set of folders and browser tabs on my to-read list. I’ll exclude papers I refereed, for privacy purposes (a fair amount if you include conferences and also cutting out a lot of the macroeconomics from my list). Themes I focused on were Bayesian computation, the optimal policy estimation/dynamic treatment regime/offline reinforcement learning space, and survival/point process models, all more-or-less project-related and in all of which I’m sure I’m missing some foundational understanding. I spent a brief time in March mostly reading about basic epidemiology, which I am led to believe many others did as well, but didn’t take it anywhere.</p>
<p>Papers, in alphabetical order</p>
<ul>
<li>Adusumilli, Geiecke, Schilter. <a href="https://arxiv.org/abs/1904.01047">Dynamically optimal treatment allocation using reinforcement learning</a>
<ul>
<li>Approximation methods for estimating viscosity solutions of HJB equations and their resulting optimal policies policies from data. These methods will form a key step in taking continuous time dynamic macro models (see <a href="https://benjaminmoll.com/lectures/">Moll lecture notes</a>) to data.</li>
</ul></li>
<li>Andrews &amp; Mikusheva <a href="https://scholar.harvard.edu/iandrews/publications/optimal-decision-rules-weak-gmm">Optimal Decision Rules for Weak GMM</a>
<ul>
<li>The Generalized Method of Moments defines a semiparametric estimator implicitly, making it often unclear what the form of the nuisance parameter being ignored actually is, especially in cases of irregular identification. This paper takes a middle ground between the fully Bayesian semiparametric approach which puts a (usually Dirichlet Process) prior over the infinite dimensional nuisance space and the regular frequentist approach which ignores it entirely, showing weak convergence to a Gaussian Process, which is tractable enough to characterize and apply to obtain approximate Bayesian tests and decision rules without strong identification.</li>
</ul></li>
<li>Cevid, Michel, Bühlmann, &amp; Meinshausen <a href="https://arxiv.org/abs/2005.14458">Distributional Random Forests : Heterogeneity Adjustment and Multivariate Distributional Regression</a>
<ul>
<li>Conditional density estimation by random forests with splits by (approximate) kernel MMD distribution tests. Produces a set of conditional weights that can be used to represent and visualize possibly multivariate conditional distributions. An <a href="https://github.com/lorismichel/drf">R package</a> is available and this really quickly became one of my go-to data exploration tools.</li>
<li>See also: Lee and Pospisil have a <a href="https://github.com/tpospisi/rfcde">related method</a> splitting by sieve <span class="math inline">\(L^2\)</span> distance tests which is more or less similar, though more tailored to low dimensional outputs.</li>
</ul></li>
<li>Gelman, Vehtari, Simpson, Margossian, Carpenter, Yao, Kennedy, Gabry, Bürkner, Modrák. <a href="http://arxiv.org/abs/2011.01808">Bayesian Workflow</a>
<ul>
<li>A comprehensive overview of what Bayesian statisticians actually do when analyzing data, as opposed to the mythology in our intro textbooks (roughly, the likelihood is given to you by God, you think real hard and come up with a prior, then you apply Bayes rule and are done). It includes all the bits of sequential model expansion and checking and computational diagnostics and compromise between simplicity, convention, and domain expertise you actually go through to build a Bayesian model from scratch. The contrarian in me would love to see more frequentist analysis of this paradigm. A lot of the checks are there to make sure you’re not fooling yourself; how well do they work in practice?</li>
<li>See also Michael Betancourt’s <a href="https://betanalpha.github.io/writing/">notebooks</a> for worked examples of this process.</li>
</ul></li>
<li>Giannone, Lenza, Primiceri <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1483826">Priors for the Long Run</a>
<ul>
<li>Exact rank constraints for cointegration are often uncertain, making pure VECM modeling a bit fraught, but standard priors on the VAR form are not strongly constraining of long run relationships, and improper treatment of initial conditions can lead to spurious inference on trends. This proposes a simple class of priors which allow “soft” constraints.</li>
</ul></li>
<li>Kallus and Uehara <a href="http://arxiv.org/abs/1908.08526">Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes</a>
<ul>
<li>Characterizes the semiparametric efficiency bound for the value of a dynamic policy and provides a doubly robust estimator combining the appropriate variants of a regression statistic and a (sequential) probability weighting statistic, allowing use of nonparametric and (with sample splitting) machine learning estimates in reinforcement learning while retaining parametric convergence rates.</li>
<li>See also companion papers on estimating the <a href="https://arxiv.org/abs/2002.04014">policy and policy gradient</a> and extending to the case of <a href="http://arxiv.org/abs/2006.03900">deterministic policies</a> (which require smoothing) among others, or watch <a href="https://www.youtube.com/watch?v=n5ZoxT_WmHo">the talk</a> for an overview.</li>
</ul></li>
<li>Sawhney &amp; Crane <a href="https://dl.acm.org/doi/abs/10.1145/3386569.3392374">Monte Carlo Geometry Processing: A Grid-Free Approach to PDE-Based Methods on Volumetric Domains</a>
<ul>
<li>I don’t usually read papers in computer graphics, but I do care a lot about computing <a href="https://donskerclass.github.io/post/why-laplacians/">Laplacians</a> and this paper offers a clever new Monte Carlo based method that allows computation on much more complicated domains. It’s not yet obvious to me whether the method generalizes to the PDE classes I and other macroeconomists <a href="https://benjaminmoll.com/wp-content/uploads/2019/07/PDE_macro_translated.pdf">tend to work with</a>, but even if not it should still be handy for many applications.</li>
</ul></li>
<li>Schmelzing <a href="https://www.bankofengland.co.uk/working-paper/2020/eight-centuries-of-global-real-interest-rates-r-g-and-the-suprasecular-decline-1311-2018">Eight Centuries of Global Real Interest Rates, R-G, and the ‘Suprasecular’ Decline, 1311–2018</a>
<ul>
<li>An enormous data collection process and public good which will be informing research on interest rates for years to come. As with any such effort at turning messy historical data into aggregate series, many contestable choices go into data selection, standardization, and normalization, and I don’t think the author’s simple trend estimates of a several hundred year decline will be the last word on the statistical properties or future implications of this data, but now that it’s out there we have a basis for discussion and testing.</li>
<li>See also: lots of useful historical macro data collection (going not quite so far back) by the folks at the Bonn <a href="http://www.macrohistory.net/">Macrohistory Lab</a>.</li>
</ul></li>
<li>Wolf <a href="https://www.aeaweb.org/articles?id=10.1257/mac.20180328">SVAR ( Mis- ) Identification and the Real Effects of Monetary Policy</a>
<ul>
<li>A nice practical application of Bayesian model checking, applying SVAR methods to simulated macro data when the (usually a bit suspect) identifying restrictions need not hold exactly. It finds that early sign-restricted BVARs with uniform (Haar) priors tend to be biased toward finding monetary neutrality, and do not in fact provide noteworthy evidence contradicting the implied shock responses of typical central bank monetary DSGEs. Of course, such models have many other problems and not being contradicted by one test is not dispositive, but macro debates would be elevated if people would check to make sure that their contradictory evidence is in fact contradictory (respectively, supportive).</li>
</ul></li>
<li>Wang and Blei <a href="http://arxiv.org/abs/1905.10859">Variational Bayes under Model Misspecification</a>
<ul>
<li>Describes what (mean field) variational Bayes ends up targeting, at least in cases where a Bernstein von Mises approximation works well enough. Also covers the much more nontrivial case with latent variables.</li>
<li>I will judiciously refrain from comment on other recent works by this pair (discussion <a href="https://casualinfer.libsyn.com/fairness-in-machine-learning-with-sherri-rose-episode-03">1</a>, <a href="https://casualinfer.libsyn.com/episode-15-dr-betsy-ogburn">2</a>) except to say that dimensionality reduction in causal inference deserves more study and this <a href="https://drive.google.com/file/d/1aN1cK_UEffkBT34a2aNtrZKIJFw_xibX/view?usp=sharing">manifold learning approach</a> to create a nonparametric version of interactive fixed effects estimation looks like a useful supplement the standard panel data toolbox.</li>
</ul></li>
</ul>
</description>
</item>
<item>
<title>Some issues with Bayesian epistemology</title>
<link>/post/some-issues-with-bayesian-epistemology/</link>
<pubDate>Sat, 05 Sep 2020 00:00:00 +0000</pubDate>
<guid>/post/some-issues-with-bayesian-epistemology/</guid>
<description><p>In this post, I’d like to lay out a few questions and concerns I have about Bayesianism and Bayesian decision theory as a <em>normative</em> theory of inductive inference. As a positive theory, of what people do, psychology is full of demonstrations of cases where people do not use Bayesian reasoning (the entire “heuristics and biases” area), which is interesting but not my target. There are no new ideas here, just a summary of some old concerns which merit more consideration, and not even necessarily the most important ones, which are better covered elsewhere.</p>
<p>My main concerns are, effectively, computational. As I understand computer science, the processing of information <em>requires real resources</em>, (mostly time, but also energy, space, etc) and so any theory of reasoning which <em>mandates</em> isomorphism between statements for which computation is required to demonstrate equivalence is effectively ignoring real costs that are unavoidable and so must have some impact on decisions. Further, as I understand it, there is no way to get around this by simply adding this cost as a component of the decision problem.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> The problem here is that determination of this cost and reasoning over it is also computationally nontrivial, and so the costs of this determination must be taken into account. But determining these is also costly, ad infinitum. It may be the case that there is some way around this infinite regress problem via means of some kind of fixed point argument, though it is not clear that the limit of such an argument would retain the properties of Bayesian reasoning.</p>
<p>The question of these processing costs becomes more interesting to the extent that they are quantitatively nontrivial. As somebody who spends hours running and debugging MCMC samplers and does a lot of reading about Bayesian computation, my takeaway from this literature is that the limits are fundamental. In particular, there are classes of distributions such that the Bayesian update step is hard, for a variety of hardness classes. This includes many distributions where the update step is NP complete, so that our best understanding of P vs NP suggests that the time to perform the update can be exponential in the size of the problem (sampling from spin glass models is an archetypal example, though really any unrestricted distribution over long strings of discrete bits will do). I suppose a kind of trivial example of this is the case with prior mass 1, in which case the hardness reduces to the hardness of the deterministic computation problem, and so encompasses every standard problem in computer science. More than just exponential time (which can mean use of time longer than the length of the known history of the universe for problems of sizes faced practically by human beings every day, like drawing inferences from the state of a high resolution image), some integration problems may even be uncomputable in the Turing sense, and so not just wildly impractical but impossible to implement on any physical substrate (at least if the Church-Turing hypothesis is correct). Amusingly, this extends to the problem above of determining the costs of practical transformations, as determining whether a problem is computable in finite time is itself the classic example of a problem which is not computable.</p>
<p>So, exact Bayesianism for all conceivable problems is physically impossible, which makes it slightly less compelling as a normative goal. What about approximations? This will obviously depend on what counts as a reasonable approximation; if one accepts the topology in which all decisions are equivalent, then sure, “approximate” Bayesianism is possible. If one gets to stronger senses of approximation, such as requiring computation to within a constant factor, for cases where this makes sense, there are inapproximability results suggesting that for many problems, one still has exponential costs. Alternately, one could think about approximation in the limit of infinite time or information; this then gets into the literature on Bayesian asymptotics, though I guess with the goal exactly reversed. Rather than attempt to show Bayes converges to a fixed truth in the limit, one would try to show that a feasible decision procedure converges to Bayes in the limit. For the former goal, impossibility results are available in the general case, with positive results, like Schwartz’s theorem and its quantitative extensions ( <a href="http://www.math.leidenuniv.nl/~avdvaart/BNP/">notes</a> and <a href="https://www.cambridge.org/core/books/fundamentals-of-nonparametric-bayesian-inference/C96325101025D308C9F31F4470DEA2E8">monograph</a>) relying on compactness conditions that are more or less unsurprising given what is known on minimax lower bounds from information theory on what cannot be learned in a frequentist sense. For the other direction (whatever that might mean), I don’t know what results have been shown, though I expect, given the computational limitations in worst case prior-likelihood settings, that no universally applicable procedure is available.</p>
<p>How about if we restrict our demands from Bayesianism, for any prior and likelihood to Bayesian methods for some reasonable prior or class of priors? In small enough restricted cases, this seems obviously feasible: we can all do Beta-Bernoulli updating at minimal cost, which is great if the only information we ever receive is a single yes no bit. If we want Bayesianism to be a general theory of logical decision making, it probably has to go beyond that. Some people like the idea of <a href="https://en.wikipedia.org/wiki/Solomonoff%27s_theory_of_inductive_inference">Solomonoff induction</a>, which proposes Bayesian inference with a “universal prior” over all <em>computable</em> distributions, avoiding the noncomputability problem in some sense. This proposes a prior mass on all programs exponentially decreasing in their length expressed in bits in the Kolmogorov complexity sense. Aside from the problem that it runs into computational hardness results for determining Kolmogorov complexity and so is not itself computable, running into the above issues again, there are some additional questions.</p>
<p>This exponentially decreasing tail condition seems to embed the space of all programs into a hyper-rectangle obeying summability conditions sufficient to satisfy Schwartz’s theorem for frequentist consistency of Bayesian inference. Hyperrectangle priors are fairly well studied in Bayesian nonparametrics: lower bounds are provided by the <a href="https://www.stat.berkeley.edu/~binyu/summer08/yu.assoua.pdf">Assouad’s lemma</a> construction and upper bounds are known and in fact reasonably small: by Brown and Low’s <a href="https://projecteuclid.org/euclid.aos/1032181159">equivalence results</a>, they are equivalent to estimation of a Hölder-smooth function, for which an appropriately integrated Brownian motion prior provides near-minimax rates. This seems to be saying that universal induction as a frequentist problem is slightly easier than one of the easier single problems in Bayesian nonparametrics. This seems… a little strange, maybe. One way to look at this is to accept, and say that the infinities contemplated by nonparametric inference are the absurd thing, or to marvel that a simple Gaussian Process regression is at least as hard as understanding all laws deriving the behavior of all activity in the universe and be grateful that it only takes cubic time. The other alternative is to suggest that this implies that the prior, while covering the entire space in principle, is satisfying a tightness condition that is so restrictive that it effectively restricts you to a topologically meager set of programs, ruling out in some sense the vast majority of the entire space (this sense is that of <a href="https://en.wikipedia.org/wiki/Baire_category_theorem">Baire category</a>) in the same way that any two Gaussian process priors with distinct covariance functions are mutually singular. In this sense, it is an incredibly strong restriction and hard to justify ex ante, certainly at least as contestable as justifying an ex-ante fixed smoothing parameter for your GP prior. (If you’ve ever run one of these, you know that’s a dig: people make so many ugly blurry maps.)</p>
<p>Alternatives might arise in fixed tractable inference procedures, or the combination of tractable procedures and models, though all of these have quite a few limitations. MCMC has the same hardness problems as above if you ask for “eventual” convergence, and fairly odd properties if run up to a fixed duration (including nondeterministic outcomes and a high probability that those outcomes exhibit various results often called logical fallacies or biases, which I suppose is not surprising since common definitions of biases or fallacies appear to essentially require Bayesian reasoning to begin with.) Variational inference likewise has these issues: even with the variational approximation, note that optimization to reach the modified objective may still be costly or even arbitrarily hard in some cases. Various neuroscientists seem to have taken up <a href="https://en.wikipedia.org/wiki/Free_energy_principle">some forms of variational inference</a> as a descriptive model of brain activity. Without expertise in neuroscience, I will leave well enough alone and say it seems like something that merits further empirical exploration. But as somebody who runs variational inference in practice, with mixed and sometimes surprising results, and computational improvements that don’t always suggest that the issue of cost is resolved, it also doesn’t seem like a full solution. I’m happy my model takes an hour rather than two days to run; I’m not sure this makes the method a compelling basis for decision-making.</p>
<p>I was going to extend this to say something about Bayes Nash equilibrium, but my problems with that concept are largely distinct, coming from the “equilibrium” rather than the “Bayes”<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> but I think I’ve conveyed my basic concerns. I don’t know that I have a compelling alternative, except that it may be the case that while an acceptable and actually feasible theory of decision making may have internal states of some kind, I see no reason that one has to have “beliefs” of any kind, at least as objects which reduce in some way to classical truth values over statements. One can simply have actions, which may on occasion correspond to binary decisions over sets that could in principle be assigned a truth value, though usually they won’t. This seems to have connections to the idea of lazy evaluation in nonparametric Bayes, which permits computations consistent with Bayes rule over a high-dimensional space to be retrieved via query without maintaining the full set of possible responses to such queries in memory. But this is only possible in a tractable way while still permitting the results to follow Bayesian inference for fairly limited classes of problems involving conjugacy. More generally, a theory which fully incorporates computational costs will likely have to await further developments in characterizing these costs, a still not fully solved problem in computer science.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>This is something like what theories of “rational inattention” do. However, information processing costs in these theories are taken over a channel between information which still has the representation as a random variable on both sides. The agent is assumed to be on one side of this channel and so is effectively still dealing with information in a fully probabilistic form over which the optimization criterion still requires reasoning to be Bayesian. That is to say, rational inattention is a theory of the information available to an agent, not a theory of the reasoning and decision-making process given available information.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>Roughly, even a computationally unlimited Bayesian agent could not reason itself to Bayes Nash equilibrium, unless it had the “right” priors. Where these priors come from is left unspoken (except that in the model they are “true”), which is a practical problem that drives a lot of differences between applied computational mechanism design, which is forced to answer this question, and the theory we teach our grad students.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
</description>
</item>
<item>
<title>Posterior Samplers for Turing.jl</title>
<link>/post/posterior-samplers-for-turing-jl/</link>
<pubDate>Sun, 28 Jun 2020 00:00:00 +0000</pubDate>
<guid>/post/posterior-samplers-for-turing-jl/</guid>
<description><p>Prompted by a question on the slack for <a href="https://turing.ml/" target="_blank">Turing.jl</a> about when to use which Bayesian sampling algorithms for which kinds of problems, I compiled a quick off-the-cuff summary of my opinions on specific samplers and how and when to use them. Take these with a grain of salt, as I have more experience with some than with others, and in any case the nice thing about a framework like Turing is that you can switch out samplers easily and test for yourself which is best for your application.</p>
<p>To get a good visual sense of how different samplers explore a parameter space, the animations <a href="https://chi-feng.github.io/mcmc-demo/" target="_blank">page by Chi Feng</a> is a great resource.</p>
<p>The following list covers mainly the samplers included by default in Turing. There&rsquo;s a lot of work in Bayesian compuation with different algorithms or implementations of these algorithms which could lead to different conclusions.</p>
<ol>
<li><p>Metropolis Hastings (MH): Explores the space randomly. Extremely simple, extremely slow, but can &ldquo;work&rdquo; in most models. Mainly worth a try if everything else fails.</p></li>
<li><p>HMC/NUTS: Gradient-based exploration, meaning the parameter space needs to be differentiable. It&rsquo;s fast if that&rsquo;s true, and so almost always the right choice if you can make your model differentiable (and sometimes so much better that it&rsquo;s worth making your model differentiable even if your initial model isn&rsquo;t in order to use it, e.g. by marginalizing out discrete parameters.) There are relatively minor differences between NUTS and the default HMC algorithm.</p></li>
<li><p>Gibbs sampling: A &ldquo;coordinate-ascent&rdquo; like sampler which samples in blocks from conditional distributions. It used to be popular along with factorizable models where conditional distributions could be updated in closed form due to conjugacy. It&rsquo;s still useful if you can do this, but slow for general models. Its main use now is for combining samplers, for example HMC for the differentiable parameters and something else for the nondifferentiable parameters.</p></li>
<li><p>SMC/&ldquo;Particle Filtering&rdquo;: A method based on importance sampling, reweighting draws from an initial draw and repeatedly updating. It is designed to work well if the proposal distribution and updates are close to the targets. The number of particles should be large for reasonable accuracy. Turing&rsquo;s implementation does this parameter by parameter starting at the prior and updating, which is close to what you want for the most common intended use, state space models with sequential structure, which is the main use case where I would even consider this. That said, tuning the proposals is really important, and more customizable SMC methods are useful in many cases where one has a computationally tractable approximate posterior you want to update to be closer to an exact posterior. This tends to be model-specific and not a good use case for generic PPLs, though.</p></li>
<li><p>Particle Gibbs (PG), or &ldquo;Conditional SMC&rdquo;: like SMC, but modified to be compatible with Metropolis sampling steps. Its main use I can see is as a step in a Gibbs sampler, where it can be used for a discrete parameter, with HMC for the other parts. The number of particles doesn&rsquo;t have to be overwhelmingly large, due to sampling, but it can cause problems if the number is too small.</p></li>
<li><p>Stochastic Gradient methods (SGLD/SGHMC/SGNHT): Gradient based methods that subsample the data to get less costly but less accurate gradients for an approximation of deterministic gradient based methods (SGHMC approximates HMC, SGLD approximates Langevin descent which also uses gradients but is simpler and usually slightly worse than HMC). These are designed for large data applications where going through a huge data set each iteration may be infeasible. They are popular for Bayesian neural network applications, where optimization methods also rely on data subsampling.</p></li>
<li><p>Variational Inference: Not a sampler per se. It comes up with a parametric model for the posterior shape and then optimizes the fit to the posterior according to a computationally feasible criterion (ie, one which doesn&rsquo;t require computing the normalizing constant in Bayes rule), allowing you to optimize instead of sampling. In general, this has no guarantee of reaching the true posterior, no matter how long you run it, but if you want a slightly wrong answer very fast it can be a good choice. It&rsquo;s also popular for Bayesian neural networks, and other &ldquo;big&rdquo; models like high dimensional topic models.</p></li>
</ol>
</description>
</item>
<item>
<title>Perturbation Methods for Incomplete Markets Economies</title>
<link>/publication/perturbationincomplete/</link>
<pubDate>Thu, 05 Dec 2019 00:00:00 -0500</pubDate>
<guid>/publication/perturbationincomplete/</guid>
<description></description>
</item>
<item>
<title>On Online Learning for Economic Forecasts</title>
<link>/post/on-online-learning-for-economic-forecasts/</link>
<pubDate>Tue, 29 Oct 2019 00:00:00 +0000</pubDate>
<guid>/post/on-online-learning-for-economic-forecasts/</guid>
<description><p>Jérémy Fouliard, Michael Howell, and Hélène Rey have just released <a href="http://conference.nber.org/conf_papers/f130922.pdf">an update of their working paper</a> applying methods from the field of Online Learning to forecasting of financial crises, demonstrating impressive performance in a difficult forecasting domain using some techniques that appear to be unappreciated in econometrics. Francis Diebold provides <a href="https://fxdiebold.blogspot.com/2019/10/machine-learning-for-financial-crises.html">discussion</a> and <a href="https://fxdiebold.blogspot.com/2019/10/online-learning-vs-tvp-forecast.html">perspectives</a>. This work is interesting to me as I spent much of the earlier part of this year designing and running a <a href="/Forecasting.html">course on economic forecasting</a> which attempted to offer a variety of perspectives beyond the traditional econometric approach, including that of Online Learning.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> This perspective has been widely applied by machine learning practitioners and businesses that employ them, particularly major web companies like <a href="https://ai.google/research/pubs/pub41159">Google</a>, <a href="https://vowpalwabbit.org/">Yahoo, and Microsoft</a>, but has not seen widespread take-up by more traditional economic forecasting consumers and practitioners like central banks and government institutions.</p>
<p>The essence of the online learning approach has less to do with particular algorithms (though there are many), but instead reconsiders the choice of <a href="Forecasting/Evaluation.html">evaluation framework</a>. Consider a quantity to be forecast <span class="math inline">\(y_{t+h}\)</span>, like an indicator equal to 1 in the presence of a financial crisis. A forecasting rule <span class="math inline">\(f(.)\)</span> depending on currently available data <span class="math inline">\(\mathcal{Y}_T\)</span> produces a forecast <span class="math inline">\(\widehat{y}_{t+h}=f(\mathcal{Y}_T)\)</span> which can be evaluated ex post according to a loss function <span class="math inline">\(\ell(y_{t+h},\widehat{y}_{t+h})\)</span> which measures how close the forecast was to being correct. Since we don’t know the true value <span class="math inline">\(y_{t+h}\)</span> until it is observed, to make a forecast we must come up with a criterion instead which compares rules. Traditional econometric forecasting looks at measures of statistical <em>risk</em>,</p>
<p><span class="math display">\[E[\ell(y_{t+h},\widehat{y}_{t+h})]\]</span></p>
<p>where expectation is taken with respect to a (usually unknown) true probability distribution. Online learning, in contrast, aims to provide estimators which have low <em>regret</em> over sequences of outcomes <span class="math inline">\(\{y_{t+h}\}_{t=1}^{T}\)</span> relative to a comparison class <span class="math inline">\(\mathcal{F}\)</span> of possible rules,
<span class="math display">\[\text{Regret}(\{\widehat{y}_{t+h}\}_{t=1}^{T})=\sum_{t=1}^{T} \ell(y_{t+h},\widehat{y}_{t+h})-\underset{f\in\mathcal{F}}{\inf}\sum_{t=1}^{T}\ell(y_{t+h},f(\mathcal{Y}_{t}))\]</span></p>
<p>This criterion looks a little odd from the perspective of traditional forecasting rules: <a href="https://fxdiebold.blogspot.com/2017/02/predictive-loss-vs-predictive-regret.html">Diebold has expressed skepticism</a>. First, regret is a relative rather than absolute standard; to even be defined you need to take a stand on rules you might compare to, which can look somewhat arbitrary. If you choose a class of rules that predict poorly, a low regret procedure will not do well in an absolute sense. Second, there is no expectation or probability, just a sequence of outcomes. Diebold frames this as ex ante vs ex post, as the regret cannot be computed until <em>after</em> the data is observed, while risk can be computed without seeing the data. But this does not accord with how regret is applied in the theoretical literature. Risk can be computed only with respect to a probability measure, which has to come from somewhere. One can build a model and ask that this be the “true” probability measure describing the sequence generating the data, but this is also unknown. To get ex ante results for a procedure, one needs to take a stand on a model or class of models. Then one can evaluate results either <em>uniformly</em> over models in the class (this is the classic <a href="Forecasting/StatisticalApproach.html">statistical approach</a>, used implicitly in much of the forecasting literature, like Diebold’s <a href="https://www.sas.upenn.edu/~fdiebold/Teaching221/econ221Penn.html">textbook</a>) or <em>on average</em> over models, where the distribution over which one averages is called a <em>prior distribution</em> and leads to <a href="Forecasting/Bayes.html">Bayesian forecasting</a>. In the online learning context, in contrast, one usually seeks guarantees which apply for <em>any</em> sequence of outcomes, as opposed to over a distribution. So the results are still ex ante, with the difference being whether one needs to take a stance on a model class or a comparison class. There are reasons why one might prefer either approach. For one, <a href="https://itzhakgilboa.weebly.com/uploads/8/3/6/3/8363317/gilboa_notes_for_introduction_to_decision_theory.pdf">standard decision theory</a> requires use of probability in “rational” decision making. But the probabilistic framework is often extremely restrictive in terms of the guarantees it provides on the type of situations in which a procedure will perform well. In general, one must have a model which is correctly specified, or at least not too badly misspecified.</p>
<p>Especially for areas where the economics is still in dispute, the confidence that one has that the models available to us encompass all likely outcomes maybe shouldn’t be so high. This is the content of the Queen’s question to which the title of the FHR paper refers: many or most economists before the financial crisis were using models which did not foresee a particularly high probability of such an event. For that reason, a procedure which allows us to perform reasonably over <em>any</em> sequence of events, not just those likely with respect to a particular model class, may be particularly desirable; a procedure with a low regret guarantee will do so, and be known to do so <em>ex ante</em>, as long as there is some comparator which performed well <em>ex post</em>. Ideally, we would like to remove that latter caveat, but as economists like to say, there is <a href="https://en.wikipedia.org/wiki/No_free_lunch_theorem">no free lunch</a>. One can instead do analysis based on risk without assuming one has an approximately correct model; this is the content of <a href="https://books.google.com/books?hl=en&amp;lr=&amp;id=EqgACAAAQBAJ&amp;oi=fnd&amp;pg=PR7&amp;dq=Vapnik+statistical+learning+theory&amp;ots=g3K-htaV29&amp;sig=5p6V7MW49xnzKUQGoAf7gRJZow0#v=onepage&amp;q=Vapnik%20statistical%20learning%20theory&amp;f=false">statistical learning theory</a>. But this usually involves both introducing a comparison class of models <span class="math inline">\(\mathcal{F}\)</span> to study a relative criterion, the <em>oracle risk</em> <span class="math inline">\(E[\ell(y_{t+h},\widehat{y}_{t+h})]-\underset{f\in\mathcal{F}}{\inf}E\ell(y_{t+h},f(\mathcal{Y}_t))\)</span>, or variants thereof.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> This requires both a comparison class and some restrictions on distributions to get uniformity; Vapnik considered the i.i.d. case, which is unsuitable for most time series forecasting applications; extensions need some version of stationarity and/or <a href="https://papers.nips.cc/paper/3489-rademacher-complexity-bounds-for-non-iid-processes">weak</a> <a href="https://projecteuclid.org/download/pdf_1/euclid.aop/1176988849">dependence</a>, or strong conditions on the class of nonstationary processes allowed, which can be problematic when one does not know what kind of distribution shifts are likely to occur.</p>
<p>This brings us to the content of the forecast procedures used: FHR apply classic Prediction with Expert Advice algorithms, like versions of Exponential Weights (closely related to the “Hedge” algorithm of <a href="http://rob.schapire.net/papers/FreundSc95.pdf">Freund and Schapire 1997</a>) and Online Gradient Descent (<a href="https://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf">Zinkevich 2003</a>), which take a set of forecasts and form a convex combination of them with weights that update each round of predictions. Diebold <a href="https://fxdiebold.blogspot.com/2019/10/online-learning-vs-tvp-forecast.html">notes</a> that these are essentially versions of <a href="Forecasting/ModelCombination.html">model averaging procedures</a> which allow for time-varying weights, which are frequently studied by econometricians, complaining that “ML types seem to think they invented everything”. To this I have two responses. First, on a credit attribution level, the online learning perspective originates in the studies of sequential decision theory and game theory from people like Wald and Blackwell, squarely in the economics tradition, and the techniques became ubiquitous in the ML field through <a href="http://www.ii.uni.wroc.pl/~lukstafi/pmwiki/uploads/AGT/Prediction_Learning_and_Games.pdf">“Prediction, Learning, and Games”</a>, by Cesa-Bianchi and Lugosi, the latter of whom is in an Economics department. So if one wants to claim credit for these ideas for economics, go right ahead. Second, there are noteworthy distinctions between these ideas and statistical approaches to forecast combination. Next, the uniformity over sequences of the regret criterion ensures that not only does it permit changes over time, these can be completely arbitrary and do not have to accord with a particular model of the way in which the shift occurs. So while the approaches can be analyzed in terms of statistical properties, and may correspond to known algorithms for a particular model, the reason they is used is to ensure guarantees for arbitrary sequences, a property which is not shared by general statistical approaches. In fact, a classic result in online model combination (Cf Section 2.2 in <a href="https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf">Shalev-Shwartz</a>) shows that some approaches with reasonable risk properties, like picking the forecast with the best performance up to the current period, can give unbounded regret for particularly bad sequences. The fact that a combination procedure adapts to these “poorly behaved” sequences is more important than the fact that it gives time varying weights per se.</p>
<p>For these reasons, I think Online Learning approaches at minimum deserve more use in economic forecasting and I am pleased to see the promising results of FHR, as well as the growing application of minimax regret criteria in other areas of economics like <a href="https://arxiv.org/abs/1909.06853">inference and policy</a> under partial identification and areas like <a href="http://yingniguo.com/wp-content/uploads/2019/09/slides-regulation.pdf">mechanism design</a> where providing a well-specified distribution over outcomes can be challenging.</p>
<p>There are still many issues that need more exploration, and there are important limitations. One thing existing online methods do not handle well is fully unbounded data; the worst case over all sequences leads to completely uninformative bounds, even for regret. For this reason, forecasting indicators is a good place to start. Whether it is even possible to extend these methods to data with unknown trends is still an open question, which may limit their suitability for many types of economic data. Tuning parameter selection is a topic of active research, with plenty of work on adapting these to the interval length and data features. Methods which perform well by regret criteria but also adapt to the case in which one does have stationary data and so could do well with a model-based algorithm are also a potentially promising direction. If one has real confidence in a model, it makes sense to rely on it, in which case statistical approaches are fine. But for many applications where the science is less settled and one might plausibly see data that doesn’t look like any model we have written down, we should keep this in our toolbox.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>For a better overview of the field than I can provide, see the survey of <a href="https://www.cs.huji.ac.il/~shais/papers/OLsurvey.pdf">Shalev-Shwartz</a>, the monograph of <a href="https://ocobook.cs.princeton.edu/OCObook.pdf">Hazan</a>, or courses by <a href="http://www.mit.edu/~rakhlin/6.883/">Rakhlin</a> or <a href="http://www.mit.edu/~rakhlin/courses/stat928/">Rakhlin and Sridharan</a>.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>Econometricians are used to thinking of this from the perspective of misspecification a la <a href="https://www.jstor.org/stable/1912526">Hal White</a> in which one compares to risk under the “pseudo-true” parameter value of the best predictor in a class. An alternative, popular in machine learning, is to use a data dependent comparator, the <em>empirical risk</em>, and prove bounds on the generalization gap. Here again we are effectively using the performance of a model or algorithm class for a relative measure.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
</description>
</item>
</channel>
</rss>