-
Notifications
You must be signed in to change notification settings - Fork 1
/
polipo.texi
2091 lines (1748 loc) · 86.6 KB
/
polipo.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\input texinfo @c -*-texinfo-*-
@c %**start of header
@setfilename polipo.info
@settitle The Polipo Manual
@afourpaper
@c %**end of header
@dircategory Network Applications
@direntry
* Polipo: (polipo). The Polipo caching web proxy.
@end direntry
@copying
Copyright @copyright{} 2003 -- 2006 by Juliusz Chroboczek.
@end copying
@titlepage
@title The Polipo Manual
@author Juliusz Chroboczek
@page
@vskip 0pt plus 1fill
Polipo is a caching web proxy designed to be used as a personal
cache or a cache shared among a few users.
@vskip 0pt plus 1fill
@insertcopying
@end titlepage
@contents
@ifnottex
@node Top, Background, (dir), (dir)
@top Polipo
Polipo is a caching web proxy designed to be used as a personal
cache or a cache shared among a few users.
@ifhtml
The latest version of Polipo can be found on
@uref{http://www.pps.jussieu.fr/~jch/software/polipo/,the Polipo web page}.
@end ifhtml
This manual was written by
@uref{http://www.pps.jussieu.fr/~jch/,,Juliusz Chroboczek}.
@end ifnottex
@menu
* Background:: Background information.
* Running:: Running Polipo
* Network:: Polipo and the network.
* Caching:: Caching.
* Memory usage:: Limiting Polipo's memory usage.
* Copying:: Your rights and mine.
* Variable index:: Variable index.
* Concept index:: Concept index.
@end menu
@node Background, Running, Top, Top
@chapter Background
@menu
* The web:: The web and HTTP.
* Proxies and caches:: Proxies and caches.
* Latency and throughput:: Optimise latency, not throughput.
* Network traffic:: Be nice to the net.
* Partial instances:: Don't discard data.
* POST and PUT:: Other requests
* Other HTTP proxies:: Why did I write Polipo from scratch?
@end menu
@node The web, Proxies and caches, Background, Background
@section The web and HTTP
@cindex URL
@cindex resource
@cindex instance
@cindex entity
@cindex HTTP
The web is a wide-scale decentralised distributed hypertext system,
something that's obviously impossible to achieve reliably.
The web is a collection of @dfn{resources} which are identified by
@dfn{URLs}, strings starting with @code{http://}. At any point in
time, a resource has a certain value, which is called an
@dfn{instance} of the resource.
The fundamental protocol of the web is HTTP, a simple request/response
protocol. With HTTP, a client can make a request for a resource to a
server, and the server replies with an @dfn{entity}, which is an
on-the-wire representation of an instance or of a fragment thereof.
@node Proxies and caches, Latency and throughput, The web, Background
@section Proxies and caches
@cindex proxy
@cindex caching
A proxy is a program that acts as both a client and a server. It
listens for client requests and forwards them to servers, and forwards
the servers' replies to clients.
An HTTP proxy can optimise web traffic away by @dfn{caching} server
replies, storing them in memory in case they are needed again. If a
reply has been cached, a later client request may, under some
conditions, be satisfied without going to the source again.
In addition to taking the shortcuts made possible by caching, proxies
can improve performance by generating better network traffic than the
client applications would do.
Proxies are also useful in ways unrelated to raw performance. A proxy
can be used to contact a server that are not visible to the browser,
for example because there is a firewall in the way (@pxref{Parent
proxies}), or because the client and the server use different lower
layer protocols (for example IPv4 and IPv6). Another common
application of proxies is to modify the data sent to servers and
returned to clients, for example by censoring headers that expose too
much about the client's identity (@pxref{Censoring headers}) or
removing advertisements from the data returned by the server
(@pxref{Forbidden}).
Polipo is a caching HTTP proxy that was originally designed as
a @dfn{personal} proxy, i.e.@: a proxy that is used by a single user
or a small group of users.
@node Latency and throughput, Network traffic, Proxies and caches, Background
@section Latency and throughput
@cindex throughput
@cindex latency
Most network benchmarks consider @dfn{throughput}, or the average
amount of data being pushed around per unit of time. While important
for batch applications (for example benchmarks), average throughput is
mostly irrelevant when it comes to interactive web usage. What is more
important is a transaction's median @dfn{latency}, or whether the data
starts to trickle down before the user gets annoyed.
Typical web caches optimise for throughput --- for example, by
consulting sibling caches before accessing a remote resource. By
doing so, they significantly add to the median latency, and therefore
to the average user frustration.
Polipo was designed to minimise latency.
@node Network traffic, Partial instances, Latency and throughput, Background
@section Network traffic
The web was developed by people who were interested in text processing
rather than in networking and, unsurprisingly enough, the first
versions of the HTTP protocol did not make very good use of network
resources. The main problem in HTTP/0.9 and early versions of
HTTP/1.0 was that a separate TCP connection (``virtual circuit'' for
them telecom people) was created for every entity transferred.
Opening multiple TCP connections has significant performance
implications. Obviously, connection setup and teardown require
additional packet exchanges which increase network usage and, more
importantly, latency.
Less obviously, TCP is not optimised for that sort of usage. TCP aims
to avoid network @dfn{congestion}, a situation in which the network
becomes unusable due to overly aggressive traffic patterns. A correct
TCP implementation will very carefully probe the network at the
beginning of every connection, which means that a TCP connection is
very slow during the first couple of kilobytes transferred, and only
gets up to speed later. Because most HTTP entities are small (in the
1 to 10 kilobytes range), HTTP/0.9 uses TCP where it is most inefficient.
@menu
* Persistent connections:: Don't shut connections down.
* Pipelining:: Send a bunch of requests at once.
* Poor Mans Multiplexing:: Split requests.
@end menu
@node Persistent connections, Pipelining, Network traffic, Network traffic
@subsection Persistent connections
@cindex persistent connection
@cindex keep-alive connection
Later HTTP versions allow the transfer of multiple entities on a
single connection. A connection that carries multiple entities is
said to be @dfn{persistent} (or sometimes @dfn{keep-alive}).
Unfortunately, persistent connections are an optional feature of HTTP,
even in version 1.1.
Polipo will attempt to use persistent connections on the server side,
and will honour persistent connection requests from clients.
@node Pipelining, Poor Mans Multiplexing, Persistent connections, Network traffic
@subsection Pipelining
@cindex Pipelining
With persistent connections it becomes possible to @dfn{pipeline} or
@dfn{stream} requests, i.e. to send multiple requests on a single
connection without waiting for the replies to come back. Because this
technique gets the requests to the server faster, it reduces latency.
Additionally, because multiple requests can often be sent in a single
packet, pipelining reduces network traffic.
Pipelining is a fairly common technique@footnote{The X11 protocol
fundamentally relies on pipelining. NNTP does support pipelining.
SMTP doesn't, while ESMTP makes it an option. FTP does support
pipelining on the control connection.}, but it is not supported by
HTTP/1.0. HTTP/1.1 makes pipelining support compulsory in every
server implementation that can use persistent connections, but there
are a number of buggy servers that claim to implement HTTP/1.1 but
don't support pipelining.
Polipo carefully probes for pipelining support in a server and uses
pipelining if it believes that it is reliable. Polipo also deeply
enjoys being pipelined at by a client@footnote{Other client-side
implementations of HTTP that make use of pipelining include
@uref{http://www.opera.com/,,Opera}, recent versions of
@uref{http://www.mozilla.org,,Mozilla}, APT (the package downloader
used by @uref{http://www.debian.org,,Debian} GNU/Linux) and LFTP.}.
@node Poor Mans Multiplexing, , Pipelining, Network traffic
@subsection Poor Man's Multiplexing
@cindex Poor Man's Multiplexing
@cindex multiplexing
A major weakness of the HTTP protocol is its inability to share a
single connection between multiple simultaneous transactions --- to
@dfn{multiplex} a number of transactions over a single connection. In
HTTP, a client can either request all instances sequentially, which
significantly increases latency, or else open multiple concurrent
connections, with all the problems that this implies
(@pxref{Persistent connections}).
Poor Man's Multiplexing (PMM) is a technique that simulates
multiplexing by requesting an instance in multiple segments; because
the segments are fetched in independent transactions, they can be
interleaved with requests for other resources.
Obviously, PMM only makes sense in the presence of persistent
connections; additionally, it is only effective in the presence of
pipelining (@pxref{Pipelining}).
PMM poses a number of reliability issues. If the resource being
fetched is dynamic, it is quite possible that it will change between
segments; thus, an implementation making use of PMM needs to be able
to switch to full-resource retrieval when it detects a dynamic
resource.
Polipo supports PMM, but it is disabled it by default (@pxref{PMM}).
@node Partial instances, POST and PUT, Network traffic, Background
@section Caching partial instances
@cindex partial instance
@cindex range request
A partial instance is an instance that is being cached but only part
of which is available in the local cache. There are three ways in
which partial instances can arise: client applications requesting only
part of an instance (Adobe's Acrobat Reader plugin is famous for
that), a server dropping a connection mid-transfer (because it is
short on resources, or, surprisingly often, because it is buggy), a
client dropping a connection (usually because the user pressed
@emph{stop}).
When an instance is requested that is only partially cached, it is
possible to request just the missing data by using a feature of HTTP
known as a @dfn{range} request. While support for range requests is
optional, most servers honour them in case of static data (data that
are stored on disk, rather then being generated on the fly e.g.@: by a
CGI script).
Caching partial instances has a number of positive effects. Obviously,
it reduces the amount of data transmitted as the available data
needn't be fetched again. Because it prevents partial data from being
discarded, it makes it reasonable for a proxy to unconditionally abort
a download when requested by the user, and therefore reduces network
traffic.
Polipo caches arbitrary partial instances in its in-memory cache. It
will only store the initial segment of a partial instance (from its
beginning up to its first hole) in its on-disk cache, though. In
either case, it will attempt to use range requests to fetch the
missing data.
@node POST and PUT, Other HTTP proxies, Partial instances, Background
@section Other requests
@cindex GET request
@cindex HEAD request
@cindex PUT request
@cindex POST request
@cindex OPTIONS request
@cindex PROPFIND request
The previous sections pretend that there is only one kind of request
in HTTP --- the @samp{GET} request. In fact, there are some others.
The @samp{HEAD} request method retrieves data about an resource. Polipo
does not normally use @samp{HEAD}, but will fall back to using it for
validation it if finds that a given server fails to cooperate with its
standard validation methods (@pxref{Cache transparency}). Polipo will
correctly reply to a client's @samp{HEAD} request.
The @samp{POST} method is used to request that the server should do
something rather than merely sending an entity; it is usually used
with HTML forms that have an effect@footnote{HTML forms should use the
@samp{GET} method when the form has no side-effect as this makes the
results cacheable.}. The @samp{PUT} method is used to replace an
resource with a different instance; it is typically used by web
publishing applications.
@samp{POST} and @samp{PUT} requests are handled by Polipo pretty much
like @samp{GET} and @samp{HEAD}; however, for various reasons, some
precautions must be taken. In particular, any cached data for the
resource they refer to must be discarded, and they can never be
pipelined.
Finally, HTTP/1.1 includes a convenient backdoor with the
@samp{CONNECT} method. For more information, please see
@ref{Tunnelling connections}.
Polipo does not currently handle the more exotic methods such as
@samp{OPTIONS} and @samp{PROPFIND}.
@node Other HTTP proxies, , POST and PUT, Background
@section Other HTTP proxies
@cindex proxy
I started writing Polipo because the weather was bad. But also
because I wanted to implement some features that other web proxies
don't have.
@menu
* Harvest and Squid:: Historic proxies.
* Apache:: The web server has a proxy.
* WWWOFFLE:: A personal proxy.
* Junkbuster:: Get rid of ads.
* Privoxy:: Junkbuster on speed.
* Oops:: A multithreaded cache.
@end menu
@node Harvest and Squid, Apache, Other HTTP proxies, Other HTTP proxies
@subsection Harvest and Squid
@cindex Harvest
@cindex Squid
Harvest, the grandfather of all web caches, has since evolved into
@uref{http://www.squid-cache.org/,,Squid}.
Squid sports an elegant single-threaded non-blocking architecture and
multiplexes multiple clients in a single process. It also features
almost complete support for HTTP/1.1, although for some reason it
doesn't currently advertise it.
Squid is designed as a large-scale shared proxy running on a dedicated
machine, and therefore carries certain design decisions which make it
difficult to use as a personal proxy. Because Squid keeps all
resource meta-data in memory, it requires a fair amount of RAM in
order to manipulate a reasonably sized cache.
Squid doesn't cache partial instances, and has trouble with instances
larger than available memory@footnote{Recent versions of Squid support
instances larger than available memory by using a hack that the
authors call a ``sliding window algorithm''.}. If a client connection
is interrupted, Squid has to decide whether to continue fetching the
resource (and possibly waste bandwidth) or discard what it already has
(and possibly waste bandwidth).
Some versions of squid would, under some circumstances, pipeline up to
two outgoing requests on a single connection. At the time of writing,
this feature appears to have been disabled in the latest version.
Squid's developers have decided to re-write it in C++.
@node Apache, WWWOFFLE, Harvest and Squid, Other HTTP proxies
@subsection The Apache proxy
@cindex Apache
The @uref{http://www.apache.org/,,Apache web server} includes a
complete HTTP/1.1 proxy.
The Apache web server was designed to maximise ease of programming ---
a decision which makes Apache immensely popular for deploying
web-based applications. Of course, this ease of programming comes at
a cost, and Apache is not the most lightweight proxy available.
As cheaper caching proxies are available, Apache is not useful as a
standalone proxy. The main application of Apache's proxy is to join
multiple web servers' trees into a single hierarchy.
The Apache proxy doesn't cache partial instances and doesn't pipeline
multiple outgoing requests.
@node WWWOFFLE, Junkbuster, Apache, Other HTTP proxies
@subsection WWWOFFLE
@cindex WWWOFFLE
@uref{http://www.gedanken.demon.co.uk/wwwoffle/,,WWWOFFLE}, an elegant
personal proxy, is the primary model for Polipo.
WWWOFFLE has more features than can be described here. It will censor
banner ads, clean your HTML, decorate it with random colours, schedule
fetches for off-peak hours.
Unfortunately, the HTTP traffic that WWWOFFLE generates is disgusting.
It will open a connection for every fetch, and forces the client to do
the same.
WWWOFFLE only caches complete instances.
I used WWWOFFLE for many years, and frustration with WWWOFFLE's
limitations is the main reason why I started Polipo in the first
place.
@node Junkbuster, Privoxy, WWWOFFLE, Other HTTP proxies
@subsection Junkbuster
@cindex Junkbuster
@uref{http://internet.junkbuster.com/,,Junkbuster} is a simple
non-caching web proxy designed to remove banner ads and cookies. It
was the main model for WWWOFFLE's (and therefore Polipo's) header and
ad-removing features.
Junkbuster's HTTP support is very simple (some would say broken): it
doesn't do persistent connections, and it breaks horribly if the
client tries pipelining. Junkbuster is no longer being maintained,
and has evolved into Privoxy.
@node Privoxy, Oops, Junkbuster, Other HTTP proxies
@subsection Privoxy
@cindex Privoxy
@uref{http://www.privoxy.org/,,Privoxy} is the current incarnation of
Junkbuster. Privoxy has the ability to randomly modify web pages
before sending them to the browser --- for example, remove
@samp{<blink>} or @samp{<img>} tags.
Just like its parent, Privoxy cannot do persistent connections. Under
some circumstances, it will also buffer whole pages before sending
them to the client, which significantly adds to its latency. However,
this is difficult to avoid given the kinds of rewriting it attempts to
perform.
@node Oops, , Privoxy, Other HTTP proxies
@subsection Oops
@cindex Oops
@uref{http://zipper.paco.net/~igor/oops.eng/,,Oops} is a caching web
proxy that uses one thread (lightweight process) for every connection.
This technique does cost additional memory, but allows good
concurrency of requests while avoiding the need for complex
non-blocking programming. Oops was apparently designed as a
wide-scale shared proxy.
Although Oops' programming model makes it easy to implement persistent
connections, Oops insists on opening a separate connection to the
server for every single resource fetch, which disqualifies it from
production usage.
@node Running, Network, Background, Top
@chapter Running Polipo
@menu
* Polipo Invocation:: Starting Polipo.
* Browser configuration:: Configuring your browser.
* Stopping:: Stopping and refreshing Polipo.
* Local server:: The local web server and web interface.
@end menu
@node Polipo Invocation, Browser configuration, Running, Running
@section Starting Polipo
@cindex invocation
By default, Polipo runs as a normal foreground job in a terminal in
which it can log random ``How do you do?'' messages. With the right
configuration options, Polipo can run as a daemon.
Polipo is run with the following command line:
@example
$ polipo [ -h ] [ -v ] [ -x ] [ -c @var{config} ] [ @var{var}=@var{val}... ]
@end example
All flags are optional. The flag @option{-h} causes Polipo to print a
short help message and to quit. The flag @option{-v} causes Polipo to
list all of its configuration variables and quit. The flag
@option{-x} causes Polipo to purge its on-disk cache and then quit
(@pxref{Purging}). The flag @option{-c} specifies the configuration
file to use (by default @file{~/.polipo} or
@file{/etc/polipo/config}). Finally, Polipo's configuration can be
changed on the command line by assigning values to given configuration
variables.
@menu
* Configuring Polipo:: Plenty of options.
* Daemon:: Running in the background.
* Logging:: Funnelling status messages.
@end menu
@node Configuring Polipo, Daemon, Polipo Invocation, Polipo Invocation
@subsection Configuration
@cindex runtime configuration
@cindex variable
@cindex configuration variable
@cindex configuration file
There is a number of variables that you can tweak in order to
configure Polipo, and they should all be described in this manual
(@pxref{Variable index}). You can display the complete, most
up-to-date list of configuration variables by using the @option{-v}
command line flag or by accessing the ``current configuration'' page
of Polipo's web interface (@pxref{Web interface}). Configuration
variables can be set either on the command line or else in the
configuration file given by the @option{-c} command-line flag.
Configuration variables are typed, and @option{-v} will display their
types. The type can be of one of the following:
@itemize @bullet
@item
@samp{integer} or @samp{float}: a numeric value;
@item
@samp{boolean}: a truth value, one of @samp{true} or @samp{false};
@item
@samp{tristate}: one of @samp{false}, @samp{maybe} or @samp{true};
@item
@samp{4-state}, one of @samp{false}, @samp{reluctantly},
@samp{happily} or @samp{true};
@item
@samp{5-state}, one of @samp{false}, @samp{reluctantly}, @samp{maybe},
@samp{happily} or @samp{true};
@item
@samp{atom}, a string written within double quotes @samp{"});
@item
@samp{list}, a comma-separated list of strings;
@item
@samp{intlist}, a comma-separated list of integers and ranges of
integers (of the form `@var{n}--@var{m}').
@end itemize
The configuration file has a very simple syntax. All blank lines are
ignored, as are lines starting with a hash sign @samp{#}. Other lines
must be of the form
@example
@var{var} = @var{val}
@end example
where @var{var} is a variable to set and @var{val} is the value to set
it to.
It is possible to change the configuration of a running polipo by
using the local configuration interface (@pxref{Web interface}).
@node Daemon, Logging, Configuring Polipo, Polipo Invocation
@subsection Running as a daemon
@cindex daemon
@cindex terminal
@cindex pid
@vindex daemonise
@vindex pidFile
If the configuration variable @code{daemonise} is set to true, Polipo
will run as a daemon: it will fork and detach from its controlling
terminal (if any). The variable @code{daemonise} defaults to false.
When Polipo is run as a daemon, it can be useful to get it to
atomically write its @emph{pid} to a file. If the variable
@code{pidFile} is defined, it should be the name of a file where
Polipo will write its @emph{pid}. If the file already exists when it
is started, Polipo will refuse to run.
@node Logging, , Daemon, Polipo Invocation
@subsection Logging
@cindex logging
@vindex logLevel
@vindex logFile
@vindex logFilePermissions
@vindex logSyslog
@vindex logFacility
@vindex scrubLogs
When it encounters a difficulty, Polipo will print a friendly message.
The location where these messages go is controlled by the
configuration variables @code{logFile} and @code{logSyslog}.
If @code{logSyslog} is @code{true}, error messages go to the system log
facility given by @code{logFacility}. If @code{logFile} is set, it is
the name of a file where all output will accumulate. If @code{logSyslog}
is @code{false} and @code{logFile} is empty, messages go to the error
output of the process (normally the terminal).
The variable @code{logFile} defaults to empty if @code{daemonise} is
false, and to @samp{/var/log/polipo} otherwise. The variable
@code{logSyslog} defaults to @code{false}, and @code{logFacility}
defaults to @samp{user}.
If @code{logFile} is set, then the variable @code{logFilePermissions}
controls the Unix permissions with which the log file will be created if
it doesn't exist. It defaults to 0640.
The amount of logging is controlled by the variable @code{logLevel}.
Please see the file @samp{log.h} in the Polipo sources for the
possible values of @code{logLevel}.
Keeping extensive logs on your users browsing habits is probably
a serere violation of their privacy. If the variable @code{scrubLogs}
is set, then Polipo will scrub most, if not all, private information
from its logs.
@node Browser configuration, Stopping, Polipo Invocation, Running
@section Configuring your browser
@cindex browser configuration
@cindex user-agent configuration
Telling your user-agent (web browser) to use Polipo is an operation
that depends on the browser. Many user-agents will transparently use
Polipo if the environment variable @samp{http_proxy} points at it;
e.g.@:
@example
$ export http_proxy=http://localhost:8123/
@end example
Netscape Navigator, Mozilla, Mozilla Firefox, KDE's Konqueror and
probably other browsers require that you configure them manually
through their @emph{Preferences} or @emph{Configure} menu.
If your user-agent sports such options, tell it to use persistent
connections when speaking to proxies, to speak HTTP/1.1 and to use
HTTP/1.1 pipelining.
@node Stopping, Local server, Browser configuration, Running
@section Stopping Polipo and getting it to reload
@cindex signals
@cindex shutting down
@cindex stopping
Polipo will shut down cleanly if it receives @code{SIGHUP},
@code{SIGTERM} or @code{SIGINT} signals; this will normally happen
when a Polipo in the foreground receives a @code{^C} key press, when
your system shuts down, or when you use the @code{kill} command with
no flags. Polipo will then write-out all its in-memory data to disk
and quit.
If Polipo receives the @code{SIGUSR1} signal, it will write out all
the in-memory data to disk (but won't discard them), reopen the log
file, and then reload the forbidden URLs file (@pxref{Forbidden}).
Finally, if Polipo receives the @code{SIGUSR2} signal, it will write
out all the in-memory data to disk and discard as much of the memory
cache as possible. It will then reopen the log file and reload the
forbidden URLs file.
@node Local server, , Stopping, Running
@section The local web server
@vindex localDocumentRoot
@vindex disableProxy
@cindex web server
@cindex local server
Polipo includes a local web server, which is accessible on the same
port as the one the proxy listens to. Therefore, by default you can
access Polipo's local web server as @samp{http://localhost:8123/}.
The data for the local web server can be configured by setting
@code{localDocumentRoot}, which defaults to
@file{/usr/share/polipo/www/}. Setting this variable to @samp{""}
will disable the local server.
Polipo assumes that the local web tree doesn't change behind its back.
If you change any of the local files, you will need to notify Polipo
by sending it a @code{SIGUSR2} signal (@pxref{Stopping}).
If you use polipo as a publicly accessible web server, you might want
to set the variable @code{disableProxy}, which will prevent it from
acting as a web proxy. (You will also want to set
@code{disableLocalInterface} (@pxref{Web interface}), and perhaps run
Polipo in a @emph{chroot} jail.)
@menu
* Web interface:: The web interface.
@end menu
@node Web interface, , Local server, Local server
@subsection The web interface
@cindex runtime configuration
@cindex web interface
@vindex disableLocalInterface
@vindex disableConfiguration
@vindex disableServersList
The subtree of the local web space rooted at
@samp{http://localhost:8123/polipo/} is treated specially: URLs under
this root do not correspond to on-disk files, but are generated by
Polipo on-the-fly. We call this subtree Polipo's @dfn{local web
interface}.
The page @samp{http://localhost:8123/polipo/config?} contains the
values of all configuration variables, and allows setting most of them.
The page @samp{http://localhost:8123/polipo/status?} provides a summary
status report about the running Polipo, and allows performing a number
of actions on the proxy, notably flushing the in-memory cache.
The page @samp{http://localhost:8123/polipo/servers?} contains the list
of known servers, and the statistics maintained about them
(@pxref{Server statistics}).
The pages starting with @samp{http://localhost:8123/polipo/index?}
contain indices of the disk cache. For example, the following page
contains the index of the cached pages from the server of some random
company:
@example
http://localhost:8123/polipo/index?http://www.microsoft.com/
@end example
The pages starting with
@samp{http://localhost:8123/polipo/recursive-index?} contain recursive
indices of various servers. This functionality is disabled by
default, and can be enabled by setting the variable
@code{disableIndexing}.
If you have multiple users, you will probably want to disable the
local interface by setting the variable @code{disableLocalInterface}.
You may also selectively control setting of variables, indexing and
listing known servers by setting the variables
@code{disableConfiguration}, @code{disableIndexing} and
@code{disableServersList}.
@node Network, Caching, Running, Top
@chapter Polipo and the network
@menu
* Client connections:: Speaking to clients
* Contacting servers:: Contacting servers.
* HTTP tuning:: Tuning at the HTTP level.
* Offline browsing:: Browsing with poor connectivity.
* Server statistics:: Polipo keeps statistics about servers.
* Server-side behaviour:: Tuning the server-side behaviour.
* PMM:: Poor Man's Multiplexing.
* Forbidden:: You can forbid some URLs.
* DNS:: How Polipo finds hosts.
* Parent proxies:: Fetching data from other proxies.
* Tuning POST and PUT:: Tuning POST and PUT requests.
* Tunnelling connections:: Tunnelling foreign protocols and https.
@end menu
@node Client connections, Contacting servers, Network, Network
@section Client connections
@vindex proxyAddress
@vindex proxyPort
@vindex proxyName
@vindex displayName
@cindex address
@cindex port
@cindex IPv6
@cindex proxy loop
@cindex loop
@cindex proxy name
@cindex via
@cindex loopback address
@cindex security
There are three fundamental values that control how Polipo speaks to
clients. The variable @code{proxyAddress}, defines the IP address on
which Polipo will listen; by default, its value is the @dfn{loopback
address} @code{"127.0.0.1"}, meaning that Polipo will listen on the
IPv4 loopback interface (the local host) only. By setting this
variable to a global IP address or to one of the special values
@code{"::"} or @code{"0.0.0.0"}, it is possible to allow Polipo to
serve remote clients. This is likely to be a security hole unless you
set @code{allowedClients} to a reasonable value (@pxref{Access control}).
Note that the type of address that you specify for @code{proxyAddress}
will determine whether Polipo listens to IPv4 or IPv6. Currently, the
only way to have Polipo listen to both protocols is to specify the
IPv6 unspecified address (@code{"::"}) for @code{proxyAddress}.
The variable @code{proxyPort}, by default 8123, defines the TCP port
on which Polipo will listen.
The variable @code{proxyName}, which defaults to the host name of the
machine on which Polipo is running, defines the @dfn{name} of the
proxy. This can be an arbitrary string that should be unique among
all instances of Polipo that you are running. Polipo uses it in error
messages and optionally for detecting proxy loops (by using the
@samp{Via} HTTP header, @pxref{Censoring headers}). Finally, the
@code{displayName} variable specifies the name used in user-visible
error messages (default ``Polipo'').
@menu
* Access control:: Deciding who can connect.
@end menu
@node Access control, , Client connections, Client connections
@subsection Access control
@vindex proxyAddress
@vindex authCredentials
@vindex authRealm
@vindex allowedClients
@cindex access control
@cindex authentication
@cindex loopback address
@cindex security
@cindex username
@cindex password
By making it possible to have Polipo listen on a non-routable address
(for example the loopback address @samp{127.0.0.1}), the variable
@code{proxyAddress} provides a very crude form of @dfn{access
control}: the ability to decide which hosts are allowed to connect.
A finer form of access control can be implemented by specifying
explicitly a number of client addresses or ranges of addresses
(networks) that a client is allowed to connect from. This is done
by setting the variable @code{allowedClients}.
Every entry in @code{allowedClients} can be an IP address, for example
@samp{134.157.168.57} or @samp{::1}. It can also be a network
address, i.e.@: an IP address and the number of bits in the network
prefix, for example @samp{134.157.168.0/24} or
@samp{2001:660:116::/48}. Typical uses of @samp{allowedClients}
variable include
@example
allowedClients = 127.0.0.1, ::1, 134.157.168.0/24, 2001:660:116::/48
@end example
or, for an IPv4-only version of Polipo,
@example
allowedClients = 127.0.0.1, 134.157.168.0/24
@end example
A different form of access control can be implemented by requiring
each client to @dfn{authenticate}, i.e.@: to prove its identity before
connecting. Polipo currently only implements the most insecure form
of authentication, @dfn{HTTP basic authentication}, which sends
usernames and passwords in clear over the network. HTTP basic
authentication is required when the variable @code{authCredentials} is
not null; its value should be of the form @samp{username:password}.
Note that both IP-based authentication and HTTP basic authentication
are insecure: the former is vulnerable to IP address spoofing, the
latter to replay attacks. If you need to access Polipo over the
public Internet, the only secure option is to have it listen over the
loopback interface only and use an ssh tunnel (@pxref{Parent
proxies})@footnote{It is not quite clear to me whether HTTP digest
authentication is worth implementing. On the one hand, if implemented
correctly, it appears to provide secure authentication; on the other
hand, and unlike ssh or SSL, it doesn't make any attempt at ensuring
privacy, and its optional integrity guarantees are impossible to
implement without significantly impairing latency.}.
@node Contacting servers, HTTP tuning, Client connections, Network
@section Contacting servers
@cindex multiple addresses
@cindex IPv6
@vindex useTemporarySourceAddress
A server can have multiple addresses, for example if it is
@dfn{multihomed} (connected to multiple networks) or if it can speak
both IPv4 and IPv6. Polipo will try all of a hosts addresses in turn;
once it has found one that works, it will stick to that address until
it fails again.
If connecting via IPv6 there is the possibility to use temporary
source addresses to increase privacy (RFC@tie{}3041). The variable
@code{useTemporarySourceAddress} controls the use of temporary
addresses for outgoing connections; if set to @code{true}
temporary addresses are preferred, if set to @code{false} static addresses
are used and if set to @code{maybe} (the default) the operation
system default is in effect. This setting is not available
on all operation systems.
@menu
* Allowed ports:: Where the proxy is allowed to connect.
@end menu
@node Allowed ports, , Contacting servers, Contacting servers
@subsection Allowed ports
@cindex Allowed ports
@cindex Forbidden ports
@cindex ports
@vindex allowedPorts
A TCP service is identified not only by the IP address of the machine
it is running on, but also by a small integer, the TCP @dfn{port} it
is @dfn{listening} on. Normally, web servers listen on port 80, but
it is not uncommon to have them listen on different ports; Polipo's
internal web server, for example, listens on port 8123 by default.
The variable @code{allowedPorts} contains the list of ports that
Polipo will accept to connect to on behalf of clients; it defaults to
@samp{80-100, 1024-65535}. Set this variable to @samp{1-65535} if your
clients (and the web pages they consult!) are fully trusted. (The
variable @code{allowedPorts} is not considered for tunnelled
connections; @pxref{Tunnelling connections}).
@node HTTP tuning, Offline browsing, Contacting servers, Network
@section Tuning at the HTTP level
@cindex HTTP
@cindex headers
@menu
* Tuning the HTTP parser:: Tuning parsing of HTTP headers.
* Censoring headers:: Censoring HTTP headers.
* Intermediate proxies:: Adjusting intermediate proxy behaviour.
@end menu
@node Tuning the HTTP parser, Censoring headers, HTTP tuning, HTTP tuning
@subsection Tuning the HTTP parser
@vindex laxHttpParser
@vindex bigBufferSize
As a number of HTTP servers and CGI scripts serve incorrect HTTP
headers, Polipo uses a @emph{lax} parser, meaning that incorrect HTTP
headers will be ignored (a warning will be logged by default). If the
variable @code{laxHttpParser} is not set (it is set by default),
Polipo will use a @emph{strict} parser, and refuse to serve an
instance unless it could parse all the headers.
When the amount of headers exceeds one chunk's worth (@pxref{Chunk
memory}), Polipo will allocate a @dfn{big buffer} in order to store
the headers. The size of big buffers, and therefore the maximum
amount of headers Polipo can parse, is specified by the variable
@code{bigBufferSize} (32@dmn{kB} by default).
@node Censoring headers, Intermediate proxies, Tuning the HTTP parser, HTTP tuning
@subsection Censoring headers
@cindex privacy
@cindex anonymity
@cindex Referer
@cindex cookies
@vindex censorReferer
@vindex censoredHeaders
@vindex proxyName
@vindex disableVia
Polipo offers the option to censor given HTTP headers in both client
requests and server replies. The main application of this feature is
to very slightly improve the user's privacy by eliminating cookies and
some content-negotiation headers.
It is important to understand that these features merely make it
slightly more difficult to gather statistics about the user's
behaviour. While they do not actually prevent such statistics from
being collected, they might make it less cost-effective to do so.
The general mechanism is controlled by the variable
@code{censoredHeaders}, the value of which is a case-insensitive list
of headers to unconditionally censor. By default, it is empty, but
I recommend that you set it to @samp{From, Accept-Language}. Adding
headers such as @samp{Set-Cookie}, @samp{Set-Cookie2}, @samp{Cookie},
@samp{Cookie2} or @samp{User-Agent} to this list will probably break
many web sites.
The case of the @samp{Referer}@footnote{HTTP contains many mistakes
and even one spelling error.} header is treated specially because many
sites will refuse to serve pages when it is not provided. If
@code{censorReferer} is @code{false} (the default), @samp{Referer}
headers are passed unchanged to the server. If @code{censorReferer}
is @code{maybe}, @samp{Referer} headers are passed to the server only
when they refer to the same host as the resource being fetched. If
@code{censorReferer} is @code{true}, all @samp{Referer} headers are
censored. I recommend setting @code{censorReferer} to @code{maybe}.
Another header that can have privacy implications is the @samp{Via}
header, which is used to specify the chain of proxies through which
a given request has passed. Polipo will generate @samp{Via} headers
if the variable @code{disableVia} is @code{false} (it is true by
default). If you choose to generate @samp{Via} headers, you may want
to set the @code{proxyName} variable to some innocuous string
(@pxref{Client connections}).
@menu
* Censor Accept-Language:: Why Accept-Language is evil.
@end menu
@node Censor Accept-Language, , Censoring headers, Censoring headers
@subsubsection Why censor Accept-Language
@cindex negotiation
@cindex content negotiation
@cindex Accept-Language
Recent versions of HTTP include a mechanism known as @dfn{content
negotiation} which allows a user-agent and a server to negotiate the
best representation (instance) for a given resource. For example, a
server that provides both PNG and GIF versions of an image will serve
the PNG version to user-agents that support PNG, and the GIF version
to Internet Explorer.
Content negotiation requires that a client should send with every
single request a number of headers specifying the user's cultural and
technical preferences. Most of these headers do not expose sensitive
information (who cares whether your browser supports PNG?). The
@samp{Accept-Language} header, however, is meant to convey the user's
linguistic preferences. In some cases, this information is sufficient
to pinpoint with great precision the user's origins and even his
political or religious opinions; think, for example, of the
implications of sending @samp{Accept-Language: yi} or @samp{ar_PS}.
At any rate, @samp{Accept-Language} is not useful. Its design is
based on the assumption that language is merely another representation
for the same information, and @samp{Accept-Language} simply carries a