-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathManipulatingDataOnLinux.txt
1436 lines (1172 loc) · 70.5 KB
/
ManipulatingDataOnLinux.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Manipulating Data on Linux
==========================
Harry Mangalam <[email protected]>
v1.25, Nov 15, 2014
:icons:
// export fileroot="/home/hjm/nacs/ManipulatingDataOnLinux-2"; asciidoc -a icons -a toc2 -b html5 -a numbered ${fileroot}.txt; scp ${fileroot}.[ht]* moo:~/public_html
// update svn from BDUC
// scp ${fileroot}.[ht]* hmangala@claw1:~/bduc/trunk/sge; ssh hmangala@bduc-login 'cd ~/bduc/trunk/sge; svn update; svn commit -m "new mods to ManipulatingData.."'
// and push it to Wordpress:
// blogpost.py update -c HowTos ${fileroot}.txt
// don't forget that the HTML equiv of '~' = '%7e'
// asciidoc cheatsheet: http://powerman.name/doc/asciidoc
// asciidoc user guide: http://www.methods.co.nz/asciidoc/userguide.html
Introduction
------------
If you're coming from Windows, the world of the Linux command line can be perplexing - you have to know what you want before you can do
anything - there's nothing to click, no wizards, few hints. So let me supply a few...
I assume you've been forced to the Linux shell prompt somewhat against your will and you have no
burning desire to learn the cryptic and agonizing commands that form the basis of
http://xkcd.org[xkcd] and other insider jokes. You want get your work done and you want to get it
done fast.
However, there are some very good reasons for using the commandline for doing your data processing.
With more instruments providing digital output and new technologies providing terabytes of digital
data, trying to handle this data with Excel is *just not going to work*. And 'Comma Separated
Value' (CSV) files are probably not going to be much help either in a bit. But we can deal with all
of these on Linux using some fairly simple, free utilities (the overwhelming majority of Linux tools
are free, another reason to use it).
There are also some good proprietary tools that have been ported to Linux ('MATLAB, Mathematica,
SAS', etc), but I'm going to mostly ignore these for now. This is also not to say that you can't do
much of this on Windows, using native utilities and applications, but it's second nature on Linux
and it works better as well. The additional point is that it will ALWAYS work like this on Linux.
No learning a new interface every 6 months because 'Microsoft' or 'Apple' need to bump their profit
by releasing yet another pointless upgrade that you have to pay for in time and money.
Note that 'most' of the applications noted exist for MacOSX and 'many' exist for Windows. See link:#FlossOnMacWin[below]
There is a great, free, self-paced tutorial called http://swc.scipy.org[Software Carpentry] that
examines much of what I'll be zooming thru in better detail. The title refers to the general
approach of Unix and Linux: use of simple, well-designed tools that tend to do one job very well
(think 'saw' or 'hammer'). Unlike the physical tools, tho, the output of one can be piped into the
input of another to form a suprisingly effective (if simple) workflow for many needs.
http://www.showmedo.com[Showmedo] is another very useful website, sort of like *YouTube for Computer
tools*. It has tutorial videos covering Linux, Python, Perl, Ruby, the 'bash' shell, Web tools,
etc. And especially for beginners, it has a section related specifically to the above-referenced
http://www.showmedo.com/videos/series?name=pQZLHo5Df[Software Carpentry series]
OK, let's start.
Getting your data to and from the Linux host.
---------------------------------------------
This has been covered in many such tutorials, and the short version is to use
http://www.winscp.com[WinSCP] on Windows and http://cyberduck.ch/[Cyberduck] on the Mac. I've
written more on this, so go and take a look if you want the
http://moo.nac.uci.edu/%7ehjm/BDUC_USER_HOWTO.html#filestoandfrom[short and sweet version] or the
http://moo.nac.uci.edu/%7ehjm/HOWTO_move_data.html[longer, more sophisticated version]. Also, see
the MacOSX note above - you can do all of this on the Mac as well, altho there are some
Linux-specific details that I'll try to mention.
'sshfs': The other very convenient approach, if the server kernel allows it, is to use
http://en.wikipedia.org/wiki/SSHFS[sshfs] to semi-permanently connect your Desktop file system to
your '$HOME' directory on the server. The process is
http://moo.nac.uci.edu/~hjm/BDUC_USER_HOWTO.html#_sshfs[described here.]
OK, you've got your data to the Linux host... Now what..?
Well, first, check the note about link:#linuxfilenames[file names on Linux] and link:#tabcompletion[tab completion (very useful)]
Simple data files - examination and slicing
-------------------------------------------
[NOTE]
.man pages
=================================================================
To figure out how to use a command beyond the simple examples below, try the link:#manpages[man pages].
=================================================================
[[file]]
What kind of file is it?
~~~~~~~~~~~~~~~~~~~~~~~~
First, even tho you might have generated the data in your lab, you might not know what kind of data
it is. While it's not foolproof, a tool that may help is called *file*. It tries to answer the
question: "What kind of file is it?" Unlike the Windows approach that maps file name endings to a
particular type (filename.typ), 'file' actually peeks inside the file to see if there are any
diagnostic characteristics, so especially if the file has been renamed or name-mangled in
translation, it can be very helpful.
ie
--------------------------------------------------------------------------
bash > file /home/hjm/z/netcdf.tar.gz
/home/hjm/z/netcdf.tar.gz: gzip compressed data, from Unix, last modified: Thu Feb 17 13:37:35 2005
# now I'll copy that file to one called 'anonymous'
bash > cp /home/hjm/z/netcdf.tar.gz anonymous
bash > file anonymous
anonymous: gzip compressed data, from Unix, last modified: Thu Feb 17 13:37:35 2005
# see - it still works.
--------------------------------------------------------------------------
[NOTE]
.Assumptions
=====================================================================
I'm assuming that you're logged into a bash shell on a Linux system
with most of the usual Linux utilities installed as well as R. You
should create a directory for this excercise - name it anything you
want, but I'll refer to it as $DDIR for DataDir. You can as well by assigning
the real name to the shell variable DDIR:
--------------------------------------------------------------------
export DDIR=/the/name/you/gave/it
--------------------------------------------------------------------
Shell commands are prefixed by *bash >* and can be moused into your own shell
to test including the embedded comments (prefixed by '#'; they will be ignored.)
Do not, of course include the *bash >* prefix.
Also, all the utilities described here will be available on the interactive
http://moo.nac.uci.edu/%7ehjm/BDUC_USER_HOWTO.html[BDUC cluster] nodes at UC Irvine. Unless
otherwise stated, they are also freely available for any distribution of Linux.
=====================================================================
How big is the file?
~~~~~~~~~~~~~~~~~~~~
We're going to use a 25MB tab-delimited data file called
http://moo.nac.uci.edu/%7ehjm/red+blue_all.txt.gz[red+blue_all.txt.gz]. Download it (in Firefox) by
right-clicking on the link and select 'Save Link As..'. Save it to the $DDIR directory you've
created for this excercise, then decompress it with 'gunzip red+blue_all.txt.gz'.
[[ls]]
We can get the total bytes with 'ls'
-----------------------------------------------------------------
bash > mkdir ~/where/you/want/the/DDIR # make the DDIR
bash > export DDIR=/where/you/made/the/DDIR # create a shell variable to store it
bash > cd $DDIR # cd into that data dir
bash > ls -l red+blue_all.txt
-rw-r--r-- 1 hjm hjm 26213442 2008-09-10 16:49 red+blue_all.txt
^^^^^^^^
-----------------------------------------------------------------
or in 'human-readable form' with:
-----------------------------------------------------------------
bash > ls -lh red+blue_all.txt
# ^
-rw-r--r-- 1 hjm hjm 25M 2008-09-10 16:49 red+blue_all.txt
# ^^^ (25 Megabytes)
-----------------------------------------------------------------
[[wc]]
We can get a little more information using 'wc' (wordcount)
-----------------------------------------------------------------
bash > wc red+blue_all.txt
385239 1926195 26213442 red+blue_all.txt
-----------------------------------------------------------------
'wc' shows that it's 385,239 lines 1,926,195 words and 26,213,442 characters
Native Spreadsheet programs for Linux
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In some cases, you'll want to use a spreadsheet application to review a spreadsheet. (For the few
of you who don't know how to use spreadsheets for data analysis (as opposed to just looking at
columns of data), there is a http://swc.scipy.org/lec/spreadsheets.html[Linux-oriented tutorial] at
the oft-referenced Software Carpentry site.)
While Excel is the acknowledged leader in this application area, there are some very good native
free spreadsheets available on Linux that behave very similarly to Excel. There's a good exposition
on spreadsheet history as well as links to free and commercial spreadsheets for Linux
http://www.cbbrowne.com/info/spreadsheets.html[here] and a
http://en.wikipedia.org/wiki/Comparison_of_spreadsheets[good comparison of various spreadsheets].
For normal use, I'd suggest one of:
[[openoffice]]
* http://en.wikipedia.org/wiki/OpenOffice.org_Calc[OpenOffice calc] (type 'oocalc' to start)
[[libreoffice]]
* http://www.libreoffice.org/features/calc/[LibreOffice calc] (type 'libreoffice' to start)
[[gnumeric]]
* http://en.wikipedia.org/wiki/Gnumeric[Gnumeric] (type 'gnumeric' to start).
NB: 'LibreOffice' is the latest fork of 'OpenOffice'. Due to IP entanglements, development of
'OpenOffice' has stopped but development of 'LibreOffice' continues actively. If both are available,
'LibreOffice' is a better choice.
The links will give you more information on them and the spreadsheet modules in each will let you
view and edit most Excel spreadsheets. In addition, there are 2 Mac-native version of OpenOffice:
one a port from the OpenOffice group called:
[[neooffice]]
* http://porting.openoffice.org/mac/download/aqua.html[OpenOffice Aqua] the X11 port of OpenOffice
* http://www.neooffice.org[NeoOffice] - a native fork of OpenOffice.
A NeoOffice-supplied http://neowiki.neooffice.org/index.php/NeoOffice_Feature_Comparison[comparison chart] may be worth reviewing.
Extracting data from MS Excel and MS Word files files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While both OpenOffice and MS Office have bulked up considerably in their officially
http://en.wikipedia.org/wiki/OpenOffice.org_Calc[stated capacity], often spreadsheets are not
appropriate to the kind of data processing we want to do. So how to extract the data?
The first way is the most direct and may in the end be the easiest - open the files (*oowriter* for
Word files, *oocalc* for Excel files) and export the data as plain text
For Word files:
-----------------------------------------------------------------------------------------
File Menu > Save as... > set Filter to text (.txt), specify directory and name > [OK]
-----------------------------------------------------------------------------------------
For Excel files:
-----------------------------------------------------------------------------------------
File Menu > Save As... > set Filter: to text CSV (.csv), specify directory and name > [OK] >
set 'Field delimiter' and 'Text delimiter' > [OK]
-----------------------------------------------------------------------------------------
Or use the much faster method below.
The above method of extracting data from a binary MS file requires a fair amount of clicking and
mousing. The 'Linux Way' would be to use a commandline utility to do it in one line.
[[antiword]]
Converting a Word file to text
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For MS Word documents, there is a utility, appropriately called... *antiword*
*antiword* does what the name implies; it takes a Word document and inverts it - turns it into a
plain text document:
--------------------------------------------------------------------------
bash > time antiword some_MS_Word.doc > some_MS_Word.txt
real 0m0.004s
user 0m0.004s
sys 0m0.000s
# only took 0.004s!
--------------------------------------------------------------------------
(the '>' operator in the example above redirects the 'STDOUT' output from 'antiword' to a file
instead of spewing it to the console. Try it without the '> some_MS_Word.txt'. See the
link:#STDINOUTERR[note on STDIN, STDOUT, & STDERR].
[NOTE]
.Timing your programs
=====================================================================
The 'time' prefix in the above example returns the amount of time that the command that followed it
used. There are actually 2 'time' commands usually available on Linux; the one demo'ed above is the
internal 'bash' timer. If you want more info, you can use the 'system' timer which is usually
'/usr/bin/time' and has to be explicitly called:
--------------------------------------------------------------------------
bash > /usr/bin/time antiword some_MS_Word.doc > some_MS_Word.txt
0.00user 0.00system 0:00.00elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+16outputs (0major+262minor)pagefaults 0swaps
# or even more verbosely:
bash > /usr/bin/time -v antiword some_MS_Word.doc > some_MS_Word.txt
Command being timed: "antiword some_MS_Word.doc > some_MS_Word.txt"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 263
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 16
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
--------------------------------------------------------------------------
=====================================================================
[[py_xls2csv]]
Extracting an Excel spreadsheet
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There's an excellent Excel extractor called 'py_xls2csv', part of the
free 'python-excelerator' package (on Ubuntu). It works similarly to 'antiword':
-----------------------------------------------------------------------------
bash > py_xls2csv BodaciouslyHugeSpreadsheet.xls > BodaciouslyHugeSpreadsheet.csv
-----------------------------------------------------------------------------
'py_xls2csv' takes no options and saves output with commas as the only separator, which is generally
what you want.
If you want to do further data mangling in Python, the http://www.python-excel.org/[xlrd and xlwt
modules] are very good Excel modules for reading and writing Excel files, but they're libs, not a
standalone utility (tho there are https://secure.simplistix.co.uk/svn/xlwt/trunk/xlwt/examples/[lots
of examples].
If you are Perl-oriented and have many Excel spreadsheets to manipulate, extract, spindle and
mutilate, the Perl Excel-handling modules are also very good.
http://www.ibm.com/developerworks/linux/library/l-pexcel/[Here is a good description] of how to do
this.
Viewing and Manipulating the data
---------------------------------
Whether the file has been saved to a text format, or whether it's still in a binary format, you will
often want to examine it to determine the columnar layout if you haven't already. You can do this
either with a commandline tool or via the native application, often a spreadsheet. If the latter,
the OpenOffice spreadsheet app *oocalc* is the most popular and arguably the most capable and
compatible Linux spreadsheet application. It will allow you to view Excel data in native format so
you can determine which cols are relevant to continued analysis.
If the file is in text format or has been converted to it, it may be easier to use a text-mode
utility to view the columns. There are a few text-mode spreadsheet programs available, but they are
overkill for simply viewing the layout. Instead, consider using either a simple editor or the text
mode utilities that can be used to view it.
[[editors]]
Editors for Data
~~~~~~~~~~~~~~~~
Common, free GUI editors that are a good choice for
viewing such tabular data are http://www.nedit.org/[nedit], http://www.jedit.org[jedit], and
http://kate-editor.org[kate]. All have unlimited horizontal scroll and rectangular copy and paste,
which makes them useful for copying rectangular chunks of data. 'nedit' and 'jedit' are also easily
scriptable and can record keystrokes to replay for repeated actions. 'nedit' has a 'backlighting'
feature which can visually distinguish tabs and spaces, sometimes helpful in debugging a data
file,. 'jeditj is written in Java so it's portable across platforms and has tremendous support in
the form of various http://plugins.jedit.org/[plugins].
All these editors run on the server and export their windows to your workstation if it can
http://moo.nac.uci.edu/~hjm/bduc/BDUC_USER_HOWTO.html#graphics[display X11 graphics].
Also, while it is despised and adored in equal measure, the http://www.xemacs.org/[xemacs] editor
can also do just about anything you want to do, if you learn enough about it. There is an optional
(free, of course) add-in for statistics call http://ess.r-project.org/[ESS] (for the SPlus/R
languages, as well as SAS and Stata). Emacs in all its forms is as much a lifestyle choice as an
editor.
Text-mode data manipulation utilities
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[[STDINOUTERR]]
[NOTE]
.STDIN, STDOUT, STDERR
=====================================================================
Automatically available to your programs in Linux and all Unix work-alikes (including MacOSX are the
3 channels noted above: Standard IN (STDIN, normally the keyboard), Standard OUT (STDOUT, normally
the terminal screen), and Standard Error (STDERR, also normally the terminal screen). These channels
can be intercepted, redirected, and piped in a variety of ways to further process, separate,
aggregate, or terminate the processes that use them. This is a whole topic by itself and is covered
well in http://swc.scipy.org/lec/shell02.html[this Software Carpentry tutorial].
=====================================================================
[[grep]]
The grep family
^^^^^^^^^^^^^^^
Possibly the most used utilities in the Unix/Linux world. These elegant utilities are used to
search files for patterns of text called http://en.wikipedia.org/wiki/Regular_expression[regular
expressions] (aka regex) and can select or omit a line based on the matching of the regex. The most
popular of these is the basic grep and it, along with some of its bretheren are
http://en.wikipedia.org/wiki/Grep[described well in Wikipedia]. Another variant which behaves
similarly but with one big difference is http://en.wikipedia.org/wiki/Agrep[agrep], or *approximate*
grep which can search for patterns with variable numbers of errors, such as might be expected in a
file resulting from an optical scan or typo's. Baeza-Yates and Navarro's
http://www.dcc.uchile.cl/~gnavarro/software/[nrgrep] is even faster, if not as flexible (or
'differently' flexible), as agrep.
A grep variant would be used to extract all the lines from a file that had a particular phrase or
pattern embedded in it.
For example:
--------------------------------------------------------------------------
bash > wc /home/hjm/FF/CyberT_C+E_DataSet
600 9000 52673 /home/hjm/FF/CyberT_C+E_DataSet
# so the file has 600 lines. If we wanted only the lines that had the
# identifier 'mur', followed by anything, we could extract it:
# (passed thru 'scut' (see below) to trim the extraneous cols.)
bash > grep mur /home/hjm/FF/CyberT_C+E_DataSet | scut --c1='0 1 2 3 4'
b0085 murE 6.3.2.13 0.000193129 0.000204041
b0086 murF+mra 6.3.2.15 0.000154382 0.000168569
b0087 mraY+murX 2.7.8.13 1.41E-05 1.89E-05
b0088 murD 6.3.2.9 0.000117098 0.000113005
b0090 murG 2.4.1.- 0.000239323 0.000247582
b0091 murC 6.3.2.8 0.000245371 0.00024733
# if we wanted only the id's murD and murG:
bash > grep mur[DG] /home/hjm/FF/CyberT_C+E_DataSet | scut --c1='0 1 2 3 4'
b0088 murD 6.3.2.9 0.000117098 0.000113005
b0090 murG 2.4.1.- 0.000239323 0.000247582
--------------------------------------------------------------------------
[[cat]]
cat
^^^
*cat* (short for 'concatenate') is one of the simplest Linux text utilities. It simply dumps the
contents of the named file (or files) to STDOUT, normally the terminal screen. However, because it
'dumps' the file(s) to STDOUT, it can also be used to concatenate multiple files into one.
--------------------------------------------------------------------------
bash > cat greens.txt
turquoise
putting
sea
leafy
emerald
bottle
bash > cat blues.txt
cerulean
sky
murky
democrat
delta
# now concatenate the files
bash > cat greens.txt blues.txt >greensNblues.txt
# and dump the concatenated file
bash > cat greensNblues.txt
turquoise
putting
sea
leafy
emerald
bottle
cerulean
sky
murky
democrat
delta
--------------------------------------------------------------------------
[[moreless]]
more & less
^^^^^^^^^^^^
These critters are called 'pagers' - utilities that allow you to page thru text files in the
terminal window. 'less is more than more' in my view, but your mileage may vary. These pagers
allow you to queue up a series of files to view, can scroll sideways, allow search by
http://en.wikipedia.org/wiki/Regular_expression[regular expression], show progression thru a file,
spawn editors, and many more things.
http://www.showmedo.com/videos/video?name=940030&fromSeriesID=94[Video example]
[[headtail]]
head & tail
^^^^^^^^^^^
These two utilities perform similar functions - they allow you view the beginning ('head') or end
('tail') of a file. Both can be used to select contiguous ends of a file and pipe it to another
file or pager. 'tail -f' can also be used to view the end of a file continuously (as when you have
a program continuously generating output to a file and you want to watch the progress). These are
also described in more detail near the end of the
http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#12[IBM DeveloperWorks tutorial]
.
[[scut]]
cut & scut
^^^^^^^^^^
These are columnar slicing utilities, which allow you to slice vertical columns of characters or
fields out of a file, based on character offset or column delimiters.
http://lowfatlinux.com/linux-columns-cut.html[cut] is on every Linux system and works very quickly,
but is fairly primitive in its ability to select and separate data.
http://forums.nacs.uci.edu/BioBB/viewtopic.php?f=10&t=7[scut] is a Perl utility which trades some
speed for much more flexibility, allowing you to select data not only by character column and single
character delimiters, but also by data fields identified by any delimiter that a
http://en.wikipedia.org/wiki/Regular_expression[regular expression] (aka regex) can define. It can
also re-order columns and sync fields to those of another file, much like the *join* utility
link:#join[see below]. See http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[this link] for more
info.
[[cols]]
cols
^^^^
'cols' is a very simple, but arguably useful utility that allows you to view the data of a file
aligned according to fields. Especially in conjunction with 'less', it's useful if you're
manipulating a file that has 10s of columns especially if those columns are of disparate widths.
http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[cols is explained fairly well here] and the
http://moo.nac.uci.edu/%7ehjm/cols[cols code is available here].
[[paste]]
paste
^^^^^
'paste' can join 2 files 'side by side' to provide a horizontal concatenation. ie:
-----------------------------------------------------------------------------
bash > cat file_1
aaa bbb ccc ddd
eee fff ggg hhh
iii jjj kkk lll
bash > cat file_2
111 222 333
444 555 666
777 888 999
bash > paste file_1 file2
aaa bbb ccc ddd 111 222 333
eee fff ggg hhh 444 555 666
iii jjj kkk lll 777 888 999
-----------------------------------------------------------------------------
Note that 'paste' inserted a TAB character between the 2 files, each of which used spaces between
each field. See also the http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#8[IBM
DeveloperWorks tutorial]
[[join]]
join
^^^^
'join' is a more powerful variant of 'paste' that acts as a simple relational 'join' based on common
fields. 'join' needs identical field values to finish the join. See that 'scut'
link:#scut[described above] can also do this type of operation. For much more powerful relational
operations, http://www.sqlite.org[SQLite] is a fully featured relational database that can do this
reasonably easily (link:#sqlite[see below]).
http://www-128.ibm.com/developerworks/linux/library/l-textutils.html#9[A good example of 'join' is
here]
pr
^^
'pr' is actually a printing utility that is mentioned here because for some tasks especially related
to presentation, it can join files together in formats that are impossible to do using any other
utility. For example if you want the width of the printing to expand to a nonstandard width or want
to columnize the output in a particular way or modify the width of tab spacing, 'pr' may be able to
do what you need.
[[LAFF]]
Large ASCII Flat Files
----------------------
These are typically files containing self-generated data or (increasingly)
http://en.wikipedia.org/wiki/High_throughput_sequencing[High Throughput Sequencing] (HTS) data in
http://en.wikipedia.org/wiki/FASTA[FASTA] or http://en.wikipedia.org/wiki/FASTQ_format[FASTQ]
format. In the latter case, these sequence file should be kept compressed. Almost all released
tools for analyzing them can deal with the compressed form.
In the former case, take some care to design your output so that it's easily parseable and
searchable. If the data is entirely numeric, consider writing it as an HDF5 file or if it's fairly
simple, in binary format (the float '23878.349875935' takes 15-30 bytes to represent in a text
format; 8 bytes in double precision floating point). Integers take even less.
In the worst case, if you MUST use ASCII text, write your output in delimted tables rather than line
by line:
------------------------------------------------------------------------
# example of line by line (666 bytes of data)
AGENT 723, t = 0
len(agentpool)=996
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 4
AGENT 917, t = 0
len(agentpool)=995
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 5
AGENT 775, t = 0
len(agentpool)=994
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 6
AGENT 675, t = 0
len(agentpool)=993
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 7
AGENT 546, t = 0
len(agentpool)=992
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 8
AGENT 971, t = 0
len(agentpool)=991
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 9
AGENT 496, t = 0
len(agentpool)=990
totaledge_G = 100348
totaledge_H = 100348
reconnectamt = 10
------------------------------------------------------------------------
The same data in (poorly designed but) delimited format:
------------------------------------------------------------------------
#agent|t|len|totaledge_G|totaledge_H|reconnectamt (182 bytes: 27% of the above)
723|0|996|100348|100348|4
917|0|995|100348|100348|5
775|0|994|100348|100348|6
675|0|993|100348|100348|7
546|0|992|100348|100348|8
971|0|991|100348|100348|9
496|0|990|100348|100348|10
------------------------------------------------------------------------
If you need to view this data in aligned or titled format for verification, use the utilities http://moo.nac.uci.edu/~hjm/scut_cols_HOWTO.html[scut and cols].
[[hdf5]]
Complex Binary Data storage and tools
-------------------------------------
While much data is available in (or can be converted to) text format, some data is so large
(typically, >1 GB) that it demands special handling. Data sets from the following domains are
typically packaged in these formats:
* Global Climate Modeling
* Stock exchange transactions
* Confocal images
* Satellite and other terrestrial scans
* Microarray and other genomic data
There are a number of specialized large-data formats, but I'll discuss a popular large-scale data
format called http://en.wikipedia.org/wiki/Hierarchical_Data_Format[HDF5], which has now been merged
with the http://en.wikipedia.org/wiki/NetCDF[netCDF] data format. These can be thought of as
numeric databases, tho they have significant functional overlap with relational databases. One
advantage is that they have no requirement for a database server, an advantage they share with
link:#sqlite[SQLite, below]. As the wiki pages describe in more detail, these Hierarchical Data
Formats are self-describing, somewhat like XML files, which enable applications to determine the
data structures without external references. HDF5 and netCDF provide sophisticated, compact, and
especially hierarchical data storage, allowing an internal structure much like a modern filesystem.
A single file can provide character data in various encodings (ASCII, UTF, etc), numeric data in
various length integer, floating point, and complex representation, geographic coordinates, encoded
data such as base+offsets for efficiency, etc. These files and the protocols for reading them,
assure that they can be read and written using common functions without regard to platform or
network protocols, in a number of languages.
These files are also useful for very large data as they have parallel Input/Output (I/O)
interfaces. Using HDF5 or netCDF4, you can read and write these formats in parallel, increasing I/O
dramatically on parallel filesystems such as
http://en.wikipedia.org/wiki/Lustre_(file_system)[Lustre], http://www.pvfs.org[PVFS2], and
http://en.wikipedia.org/wiki/Global_File_System[GFS]. Many analytical programs already have
interfaces to the HDF5 and netCDF4 format, among them http://tinyurl.com/dyclf6[MATLAB],
http://reference.wolfram.com/mathematica/ref/format/HDF5.html[Mathematica],
http://cran.r-project.org/web/packages/hdf5/index.html[R],
https://wci.llnl.gov/codes/visit/FAQ.html#28[VISIT], and http://www.hdfgroup.org/tools.html[others].
Tools for HDF and netCDF
~~~~~~~~~~~~~~~~~~~~~~~~
As noted above, some applications can open such files directly, obviating the need to use external
tools to extract or extend a data set in this format. However, for those time when you have to
subset, extend, or otherwise modify such a dataset, there are a couple of tools that can make that
job much easier. Since this document is aimed at the beginner rather than the expert, I'll keep
this section brief, but realize that these extremely powerful tools are available (and free!).
Some valuable tools for dealing with these formats:
[[nco]]
* http://nco.sf.net[nco], a suite of tools written in C/C++ mainly by UCI's own
http://www.ess.uci.edu/~zender[Charlie Zender]. they were originally written to manipulate netCDF
3.x files but have been updated to support netCDF4 (which uses HDF5 format). They are
http://nco.sourceforge.net/#Definition[described in detail here, with examples]. They are portable
to all current Unix implementations and of course Linux. They are extremely well-debugged and their
development is ongoing.
[[pytables]]
* http://www.pytables.org/moin[PyTables] is another utility that is used to create, inspect, modify,
and/or query HDF5 tables to extract data into other HDF or text files. This project also have very
good documentation and even has a couple video introductions on http://www.showmedo.com[ShowMeDo].
They can be reached http://www.pytables.org/moin/HowToUse#Videos[from here]. In addition, PyTables
also has a companion graphical data browser called http://www.vitables.org[ViTables].
Databases
---------
Databases (DBs) are data structures and the supporting code that allow you to store data in a
particular way. Some DBs are used to store only numbers (and if this is the case, it might be
quicker and compact to store that data in an link:#hdf5[HDF5 file]). If you have lots of different
kinds of data and you want to query that data on complex criteria (give me the names of all the
people who lived in the 92655 area code and have spent more than $500 on toothpicks in the last 10
years), using a DB can be extremely useful and in fact using a DB may be the only way to extract the
data you need in a timely manner.
As implied above, there are a number of different types of DBs, delineated not only by the way they
store and retrieve data, but by the way the user interacts with the DB. The software can be
separated into Desktop DBs (which are meant to serve mostly local queries by one person - for
example, a research DB that a scientist might use to store his own data) and DB Servers, which are
meant to answer queries from a variety of clients via network socket.
The latter are typically much more complex and require much more configuration to protect their
contents from network mischief. Desktop DBs typically have much less security and are less
concerned about answering queries from other users on other computers, tho they can typically do
so. Microsoft Access and SQLite are examples of Desktop DBs.
Especially if you are using the names of flat files as the index of their contents ( such as
's=2.0_tr=48_N=1000.out', the use of an DB is strongly suggested. In this way, you can append to a
common DB such that you can rapisdly query the results over millions of entries (see
http://moo.nac.uci.edu/~hjm/sb/index.html#toc34[here] for an example where this approach was used).
Relational databases
~~~~~~~~~~~~~~~~~~~~
Relational DBs are those that can 'relate' data from one table (or data structure) to another. A
relational DB is composed of tables which are internally related and can be joined or related to
other tables with more distant information. For example, woodworker's DB might be composed of DB
tables that describe PowerTools, Handtools, Plans, Glues, Fasteners, Finishes, Woods, and Injuries,
with interconnecting relationships among the tables. The 'Woods' table definition might look like
this:
----------------------------------------------------------------------
TABLE Woods (
id INTEGER PRIMARY KEY, # the entry index
origin VARCHAR(20), # area of native origin
now_grown VARCHAR (100), # areas where now grown
local BOOLEAN, # locally available?
cost FLOAT, # cost per board foot
density FLOAT, # density in lbs/board ft.
hardness INT, # relative hardness scaled 1-10
sanding_ease INT, # relative ease to sand to finish
color VARCHAR(20), # string descriptor of color range
allergy_propensity INT, # relative propensity to cause allergic reaction 1-10
toxicity INT, # relative toxicity scaled 1-10
strength INT, # relative strength scaled 1 (balsa) to 10 (hickory)
appro_glue VARCHAR(20), # string descriptor of best glue
warnings VARCHAR(200), # any other warnings
notes VARCHAR (1000), # notes about preparation, finishing, cutting
<etc>
);
----------------------------------------------------------------------
See http://en.wikipedia.org/wiki/Relational_database[the Wikipedia entry on Relational Databases] for more.
Desktop
^^^^^^^
[[sqlite]]
* SQLite (http://www.sqlite.org[website], http://en.wikipedia.org/wiki/Sqlite[wikipedia]) is an
amazingly powerful, http://en.wikipedia.org/wiki/ACID[ACID-compliant], astonishingly tiny DB engine
(could fit on a floppy disk) that can do much of what much larger DB engines can do. It is public
domain, has a huge user base, has good documentation (including a
http://www.amazon.com/Definitive-Guide-SQLite-Mike-Owens/dp/1590596730[book]) and is well suited to
using for the transition from flat files to relational database. It has support for almost every
computer language, and several utilities (such as graphical browsers like
http://sqlitebrowser.sourceforge.net/screenshots.html[sqlitebrowser] and
http://www.knoda.org/[knoda] to ease its use. The *sqlite3* program that provides the native
commandline interface to the DB system is fairly flexible, allowing generic import of TAB-delimited
data into SQLite tables. Ditto the graphical *sqlitebrowser* programs. Here is
http://moo.nac.uci.edu/%7ehjm/recursive.filestats.sqlite_skel.pl[a well-documented example of how to
use SQLite in a Perl script] (about 300 lines including comments).
[[oobase]]
* http://www.openoffice.org[OpenOffice] comes with it's http://hsqldb.org/[own DB] and
http://dba.openoffice.org/[DB interaction tools], which are extremely powerful, tho they have not
been well-documented in the past. The OpenOffice DB tools can be used not only with its own DB, but
with many others including MySQL, PostgreSQL, SQLite, MS Access, and any DB that provides an
http://en.wikipedia.org/wiki/Open_Database_Connectivity[ODBC interface].
Server-based
^^^^^^^^^^^^
[[mysql]]
* http://en.wikipedia.org/wiki/MySQL[MySQL] is a hugely popular, very fast DB server that provides
much of the DB needs of the WWW, both commercial and non-profit. Some of the largest, most popular
web sites in the world use MySQL to keep track of the web intereactions, as well as their data. The
http://genome.ucsc.edu/index.html[UC Santa Cruz Genome DB] uses MySQL with a schema of >300 tables
to keep one of the most popular Biology web sites in the world running.
[[postgresql]]
* http://en.wikipedia.org/wiki/PostgreSQL[PostgreSQL] is similarly popular and has a reputation for
being even more robust and full-featured. It also is formally an
http://en.wikipedia.org/wiki/Object-relational_database_management_system[Object-Relational]
database, so its internal storage can be used in a more object oriented way. If your needs require
storing Geographic Information, PostgreSQL has a parallel development called PostGIS, which is
optimized for storing Geographical Information and has become a de facto standard for GIS DBs. It
supports the popular http://mapserver.osgeo.org/[Mapserver] software
[[firebird]]
* http://en.wikipedia.org/wiki/Firebird_(database_server)[Firebird] is the Open Source version of a
previously commercial DB called Interbase from Borland. Since its release as OSS, it has undergone
a dramatic upswing in popularity and support.
* Others. There are a huge number of very good relational DBs, many of them free or OSS.
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems[Wikipedia has a
page] that names a number of them and briefly describes some differences, altho unless you are bound
by some external constraint, you would be foolish not to choose one of SQLite, MySQL, or PostgreSQL
due to the vast user-generated support and utilities.
[[MathModel]]
Mathematical Modeling Systems
-----------------------------
There are a wide variety of mathematical packages available for Linux, ranging from the well-known
proprietary ones such as MATLAB and Mathematica to a number of very well-implemented OSS ones. For
those looking for an OSS package with MATLAB-like language compatibility, Octave is a very good fit.
For a symbolic package like Mathematica, well... there's Mathematica (which is available for
Linux). Note that there is much cross-over among the packages mentioned in this section and those
mentioned in the Visualization section. Many of the OSS Visualization packages are used to display
the results of the Modeling packages and many of the Visualization packages actually have
significant analytical muscle as well, especially link:#visit[VISIT] and link:#rprogram[R].
However, there are a number of other OSS packages that can help you with manipulating your data. Here are a few of the best ones.
[[Octave]]
Octave
~~~~~~
http://www.gnu.org/software/octave/[Octave] is a
http://wiki.octave.org/FAQ#How_is_Octave_different_from_Matlab.3F[near-clone] of MATLAB in terms of
functionality and is widely used and well-supported. It does not have a native GUI, altho there are
externally developed GUI's that provide some functionality to get you started. While it is largely
compatible with MATLAB code, it is also considerably slower than MATLAB, especially for iterative
code (loops, tests, etc). If you can vectorize your code, it approaches MATLAB speed.
[[Scilab]]
SciLab
~~~~~~
http://www.scilab.org[SciLab] is a large project founded by INRIA, supported by a consortium of ~18
academic and commercial companies. It is similar to MATLAB, but is not language-compatible. It
does, however, come with a MATLAB to SciLab converter. It supports Mac/Win/Lin in 32/64bits, has a
full GUI, and like MATLAB, has a
http://www.scilab.org/contrib/index_contrib.php?page=listdyn.php&order=date[very large number of
toolboxes]. The older versions had a tcl/tk-based interface; the most recent version (5) now has a
more modern GTK-based GUI with some output based on http://www.scicos.org/[scicos] and
http://www.jgraph.com/jgraph.html[JgraphX].
For MATLAB users wanting to try or convert to SciLab, the current (5.3) version of SciLab has a
built-in MATLAB to SciLab conversion tool. And there is an extensive PDF document that describes
http://wiki.scilab.org/Tutorials?action=AttachFile&do=get&target=Scilab4Matlab.pdf[using SciLab from
a MATLAB user's perspective].
.Python-based Mathematics systems
[NOTE]
========================================================================
There are a surprising number of good systems developed using Python. SAGE and Pylab are two of the
better known. Both are commandline-only, but have good-to-excellent graphic output, including
interactive plots. Altho you don't type Python code directly in either, both systems are extensible
using the underlying Python. Both have good documentation and even introductory videos showing how
to use the systems. Both can be run on Mac/Win/Linux.
========================================================================
[[Pylab]]
PyLab
~~~~~
http://www.scipy.org/PyLab[Pylab] uses the SciPy packages (http://www.scipy.org/[SciPy],
http://numpy.scipy.org/[NumPy] (Numerical Python), & http://matplotlib.sourceforge.net/[matplotlib])
to create a very flexible programming system for Mathematics. NumPy is one of the more heavily
developed math libraries in the OSS world and PyLab benefits strongly from that development.
[[SAGE]]
SAGE
~~~~
http://www.sagemath.org/[SAGE] is a more formal system and is fairly well-funded to continue
development. It too uses SciPy and NumPy for some of the underlying Math plumbing. It can be used
either as a self-contained commandline-driven system or as a client, connecting to a webserver,
where your notebook is hosted. This arrangement allows you to test-drive the system using
http://www.sagenb.org/[SAGE's own web service]. Another non-trivial feature is that it is
distributed as a complete system; you can install it on a thumb drive and take the entire system,
including your work, with you as you travel from place to place.
[[DataVisualization]]
Data visualization
------------------
You almost always want to visualize your data. It's one thing to page thru acres of spreadsheet or
numeric data, but nothing can give you a better picture of your data than ... a picture. This is an
area where Linux shines, tho it's not without scrapes and scratches. I'm going to break this into 2
arbitrary sections. The first is 'Simple' the second 'Complex'. "Simple" alludes to that both the
process and data are relatively simple; "Complex" implies that both the visualization process and
data are more complex.
Simple Data Visualization
~~~~~~~~~~~~~~~~~~~~~~~~~
[[qtiplot]]
qtiplot
^^^^^^^
http://soft.proindependent.com/qtiplot.html[qtiplot] is "a fully fledged plotting software similar
to the OriginLab http://www.originlab.com[Origin] software". It also is multiplatform, so it can
run on Windows as well as MacOSX. This is probably what researchers coming from the Windows
environment would expect to use when they want a quick look at their data. It has a fully
interactive GUI and while it takes some getting used to, it is fairly intuitive and has a number of
useful descriptive statistics features as well. Highly recommended.
[[gretl]]
gretl
^^^^^
http://gretl.sourceforge.net/[gretl] is not a plotting package 'per se' but a fully graphical
statistical package that incorporates gnuplot and as such is worth knowing about. It is quite
intuitive and the statistical functions are an added bonus for when you want to start applying them
to your data. It was originally developed with its own statistics engine but can start and use R as
well. Highly recommended.
[[quickplot]]
quickplot
^^^^^^^^^
http://quickplot.sourceforge.net/[quickplot] is a more primitive graphical plotting program but with
a very large capacity if you want to plot large numbers of point (say 10s or 100s of thousands of
points.) It can also read data from a pipe so you can place it at the end of a data pipeline to
show the result.
[[gnuplot]]
gnuplot
^^^^^^^
http://gnuplot.info/[gnuplot] is one of the most popular plotting applications for Linux. If
you spend a few minutes with it, you'll wonder why. If you persist and spend a few hours with it,
you'll understand. It's not a GUI program, altho *qgfe* provides a primitive GUI (but if you want a
GUI, try *qtiplot* above. gnuplot is really a scripting language for automatically plotting (and
replotting) complex data sets. To see this in action, you may first have to download the demo
scripts and then have gnuplot execute all of them with *gnuplot /where/the/demo/data/is/all.dem*.
Pretty impressive (and useful, as all the demo examples ARE the gnuplot scripts) It's an extremely
powerful plotting package, but it's not for dilettantes. Highly recommended (if you're willing to
spend the time).
[[pyxplot]]
pyxplot
^^^^^^^
http://www.pyxplot.org.uk[pyxplot] is a graphing package sort of like gnuplot in that it relies on
an input script, but instead of the fairly crude output of gnuplot, the output is
http://www.pyxplot.org.uk/examples/[truly publication quality] and the greater the complexity of the
plot, the more useful it is. Recommended if you need the quality or typesetting functionality.
[[matplotlib]]
Matplotlib
^^^^^^^^^^
http://matplotlib.sourceforge.net/[Matplotlib] is another Python Library that can be used to
generate very Matlab-like plots (see http://matplotlib.sourceforge.net/gallery.html[the gallery of
plots] to get an idea of the kind of versatility and quality that's available with this toolset.
Each plot comes with the Python code that generates it.
[[rplot]]
R's plot packages
^^^^^^^^^^^^^^^^^
http://en.wikipedia.org/wiki/R_(programming_language)[The R statistical language] has some
http://www.statmethods.net/advgraphs/index.html[very good plotting facilities] for both low and high
dimensional data. Like the Gretl program immediately above, R combines impressive statistical
capabilities with graphics, although R is an interpreted language that uses commands to create the
graphs which are then generated in graphics windows. There are some R packages that are completely
graphical and there are moves afoot to put more effort into making a completely graphical version of
R, but for the most part, you'll be typing, not clicking. That said, it's an amazingly powerful
language and http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html[one of
the best statistical environments available], commercial or free.
Notable is the http://had.co.nz/ggplot2/[ggplot2] package which produces elegant plots fairly simply
by providing excellent defaults - highly recommended. While software is free, and the internal
documentation is fairly good, the author has written http://tinyurl.com/ggplot2-book[an entire book]
on it. While the entire book is not free, the introductory chapter
http://had.co.nz/ggplot2/book/qplot.pdf[Getting started with qplot] is, along with
http://had.co.nz/ggplot2/book/[all the code] mentioned in it.
Visualization of Multivariate Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[[ggobi]]
ggobi
^^^^^
http://www.ggobi.org[ggobi] is on the border of "simple" and "complex", but since it can be started
and used relatively easily and has some very compelling abilities. I have to admit to being a ggobi
fan for many years - it has some features that I haven't seen anywhere else which, if your data is
multivariate, really helps you to understand it. It can be run by itself, but its real power is
obvious when you use it inside of R. That interface also obviates the grotacious requirement to