shell-lesson-data/excercise-data/proteins on molly..pdb extension.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
- SSH to `molly`.
- Run the following commands to prepare the environment.
~~~
clear
cd
pwd
wget --no-check-certificate https://www.cs.wcupa.edu/lngo/data/shell-lesson-data.zip
unzip shell-lesson-data.zip
cd ~/shell-lesson-data/exercise-data/proteins
ls -l *.pdb
~~~
:::{image} fig/09-scripting-linux/pdb-list.png
:alt: List files in current directory
:class: bg-primary mb-1
:height: 200px
:align: center
:::
- To get counts of characters, words, and lines in a file, we use `wc`.
~~~
man wc
wc *.pdb
wc -l *.pdb
~~~
:::{image} fig/09-scripting-linux/wc-cli.png
:alt: Running wc command
:class: bg-primary mb-1
:height: 300px
:align: center
:::
- We can use the `>` to redirect output to a file
- `>` redirects output and creates a new file.
- `>>` appends output to a file (if the file already exists, else creates a new file)
~~~
ls
wc -l *.pdb > lengths.txt
ls
cat lengths.txt
wc -l *.pdb >> lengths.txt
cat lengths.txt
wc -l *.pdb > lengths.txt
cat lengths.txt
~~~
:::{image} fig/09-scripting-linux/wc-redirect.png
:alt: Redirect outputs to a file
:class: bg-primary mb-1
:height: 600px
:align: center
:::
lengths.txt using sort
1
man sort
::::{admonition} Challenge: what does sort -n do? :class: note
-n do by observing the following two commands
1
2
3
4
5
6
sort ~/shell-lesson-data/exercise-data/numbers.txt
10
19
2
22
6
1
2
3
4
5
6
sort -n ~/shell-lesson-data/exercise-data/numbers.txt
2
6
10
19
22
:::{dropdown} Solution
The -n option specifies a numerical rather than an alphanumerical sort. ::: ::::
Let’s look at lengths.txt:
1
2
3
sort -n lengths.txt
sort -n lengths.txt > sorted-lengths.txt
cat sorted-lengths.txt
:::{image} fig/09-scripting-linux/sort-cli.png :alt: Redirect sorted outputs to a file :class: bg-primary mb-1 :height: 200px :align: center :::
head command to get the first line
1
head -n 1 sorted-lengths.txt
:::{image} fig/09-scripting-linux/head-cli.png :alt: Run head to get the first line :class: bg-primary mb-1 :height: 50px :align: center :::
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
- We used intermediate files to store output. We can use a pipe (`|`) to
combine them together.
~~~
sort -n lengths.txt | head -n 1
~~~
- We can combine multiple commands
~~~
wc -l *.pdb | sort -n | head -n 1
~~~
:::{image} fig/09-scripting-linux/pipe-multiple.png
:alt: Multiple commands connection via pipes
:class: bg-primary mb-1
:height: 100px
:align: center
:::
::::{admonition} Challenge: piping commands together
:class: note
- In our current directory, we want to find the 3 files which have the least
number of lines. Which command listed below would work?
1. `wc -l * > sort -n > head -n 3`
2. `wc -l * | sort -n | head -n 1-3`
3. `wc -l * | head -n 3 | sort -n`
4. `wc -l * | sort -n | head -n 3`
:::{dropdown} Solution
- Option 4 is the solution. The pipe character `|` is used to connect the output
from one command to the input of another. `>` is used to redirect standard output
to a file. Try it in the `shell-lesson-data/exercise-data/proteins` directory!
:::
::::
::::{admonition} Challenge: pipe reading comprehension
:class: note
- A file called animals.csv (in the `shell-lesson-data/exercise-data/animal-counts` folder)
contains the following data:
~~~
cat ~/shell-lesson-data/exercise-data/animal-counts/animals.csv
2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,deer,2
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-07,bear,1
~~~
- What text passes through each of the pipes and the final redirect in the
pipeline below? Note, the `sort -r` command sorts in reverse order.
~~~
cat animals.csv | head -n 5 | tail -n 3 | sort -r > final.txt
~~~
:::{dropdown} Solution
~~~
2012-11-06,rabbit,19
2012-11-06,deer,2
2012-11-05,raccoon,7
~~~
:::
::::
::::{admonition} Challenge: pipe construction
- For the file `animals.csv` from the previous exercise, consider the following command:
~~~
man cut
cut -d , -f 2 animals.csv
~~~
:::{image} fig/09-scripting-linux/cut-pipe.png
:alt: Piping the cut command
:class: bg-primary mb-1
:height: 100px
:align: center
:::
- The `uniq` command filters out adjacent matching lines in a file. How could you
extend this pipeline (using uniq and another command) to find out what animals the
file contains (without any duplicates in their names)?
:::{dropdown} Solution
~~~
cut -d , -f 2 animals.csv | sort | uniq
~~~
:::
::::
::::{admonition} Challenge: which pipe?
- The file `animals.csv` contains 8 lines of data formatted as follows::
~~~
2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
...
~~~
The `uniq` command has a `-c` option which gives a count of the number of
times a line occurs in its input. Assuming your current directory is
`shell-lesson-data/exercise-data/animal-counts`, what command would you
use to produce a table that shows the total count of each type of animal in the file?
1. `sort animals.csv | uniq -c`
2. `sort -t, -k2,2 animals.csv | uniq -c`
3. `cut -d, -f 2 animals.csv | uniq -c`
4. `cut -d, -f 2 animals.csv | sort | uniq -c`
5. `cut -d, -f 2 animals.csv | sort | uniq -c | wc -l`
:::{dropdown} Solution
Option 4. is the correct answer.
:::
::::
north-pacific-gyre directory described earlier. Let’s check the integrity of this data:
1
2
cd ~/shell-lesson-data/north-pacific-gyre
ls -l
1
wc -l *.txt | sort -n | head -n 5
Z in their names,
1
ls *Z.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
Suppose we have several hundred genome data files named `basilisk.dat`, `minotaur.dat`, and
`unicorn.dat`. For this example, we'll use the `exercise-data/creatures` directory which only
has three example files, but the principles can be applied to many many more files at once.
The structure of these files is the same:
- The common name, classification, and updated date are
presented on the first three lines
- The DNA sequences on the following lines.
Let's look at the files:
~~~
cd ~/shell-lesson-data/exercise-data/creatures/
head -n 5 basilisk.dat minotaur.dat unicorn.dat
~~~
:::{image} fig/09-scripting-linux/loop-creatures.png
:alt: Viewing DNA contents of mystical creatures
:class: bg-primary mb-1
:height: 300px
:align: center
:::
- We would like to print out the classification for each species, which is given on
the second line of each file.
- For each file, we would need to execute the command `head -n 2` and pipe this to `tail -n 1`.
- We’ll use a loop to solve this problem, but first let’s look at the general form of a loop:
~~~
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
~~~
and we can apply this to our example like this:
~~~
for filename in basilisk.dat minotaur.dat unicorn.dat
> do
> head -n 2 $filename | tail -n 1
> done
~~~
:::{image} fig/09-scripting-linux/loop-classifications.png
:alt: Looping through the dat files to view creature classifications
:class: bg-primary mb-1
:height: 100px
:align: center
:::
::::{admonition} Follow the prompt
:class: note
The shell prompt changes from `$` to `>` and back again as we were
typing in our loop. The second prompt, `>`, is different to remind
us that we haven't finished typing a complete command yet. A semicolon, `;`,
can be used to separate two commands written on a single line.
::::
- When the shell sees the keyword `for`, it knows to repeat a command (or group of commands) once
for each item in a list.
- Inside the loop, we call for the variable's value by putting `$` in front of it. The `$` tells
the shell interpreter to treat the variable as a variable name and substitute its value in its place,
rather than treat it as text or an external command.
- In this example, the list is three filenames: `basilisk.dat`, `minotaur.dat`, and `unicorn.dat`.
Each time the loop iterates, it will assign a file name to the variable `filename` and run
the `head` command.
- The first time through the loop, `$filename` is `basilisk.dat`. The interpreter runs the
command `head` on `basilisk.dat` and pipes the first two lines to the `tail` command,
which then prints the second line of `basilisk.dat`.
- For the second iteration, `$filename` becomes `minotaur.dat`. This time, the shell runs
`head` on `minotaur.dat` and pipes the first two lines to the `tail` command, which then
prints the second line of `minotaur.dat`.
- For the third iteration, `$filename` becomes `unicorn.dat`, so the shell runs the `head`
command on that file, and `tail` on the output of that.
- Since the list was only three items, the shell exits the `for` loop.
::::{admonition} Same symbols, different meanings :class: note
> being used as a shell prompt, whereas > is also used to redirect output.$ is used as a shell prompt, but, as we saw earlier, it is also used to ask the shell to get the value of a variable.> or $ then it expects you to type something, and the symbol is a prompt.> or $ yourself, it is an instruction from you that the shell should redirect output or get the value of a variable.$filename is equivalent to ${filename}, but is different from ${file}name. You may find this notation in other people’s programs.::::
::::{admonition} Challenge: write your own loop :class: note
:::{dropdown} Solution
1
2
3
4
for loop_variable in 0 1 2 3 4 5 6 7 8 9
> do
> echo $loop_variable
> done
::: ::::
::::{admonition} Challenge: variables in loops :class: note
shell-lesson-data/exercise-data/proteins directory.
1
2
cd ~/shell-lesson-data/exercise-data/proteins/
ls *.pdb
1
2
3
4
for datafile in *.pdb
> do
> ls *.pdb
> done
1
2
3
4
for datafile in *.pdb
> do
> ls $datafile
> done
:::{dropdown} Solution
*.pdb within the loop body (as well as before the loop starts) to match all files ending in .pdb and then lists them using ls.datafile variable is evaluated using $datafile, and then listed using ls. ::: ::::::::{admonition} Challenge: limiting sets of files :class: note
shell-lesson-data/exercise-data/proteins directory?
1
2
3
4
5
cd ~/shell-lesson-data/exercise-data/proteins/
for filename in c*
> do
> ls $filename
> done
cubane.pdb, octane.pdb and pentane.pdb are listed.cubane.pdb is listed.
1
2
3
4
5
cd ~/shell-lesson-data/exercise-data/proteins/
for filename in *c*
> do
> ls $filename
> done
cubane.pdb and octane.pdb will be listed.octane.pdb will be listed.:::{dropdown} Solution
* matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.* matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched. ::: ::::::::{admonition} Challenge: saving to a file in a Loop :class: note
shell-lesson-data/exercise-data/proteins directory, what is the effect of this loop?
1
2
3
4
5
6
cd ~/shell-lesson-data/exercise-data/proteins/
for alkanes in *.pdb
> do
> echo $alkanes
> cat $alkanes > alkanes.pdb
> done
cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.cubane.pdb, ethane.pdb, and methane.pdb, and the text from all three files would be concatenated and saved to a file called alkanes.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.shell-lesson-data/exercise-data/proteins directory, what would be the output of the following loop?
1
2
3
4
5
cd ~/shell-lesson-data/exercise-data/proteins/
for datafile in *.pdb
> do
> cat $datafile >> all.pdb
> done
cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb would be concatenated and saved to a file called all.pdb.ethane.pdb will be saved to a file called all.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be concatenated and saved to a file called all.pdb.cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be printed to the screen and saved to a file called all.pdb.:::{dropdown} Solution
alkanes.pdb file. However, the file gets overwritten on each loop iteration, so the final content of alkanes.pdb is the text from the propane.pdb file.>> appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat command has been redirected, nothing is printed to the screen. ::: ::::
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
- Run the following loop
- The shell starts by expanding `*.dat` to create the list of files it will process.
- The **loop body** then executes two commands for each of those files.
- The first command, `echo`, prints its command-line arguments to standard output.
In this case, since the shell expands `$filename` to be the name of a file,
`echo $filename` prints the name of the file.
- Finally, the `head` and `tail` combination selects lines 81-100
from whatever file is being processed (assuming the file has at least 100 lines).
~~~
cd ~/shell-lesson-data/exercise-data/creatures
for filename in *.dat
> do
> echo $filename
> head -n 100 $filename | tail -n 20
> done
~~~
- We would like to modify each of the files in `shell-lesson-data/exercise-data/creatures`,
but also save a version of the original files, naming the copies `original-basilisk.dat`
and `original-unicorn.dat`.
- We can't use:
~~~
cp *.dat original-*.dat
~~~
{: .language-bash}
because that would expand to:
~~~
cp basilisk.dat minotaur.dat unicorn.dat original-*.dat
~~~
{: .language-bash}
This wouldn't back up our files, instead we get an error:
~~~
cp: target `original-*.dat' is not a directory
~~~
{: .error}
- This problem arises when `cp` receives more than two inputs. When this happens, it
expects the last input to be a directory where it can copy all the files it was passed.
Since there is no directory named `original-*.dat` in the `creatures` directory we get an
error.
- Instead, we can use a loop:
~~~
for filename in *.dat
> do
> cp $filename original-$filename
> done
~~~
{: .language-bash}
- Since the `cp` command does not normally produce any output, it's hard to check
that the loop is doing the correct thing. However, we learned earlier how to print strings
using `echo`, and we can modify the loop to use `echo` to print our commands without
actually executing them. As such we can check what commands *would be* run in the unmodified loop.
The following diagram
shows what happens when the modified loop is executed, and demonstrates how the
judicious use of `echo` is a good debugging technique.
Nelle is now ready to process her data files using goostats.sh — a shell script written by her supervisor. This calculates some statistics from a protein sample file, and takes two arguments:
Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:
1
2
3
4
5
cd ~/shell-lesson-data/north-pacific-gyre
for datafile in NENE*A.txt NENE*B.txt
> do
> echo $datafile
> done
Her next step is to decide what to call the files that the goostats.sh analysis program will create. Prefixing each input file’s name with ‘stats’ seems simple, so she modifies her loop to do that:
1
2
3
4
for datafile in NENE*A.txt NENE*B.txt
> do
> echo $datafile stats-$datafile
> done
She hasn’t actually run goostats.sh yet, but now she’s sure she can select the right files and generate the right output filenames.
Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses ↑. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):
1
for datafile in NENE*A.txt NENE*B.txt; do echo $datafile stats-$datafile; done
Using the left arrow key, Nelle backs up and changes the command echo to bash goostats.sh:
1
for datafile in NENE*A.txt NENE*B.txt; do bash goostats.sh $datafile stats-$datafile; done
When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl+C, uses ↑ to repeat the command, and edits it to read:
1
2
for datafile in NENE*A.txt NENE*B.txt; do echo $datafile;
bash goostats.sh $datafile stats-$datafile; done
::::{admonition} Beginning and End :class: note
::::
When she runs her program now, it produces one line of output every five seconds or so 1518 times 5 seconds, divided by 60, tells her that her script will take about two hours to run. As a final check, she opens another terminal window, goes into north-pacific-gyre, and uses cat stats-NENE01729B.txt to examine one of the output files. It looks good, so she decides to get some coffee and catch up on her reading.
::::{admonition} Those Who Know History Can Choose to Repeat It :class: note
Another way to repeat previous work is to use the history command to get a list of the last few hundred commands that have been executed, and then to use !123 (where ‘123’ is replaced by the command number) to repeat one of those commands. For example, if Nelle types this:
1
2
3
4
5
6
history | tail -n 5
456 ls -l NENE0*.txt
457 rm stats-NENE01729B.txt.txt
458 bash goostats.sh NENE01729B.txt stats-NENE01729B.txt
459 ls -l NENE0*.txt
460 history
then she can re-run goostats.sh on NENE01729B.txt simply by typing !458.
::::
::::{admonition} Challenge: doing a dry run :class: note
echo the commands it would run instead of actually running them.
1
2
3
4
for datafile in *.pdb
> do
> cat $datafile >> all.pdb
> done
1
2
3
4
5
# Version 1
for datafile in *.pdb
> do
> echo cat $datafile >> all.pdb
> done
1
2
3
4
5
# Version 2
for datafile in *.pdb
> do
> echo "cat $datafile >> all.pdb"
> done
:::{dropdown} Solution
all.pdb, as the >> is treated literally as part of a string rather than as a redirection instruction.echo cat $datafile to the file, all.pdb. This file will just contain the list; cat cubane.pdb, cat ethane.pdb, cat methane.pdb etc.all.pdb file to view its contents. ::: ::::::::{admonition} Challenge: nested loops :class: note
1
2
3
4
5
6
7
for species in cubane ethane methane
> do
> for temperature in 25 30 37 40
> do
> mkdir $species-$temperature
> done
> done
:::{dropdown} Solution
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
- Let's start by going back to `~/shell-lesson-data/exercise-data/proteins$` and creating a new file,
`middle.sh` which will become our shell script:
~~~
cd ~/shell-lesson-data/exercise-data/proteins
nano middle.sh
cat middle.sh
~~~
- Add the following line to `middle.sh` and save:
- `head -n 15 octane.pdb | tail -n 5`
- Once we have saved the file, we can ask the shell to execute the commands it contains.
Our shell is called `bash`, so we run the following command:
~~~
bash middle.sh
~~~
:::{image} fig/09-scripting-linux/script-middle.png
:alt: First simple script
:class: bg-primary mb-1
:height: 200px
:align: center
:::
::::{admonition} Text vs. Whatever
:class: note
We usually call programs like Microsoft Word or LibreOffice Writer *text
editors*, but we need to be a bit more careful when it comes to
programming. By default, Microsoft Word uses `.docx` files to store not
only text, but also formatting information about fonts, headings, and so
on. This extra information isn't stored as characters and doesn't mean
anything to tools like `head`: they expect input files to contain
nothing but the letters, digits, and punctuation on a standard computer
keyboard. When editing programs, therefore, you must either use a plain
text editor, or be careful to save files as plain text.
::::
- What if we want to select lines from an arbitrary file? We could edit
`middle.sh` each time to change the filename, but that would probably
take longer than typing the command out again in the shell and
executing it with a new file name. Instead, let's edit `middle.sh`
and make it more versatile:
- Edit `middle.sh` and replace the text `octane.pdb` with the special variable called `$1`.
- Wrap `$1` inside double quotes: `"$1"`.
- `$1` means 'the first filename (or other argument) on the command line'.
~~~
nano middle.sh
cat middle.sh
bash middle.sh octane.pdb
bash middle.sh pentane.pdb
~~~
{: .language-bash}
:::{image} fig/09-scripting-linux/script-arguments.png
:alt: script with command line arguments
:class: bg-primary mb-1
:height: 300px
:align: center
:::
- Currently, we need to edit `middle.sh` each time we want to adjust the range of
lines that is returned. Let's fix that by configuring our script to instead use three
command-line arguments.
- After the first command-line argument (`$1`), each additional argument that we
provide will be accessible via the special variables `$1`, `$2`, `$3`,
which refer to the first, second, third command-line arguments, respectively.
- Edit `middle.sh` and replace `15` with `"$2"` and `5` with `"$3"`
~~~
nano middle.sh
cat middle.sh
bash middle.sh pentane.pdb 15 5
~~~
- By changing the arguments to our command we can change our script's
behaviour:
~~~
bash middle.sh pentane.pdb 20 5
~~~
{: .language-bash}
- This works, but it may take the next person who reads `middle.sh` a moment to
figure out what it does. We can improve our script by adding some **comments** at the top:
- A comment starts with a `#` character and runs to the end of the line.
- Add the following comments to `middle.sh` at the top:
- `# Select lines from the middle of a file.`
- `#Usage: bash middle.sh filename end_line num_lines`
- What if we want to process many files in a single pipeline?
For example, if we want to sort our `.pdb` files by length, we would type
the following command because `wc -l` lists the number of lines in the files
and `sort -n` sorts things numerically.
~~~
wc -l *.pdb | sort -n
~~~
- We could put this in a file, but then it would only ever sort a list of `.pdb` files
in the current directory. If we want to be able to get a sorted list of other kinds of files,
we need a way to get all those names into the script.
- We can't use `$1`, `$2`, and so on because we don't know how many files there are.
- Instead, we use the special variable `$@`, which means,
'All of the command-line arguments to the shell script'.
- We also should put `$@` inside double-quotes to handle the case of arguments
containing spaces (`"$@"` is special syntax and is equivalent to `"$1"` `"$2"` ...).
- Create a file called `sorted.sh` inside `shell-lesson-data/exercise-data/proteins` with
the following contents:
~~~
# Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n
~~~
- Observe the following commands:
~~~
cd ~/shell-lesson-data/exercise-data/proteins
nano sorted.sh
cat sorted.sh
bash sorted.sh *.pdb ../creatures/*.dat
~~~
{: .language-bash}
- To turn your script into an `executable file` (run without `bash` command), the
following line must be at the top of your script:
~~~
#!/bin/bash
~~~
- and your script file must have executable permission:
~~~
chmod 755 sorted.sh
./sorted.sh
~~~
::::{admonition} Challenge: list unique species
:class: note
- Leah has several hundred data files, each of which is formatted like this:
~~~
2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
~~~
- An example of this type of file is given in
`shell-lesson-data/exercise-data/animal-counts/animals.csv`.
- We can use the command `cut -d , -f 2 animals.txt | sort | uniq` to produce
the unique species in `animals.txt`.
- In order to avoid having to type out this series of commands every time,
a scientist may choose to write a shell script instead.
- Write a shell script called `species.sh` that takes any number of
filenames as command-line arguments, and uses a variation of the above command
to print a list of the unique species appearing in each of those files separately.
:::{dropdown} Solution
~~~
#!/bin/bash
# Script to find unique species in csv files where species is the second data field
# This script accepts any number of file names as command line arguments
# Loop over all files
for file in $@
do
echo "Unique species in $file:"
# Extract species names
cut -d , -f 2 $file | sort | uniq
done
~~~
:::
::::
- Suppose we have just run a series of commands that did something useful --- for example,
that created a graph we'd like to use in a paper. We'd like to be able to re-create the
graph later if we need to, so we want to save the commands in a file.
- Instead of typing them in again (and potentially getting them wrong) we can do this:
~~~
history | tail -n 5 > redo-figure-3.sh
~~~
The file `redo-figure-3.sh` now *could* contains:
~~~
297 bash goostats.sh NENE01729B.txt stats-NENE01729B.txt
298 bash goodiff.sh stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt
299 cut -d ',' -f 2-3 01729-differences.txt > 01729-time-series.txt
300 ygraph --format scatter --color bw --borders none 01729-time-series.txt figure-3.png
301 history | tail -n 5 > redo-figure-3.sh
~~~
- After a moment's work in an editor to remove the serial numbers on the commands,
and to remove the final line where we called the `history` command,
we have a completely accurate record of how we created that figure.
- In practice, most people develop shell scripts by running commands at the shell prompt a few
times to make sure they're doing the right thing, then saving them in a file for re-use.
- This style of work allows people to recycle what they discover about their data and their
workflow with one call to `history` and a bit of editing to clean up the output
and save it as a shell script.
Nelle’s supervisor insisted that all her analytics must be reproducible. The easiest way to capture all the steps is in a script.
First we return to Nelle’s project directory:
1
cd ../../north-pacific-gyre/
nano …
1
nano do-stats.sh
1
2
3
4
5
6
7
#!/bin/bash
# Calculate stats for data files.
for datafile in "$@"
do
echo $datafile
bash goostats.sh $datafile stats-$datafile
done
do-stats.sh and set executable mode so that she can now re-do the first stage of her analysis by typing:
1
./do-stats.sh NENE*A.txt NENE*B.txt
1
./do-stats.sh NENE*A.txt NENE*B.txt | wc -l
1
2
3
4
5
6
7
#!/bin/bash
# Calculate stats for Site A and Site B data files.
for datafile in NENE*A.txt NENE*B.txt
do
echo $datafile
bash goostats.sh $datafile stats-$datafile
done
NENE*A.txt NENE*B.txt if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.::::{admonition} Challenge: variables in shell scripts :class: note
proteins directory, imagine you have a shell script called script.sh containing the following commands:
1
2
3
#!/bin/bash
head -n $2 $1
tail -n $3 $1
While you are in the proteins directory, you type the following command:
1
./script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?
.pdb in the proteins directory.pdb in the proteins directoryproteins directory*.pdb :::{dropdown} Solution
1
2
head -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb
tail -n 1 cubane.pdb ethane.pdb octane.pdb pentane.pdb propane.pdb
'*.pdb' because it is enclosed by quote marks.'*.pdb' which gets expanded within the script by head and tail. ::: ::::::::{admonition} Challenge: find the longest file with a given extension :class: note
longest.sh that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. For example:
1
./longest.sh shell-lesson-data/data/pdb pdb
would print the name of the .pdb file in shell-lesson-data/data/pdb that has the most lines.
Feel free to test your script on another directory e.g.
1
bash longest.sh shell-lesson-data/writing/data txt
:::{dropdown} Solution
1
2
3
4
5
6
7
#!/bin/bash
# Shell script which takes two arguments:
# 1. a directory name
# 2. a file extension
# and prints the name of the file in that directory
# with the most lines which matches the file extension.
wc -l $1/*.$2 | sort -n | tail -n 2 | head -n 1
wc -l $1/*.$2 | sort -n, counts the lines in each file and sorts them numerically (largest last). When there’s more than one file, wc also outputs a final summary line, giving the total number of lines across all files. We use tail -n 2 | head -n 1 to throw away this last line.wc -l $1/*.$2 | sort -n | tail -n 1 we’ll see the final summary line: we can build our pipeline up in pieces to be sure we understand the output. ::: ::::::::{admonition} Challenge: script reading comprehension :class: note
shell-lesson-data/exercise-data/proteins directory once again. This contains a number of .pdb files in addition to any other files you may have created.bash script1.sh *.pdb, bash script2.sh *.pdb, and bash script3.sh *.pdb respectively.
1
2
# Script 1
echo *.*
1
2
3
4
5
# Script 2
for filename in $1 $2 $3
do
cat $filename
done
1
2
# Script 3
echo $@.pdb
:::{dropdown} Solution In each case, the shell expands the wildcard in *.pdb before passing the resulting list of file names as arguments to the script.
.pdb file extension. $1, $2, and $3 refer to the first, second, and third argument respectively..pdb files), followed by .pdb. $@ refers to all the arguments given to a shell script.
1
cubane.pdb ethane.pdb methane.pdb octane.pdb pentane.pdb propane.pdb.pdb
::: ::::
::::{admonition} Challenge: debugging scripts :class: note
do-errors.sh in Nelle’s north-pacific-gyre/scripts directory:
1
2
3
4
5
6
# Calculate stats for data files.
for datafile in "$@"
do
echo $datfile
bash goostats.sh $datafile stats-$datafile
done
north-pacific-gyre directory, the output is blank.
1
bash do-errors.sh NENE*A.txt NENE*B.txt
-x option:
1
bash -x do-errors.sh NENE*A.txt NENE*B.txt
:::{dropdown} Solution
-x option causes bash to run in debug mode.echo isn’t printing anything. We have made a typo in the loop variable name, and the variable datfile doesn’t exist, hence returning an empty string. ::: ::::```