Suppose we have several hundred genome data files named basilisk.dat, minotaur.dat, and unicorn.dat. For this example, we’ll use the exercise-data/creatures directory which only has three example files, but the principles can be applied to many many more files at once.


1
2
 cd ~/shell-lesson-data/exercise-data/creatures/
head -n 5 basilisk.dat minotaur.dat unicorn.dat


1
2
3
4
 for thing in list_of_things
do
    operation_using $thing    # Indentation within the loop is not required, but aids legibility
done


1
2
3
4
 for filename in basilisk.dat minotaur.dat unicorn.dat
> do
>   head -n 2 $filename | tail -n 1
> done

More complicated loop


1
2
3
4
5
6
 cd ~/shell-lesson-data/exercise-data/creatures
for filename in *.dat
> do
>   echo $filename
>   head -n 100 $filename | tail -n 20
> done


1
 cp *.dat original-*.dat


1
 cp basilisk.dat minotaur.dat unicorn.dat original-*.dat


1
 cp: target `original-*.dat' is not a directory


1
2
3
4
 for filename in *.dat
> do
>   cp $filename original-$filename
> done

The following diagram shows what happens when the modified loop is executed, and demonstrates how the judicious use of echo is a good debugging technique.

Nelle’s Pipeline: Processing Files

Nelle is now ready to process her data files using goostats.sh — a shell script written by her supervisor. This calculates some statistics from a protein sample file, and takes two arguments:

Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:


1
2
3
4
5
 cd ~/shell-lesson-data/north-pacific-gyre
for datafile in NENE*A.txt NENE*B.txt
> do
>     echo $datafile
> done

Her next step is to decide what to call the files that the goostats.sh analysis program will create. Prefixing each input file’s name with ‘stats’ seems simple, so she modifies her loop to do that:


1
2
3
4
 for datafile in NENE*A.txt NENE*B.txt
> do
>     echo $datafile stats-$datafile
> done

She hasn’t actually run goostats.sh yet, but now she’s sure she can select the right files and generate the right output filenames.

Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses ↑. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):


1
 for datafile in NENE*A.txt NENE*B.txt; do echo $datafile stats-$datafile; done

Using the left arrow key, Nelle backs up and changes the command echo to bash goostats.sh:


1
 for datafile in NENE*A.txt NENE*B.txt; do bash goostats.sh $datafile stats-$datafile; done

When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl+C, uses ↑ to repeat the command, and edits it to read:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
 for datafile in NENE*A.txt NENE*B.txt; do echo $datafile;
bash goostats.sh $datafile stats-$datafile; done

<details class="details details--default" data-variant="default"><summary>Beginning and End</summary>
<ul>
  <li>We can move to the beginning of a line in the shell by typing 
<kbd>Ctrl</kbd>+<kbd>A</kbd> and to the end using <kbd>Ctrl</kbd>+<kbd>E</kbd>.</li>
</ul>

</details>
When she runs her program now, it produces one line of output every five seconds or so
1518 times 5 seconds, divided by 60, tells her that her script will take about two hours to run.
As a final check, she opens another terminal window, goes into `north-pacific-gyre`,
and uses `cat stats-NENE01729B.txt` to examine one of the output files.
It looks good, so she decides to get some coffee and catch up on her reading.

<details class="details details--default" data-variant="default"><summary>Those Who Know History Can Choose to Repeat It</summary>
<p>Another way to repeat previous work is to use the <code class="language-plaintext highlighter-rouge">history</code> command to 
get a list of the last few hundred commands that have been executed, and 
then to use <code class="language-plaintext highlighter-rouge">!123</code> (where ‘123’ is replaced by the command number) to 
repeat one of those commands. For example, if Nelle types this:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="nb">history</span> | <span class="nb">tail</span> <span class="nt">-n</span> 5
  456  <span class="nb">ls</span> <span class="nt">-l</span> NENE0<span class="k">*</span>.txt
  457  <span class="nb">rm </span>stats-NENE01729B.txt.txt
  458  bash goostats.sh NENE01729B.txt stats-NENE01729B.txt
  459  <span class="nb">ls</span> <span class="nt">-l</span> NENE0<span class="k">*</span>.txt
  460  <span class="nb">history</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>then she can re-run <code class="language-plaintext highlighter-rouge">goostats.sh</code> on <code class="language-plaintext highlighter-rouge">NENE01729B.txt</code> simply by typing
<code class="language-plaintext highlighter-rouge">!458</code>.</p>

</details>
<details class="details details--default" data-variant="default"><summary>Challenge: doing a dry run</summary>
<ul>
  <li>A loop is a way to do many things at once — or to make many mistakes at
once if it does the wrong thing. One way to check what a loop <em>would</em> do
is to <code class="language-plaintext highlighter-rouge">echo</code> the commands it would run instead of actually running them.</li>
  <li>Suppose we want to preview the commands the following loop will execute
without actually running those commands:</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="k">for </span>datafile <span class="k">in</span> <span class="k">*</span>.pdb
<span class="o">&gt;</span> <span class="k">do</span>
<span class="o">&gt;</span>   <span class="nb">cat</span> <span class="nv">$datafile</span> <span class="o">&gt;&gt;</span> all.pdb
<span class="o">&gt;</span> <span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<ul>
  <li>What is the difference between the two loops below, and which one would we
want to run?</li>
</ul>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="c"># Version 1</span>
<span class="k">for </span>datafile <span class="k">in</span> <span class="k">*</span>.pdb
<span class="o">&gt;</span> <span class="k">do</span>
<span class="o">&gt;</span>   <span class="nb">echo cat</span> <span class="nv">$datafile</span> <span class="o">&gt;&gt;</span> all.pdb
<span class="o">&gt;</span> <span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>

<p>~~~bash</p>
<h1 id="version-2">Version 2</h1>
<p>for datafile in *.pdb</p>
<blockquote>
  <p>do
  echo “cat $datafile » all.pdb”
done</p>
</blockquote>

<details class="details details--note" data-variant="note"><summary>Solution</summary>
<ul>
  <li>The second version is the one we want to run.
This prints to screen everything enclosed in the quote marks, expanding the 
loop variable name because we have prefixed it with a dollar sign. 
It also <em>does not</em> modify nor create the file <code class="language-plaintext highlighter-rouge">all.pdb</code>, as the <code class="language-plaintext highlighter-rouge">&gt;&gt;</code> 
is treated literally as part of a string rather than as a 
redirection instruction.</li>
  <li>The first version appends the output from the command <code class="language-plaintext highlighter-rouge">echo cat $datafile</code> 
to the file, <code class="language-plaintext highlighter-rouge">all.pdb</code>. This file will just contain the list; 
<code class="language-plaintext highlighter-rouge">cat cubane.pdb</code>, <code class="language-plaintext highlighter-rouge">cat ethane.pdb</code>, <code class="language-plaintext highlighter-rouge">cat methane.pdb</code> etc.</li>
  <li>Try both versions for yourself to see the output! Be sure to change to the 
proper directory and open <code class="language-plaintext highlighter-rouge">all.pdb</code> file to view its contents.</li>
</ul>

</details>
</details>
<details class="details details--default" data-variant="default"><summary>Challenge: nested loops</summary>
<ul>
  <li>Suppose we want to set up a directory structure to organize 
some experiments measuring reaction rate constants with different compounds 
<em>and</em> different temperatures.  What would be the result of the following code:</li>
</ul>

<p>~~~bash
for species in cubane ethane methane</p>
<blockquote>
  <p>do
   for temperature in 25 30 37 40
   do
      mkdir $species-$temperature
    done
done</p>
</blockquote>

<details class="details details--note" data-variant="note"><summary>Solution</summary>
<ul>
  <li>We have a nested loop, i.e. contained within another loop, so for each species
in the outer loop, the inner loop (the nested loop) iterates over the list of
temperatures, and creates a new directory for each combination.</li>
  <li>Try running the code for yourself to see which directories are created!</li>
</ul>

</details>
</details>
---

## Shell scripting

    - Let's start by going back to `~/shell-lesson-data/exercise-data/proteins$` and creating a new file, 
    `middle.sh` which will become our shell script:

    ~~~bash
    cd ~/shell-lesson-data/exercise-data/proteins
    nano middle.sh
    cat middle.sh
    ~~~

    - Add the following line to `middle.sh` and save:
      - `head -n 15 octane.pdb | tail -n 5`
    - Once we have saved the file, we can ask the shell to execute the commands it contains.
    Our shell is called `bash`, so we run the following command:

    ~~~bash
    bash middle.sh
    ~~~

    



<figure
  
>
  <picture>
    <!-- Auto scaling with imagemagick -->
    <!--
      See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
      https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
    -->
    
      
        <source
          class="responsive-img-srcset"
          
            srcset="/assets/img/courses/csc586/09-scripting-linux/script-middle-480.webp 480w,/assets/img/courses/csc586/09-scripting-linux/script-middle-800.webp 800w,/assets/img/courses/csc586/09-scripting-linux/script-middle-1400.webp 1400w,"
            type="image/webp"
          
          
            sizes="95vw"
          
        >
      
    
    <img
      src="/assets/img/courses/csc586/09-scripting-linux/script-middle.png"
      
      
        width="50%"
      
      
        height="auto"
      
      
      
      
      
        data-zoomable
      
      
        loading="lazy"
      
      onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
    >
  </picture>

  
</figure>

    

<details class="details details--default" data-variant="default"><summary>Text vs. Whatever</summary>
<p>We usually call programs like Microsoft Word or LibreOffice Writer <em>text 
editors</em>, but we need to be a bit more careful when it comes to 
programming. By default, Microsoft Word uses <code class="language-plaintext highlighter-rouge">.docx</code> files to store not 
only text, but also formatting information about fonts, headings, and so 
on. This extra information isn’t stored as characters and doesn’t mean 
anything to tools like <code class="language-plaintext highlighter-rouge">head</code>: they expect input files to contain 
nothing but the letters, digits, and punctuation on a standard computer 
keyboard. When editing programs, therefore, you must either use a plain 
text editor, or be careful to save files as plain text.</p>

</details>
- What if we want to select lines from an arbitrary file? We could edit 
`middle.sh` each time to change the filename, but that would probably 
take longer than typing the command out again in the shell and 
executing it with a new file name. Instead, let's edit `middle.sh` 
and make it more versatile:
  - Edit `middle.sh` and replace the text `octane.pdb` with the special variable called `$1`. 
    - Wrap `$1` inside double quotes: `"$1"`. 
  - `$1` means 'the first filename (or other argument) on the command line'.

~~~bash
nano middle.sh
cat middle.sh
bash middle.sh octane.pdb
bash middle.sh pentane.pdb


1
2
3
 nano middle.sh
cat middle.sh
bash middle.sh pentane.pdb 15 5


1
 bash middle.sh pentane.pdb 20 5


1
 wc -l *.pdb | sort -n


1
2
3
 # Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n


1
2
3
4
 cd ~/shell-lesson-data/exercise-data/proteins
nano sorted.sh
cat sorted.sh
bash sorted.sh *.pdb ../creatures/*.dat


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
 chmod 755 sorted.sh
./sorted.sh

<details class="details details--default" data-variant="default"><summary>Challenge: list unique species</summary>
<ul>
  <li>Leah has several hundred data files, each of which is formatted like this:</li>
</ul>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre>2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
</pre></td></tr></tbody></table></code></pre></div></div>

<ul>
  <li>An example of this type of file is given in 
<code class="language-plaintext highlighter-rouge">shell-lesson-data/exercise-data/animal-counts/animals.csv</code>.</li>
  <li>We can use the command <code class="language-plaintext highlighter-rouge">cut -d , -f 2 animals.txt | sort | uniq</code> to produce 
the unique species in <code class="language-plaintext highlighter-rouge">animals.txt</code>.</li>
  <li>In order to avoid having to type out this series of commands every time, 
a scientist may choose to write a shell script instead.</li>
  <li>Write a shell script called <code class="language-plaintext highlighter-rouge">species.sh</code> that takes any number of 
filenames as command-line arguments, and uses a variation of the above command 
to print a list of the unique species appearing in each of those files separately.</li>
</ul>

<details class="details details--note" data-variant="note"><summary>Solution</summary>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="c">#!/bin/bash</span>
<span class="c"># Script to find unique species in csv files where species is the second data field</span>
<span class="c"># This script accepts any number of file names as command line arguments</span>
<span class="c"># Loop over all files</span>
<span class="k">for </span>file <span class="k">in</span> <span class="nv">$@</span>
<span class="k">do
  </span><span class="nb">echo</span> <span class="s2">"Unique species in </span><span class="nv">$file</span><span class="s2">:"</span>
  <span class="c"># Extract species names</span>
  <span class="nb">cut</span> <span class="nt">-d</span> , <span class="nt">-f</span> 2 <span class="nv">$file</span> | <span class="nb">sort</span> | <span class="nb">uniq
</span><span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>

</details>
</details>
- Suppose we have just run a series of commands that did something useful --- for example,
that created a graph we'd like to use in a paper. We'd like to be able to re-create the 
graph later if we need to, so we want to save the commands in a file. 
- Instead of typing them in again (and potentially getting them wrong) we can do this:


1
2
3
 The file `redo-figure-3.sh` now *could* contains:

297 bash goostats.sh NENE01729B.txt stats-NENE01729B.txt 298 bash goodiff.sh stats-NENE01729B.txt /data/validated/01729.txt > 01729-differences.txt 299 cut -d ‘,’ -f 2-3 01729-differences.txt > 01729-time-series.txt 300 ygraph –format scatter –color bw –borders none 01729-time-series.txt figure-3.png 301 history | tail -n 5 > redo-figure-3.sh

More Bash Scriptings

Pipes and Filters

Capturing output from commands

Filtering output

Passing output to another command

Nelle’s Pipeline: Checking Files

Loop

More complicated loop

Nelle’s Pipeline: Processing Files

Script 3