It is possible to combine multiple Linux commands into one
Settings:
Data are stored in shell-lesson-data/excercise-data/proteins on a Linux server (either sulu or a CloudLab experiment.
Data files have .pdb extension.
Question:
Which of these files contains the fewest lines?
Capturing output from commands
SSH to your Linux server.
Run the following commands to prepare the environment.
1
2
3
4
5
6
7
clear
cd
pwd
wget --no-check-certificate https://www.cs.wcupa.edu/lngo/data/shell-lesson-data.zip
unzip shell-lesson-data.zip
cd ~/shell-lesson-data/exercise-data/proteins
ls -l *.pdb
To get counts of characters, words, and lines in a file, we use wc.
1
2
3
man wc
wc*.pdb
wc-l*.pdb
1
2
3
4
5
6
7
8
ls
wc-l*.pdb > lengths.txt
ls
cat lengths.txt
wc-l*.pdb >> lengths.txt
cat lengths.txt
wc-l*.pdb > lengths.txt
cat lengths.txt
Filtering output
We can sort the contents of lengths.txt using sort
1
man sort
Challenge: what does sort -n do?
Explain what does -n do by observing the following two commands
We used intermediate files to store output. We can use a pipe (|) to combine them together.
1
sort-n lengths.txt | head-n 1
We can combine multiple commands
1
wc-l*.pdb | sort-n | head-n 1
Challenge: piping commands together
In our current directory, we want to find the 3 files which have the least number of lines. Which command listed below would work?
wc -l * > sort -n > head -n 3
wc -l * | sort -n | head -n 1-3
wc -l * | head -n 3 | sort -n
wc -l * | sort -n | head -n 3
Solution
Option 4 is the solution. The pipe character | is used to connect the output from one command to the input of another. > is used to redirect standard output to a file. Try it in the shell-lesson-data/exercise-data/proteins directory!
Challenge: pipe reading comprehension
A file called animals.csv (in the shell-lesson-data/exercise-data/animal-counts folder) contains the following data:
For the file animals.csv from the previous exercise, consider the following command:
1
2
man cut
cut-d , -f 2 animals.csv
The uniq command filters out adjacent matching lines in a file. How could you extend this pipeline (using uniq and another command) to find out what animals the file contains (without any duplicates in their names)?
Solution
1
cut-d , -f 2 animals.csv | sort | uniq
Challenge: which pipe?
The file animals.csv contains 8 lines of data formatted as follows::
The uniq command has a -c option which gives a count of the number of times a line occurs in its input. Assuming your current directory is shell-lesson-data/exercise-data/animal-counts, what command would you use to produce a table that shows the total count of each type of animal in the file?
sort animals.csv | uniq -c
sort -t, -k2,2 animals.csv | uniq -c
cut -d, -f 2 animals.csv | uniq -c
cut -d, -f 2 animals.csv | sort | uniq -c
cut -d, -f 2 animals.csv | sort | uniq -c | wc -l
Solution
Option 4. is the correct answer.
Nelle’s Pipeline: Checking Files
Nelle has run her samples through the assay machines and created 17 files in the north-pacific-gyre directory described earlier. Let’s check the integrity of this data:
1
2
cd ~/shell-lesson-data/north-pacific-gyre
ls -l
How do we check for data integrity? Imagine if you have thousands of files?
1
wc -l *.txt | sort -n | head -n 5
This is possible by looking at metadata (line counts, word counts, etc)
There are also files containing Z in their names,
1
ls *Z.txt
It is important to be careful when using wildcards if we don’t want to include these strange files in our calculations.
Loop
Suppose we have several hundred genome data files named basilisk.dat, minotaur.dat, and unicorn.dat. For this example, we’ll use the exercise-data/creatures directory which only has three example files, but the principles can be applied to many many more files at once.
The structure of these files is the same:
The common name, classification, and updated date are presented on the first three lines
The DNA sequences on the following lines.
Let’s look at the files:
1
2
cd ~/shell-lesson-data/exercise-data/creatures/
head -n 5 basilisk.dat minotaur.dat unicorn.dat
We would like to print out the classification for each species, which is given on the second line of each file.
For each file, we would need to execute the command head -n 2 and pipe this to tail -n 1.
We’ll use a loop to solve this problem, but first let’s look at the general form of a loop:
1
2
3
4
for thing in list_of_things
do
operation_using $thing # Indentation within the loop is not required, but aids legibility
done
and we can apply this to our example like this:
1
2
3
4
for filename in basilisk.dat minotaur.dat unicorn.dat
> do
> head -n 2 $filename | tail -n 1
> done
to match all files ending in .pdb and then lists them using ls.
The second code block lists a different file on each loop iteration. The value of the datafile variable is evaluated using $datafile, and then listed using ls.
Challenge: limiting sets of files
What would be the output of running the following loop in the shell-lesson-data/exercise-data/proteins directory?
1
2
3
4
5
cd ~/shell-lesson-data/exercise-data/proteins/
for filename in c*>do>ls$filename>done
No files are listed.
All files are listed.
Only cubane.pdb, octane.pdb and pentane.pdb are listed.
Only cubane.pdb is listed.
How would the output differ from using this command instead?
1
2
3
4
5
cd ~/shell-lesson-data/exercise-data/proteins/
for filename in*c*>do>ls$filename>done
The same files would be listed.
All the files are listed this time.
No files are listed this time.
The files cubane.pdb and octane.pdb will be listed.
Only the file octane.pdb will be listed.
Solution
4 is the correct answer. * matches zero or more characters, so any file name starting with the letter c, followed by zero or more other characters will be matched.
8 is the correct answer. * matches zero or more characters, so a file name with zero or more characters before a letter c and zero or more characters after the letter c will be matched.
Challenge: saving to a file in a Loop
In the shell-lesson-data/exercise-data/proteins directory, what is the effect of this loop?
1
2
3
4
5
6
cd ~/shell-lesson-data/exercise-data/proteins/
for alkanes in*.pdb
>do>echo$alkanes>cat$alkanes> alkanes.pdb
>done
Prints cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.
Prints cubane.pdb, ethane.pdb, and methane.pdb, and the text from all three files would be concatenated and saved to a file called alkanes.pdb.
Prints cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb, and the text from propane.pdb will be saved to a file called alkanes.pdb.
None of the above.
Also in the shell-lesson-data/exercise-data/proteins directory, what would be the output of the following loop?
1
2
3
4
5
cd ~/shell-lesson-data/exercise-data/proteins/
for datafile in*.pdb
>do>cat$datafile>> all.pdb
>done
All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, and pentane.pdb would be concatenated and saved to a file called all.pdb.
The text from ethane.pdb will be saved to a file called all.pdb.
All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be concatenated and saved to a file called all.pdb.
All of the text from cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb and propane.pdb would be printed to the screen and saved to a file called all.pdb.
Solution
The text from each file in turn gets written to the alkanes.pdb file. However, the file gets overwritten on each loop iteration, so the final content of alkanes.pdb is the text from the propane.pdb file.
7 is the correct answer. >> appends to a file, rather than overwriting it with the redirected output from a command. Given the output from the cat command has been redirected, nothing is printed to the screen.
More complicated loop
Run the following loop
The shell starts by expanding *.dat to create the list of files it will process.
The loop body then executes two commands for each of those files.
The first command, echo, prints its command-line arguments to standard output. In this case, since the shell expands $filename to be the name of a file, echo $filename prints the name of the file.
Finally, the head and tail combination selects lines 81-100 from whatever file is being processed (assuming the file has at least 100 lines).
1
2
3
4
5
6
cd ~/shell-lesson-data/exercise-data/creatures
for filename in*.dat
>do>echo$filename>head-n 100 $filename | tail-n 20
>done
We would like to modify each of the files in shell-lesson-data/exercise-data/creatures, but also save a version of the original files, naming the copies original-basilisk.dat and original-unicorn.dat.
This wouldn’t back up our files, instead we get an error:
1
cp: target `original-*.dat' is not a directory
This problem arises when cp receives more than two inputs. When this happens, it expects the last input to be a directory where it can copy all the files it was passed. Since there is no directory named original-*.dat in the creatures directory we get an error.
Instead, we can use a loop:
1
2
3
4
for filename in*.dat
>do>cp$filename original-$filename>done
Since the cp command does not normally produce any output, it’s hard to check that the loop is doing the correct thing. However, we learned earlier how to print strings using echo, and we can modify the loop to use echo to print our commands without actually executing them. As such we can check what commands would be run in the unmodified loop.
The following diagram shows what happens when the modified loop is executed, and demonstrates how the judicious use of echo is a good debugging technique.
Nelle’s Pipeline: Processing Files
Nelle is now ready to process her data files using goostats.sh — a shell script written by her supervisor. This calculates some statistics from a protein sample file, and takes two arguments:
an input file (containing the raw data)
an output file (to store the calculated statistics)
Since she’s still learning how to use the shell, she decides to build up the required commands in stages. Her first step is to make sure that she can select the right input files — remember, these are ones whose names end in ‘A’ or ‘B’, rather than ‘Z’. Starting from her home directory, Nelle types:
1
2
3
4
5
cd ~/shell-lesson-data/north-pacific-gyre
for datafile in NENE*A.txt NENE*B.txt
>do>echo$datafile>done
Her next step is to decide what to call the files that the goostats.sh analysis program will create. Prefixing each input file’s name with ‘stats’ seems simple, so she modifies her loop to do that:
1
2
3
4
for datafile in NENE*A.txt NENE*B.txt
>do>echo$datafile stats-$datafile>done
She hasn’t actually run goostats.sh yet, but now she’s sure she can select the right files and generate the right output filenames.
Typing in commands over and over again is becoming tedious, though, and Nelle is worried about making mistakes, so instead of re-entering her loop, she presses ↑. In response, the shell redisplays the whole loop on one line (using semi-colons to separate the pieces):
1
for datafile in NENE*A.txt NENE*B.txt;do echo$datafile stats-$datafile;done
Using the left arrow key, Nelle backs up and changes the command echo to bash goostats.sh:
1
for datafile in NENE*A.txt NENE*B.txt;do bash goostats.sh $datafile stats-$datafile;done
When she presses Enter, the shell runs the modified command. However, nothing appears to happen — there is no output. After a moment, Nelle realizes that since her script doesn’t print anything to the screen any longer, she has no idea whether it is running, much less how quickly. She kills the running command by typing Ctrl+C, uses ↑ to repeat the command, and edits it to read:
for datafile in NENE*A.txt NENE*B.txt;do echo$datafile;
bash goostats.sh $datafile stats-$datafile;done
<details class="details details--default" data-variant="default"><summary>Beginning and End</summary>
<ul>
<li>We can move to the beginning of a line in the shell by typing
<kbd>Ctrl</kbd>+<kbd>A</kbd> and to the end using <kbd>Ctrl</kbd>+<kbd>E</kbd>.</li>
</ul>
</details>
When she runs her program now, it produces one line of output every five seconds or so
1518 times 5 seconds, divided by 60, tells her that her script will take about two hours to run.
As a final check, she opens another terminal window, goes into `north-pacific-gyre`,
and uses `cat stats-NENE01729B.txt` to examine one of the output files.
It looks good, so she decides to get some coffee and catch up on her reading.
<details class="details details--default" data-variant="default"><summary>Those Who Know History Can Choose to Repeat It</summary>
<p>Another way to repeat previous work is to use the <code class="language-plaintext highlighter-rouge">history</code> command to
get a list of the last few hundred commands that have been executed, and
then to use <code class="language-plaintext highlighter-rouge">!123</code> (where ‘123’ is replaced by the command number) to
repeat one of those commands. For example, if Nelle types this:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="rouge-code"><pre><span class="nb">history</span> | <span class="nb">tail</span> <span class="nt">-n</span> 5
456 <span class="nb">ls</span> <span class="nt">-l</span> NENE0<span class="k">*</span>.txt
457 <span class="nb">rm </span>stats-NENE01729B.txt.txt
458 bash goostats.sh NENE01729B.txt stats-NENE01729B.txt
459 <span class="nb">ls</span> <span class="nt">-l</span> NENE0<span class="k">*</span>.txt
460 <span class="nb">history</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>then she can re-run <code class="language-plaintext highlighter-rouge">goostats.sh</code> on <code class="language-plaintext highlighter-rouge">NENE01729B.txt</code> simply by typing
<code class="language-plaintext highlighter-rouge">!458</code>.</p>
</details>
<details class="details details--default" data-variant="default"><summary>Challenge: doing a dry run</summary>
<ul>
<li>A loop is a way to do many things at once — or to make many mistakes at
once if it does the wrong thing. One way to check what a loop <em>would</em> do
is to <code class="language-plaintext highlighter-rouge">echo</code> the commands it would run instead of actually running them.</li>
<li>Suppose we want to preview the commands the following loop will execute
without actually running those commands:</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="rouge-code"><pre><span class="k">for </span>datafile <span class="k">in</span> <span class="k">*</span>.pdb
<span class="o">></span> <span class="k">do</span>
<span class="o">></span> <span class="nb">cat</span> <span class="nv">$datafile</span> <span class="o">>></span> all.pdb
<span class="o">></span> <span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<ul>
<li>What is the difference between the two loops below, and which one would we
want to run?</li>
</ul>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="rouge-code"><pre><span class="c"># Version 1</span>
<span class="k">for </span>datafile <span class="k">in</span> <span class="k">*</span>.pdb
<span class="o">></span> <span class="k">do</span>
<span class="o">></span> <span class="nb">echo cat</span> <span class="nv">$datafile</span> <span class="o">>></span> all.pdb
<span class="o">></span> <span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>
<p>~~~bash</p>
<h1 id="version-2">Version 2</h1>
<p>for datafile in*.pdb</p>
<blockquote>
<p>do
echo “cat $datafile » all.pdb”
done</p>
</blockquote>
<details class="details details--note" data-variant="note"><summary>Solution</summary>
<ul>
<li>The second version is the one we want to run.
This prints to screen everything enclosed in the quote marks, expanding the
loop variable name because we have prefixed it with a dollar sign.
It also <em>does not</em> modify nor create the file <code class="language-plaintext highlighter-rouge">all.pdb</code>, as the <code class="language-plaintext highlighter-rouge">>></code>
is treated literally as part of a string rather than as a
redirection instruction.</li>
<li>The first version appends the output from the command <code class="language-plaintext highlighter-rouge">echo cat$datafile</code>
to the file, <code class="language-plaintext highlighter-rouge">all.pdb</code>. This file will just contain the list;
<code class="language-plaintext highlighter-rouge">cat cubane.pdb</code>, <code class="language-plaintext highlighter-rouge">cat ethane.pdb</code>, <code class="language-plaintext highlighter-rouge">cat methane.pdb</code> etc.</li>
<li>Try both versions for yourself to see the output! Be sure to change to the
proper directory and open <code class="language-plaintext highlighter-rouge">all.pdb</code> file to view its contents.</li>
</ul>
</details>
</details>
<details class="details details--default" data-variant="default"><summary>Challenge: nested loops</summary>
<ul>
<li>Suppose we want to set up a directory structure to organize
some experiments measuring reaction rate constants with different compounds
<em>and</em> different temperatures. What would be the result of the following code:</li>
</ul>
<p>~~~bash
for species in cubane ethane methane</p>
<blockquote>
<p>do
for temperature in 25 30 37 40
do
mkdir$species-$temperaturedone
done</p>
</blockquote>
<details class="details details--note" data-variant="note"><summary>Solution</summary>
<ul>
<li>We have a nested loop, i.e. contained within another loop, so for each species
in the outer loop, the inner loop (the nested loop) iterates over the list of
temperatures, and creates a new directory for each combination.</li>
<li>Try running the code for yourself to see which directories are created!</li>
</ul>
</details>
</details>
---## Shell scripting
- Let's start by going back to `~/shell-lesson-data/exercise-data/proteins$` and creating a new file,
`middle.sh` which will become our shell script:
~~~bash
cd ~/shell-lesson-data/exercise-data/proteins
nano middle.sh
cat middle.sh
~~~
- Add the following line to `middle.sh` and save:
- `head -n 15 octane.pdb | tail -n 5`
- Once we have saved the file, we can ask the shell to execute the commands it contains.
Our shell is called `bash`, so we run the following command:
~~~bash
bash middle.sh
~~~
<figure
>
<picture>
<!-- Auto scaling with imagemagick -->
<!--
See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
-->
<source
class="responsive-img-srcset"
srcset="/assets/img/courses/csc586/09-scripting-linux/script-middle-480.webp 480w,/assets/img/courses/csc586/09-scripting-linux/script-middle-800.webp 800w,/assets/img/courses/csc586/09-scripting-linux/script-middle-1400.webp 1400w,"
type="image/webp"
sizes="95vw"
>
<img
src="/assets/img/courses/csc586/09-scripting-linux/script-middle.png"
width="50%"
height="auto"
data-zoomable
loading="lazy"
onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
>
</picture>
</figure>
<details class="details details--default" data-variant="default"><summary>Text vs. Whatever</summary>
<p>We usually call programs like Microsoft Word or LibreOffice Writer <em>text
editors</em>, but we need to be a bit more careful when it comes to
programming. By default, Microsoft Word uses <code class="language-plaintext highlighter-rouge">.docx</code> files to store not
only text, but also formatting information about fonts, headings, and so
on. This extra information isn’t stored as characters and doesn’t mean
anything to tools like <code class="language-plaintext highlighter-rouge">head</code>: they expect input files to contain
nothing but the letters, digits, and punctuation on a standard computer
keyboard. When editing programs, therefore, you must either use a plain
text editor, or be careful to save files as plain text.</p>
</details>
- What if we want to select lines from an arbitrary file? We could edit
`middle.sh` each time to change the filename, but that would probably
take longer than typing the command out again in the shell and
executing it with a new file name. Instead, let's edit `middle.sh`
and make it more versatile:
- Edit `middle.sh` and replace the text `octane.pdb` with the special variable called `$1`.
- Wrap `$1` inside double quotes: `"$1"`.
- `$1` means 'the first filename (or other argument) on the command line'.
~~~bash
nano middle.sh
cat middle.sh
bash middle.sh octane.pdb
bash middle.sh pentane.pdb
Currently, we need to edit middle.sh each time we want to adjust the range of lines that is returned. Let’s fix that by configuring our script to instead use three command-line arguments.
After the first command-line argument ($1), each additional argument that we provide will be accessible via the special variables $1, $2, $3, which refer to the first, second, third command-line arguments, respectively.
Edit middle.sh and replace 15 with "$2" and 5 with "$3"
By changing the arguments to our command we can change our script’s behaviour:
1
bash middle.sh pentane.pdb 20 5
This works, but it may take the next person who reads middle.sh a moment to figure out what it does. We can improve our script by adding some comments at the top:
A comment starts with a # character and runs to the end of the line.
Add the following comments to middle.sh at the top:
What if we want to process many files in a single pipeline? For example, if we want to sort our .pdb files by length, we would type the following command because wc -l lists the number of lines in the files and sort -n sorts things numerically.
1
wc-l*.pdb | sort-n
We could put this in a file, but then it would only ever sort a list of .pdb files in the current directory. If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script.
We can’t use $1, $2, and so on because we don’t know how many files there are.
Instead, we use the special variable $@, which means, ‘All of the command-line arguments to the shell script’.
We also should put $@ inside double-quotes to handle the case of arguments containing spaces ("$@" is special syntax and is equivalent to "$1""$2" …).
Create a file called sorted.sh inside shell-lesson-data/exercise-data/proteins with the following contents:
1
2
3
# Sort files by their length.# Usage: bash sorted.sh one_or_more_filenameswc-l"$@" | sort-n
Observe the following commands:
1
2
3
4
cd ~/shell-lesson-data/exercise-data/proteins
nano sorted.sh
cat sorted.sh
bash sorted.sh *.pdb ../creatures/*.dat
To turn your script into an executable file (run without bash command), the following line must be at the top of your script:
1
#!/bin/bash
and your script file must have executable permission:
chmod 755 sorted.sh
./sorted.sh
<details class="details details--default" data-variant="default"><summary>Challenge: list unique species</summary>
<ul>
<li>Leah has several hundred data files, each of which is formatted like this:</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="rouge-code"><pre>2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1
</pre></td></tr></tbody></table></code></pre></div></div>
<ul>
<li>An example of this type of file is given in
<code class="language-plaintext highlighter-rouge">shell-lesson-data/exercise-data/animal-counts/animals.csv</code>.</li>
<li>We can use the command <code class="language-plaintext highlighter-rouge">cut-d , -f 2 animals.txt | sort | uniq</code> to produce
the unique species in <code class="language-plaintext highlighter-rouge">animals.txt</code>.</li>
<li>In order to avoid having to type out this series of commands every time,
a scientist may choose to write a shell script instead.</li>
<li>Write a shell script called <code class="language-plaintext highlighter-rouge">species.sh</code> that takes any number of
filenames as command-line arguments, and uses a variation of the above command
to print a list of the unique species appearing in each of those files separately.</li>
</ul>
<details class="details details--note" data-variant="note"><summary>Solution</summary>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="rouge-code"><pre><span class="c">#!/bin/bash</span>
<span class="c"># Script to find unique species in csv files where species is the second data field</span>
<span class="c"># This script accepts any number of file names as command line arguments</span>
<span class="c"># Loop over all files</span>
<span class="k">for </span>file <span class="k">in</span> <span class="nv">$@</span>
<span class="k">do
</span><span class="nb">echo</span> <span class="s2">"Unique species in </span><span class="nv">$file</span><span class="s2">:"</span>
<span class="c"># Extract species names</span>
<span class="nb">cut</span> <span class="nt">-d</span> , <span class="nt">-f</span> 2 <span class="nv">$file</span> | <span class="nb">sort</span> | <span class="nb">uniq
</span><span class="k">done</span>
</pre></td></tr></tbody></table></code></pre></div></div>
</details>
</details>
- Suppose we have just run a series of commands that did something useful ---for example,
that created a graph we'd like to use in a paper. We'd like to be able to re-create the
graph later if we need to, so we want to save the commands in a file.
- Instead of typing them in again (and potentially getting them wrong) we can do this:
- After a moment's work in an editor to remove the serial numbers on the commands,
and to remove the final line where we called the `history` command,
we have a completely accurate record of how we created that figure.
- In practice, most people develop shell scripts by running commands at the shell prompt a few
times to make sure they're doing the right thing, then saving them in a file for re-use.
- This style of work allows people to recycle what they discover about their data and their
workflow with one call to `history` and a bit of editing to clean up the output
and save it as a shell script.
---
## Nelle's Pipeline: Creating a Script
- Nelle's supervisor insisted that all her analytics must be reproducible.
The easiest way to capture all the steps is in a script.
- First we return to Nelle's project directory:
~~~bash
cd ../../north-pacific-gyre/
then creates a file using nano …
1
nano do-stats.sh
…which contains the following:
1
2
3
4
5
6
7
#!/bin/bash# Calculate stats for data files.for datafile in"$@"do
echo$datafile
bash goostats.sh $datafile stats-$datafiledone
… saves this in a file called do-stats.sh and set executable mode so that she can now re-do the first stage of her analysis by typing:
1
./do-stats.sh NENE*A.txt NENE*B.txt
She can also do the following so that the output is just the number of files processed rather than the names of the files that were processed.
1
./do-stats.sh NENE*A.txt NENE*B.txt | wc-l
One thing to note about Nelle’s script is that it lets the person running it decide what files to process. She could have written it as:
1
2
3
4
5
6
7
#!/bin/bash# Calculate stats for Site A and Site B data files.for datafile in NENE*A.txt NENE*B.txt
do
echo$datafile
bash goostats.sh $datafile stats-$datafiledone
The advantage is that this always selects the right files:
she doesn’t have to remember to exclude the ‘Z’ files.
The disadvantage is that it always selects just those files — she can’t run it on all files (including the ‘Z’ files), or on the ‘G’ or ‘H’ files her colleagues in Antarctica are producing, without editing the script.
She could modify her script to check for command-line arguments, and use NENE*A.txt NENE*B.txt if none were provided. Of course, this introduces another tradeoff between flexibility and complexity.
Challenge: variables in shell scripts
In the proteins directory, imagine you have a shell script called script.sh containing the following commands:
1
2
3
#!/bin/bashhead-n$2$1tail-n$3$1
While you are in the proteins directory, you type the following command:
1
./script.sh '*.pdb' 1 1
Which of the following outputs would you expect to see?
All of the lines between the first and the last lines of each file ending in .pdb in the proteins directory
The first and the last line of each file ending in .pdb in the proteins directory
The first and the last line of each file in the proteins directory
An error because of the quotes around *.pdb
The correct answer is 2.
The special variables $1, $2 and $3 represent the command line arguments given to the script, such that the commands run are:
The shell does not expand '*.pdb' because it is enclosed by quote marks.
As such, the first argument to the script is '*.pdb' which gets expanded within the script by head and tail.
Challenge: find the longest file with a given extension
Write a shell script called longest.sh that takes the name of a directory and a filename extension as its arguments, and prints out the name of the file with the most lines in that directory with that extension. For example:
1
./longest.sh shell-lesson-data/data/pdb pdb
would print the name of the .pdb file in shell-lesson-data/data/pdb that has the most lines.
Feel free to test your script on another directory e.g. ~~~ bash longest.sh shell-lesson-data/writing/data txt
Solution
1
2
3
4
5
6
7
#!/bin/bash# Shell script which takes two arguments:# 1. a directory name# 2. a file extension# and prints the name of the file in that directory# with the most lines which matches the file extension.wc-l$1/*.$2 | sort-n | tail-n 2 | head-n 1
The first part of the pipeline, wc -l $1/*.$2 | sort -n, counts the lines in each file and sorts them numerically (largest last). When there’s more than one file, wc also outputs a final summary line, giving the total number of lines across all files. We use tail -n 2 | head -n 1 to throw away this last line.
With wc -l $1/*.$2 | sort -n | tail -n 1 we’ll see the final summary line: we can build our pipeline up in pieces to be sure we understand the output.
Challenge: script reading comprehension
For this question, consider the shell-lesson-data/exercise-data/proteins directory once again. This contains a number of .pdb files in addition to any other files you may have created.
Explain what each of the following three scripts would do when run as bash script1.sh *.pdb, bash script2.sh *.pdb, and bash script3.sh *.pdb respectively.
In each case, the shell expands the wildcard in *.pdb before passing the resulting list of file names as arguments to the script.
Script 1 would print out a list of all files containing a dot in their name. The arguments passed to the script are not actually used anywhere in the script.
Script 2 would print the contents of the first 3 files with a .pdb file extension. $1, $2, and $3 refer to the first, second, and third argument respectively.
Script 3 would print all the arguments to the script (i.e. all the .pdb files), followed by .pdb. $@ refers to all the arguments given to a shell script.
Suppose you have saved the following script in a file called do-errors.sh in Nelle’s north-pacific-gyre/scripts directory:
1
2
3
4
5
6
# Calculate stats for data files.for datafile in"$@"do
echo$datfile
bash goostats.sh $datafile stats-$datafiledone
When you run it from the north-pacific-gyre directory, the output is blank.
1
bash do-errors.sh NENE*A.txt NENE*B.txt
To figure out why, re-run the script using the -x option:
1
bash -xdo-errors.sh NENE*A.txt NENE*B.txt
What is the output showing you?
Which line is responsible for the error?
Solution
The -x option causes bash to run in debug mode.
This prints out each command as it is run, which will help you to locate errors.
In this example, we can see that echo isn’t printing anything. We have made a typo in the loop variable name, and the variable datfile doesn’t exist, hence returning an empty string.