Chapter 2
Content of This Book
This book aims at the total beginner. However, if you know something about
computers but not about programming, the book will still be useful for you. After
introducing the basics of how to work in the Linux environment, some great tools
will be presented. Among these are the stream line editor sed, the script-oriented
programming languages awk and perl, the data analysis and visualization tool R,
and the relational database system MySQL. These utilities are extremely helpful
when it comes to formatting and analyzing data files. After you have worked through
all the chapters, you can use this book as a reference. The learning approach is
absolutely practically oriented. Thus, you are invited to run all examples, printed in
so-called Terminals, on your own! If you face any problems: contact me! Of course,
I cannot help you if your non-unix-like-operating-system driven computer crashes
continuously. However, if things connected to this book confuse you—or you even
find errors—please let me know (Email: rw@biowasserstoff.de). Further informa-
tion about this book, including lists with internet links and known errors, can be
found at my homepage (http://www.hs-mittweida.de/wuenschi). You are very much
welcome to supply me with good ideas for examples!
2.1 The Main Chapters
This book contains several chapters that focus on one particular concept, e.g. a
programming language. Though, there is a progression—chapters usually depend
on previous chapters. However, if you are already a command line geek, you might
skip early chapters.
2.1.1 Linux
Linux is a multi-user multi-task operating system, originally based on Minix, which is
an operating system similar to Unix. Linux was initially developed by Linus Torvalds
R. Wúnschiers, Computational Biology, DOI: 10.1007/978-3-642-34749-8_2, 11
© Springer-Verlag Berlin Heidelberg 2013
12 2 Content of This Book
in 1991. It is an open source operating system. This means everybody who has
programming knowledge can modify and improve the system; but it also means that
everybody can download and install it. This is a main reason to choose Linux: you
need invest no money except for the book itself. Still, the content of this book is
valid for any Unix-like operating system like Mac OS X or CygWin, the free Unix
emulator for Windows.
2.1.2 Shell Programming
The shell, also called terminal or console, will be our playground. Everything we do
in this book is done in the shell. The shell can be seen as a command interpreter:
we enter a command and the shell takes care of its execution; but we can also
combine a number of commands and programs, including programming structures
like decisions, in order to generate new functionality. Typical shell programs handle
files and directories rather than file contents. A common task would be to convert all
file extensions from .txt to .seq, make specific files executable or archive all recently
changed files. Shell programming resembles DOS’s programming language for batch
files.
2.1.3 Sed
No, this is not the about the Socialist Unity Party of Germany (German: Sozialis-
tische Einheitspartei Deutschlands, SED) that was governing the German Demo-
cratic Republic from October 1949 until March 1990. sed (stream editor) is used
to perform basic text editing on an input text file (or data stream) and was written by
Lee E. McMahon in 1973. sed does not allow for any interactions.
Figure2.1 shows a sketch for an example of what sed can do. Here, the stop codon
of a DNA sequence is replaced by the text “!STOP!”. sed is well suited to perform
small formatting tasks like converting RNA to DNA, commas to points, tabs to
semicolons and the like.
Fig. 2.1 Stream EDitor. What
does Sed do?
2.1 The Main Chapters 13
2.1.4 AWK
AWK is a programming language for handling common data manipulation tasks
with only a few lines of code. It was initially developed by Alfred V. Aho, Peter J.
Weinberger and Brian W. Kernighan in 1977. This is where the name comes from.
AWK is really a great tool when it comes to analyzing the content of data files. With
AWK you can perform calculations, draw decisions, read and write multiple files.
What is best is that AWK can be extended with your own designed functions. A
typical task would be to fuse the content of files having one common field. Another
typical task would be to extract data matching certain criteria. AWK forms the kernel
of this book. After you have finished the chapter on AWK you should be able to (a)
program basically anything you need and (b) learn any other programming language.
The example shown in Fig.2.2 shows one basic function of AWK. All enzyme names
in the file enzymes.file are printed, if the corresponding Km value (do you remember
enzyme kinetics and Michaelis-Menten?) is smaller than 0.4.
Fig. 2.2 AWK. What does
AWK do?
2.1.5 Perl
Practical extraction and report language (Perl) arose from a project started in 1987
by Larry Wall. It combines some of the best features of the programming language
C, Sed, AWK and shell programming and was optimized for scanning arbitrary text
files, extracting information from those text files and printing reports based on that
information. Perl is intended to be practical rather than being a beautiful language. It
offersyoueverythingaprogramminglanguagecanoffer.Perlisoftenusedtoprogram
web-based applications (known as CGI scripts), it provides database connectivity and
there is even a Bioperl project which provides many tools biologists need. This book
can introduce you to only the very basics of Perl. Still, you will get to the point where
you learn how to write your own modules and generate small database files.
14 2 Content of This Book
2.1.6 MySQL
When you happen to have several, possibly cross-linked, Excel tables, then you
should consider to save them in a MySQL database. Imagine you have one table
(or tab-delimited file) containing gene expression data. Another table contains gene
annotations (Fig.2.3).
Thus, there is a relationship between data in these tables. The power of MySQL
(and other relational database systems) is to query over such relations. Furthermore,
MySQL is a good storage place for such data because everything is available in
one database. Many programming languages and software packages—e.g. R—can
connect to MySQL.
gene function metabolism
alr2938 iron superoxide dismutase Detoxification
alr4392 nitrogen-responsive regulator Nitrogen assimilation
alr4851 preprotein translocase subunit Protein and peptide secretion
alr3395 adenylosuccinate lyase Purine biosynthesis
alr1207 uridylate kinase Pyrimidine biosynthesis
alr5000 CTP synthetase Pyrimidine biosynthesis
all3556 succinate-dehydrogenase TCA cycle
gene
alr1207
alr2938
alr3395
all3556
alr4392
alr4851
alr5000
Table: annotation
Table: expression
expr_level
8303
10323
1432
8043
729
633
5732
Fig. 2.3 MySQL. The query shown at the bottom right displays gene names and gene functions
for all genes, which are involved in either purine or pyrimidine metabolism and which have an
expression value less than 5000
2.1.7 R
There are a broad range of software tools available to perform data analysis and
visualization. R is a completely free software that emulates S-Plus. S-Plus in turn
was initially developed by AT&T’s Bell Laboratories to provide a software tool
for professional statisticians who wanted to combine state-of-the-art graphics with
powerful model-fitting capability.
R is a very well established platform for scientists in general and computational
biologists in particular. Besides from being a programming environment for statisti-
2.1 The Main Chapters 15
cal computing, R is also a data visualization tool. Several subject specific packages
are available to analyze and visualize experimental data, e.g. for evolutionary biology,
the evaluation of biochemical assays, nucleotide and amino acid sequence analysis,
microarray data interpretation and more. Bioconductor is the name of an interna-
tional project that brings developments from various research teams together and
that provides tools for the analysis and comprehension of high-throughput genomic
data.
The introduction given in this book shall set you on the right track to develop more
and more sophisticated skills in using and combing available packages and apply
them to your own scientific problems.
2.1.8 Worked Examples
This might be the most important part of this book. In Part VI on p. 343 I present
four chapters that can be used for teaching and training. All are based on real-world
problems and come with a detailed instruction of how to solve the problem. All
exercises include data processing and visualization of some sort.
Most programmers would agree that one learns a programming language by using
and adapting programs of others. This is learning by mimicking. You should go the
same way with the worked examples. Start by following the instructions—then leave
the path and start playing with both the data and data processing. Maybe you develop
nice modifications to the examples and like to send them to me—I would be very
happy. Last but not least, the worked examples shall give you a glimpse of where
you could employ command line data processing in your own projects.
2.2 Prerequisites
In order to perform the exercises shown in this book—and there are lot—you need
to have access to a computer running either Linux, Unix or Mac OS X (the newest
Apple operating system) or the free Windows Unix emulator CygWin (see below).
As you will learn soon, these systems are very similar. Thus, all the things we are
going to learn will work on all Linux, Unix and Mac OS X computers. On a normal
installation all required programs should be installed. Otherwise, contact the system
administrator.
My personal recommendation is to install Linux as a virtual machine. This operating
system independent approach is explained in Sect.4.1.2 on p. 40 (see also Sect.3.8.1
on p. 30).
Alternatively, you can install the free Cygwin Unix emulator on a computer running
the Microsoft Windows operating system (from Win95 upwards, excluding WinCE)
(see Sect.3.8.2 on p. 30) or to start with Knoppix Linux (see Sect.3.9 on p. 31).
16 2 Content of This Book
Knoppix runs from a CD-ROM and requires no installation on the hard disk drive.
Neither your operating system nor the data on your computer will be touched.
2.3 Conventions
What you see on your computer screen is written in typewriter style and boxed.
I will refer to this as the Terminal. The data you have to enter are given behind
the $ character. Key labels are written in a box. For example, the key labelled
“Enter” would be written as
§
¦
¤
¥Enter . Commands that appear in the text are written in
typewriter, too. When necessary, space characters are symbolized by “ ” in the
text.Thus,“ ”meansthatyouhavetotypethreeconsecutivespaces.Inthefollowing
example you would type date as input and get the current date as output. There-
after, from lines 3-5 I indicate how I mark commands or output that spans several
lines.
Terminal 1: The Command date
1 $ date
2 Thu Feb 13 18:53:26 CET 2003
3 $ # this is a very long comment line that does nothing but indicating how I +
4 highlight commands or output that spans two or more lines; NOTE the plus signs +
5 at the end of the lines
6 $
In most cases you will find some text behind the terminal which describes the terminal
content: in Terminal 1, line 1, we check for the current date and time.
Boxes labelled “Program” contain script files or programs. These have to be saved
in a file as indicated in the first or second program line: # save as hello.sh.
You will find the program under the same name on the accompanying website. As
terminals, programs are numbered.
Program 1: hello.sh - Our First Shell Script
1 #!/bin/bash
2 # save as hello.sh
3 # This is a comment
4 echo "Hello World"
Finally, there are “Files”, which usually contain text files that are processed. At the
end of most chapters you will find exercises. These are numbered, too. The solution
can be found in Sect.A.7 on p. 431.
2.4 Additional Resources 17
2.4 Additional Resources
Thisbookisaccompainedbyawebsite(http://www.staff.hs-mittweida.de/~wuenschi/
doku.php?id=rwbook2) and a blog (Fig.2.4). Suggestions and comments are highly
welcome.
Fig. 2.4 Blog about AWK. Take a look or participate at my blog at http://www.awkologist.com
http://www.springer.com/978-3-642-34748-1

More Related Content

PPTX
Revised guideline for research in transgenic plants (
PPTX
PDF
Manual of Geriatric Anesthesia
PDF
Knowledge perspectives of new product development
PDF
Efficiency measures in the agricultural sector
PDF
Multiresonator based chipless rfid
PDF
Prominin new insights on stem & cancer stem cell biology
PDF
Strategy deployment in business units
Revised guideline for research in transgenic plants (
Manual of Geriatric Anesthesia
Knowledge perspectives of new product development
Efficiency measures in the agricultural sector
Multiresonator based chipless rfid
Prominin new insights on stem & cancer stem cell biology
Strategy deployment in business units

Viewers also liked (15)

PDF
Climate change and the law
PDF
Value stream design
DOCX
PPT
Egídio guanzati e anselmo valli
PDF
02 construccion psicosexual
PPTX
Project management master class karin rheeder
PDF
Zarb School of Business Newsletter Fall 2014
PDF
Composite kimmalo
PDF
PDF
Yiru Chen's Transcript
DOCX
Resume arvind
PPT
Dízimo
PDF
Encyclopedia of biophysics
PDF
Via ctt comunicacao_chave_seguranca
Climate change and the law
Value stream design
Egídio guanzati e anselmo valli
02 construccion psicosexual
Project management master class karin rheeder
Zarb School of Business Newsletter Fall 2014
Composite kimmalo
Yiru Chen's Transcript
Resume arvind
Dízimo
Encyclopedia of biophysics
Via ctt comunicacao_chave_seguranca
Ad

Similar to Computational biology (20)

PDF
HPC Essentials
DOCX
Unix lab manual
PPT
redhat_by_Cbitss.ppt
PPTX
rlanguage-201216152504.pptx
PDF
Unix and Linux - The simple introduction
PPTX
PDF
PDF
PPT
Karkha unix shell scritping
PPT
Linux operating system by Quontra Solutions
PPTX
Linux Systems Programming: Ubuntu Installation and Configuration
PDF
2018-Summer-Tutorial-Intro-to-Linux.pdf
PPTX
Unix Shell Script - 2 Days Session.pptx
PPT
Chapter09 -- networking with unix and linux
PDF
Introduction to Linux with Focus on Raspberry Pi
PPT
PPTX
Presentation for RHCE in linux
PPTX
Bioinformatics v2014 wim_vancriekinge
PPTX
Unix_Introduction_BCA.pptx the very basi
PPT
IntroToUnix.ppt
HPC Essentials
Unix lab manual
redhat_by_Cbitss.ppt
rlanguage-201216152504.pptx
Unix and Linux - The simple introduction
Karkha unix shell scritping
Linux operating system by Quontra Solutions
Linux Systems Programming: Ubuntu Installation and Configuration
2018-Summer-Tutorial-Intro-to-Linux.pdf
Unix Shell Script - 2 Days Session.pptx
Chapter09 -- networking with unix and linux
Introduction to Linux with Focus on Raspberry Pi
Presentation for RHCE in linux
Bioinformatics v2014 wim_vancriekinge
Unix_Introduction_BCA.pptx the very basi
IntroToUnix.ppt
Ad

More from Springer (20)

PDF
The chemistry of the actinide and transactinide elements (set vol.1 6)
PDF
Transition metal catalyzed enantioselective allylic substitution in organic s...
PDF
Total synthesis of natural products
PDF
Solid state nmr
PDF
Mass spectrometry
PDF
Higher oxidation state organopalladium and platinum
PDF
Principles and applications of esr spectroscopy
PDF
Inorganic 3 d structures
PDF
Field flow fractionation in biopolymer analysis
PDF
Thermodynamics of crystalline states
PDF
Theory of electroelasticity
PDF
Tensor algebra and tensor analysis for engineers
PDF
Springer handbook of nanomaterials
PDF
Shock wave compression of condensed matter
PDF
Polarization bremsstrahlung on atoms, plasmas, nanostructures and solids
PDF
Nanostructured materials for magnetoelectronics
PDF
Nanobioelectrochemistry
PDF
Modern theory of magnetism in metals and alloys
PDF
Mechanical behaviour of materials
PDF
Magnonics
The chemistry of the actinide and transactinide elements (set vol.1 6)
Transition metal catalyzed enantioselective allylic substitution in organic s...
Total synthesis of natural products
Solid state nmr
Mass spectrometry
Higher oxidation state organopalladium and platinum
Principles and applications of esr spectroscopy
Inorganic 3 d structures
Field flow fractionation in biopolymer analysis
Thermodynamics of crystalline states
Theory of electroelasticity
Tensor algebra and tensor analysis for engineers
Springer handbook of nanomaterials
Shock wave compression of condensed matter
Polarization bremsstrahlung on atoms, plasmas, nanostructures and solids
Nanostructured materials for magnetoelectronics
Nanobioelectrochemistry
Modern theory of magnetism in metals and alloys
Mechanical behaviour of materials
Magnonics

Recently uploaded (20)

PDF
UiPath Agentic Automation session 1: RPA to Agents
PDF
A comparative study of natural language inference in Swahili using monolingua...
PPTX
Configure Apache Mutual Authentication
PDF
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
CloudStack 4.21: First Look Webinar slides
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
STKI Israel Market Study 2025 version august
PDF
sustainability-14-14877-v2.pddhzftheheeeee
PDF
Abstractive summarization using multilingual text-to-text transfer transforme...
PPTX
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
PDF
Five Habits of High-Impact Board Members
PDF
Developing a website for English-speaking practice to English as a foreign la...
PPTX
Modernising the Digital Integration Hub
PDF
A review of recent deep learning applications in wood surface defect identifi...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PPT
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
PPTX
2018-HIPAA-Renewal-Training for executives
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
UiPath Agentic Automation session 1: RPA to Agents
A comparative study of natural language inference in Swahili using monolingua...
Configure Apache Mutual Authentication
Produktkatalog für HOBO Datenlogger, Wetterstationen, Sensoren, Software und ...
A contest of sentiment analysis: k-nearest neighbor versus neural network
CloudStack 4.21: First Look Webinar slides
Getting started with AI Agents and Multi-Agent Systems
STKI Israel Market Study 2025 version august
sustainability-14-14877-v2.pddhzftheheeeee
Abstractive summarization using multilingual text-to-text transfer transforme...
AI IN MARKETING- PRESENTED BY ANWAR KABIR 1st June 2025.pptx
Five Habits of High-Impact Board Members
Developing a website for English-speaking practice to English as a foreign la...
Modernising the Digital Integration Hub
A review of recent deep learning applications in wood surface defect identifi...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Galois Field Theory of Risk: A Perspective, Protocol, and Mathematical Backgr...
2018-HIPAA-Renewal-Training for executives
Convolutional neural network based encoder-decoder for efficient real-time ob...

Computational biology

  • 1. Chapter 2 Content of This Book This book aims at the total beginner. However, if you know something about computers but not about programming, the book will still be useful for you. After introducing the basics of how to work in the Linux environment, some great tools will be presented. Among these are the stream line editor sed, the script-oriented programming languages awk and perl, the data analysis and visualization tool R, and the relational database system MySQL. These utilities are extremely helpful when it comes to formatting and analyzing data files. After you have worked through all the chapters, you can use this book as a reference. The learning approach is absolutely practically oriented. Thus, you are invited to run all examples, printed in so-called Terminals, on your own! If you face any problems: contact me! Of course, I cannot help you if your non-unix-like-operating-system driven computer crashes continuously. However, if things connected to this book confuse you—or you even find errors—please let me know (Email: rw@biowasserstoff.de). Further informa- tion about this book, including lists with internet links and known errors, can be found at my homepage (http://www.hs-mittweida.de/wuenschi). You are very much welcome to supply me with good ideas for examples! 2.1 The Main Chapters This book contains several chapters that focus on one particular concept, e.g. a programming language. Though, there is a progression—chapters usually depend on previous chapters. However, if you are already a command line geek, you might skip early chapters. 2.1.1 Linux Linux is a multi-user multi-task operating system, originally based on Minix, which is an operating system similar to Unix. Linux was initially developed by Linus Torvalds R. Wúnschiers, Computational Biology, DOI: 10.1007/978-3-642-34749-8_2, 11 © Springer-Verlag Berlin Heidelberg 2013
  • 2. 12 2 Content of This Book in 1991. It is an open source operating system. This means everybody who has programming knowledge can modify and improve the system; but it also means that everybody can download and install it. This is a main reason to choose Linux: you need invest no money except for the book itself. Still, the content of this book is valid for any Unix-like operating system like Mac OS X or CygWin, the free Unix emulator for Windows. 2.1.2 Shell Programming The shell, also called terminal or console, will be our playground. Everything we do in this book is done in the shell. The shell can be seen as a command interpreter: we enter a command and the shell takes care of its execution; but we can also combine a number of commands and programs, including programming structures like decisions, in order to generate new functionality. Typical shell programs handle files and directories rather than file contents. A common task would be to convert all file extensions from .txt to .seq, make specific files executable or archive all recently changed files. Shell programming resembles DOS’s programming language for batch files. 2.1.3 Sed No, this is not the about the Socialist Unity Party of Germany (German: Sozialis- tische Einheitspartei Deutschlands, SED) that was governing the German Demo- cratic Republic from October 1949 until March 1990. sed (stream editor) is used to perform basic text editing on an input text file (or data stream) and was written by Lee E. McMahon in 1973. sed does not allow for any interactions. Figure2.1 shows a sketch for an example of what sed can do. Here, the stop codon of a DNA sequence is replaced by the text “!STOP!”. sed is well suited to perform small formatting tasks like converting RNA to DNA, commas to points, tabs to semicolons and the like. Fig. 2.1 Stream EDitor. What does Sed do?
  • 3. 2.1 The Main Chapters 13 2.1.4 AWK AWK is a programming language for handling common data manipulation tasks with only a few lines of code. It was initially developed by Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan in 1977. This is where the name comes from. AWK is really a great tool when it comes to analyzing the content of data files. With AWK you can perform calculations, draw decisions, read and write multiple files. What is best is that AWK can be extended with your own designed functions. A typical task would be to fuse the content of files having one common field. Another typical task would be to extract data matching certain criteria. AWK forms the kernel of this book. After you have finished the chapter on AWK you should be able to (a) program basically anything you need and (b) learn any other programming language. The example shown in Fig.2.2 shows one basic function of AWK. All enzyme names in the file enzymes.file are printed, if the corresponding Km value (do you remember enzyme kinetics and Michaelis-Menten?) is smaller than 0.4. Fig. 2.2 AWK. What does AWK do? 2.1.5 Perl Practical extraction and report language (Perl) arose from a project started in 1987 by Larry Wall. It combines some of the best features of the programming language C, Sed, AWK and shell programming and was optimized for scanning arbitrary text files, extracting information from those text files and printing reports based on that information. Perl is intended to be practical rather than being a beautiful language. It offersyoueverythingaprogramminglanguagecanoffer.Perlisoftenusedtoprogram web-based applications (known as CGI scripts), it provides database connectivity and there is even a Bioperl project which provides many tools biologists need. This book can introduce you to only the very basics of Perl. Still, you will get to the point where you learn how to write your own modules and generate small database files.
  • 4. 14 2 Content of This Book 2.1.6 MySQL When you happen to have several, possibly cross-linked, Excel tables, then you should consider to save them in a MySQL database. Imagine you have one table (or tab-delimited file) containing gene expression data. Another table contains gene annotations (Fig.2.3). Thus, there is a relationship between data in these tables. The power of MySQL (and other relational database systems) is to query over such relations. Furthermore, MySQL is a good storage place for such data because everything is available in one database. Many programming languages and software packages—e.g. R—can connect to MySQL. gene function metabolism alr2938 iron superoxide dismutase Detoxification alr4392 nitrogen-responsive regulator Nitrogen assimilation alr4851 preprotein translocase subunit Protein and peptide secretion alr3395 adenylosuccinate lyase Purine biosynthesis alr1207 uridylate kinase Pyrimidine biosynthesis alr5000 CTP synthetase Pyrimidine biosynthesis all3556 succinate-dehydrogenase TCA cycle gene alr1207 alr2938 alr3395 all3556 alr4392 alr4851 alr5000 Table: annotation Table: expression expr_level 8303 10323 1432 8043 729 633 5732 Fig. 2.3 MySQL. The query shown at the bottom right displays gene names and gene functions for all genes, which are involved in either purine or pyrimidine metabolism and which have an expression value less than 5000 2.1.7 R There are a broad range of software tools available to perform data analysis and visualization. R is a completely free software that emulates S-Plus. S-Plus in turn was initially developed by AT&T’s Bell Laboratories to provide a software tool for professional statisticians who wanted to combine state-of-the-art graphics with powerful model-fitting capability. R is a very well established platform for scientists in general and computational biologists in particular. Besides from being a programming environment for statisti-
  • 5. 2.1 The Main Chapters 15 cal computing, R is also a data visualization tool. Several subject specific packages are available to analyze and visualize experimental data, e.g. for evolutionary biology, the evaluation of biochemical assays, nucleotide and amino acid sequence analysis, microarray data interpretation and more. Bioconductor is the name of an interna- tional project that brings developments from various research teams together and that provides tools for the analysis and comprehension of high-throughput genomic data. The introduction given in this book shall set you on the right track to develop more and more sophisticated skills in using and combing available packages and apply them to your own scientific problems. 2.1.8 Worked Examples This might be the most important part of this book. In Part VI on p. 343 I present four chapters that can be used for teaching and training. All are based on real-world problems and come with a detailed instruction of how to solve the problem. All exercises include data processing and visualization of some sort. Most programmers would agree that one learns a programming language by using and adapting programs of others. This is learning by mimicking. You should go the same way with the worked examples. Start by following the instructions—then leave the path and start playing with both the data and data processing. Maybe you develop nice modifications to the examples and like to send them to me—I would be very happy. Last but not least, the worked examples shall give you a glimpse of where you could employ command line data processing in your own projects. 2.2 Prerequisites In order to perform the exercises shown in this book—and there are lot—you need to have access to a computer running either Linux, Unix or Mac OS X (the newest Apple operating system) or the free Windows Unix emulator CygWin (see below). As you will learn soon, these systems are very similar. Thus, all the things we are going to learn will work on all Linux, Unix and Mac OS X computers. On a normal installation all required programs should be installed. Otherwise, contact the system administrator. My personal recommendation is to install Linux as a virtual machine. This operating system independent approach is explained in Sect.4.1.2 on p. 40 (see also Sect.3.8.1 on p. 30). Alternatively, you can install the free Cygwin Unix emulator on a computer running the Microsoft Windows operating system (from Win95 upwards, excluding WinCE) (see Sect.3.8.2 on p. 30) or to start with Knoppix Linux (see Sect.3.9 on p. 31).
  • 6. 16 2 Content of This Book Knoppix runs from a CD-ROM and requires no installation on the hard disk drive. Neither your operating system nor the data on your computer will be touched. 2.3 Conventions What you see on your computer screen is written in typewriter style and boxed. I will refer to this as the Terminal. The data you have to enter are given behind the $ character. Key labels are written in a box. For example, the key labelled “Enter” would be written as § ¦ ¤ ¥Enter . Commands that appear in the text are written in typewriter, too. When necessary, space characters are symbolized by “ ” in the text.Thus,“ ”meansthatyouhavetotypethreeconsecutivespaces.Inthefollowing example you would type date as input and get the current date as output. There- after, from lines 3-5 I indicate how I mark commands or output that spans several lines. Terminal 1: The Command date 1 $ date 2 Thu Feb 13 18:53:26 CET 2003 3 $ # this is a very long comment line that does nothing but indicating how I + 4 highlight commands or output that spans two or more lines; NOTE the plus signs + 5 at the end of the lines 6 $ In most cases you will find some text behind the terminal which describes the terminal content: in Terminal 1, line 1, we check for the current date and time. Boxes labelled “Program” contain script files or programs. These have to be saved in a file as indicated in the first or second program line: # save as hello.sh. You will find the program under the same name on the accompanying website. As terminals, programs are numbered. Program 1: hello.sh - Our First Shell Script 1 #!/bin/bash 2 # save as hello.sh 3 # This is a comment 4 echo "Hello World" Finally, there are “Files”, which usually contain text files that are processed. At the end of most chapters you will find exercises. These are numbered, too. The solution can be found in Sect.A.7 on p. 431.
  • 7. 2.4 Additional Resources 17 2.4 Additional Resources Thisbookisaccompainedbyawebsite(http://www.staff.hs-mittweida.de/~wuenschi/ doku.php?id=rwbook2) and a blog (Fig.2.4). Suggestions and comments are highly welcome. Fig. 2.4 Blog about AWK. Take a look or participate at my blog at http://www.awkologist.com