SlideShare a Scribd company logo
RedHat Enterprise Linux Essential
     Unit 7: Text Processing Tools
Objectives
Upon completion of this unit, you should be able to:

 Use tools for extracting, analyzing and manipulating
  text data
Tools for Extracting Text

 File Contents: less and cat

 File Excerpts: head and tail

 Extract by Column: cut

 Extract by Keyword: grep
Viewing File Contents
                             less and cat

 cat: dump one or more files to STDOUT

    Multiple files are concatenated together


 less: view file or STDIN one page at a time

    Useful commands while viewing:

       • /text searches for text

       • n/N jumps to the next/previous match

       • v opens the file in a text editor


 less is the pager used by man
Viewing File Excerpts
                               head and tail

 head: Display the first 10 lines of a file

    Use -n to change number of lines displayed


 tail: Display the last 10 lines of a file

    Use -n to change number of lines displayed


    Use -f to "follow" subsequent additions to the file
       • Very useful for monitoring log files!
Extracting Text by Keyword
                             grep
 Prints lines of files or STDIN where a pattern is matched
       $ grep 'john' /etc/passwd

       $ date --help | grep year

 Use -i to search case-insensitively

 Use -n to print line numbers of matches

 Use -v to print lines not containing pattern

 Use -AX to include the X lines after each match

 Use -BX to include the X lines before each match
Extracting Text by Column
                                 cut

 Display specific columns of file or STDIN data

  $ cut -d: -f1 /etc/passwd

  $ grep root /etc/passwd | cut -d: -f7


 Use -d to specify the column delimiter (default is TAB)

 Use -f to specify the column to print

 Use -c to cut by characters

  $ cut -c2-5 /usr/share/dict/words
Tools for Analyzing Text

 Text Stats: wc

 Sorting Text: sort

 Comparing Files: diff and patch

 Spell Check: aspell
Gathering Text Statistics
                       wc (word count)
 Counts words, lines, bytes and characters

 Can act upon a file or STDIN

       $ wc story.txt

       39   237   1901 story.txt

 Use -l for only line count

 Use -w for only word count

 Use -c for only byte count

 Use -m for character count (not displayed)
Sorting Text sort

 Sorts text to STDOUT - original file unchanged

       $ sort [options] file(s)
 Common options
    -r performs a reverse (descending) sort

    -n performs a numeric sort

    -f ignores (folds) case of characters in strings

    -u (unique) removes duplicate lines in output

    -t c uses c as a field separator

    -k X sorts by c-delimited field X
       • Can be used multiple times
Eliminating Duplicate Lines
                        sort and uniq
 sort -u: removes duplicate lines from input

 uniq: removes duplicate adjacent lines from input
    Use -c to count number of occurrences

    Use with sort for best effect:

      $ sort userlist.txt | uniq -c
Comparing Files
                              diff
 Compares two files for differences
      $ diff foo.conf-broken foo.conf-works
      5c5
      < use_widgets = no
      ---
      > use_widgets = yes
    Denotes a difference (change) on line 5

 Use gvimdiff for graphical diff
    Provided by vim-X11 package
Duplicating File Changes
                               patch
 diff output stored in a file is called a "patchfile"
    Use -u for "unified" diff, best in patchfiles

 patch duplicates changes in other files (use with care!)

 • Use -b to automatically back up changed files

  $ diff -u foo.conf-broken foo.conf-works > foo.patch

  $ patch -b foo.conf-broken foo.patch
Spell Checking with aspell

 Interactively spell-check files:
       $ aspell check letter.txt

 Non-interactively list mis-spelled words in STDIN

       $ aspell list < letter.txt

       $ aspell list < letter.txt | wc -l
Tools for Manipulating Text
                           tr and sed
 Alter (translate) Characters: tr
    Converts characters in one set to corresponding characters in another
     set
    Only reads data from STDIN

       $ tr 'a-z' 'A-Z' < lowercase.txt

 Alter Strings: sed
    stream editor

    Performs search/replace operations on a stream of text

    Normally does not alter source file

    Use -i.bak to back-up and alter source file
sed
                              Examples
 Quote search and replace instructions!

 sed addresses
    sed 's/dog/cat/g' pets

    sed '1,50s/dog/cat/g' pets

    sed '/digby/,/duncan/s/dog/cat/g' pets

 Multiple sed instructions
    sed -e 's/dog/cat/' -e 's/hi/lo/' pets

    sed -f myedits pets
Introduction awk

   Field/Column processor
   Supports egrep-compatible (POSIX) RegExes
   Can return full lines like grep
   Awk runs 3 steps:
     BEGIN - optional
     Body, where the main action(s) take place
     END - optional
 Multiple body actions can be executed by separating them using
  semicolons. e.g. '{ print $1; print $2 }'
 awk, auto-loops through input stream, regardless of the source of the
  stream. e.g. STDIN, Pipe, File
 Usage:
       awk '/optional_match/ { action }' file_name | Pipe
Example awk

 Print a text file
    awk '{print }' /etc/passwd

    awk '{print $0}' /etc/passwd

 Print specific field
    awk -F':' '{print $1}' /etc/passwd

 Pattern matching
    awk '$9 == 500 { print $0}' /var/log/httpd/access.log

 Print lines containing vmintam,student and khanh
    awk '/vmintam|student|khanh/' /etc/passwd
Example awk (con’t)

 print 1st lines from file
   awk "NR==1{print;exit}" /etc/resolv.conf

 Simply Arithmetic
   awk '{total += $1} END {print total}' earnings.txt

 Shell cannot calculate with floating point numberes, but awk can:
   awk 'BEGIN {printf "%.3fn", 2005.50 / 3}‘

 history | awk '{print $2}' | sort | uniq -c | sort -rn | head
Special Characters for Complex Searches
                 Regular Expressions
 ^ represents beginning of line

 $ represents end of line

 Character classes as in bash:
    [abc], [^abc]

    [[:upper:]], [^[:upper:]]

 Used by:
    grep, sed, less, others
Unit 8 text processing tools

More Related Content

PPT
Learning sed and awk
PPT
Unix command-line tools
PDF
Unix Tutorial
PDF
Course 102: Lecture 13: Regular Expressions
PDF
Course 102: Lecture 4: Using Wild Cards
PDF
First steps in PERL
PPT
3.7 search text files using regular expressions
PPT
BITS: Introduction to Linux - Text manipulation tools for bioinformatics
Learning sed and awk
Unix command-line tools
Unix Tutorial
Course 102: Lecture 13: Regular Expressions
Course 102: Lecture 4: Using Wild Cards
First steps in PERL
3.7 search text files using regular expressions
BITS: Introduction to Linux - Text manipulation tools for bioinformatics

What's hot (18)

PPTX
Grep - A powerful search utility
PPT
101 3.7 search text files using regular expressions
PPT
101 3.7 search text files using regular expressions
DOCX
15 practical grep command examples in linux
PDF
Hex file and regex cheat sheet
KEY
PHP 5.3
PPTX
Unix - Filters
PPT
intro unix/linux 05
PPT
PPT
Unix Basics
PPTX
Introduction to Python , Overview
PPT
Using Unix
PPTX
Programming in C
PDF
Unix Commands
DOCX
Learning Grep
PDF
Chunked, dplyr for large text files
Grep - A powerful search utility
101 3.7 search text files using regular expressions
101 3.7 search text files using regular expressions
15 practical grep command examples in linux
Hex file and regex cheat sheet
PHP 5.3
Unix - Filters
intro unix/linux 05
Unix Basics
Introduction to Python , Overview
Using Unix
Programming in C
Unix Commands
Learning Grep
Chunked, dplyr for large text files
Ad

Viewers also liked (7)

PPT
Speed protocol processor
PPTX
Word processor in the classroom
PPTX
Ictlessonepp4aralin10angcomputerfilesystem 150622081942-lva1-app6892 -
PPTX
Ictlessonepp4 aralin11pananaliksikgamitanginternet-150622045536-lva1-app6891 -
PPTX
Ict lesson epp 4 aralin 9 pangangalap ng impormasyon gamit ang ict
PDF
K TO 12 GRADE 4 UNANG MARKAHANG PAGSUSULIT
PPT
Ppt for tranmission media
Speed protocol processor
Word processor in the classroom
Ictlessonepp4aralin10angcomputerfilesystem 150622081942-lva1-app6892 -
Ictlessonepp4 aralin11pananaliksikgamitanginternet-150622045536-lva1-app6891 -
Ict lesson epp 4 aralin 9 pangangalap ng impormasyon gamit ang ict
K TO 12 GRADE 4 UNANG MARKAHANG PAGSUSULIT
Ppt for tranmission media
Ad

Similar to Unit 8 text processing tools (20)

PPTX
Handling Files Under Unix.pptx
PPTX
Handling Files Under Unix.pptx
PDF
Cheatsheet: Hex file headers and regex
PPTX
Unix Trainning Doc.pptx
ODP
PPT
ODP
PPT
Spsl II unit
PPTX
Linux System commands Essentialsand Basics.pptx
RTF
Unix lab manual
ODP
Vim and Python
PPT
intro unix/linux 06
PDF
Scripting and the shell in LINUX
PDF
Linux Command Line - By Ranjan Raja
PPT
101 3.7 search text files using regular expressions
PPTX
terminal command2.pptx with good explanation
PPT
Shell Scripts
PDF
1) List currently running jobsANS) see currently runningcommand.pdf
Handling Files Under Unix.pptx
Handling Files Under Unix.pptx
Cheatsheet: Hex file headers and regex
Unix Trainning Doc.pptx
Spsl II unit
Linux System commands Essentialsand Basics.pptx
Unix lab manual
Vim and Python
intro unix/linux 06
Scripting and the shell in LINUX
Linux Command Line - By Ranjan Raja
101 3.7 search text files using regular expressions
terminal command2.pptx with good explanation
Shell Scripts
1) List currently running jobsANS) see currently runningcommand.pdf

More from root_fibo (11)

PDF
Unit 13 network client
PDF
Unit 12 finding and processing files
PDF
Unit 11 configuring the bash shell – shell script
PDF
Unit3 browsing the filesystem
PDF
Unit 10 investigating and managing
PDF
Unit 9 basic system configuration tools
PDF
Unit 7 standard i o
PDF
Unit 6 bash shell
PDF
Unit 5 vim an advanced text editor
PDF
Unit 4 user and group
PDF
Unit2 help
Unit 13 network client
Unit 12 finding and processing files
Unit 11 configuring the bash shell – shell script
Unit3 browsing the filesystem
Unit 10 investigating and managing
Unit 9 basic system configuration tools
Unit 7 standard i o
Unit 6 bash shell
Unit 5 vim an advanced text editor
Unit 4 user and group
Unit2 help

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
cuic standard and advanced reporting.pdf
PDF
Approach and Philosophy of On baking technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
KodekX | Application Modernization Development
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
sap open course for s4hana steps from ECC to s4
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
cuic standard and advanced reporting.pdf
Approach and Philosophy of On baking technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
KodekX | Application Modernization Development
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
sap open course for s4hana steps from ECC to s4
NewMind AI Weekly Chronicles - August'25 Week I
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf

Unit 8 text processing tools

  • 1. RedHat Enterprise Linux Essential Unit 7: Text Processing Tools
  • 2. Objectives Upon completion of this unit, you should be able to:  Use tools for extracting, analyzing and manipulating text data
  • 3. Tools for Extracting Text  File Contents: less and cat  File Excerpts: head and tail  Extract by Column: cut  Extract by Keyword: grep
  • 4. Viewing File Contents less and cat  cat: dump one or more files to STDOUT  Multiple files are concatenated together  less: view file or STDIN one page at a time  Useful commands while viewing: • /text searches for text • n/N jumps to the next/previous match • v opens the file in a text editor  less is the pager used by man
  • 5. Viewing File Excerpts head and tail  head: Display the first 10 lines of a file  Use -n to change number of lines displayed  tail: Display the last 10 lines of a file  Use -n to change number of lines displayed  Use -f to "follow" subsequent additions to the file • Very useful for monitoring log files!
  • 6. Extracting Text by Keyword grep  Prints lines of files or STDIN where a pattern is matched $ grep 'john' /etc/passwd $ date --help | grep year  Use -i to search case-insensitively  Use -n to print line numbers of matches  Use -v to print lines not containing pattern  Use -AX to include the X lines after each match  Use -BX to include the X lines before each match
  • 7. Extracting Text by Column cut  Display specific columns of file or STDIN data $ cut -d: -f1 /etc/passwd $ grep root /etc/passwd | cut -d: -f7  Use -d to specify the column delimiter (default is TAB)  Use -f to specify the column to print  Use -c to cut by characters $ cut -c2-5 /usr/share/dict/words
  • 8. Tools for Analyzing Text  Text Stats: wc  Sorting Text: sort  Comparing Files: diff and patch  Spell Check: aspell
  • 9. Gathering Text Statistics wc (word count)  Counts words, lines, bytes and characters  Can act upon a file or STDIN $ wc story.txt 39 237 1901 story.txt  Use -l for only line count  Use -w for only word count  Use -c for only byte count  Use -m for character count (not displayed)
  • 10. Sorting Text sort  Sorts text to STDOUT - original file unchanged $ sort [options] file(s)  Common options  -r performs a reverse (descending) sort  -n performs a numeric sort  -f ignores (folds) case of characters in strings  -u (unique) removes duplicate lines in output  -t c uses c as a field separator  -k X sorts by c-delimited field X • Can be used multiple times
  • 11. Eliminating Duplicate Lines sort and uniq  sort -u: removes duplicate lines from input  uniq: removes duplicate adjacent lines from input  Use -c to count number of occurrences  Use with sort for best effect: $ sort userlist.txt | uniq -c
  • 12. Comparing Files diff  Compares two files for differences $ diff foo.conf-broken foo.conf-works 5c5 < use_widgets = no --- > use_widgets = yes  Denotes a difference (change) on line 5  Use gvimdiff for graphical diff  Provided by vim-X11 package
  • 13. Duplicating File Changes patch  diff output stored in a file is called a "patchfile"  Use -u for "unified" diff, best in patchfiles  patch duplicates changes in other files (use with care!)  • Use -b to automatically back up changed files $ diff -u foo.conf-broken foo.conf-works > foo.patch $ patch -b foo.conf-broken foo.patch
  • 14. Spell Checking with aspell  Interactively spell-check files: $ aspell check letter.txt  Non-interactively list mis-spelled words in STDIN $ aspell list < letter.txt $ aspell list < letter.txt | wc -l
  • 15. Tools for Manipulating Text tr and sed  Alter (translate) Characters: tr  Converts characters in one set to corresponding characters in another set  Only reads data from STDIN $ tr 'a-z' 'A-Z' < lowercase.txt  Alter Strings: sed  stream editor  Performs search/replace operations on a stream of text  Normally does not alter source file  Use -i.bak to back-up and alter source file
  • 16. sed Examples  Quote search and replace instructions!  sed addresses  sed 's/dog/cat/g' pets  sed '1,50s/dog/cat/g' pets  sed '/digby/,/duncan/s/dog/cat/g' pets  Multiple sed instructions  sed -e 's/dog/cat/' -e 's/hi/lo/' pets  sed -f myedits pets
  • 17. Introduction awk  Field/Column processor  Supports egrep-compatible (POSIX) RegExes  Can return full lines like grep  Awk runs 3 steps:  BEGIN - optional  Body, where the main action(s) take place  END - optional  Multiple body actions can be executed by separating them using semicolons. e.g. '{ print $1; print $2 }'  awk, auto-loops through input stream, regardless of the source of the stream. e.g. STDIN, Pipe, File  Usage: awk '/optional_match/ { action }' file_name | Pipe
  • 18. Example awk  Print a text file awk '{print }' /etc/passwd awk '{print $0}' /etc/passwd  Print specific field awk -F':' '{print $1}' /etc/passwd  Pattern matching awk '$9 == 500 { print $0}' /var/log/httpd/access.log  Print lines containing vmintam,student and khanh awk '/vmintam|student|khanh/' /etc/passwd
  • 19. Example awk (con’t)  print 1st lines from file awk "NR==1{print;exit}" /etc/resolv.conf  Simply Arithmetic awk '{total += $1} END {print total}' earnings.txt  Shell cannot calculate with floating point numberes, but awk can: awk 'BEGIN {printf "%.3fn", 2005.50 / 3}‘  history | awk '{print $2}' | sort | uniq -c | sort -rn | head
  • 20. Special Characters for Complex Searches Regular Expressions  ^ represents beginning of line  $ represents end of line  Character classes as in bash:  [abc], [^abc]  [[:upper:]], [^[:upper:]]  Used by:  grep, sed, less, others