Pages

Tuesday, September 17, 2013

Text Processing In Linux

Text Processing is one of the important Operations when working with a Linux Environment. This is more complex in linux since we need to be familiar with the available tools and commands for text processing in linux. Even though we have Word Processing tools available in linux we need to be familiar with commands provided by linux since we may work with linux with no GUI mode.

The following article tells about the available tools and command provided by linux in handling the Text processing Operations.

Concat Text
Cat is a command available in linux to concat text.We can also do various other operations using the cat command
[root@vx111a ~]# cat text1 # Displays the contents of the text1 files
[root@vx111a ~]# cat text1 text2 # Concats the contents of the both text1 and text2 files
[root@vx111a ~]# cat text1 >> text2 # Push the content of the text1 to text2

Head and Tail
The head and tail commands let you preview the first or last bits of a text file you’re working with, letting you narrow down the file that you need without opening it in a text editor.

The head command lets you view the first part of a text file.The syntax will be
head text1

The tail command lets you view the last part of a text file. The syntax will be
tail text1

Controlling the lines
We can control the number of lines that can be display with either head or the tail command like
head -n 30 text1 #displays the first 30 lines
tail -n 30 text1 # displays the last 30 lines

tr
The tr command is used to translate specified characters into other characters or to delete them. 

[root@vx111a ~]# echo hello world | tr -s "hl" "kf"
kefo worfd

[root@vx111a ~]# echo hello world | tr -s "hl" "kf"
kefo worfd

pr
The pr command is used to format files for printing. The default header includes the file name and file creation date and time, along with a page number and two lines of blank footer.

[root@vx111a ~]# pr text1 | head
2011-12-19 21:49 text1 Page 1
1 apple
2 pear
3 banana

nl
nl Command is used for numbering the lines like,

[root@vx111a ~]# nl text1
1 1 apple
2 2 pear
3 3 banana

look
look command is used to display strings that begin with the given strings.In other words, look command can be used to check the spelling of a word, by giving the words prefix.

The command look works like grep, but does a lookup on a "dictionary," a sorted word list. By default, look searches for a match in /usr/dict/words, but a different dictionary file may be specified.

sort
The sort command sorts the input using the collating sequence for the locale (LC_COLLATE) of the system. The sort command can also merge already sorted files and check whether a file is sorted or not.

[root@vx111a ~]# sort text2
10 apple
3 banana
9 plum
[root@vx111a ~]# tsort text1 # perform topological sort
1
2
3
4
pear
banana
apple

uniq
uniq can be used to display, count, or delete adjacent duplicate lines from a file or standard input (stdin). If duplicate lines in a file are not adjacent to one another, uniq will not treat them as duplicates:

[root@vx111a ~]# cat samp
unix commands
shell script
command prompt
unix commands
unix system administration
shell script
unix commands

[root@vx111a ~]# sort samp | uniq
command prompt
shell script
unix commands
unix system administration

diff
diff is a command available in linux which checks for the difference between 2 files like,

[root@vx111a ~]# echo "this is jagadesh" >> s1
[root@vx111a ~]# echo "jagadesh is this" >> s2

[root@vx111a ~]# diff s1 s2
1c1
< this is jagadesh
---
> jagadesh is this

The diff command can also recursively compare directories (for the filenames present).

[root@vx111a ~]# diff -r ~/notes1 ~/notes2
Only in /home/bozo/notes1: file02
Only in /home/bozo/notes1: file03
Only in /home/bozo/notes2: file04

cut
You can use this command to extract portion of text from a file by selecting columns.

$ cat test.txt
cat command for file oriented operations.
cp command for copy files or directories.
ls command to list out files and directories with its attributes.
[root@vx111a ~]# cut -c2 test.txt #second Character from file
a
p
s

[root@vx111a ~]# cut -c1-3 test.txt #Column of Characters using Range
cat
cp
ls

[root@vx111a ~]# cut -d':' -f1 /etc/passwd #Specific Field from a File
root
daemon
bin
sys
sync
games
bala

join
join is a command available in linux which helps in joining 2 files based on a similar field available in both files like
[root@vx111a ~]# cat s1
100 Shoes
200 Laces
300 Socks

[root@vx111a ~]# cat s2
100 $40.00
200 $1.00
300 $2.00

[root@vx111a ~]# join s1 s2
100 Shoes $40.00
200 Laces $1.00
300 Socks $2.00

fold
A filter that wraps lines of input to a specified width. This is especially useful with the -s option, which breaks lines at word spaces
This is much like a command line utility to make a text file word wrap.

File
File is a command available in linux which tells you about the file type like

[root@vx111a ~]# file perl
perl: directory

[root@vx111a ~]#file td.log
td.log: ASCII text

Rev
rev is a command available in linux which reverses the contents of the files

[root@vx111a ~]# echo "hai hello" > none
[root@vx111a ~]# rev none
olleh iah

Source
The source command in shell is used to execute commands from a file in the current shell. This is useful to load function or variables stored in another file.

Consider if we change any of the things in ~/.bashrc , ~/.bash_profile files by adding new Env variables we can just use the
source ~/.bashrc rather than using a re-login.

Strings
print the strings of printable characters in files.

strings filename | more

Display the strings contained in the binary file called filename. "strings" could, for example, be a useful first step to a close examination of an unknown executable.

cmp
The cmp command is a simpler version of diff, above. Whereas diff reports the differences between two files, cmp merely shows at what point they differ.

[root@vx111a ~]# echo "hello" > s1
[root@vx111a ~]# echo "mello" > sw
[root@vx111a ~]# cmp s1 sw
s1 sw differ: byte 1, line 1

paste
paste is a command available in linux used to merge lines of files

[root@vx111a ~]# echo "hello" > s1
[root@vx111a ~]# echo "mello" > sw
[root@vx111a ~]# paste s1 sw
hello mello

expand
The expand command converts all tabs to spaces

unexpand
Unfortunately, you cannot use unexpand to replace the spaces in text1 with tabs, as unexpand requires at least two spaces to convert to tabs.


More To Come , Happy learning :-)