Pages

Wednesday, February 17, 2010

Awk


Today we will work on AWK command line tool in Unix .This article helps you to have a basic understanding on how awk works and some of the internal structure of awk.

Awk is a simple and elegant pattern scanning and processing language. It is created in the late 70’s.the name was composed from the initial letters of three original authors Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. It is commonly used as a command line filter in pipers to reformat the output of other commands.

Some others features of awk are
Its ability to view a text file as made up of records and fields in a textual database.
Its use of variables to manipulate the database.
Its use of arithmetic and string operators.
Its use of common programming constructs such as loops and conditionals.
Its ability to generate formatted reports.

awk takes two inputs : data file & command line
the command line file can be absent and necessary commands can be sent as an argument.

An awk would take the command line syntax as ,

awk ‘{pattern + action}’ {filename}

The pattern represents what awk is looking for in the data and action is a series of commands executed when a match for the pattern is found. Curly brackets are not always required around your program, but they can be used to group series of instructions based on a specific pattern.

Understanding Fields: a common use of awk is to process files by formatting and then displaying the necessary data from files.awk separates each input file into Records. A Record is nothing but a single line of input and each Record contains multiple fields. The field separator is space or tab and can be changed.

Let’s see a simple text file to illustrate awk. Create a file student with the following data
jagadesh  10    2010  h
kiran        30    1997  k
pavan      123  10000 n
pavan      345   2009 j
jagan       345  400   p
madan    345   2007  k
naren     1009  1200 l
gagan    234   100    m

Now we will try a simple awk command  awk '{print $1 " " $2}' student
this command prints the first field and second field. We used the awk along with a pattern and a file name, in the form of arguments.

Try this  : awk '{print "first Name is MR." $1 " " $2}' student

Working With Patterns : an awk can contain a pattern and a procedure ,

Pattern { procedure }

Both are optional, if pattern is missing , { procedure } is applied to all lines , if {procedure } is missing , the matched line is printed.

A pattern can take the following form ,
/regular expression/
Relational expression
Pattern-matching expression
BEGIN && END

Regular Expression: let’s search for a string in the file above , in order to use a regular expression , we need to write as  /String to Search / . let write a simple command to search a “jagadesh”
awk '/jagadesh/' student : This command searches for the string jagadesh and prints the data related to it.

Relational expression: we can use a releational expression in order to retrieve results , let’s see a simple one , awk '$2==10' student

we can also print firleds that we are intersted in  awk '/kiran/ {print $2}' student
print the second field in the record which has a matching record to the pattern

multiple commands for the same set of data can be used by using a ; between them like
 awk '/10/ {print $1 ;print $2}' student .print the first field and second field with the matching pattern from the student data file.

We can also insert the field separators like new line '\n' , new tab '\t' e.t.c to display data in a appropriate way.

Searching data with multiple patterns is also possible with awk , this can done by including a ‘|’ pipe in the awk command as awk ‘/jagadesh|2010/’ student : in this iam searching for records with jagadesh and 2010 in them.

Now lets try a more advanced example of searching for k in file
awk ‘/k/’ student : this returns all the records with ‘k’ in them . from the above data file we will get kiran and madan who has k in their records . but I want to get records whose first field contains ‘k’ . This is where we get pattern matching regular expression comes into position.

Pattern Matching Expressions: as said above if we need to check for a particular field by out pattern, we will be using the pattern matching regular expression as

awk '$1 ~ /k/' student : the “~” tilde operator makes sure that the k is being search in the first field only . this gives us the result as only 1 record with [kiran].

Similarly we can search for a 4 in 3rd row as
awk ‘$3 ~ /4/’ student

The opposite to the tilde operator is negotiation operator ‘!~’ which gives all the records form the one that we are currently searching like
awk ‘$1 !~ /k/’ student : displays all the records that don’t have ‘k’ in their first field.

Before going to the other patten matching expression ‘BEGIN’ and ‘END’ , we will have a look at the awk built in variable and operators that awk support .

Build in Variables: awk provides some built in variable which can be used while performing a search on a data file. These are the built in variable available in awk

FS: filed separator
NF: number of fields
NR: number of current row
OFMT: output format for numbers “%.6g” and for conversion to string
OFS: output field separator
ORS: output record separator
RD: record separator
$0: entire input record
$n: nth field in current record

We can use the awk built in variables to get better results.If we need to get the data depending on the number of fields , we can write as  awk ‘NF==4’ student  , which gets all the records which has 4 fields
awk ‘NF==4 && /jagadesh/’ student : retrieves all the records which has 4 fields and has ‘jagadesh’ in it.

Operators: the following are the operators available in the awk ,
= += -= *= /= %= ^= **=         : Assignment
||                                              : Logical OR (short-circuit)
&&                                            : Logical AND (short-circuit)
~ !~                                          : Match regular expression and negation
< <= > >= != ==                      : Relational operators
(blank)                                       : Concatenation
+ -                                           : Addition, subtraction
* / %                                        : Multiplication, division, and modulus (remainder)
+ - !                                         : Unary plus and minus, and logical negation
^ **                                         : Exponentiation
++ --                                        : Increment and decrement, either prefix or postfix
$                                              : Field reference

BEGIN && END : a begin and end pattern rules can be applied to get better results . a beging rule is executed before the first records is read and an end rule is executed after all records are read. Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.

Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.

Lets see a simple syntax

awk '
> BEGIN { print "jagadesh" }
> /jagadesh/
> END { print "done" }' student

What iam doing here is iam searching for a pattern /jagadesh/ , before reading the first record , I want to print “jagadesh” then search the string and print results. After all records are read , I want to print “done” .

An awk program may have multiple BEGIN and/or END rules. They are executed in the order in which they appear: all the BEGIN rules at startup and all the END rules at termination. BEGIN and END rules may be intermixed with other rules. Multiple BEGIN and END rules are useful for writing library functions, because each library file can have its own BEGIN and/or END rule to do its own initialization and/or cleanup. The order in which library functions are named on the command line controls the order in which their BEGIN and END rules are executed.

Some examples ,

Print the firstNames in the Data File :

awk '
> BEGIN { print "First Names" }
> { print $1 }' student

Display values of the 2,3 and 4 columns

awk '
> BEGIN { print "Names "  }
> BEGIN { print "------" }
> { print $2+$3+$4 }
> END { } ' student

awk '/jagadesh/ {++x} END {print x}' student
awk '{total +=2 } END {print total }' student

We can dig into for more examples

Empty Pattern : A Empty pattern is considered as a match to every record in file.
awk ‘ { print $0 } ‘ student
Variables: variables in awk are assigned by “=” operator , like
FS=”,”

Arrays: Arrays in Awk are associate arrays , that is they contain a index and a associated value to the index.

     Element 3     Value 30
     Element 1     Value "foo"
     Element 0     Value 8
     Element 2     Value ""

The pairs are shown in jumbled order because their order is irrelevant. One advantage of associative arrays is that the elements can be added at any time.

Array can be created as

arr[0]=”jagadesh”         or 

for(i=0;i<5;i++)
    arr[i]=i

iterating over arrays : awk has a handy mechanism for iterating over arrays , it has for construct as follows ,
  for(x in myarray)
    print myarray[x]

elemets in the array can be deleted by using the delete in awk
  delete myarray[1]

Escape Sequences :
Within string and regular expression constants, the following escape sequences may be used. Note: The \x escape sequence is a common extension.

Sequence Meaning Sequence Meaning
\a    Alert (bell)
\v    Vertical tab
\b    Backspace
\\    Literal backslash
\f    Form feed   
\nnn Octal value nnn
\n Newline
\xnn Hexadecimal value nn
\r Carriage return
\" Literal double quote (in strings)
\t Tab
\/ Literal slash (in regular expressions


Functions: lets move to more advanced concept of using functions and writing our own.
There are 2 types of functions availalable ,
Built in  (&&)
User –defined

Built in function comes under 3 types  I/o , String and math . To call one of awk's built-in functions, write the name of the function followed by arguments in parentheses.a simple syntax is

awk ‘ { print sqrt(16) }‘ student

awk provides functions that work on numbers like sin(x) ,tan(x),sqrt(x) . string functions like getting the length of string , spiriting the string e.t.c.and even I/o ,I18n and even functions on Time and Date.

These are the basics of Awk .

A more detailed information can be found at
http://www.gnu.org/manual/gawk/html_node/index.html
Read More

Friday, February 5, 2010

Java Performance Tips

Today's tip is about performance in Java. Here are a few things to watch out for. It is the little things that matter in the long run, so please pay attention:

1. Loops. Do not use nested loops as much as you can, unless you have to. Nested loops are VERY expensive.
2. For constant length number of items, use arrays instead of ArrayList. Only use ArrayList if you have a variable number of elements.
3. If you are using ArrayList, and you know you will have a lot of items in it, use the constructor that supplies an initial size for the ArrayList. If you don't supply that, Java creates an ArrayList of 10 items, and then recreate another ArrayList when the size is more than 10 items, and copies all the elements over from the first ArrayList to the new one. So, this is a very expensive operation. Try to start an ArrayList with a pre-size corresponding to the elements you already have, or start with a higher number. Just make sure you don't start with a number TOO HIGH because there is a penalty for that. If you start with a number like 20 and we only have 2 elements then 18 more spots have already been allocated in memory and it wastes space. So, if you get 25 objects back from Hibernate, and you want to put those elements in an ArrayList, then start with new ArrayList(25). DO NOT USE THIS TECHNIQUE WITH HASHMAPS. You should let them create themselves with their own size limitations and never enforce a size.
4. As much as you can create local variables inside methods or blocks. Every time you create a variable it lives until the { ends. So, if you want to create a variable in a class CalendarEntry and you are not sure whether to create it at the class level (static or instance variable), or to create it inside a method that will be using it, then choose the local variable inside the method. Only create variables at the class level if you really need to. The reason is the variable does not take up space until the method is called, and as soon as the method exits, the local variables dies. Saving us a space in memory.
5. Creating objects using the "new" keyword is very expensive in Java. So, try to reuse objects as much as you can. Especially in a loop. So, if you are declaring a temporary object such as Date or Calendar inside a loop just to do some calculations, then you may want to take the declaration outside the loop and reuse the same object over and over again. This will save so much.
6. Use StringBuilder if you are concatenating to a String (adding more strings to it using +) instead of String or StringBuffer.
7. Use Arrays class for sorting, searching, etc. as much as you can on an array of objects. And use Java's algorithms as much as you can because they are optimized for speed.
8. If a class needs to do heavy initialization of stuff, create static blocks just inside the class definition so those initializations get done at the class load level:

public class Test {

static{
//initialize some class level properties, etc.
}
...
}

9. Use paging and streaming as much as you can rather than loading ALL objects in an ArrayList or a Map. If you load all objects, they are all in memory taking too much space. If you are streaming to the client 10 objects at a time, or by using paging 10 at a time, then we are going to Hibernate to get the next set of objects as we need them. If the user does not click on the next page, then he only needed to see 10 not all 300 objects for example. Let's never load all objects from Hibernate in memory, and instead use Hibernate to get the next set of objects and consume them and only come back and get the next set when the user clicks next, etc.
10. If you have a bunch of boolean conditions in an IF statement for example, put the condition that is most likely to resolve to false most of the time in front. This will allow AND conditions (&&) to only evaluate the first condition and then exiting the loop if the first condition is false. The JVM will start evaluating the next condition ONLY if the previous condition is true. For example:

if (condition1 && condition2 && condition3)

In the statement above, if the first condition (condition1) is false, JVM exits this statement since it doesn't matter what conditions 2 and 3 are, since the first one is false and since we are using AND conditions, then there is no need to evaluate the rest. So, put the condition that is most likely to be false in the beginning to avoid having the JVM evaluate condition1, then condition2 then it finally finds out that condition3 is false.Do the exact same for OR statements (||) except put the most likely condition to be true in the beginning.

11. Static variables load faster than instance variables. So, if you have a variable that depends on the class and not individual instances, then make sure you declare it as static.
12. Use System.arraycopy() to copy arrays. In general, anything you want to do, or any algorithm, first look it up to see if Java does it, before you try to do it yourself because Java will do it the best.
13. For values that will hold numbers that are small use byte or short instead of int or long as much as possible. Only use int if the values can be more than what a short or byte can handle. Think in the future. So, if we have a constant number of value 5 or something like that then there is no need to create an int. Use a smaller value variable. The same for float and double. Check how much each of those variables can hold and make a proper analysis before using the correct type.
14. If you have a loop from n to m, it is faster to loop from m to n (the opposite) and decrementing the counter, rather than going from n to m and incrementing the counter. The reason is because our condition checks whether our counter has reached m. Since m > n, then checking on every cycle whether we have reached n (smaller number) is better than checking whether we have reached m (which is bigger number).
15. Make sure all transactions are closed at the end, and all connections are closed. Leaving connections open may cause a problem and running out of memory later.
16. Use lazy loading as much as you can. This means that unless the client needs it, don't load additional objects. So, if the user is viewing in daily view on the calendar, we don't need to send appointments for a different day than what we are viewing.
17. Use generics as much as you can.
18. Before being done with a class, always click the following in Eclipse "Shift + Alt + O", this will clean all import statements that are not needed and takes them out. Having more import statements for classes that you do not need will cause a hit at runtime forcing a loading of a class that you don't need.
19. Delcare methods that will not be inherited "final". This helps the compiler optimize code.
20. Avoid processing blocks of code if you don't need to. Sometimes you may need to move code inside a conditional statement to avoid having to execute it for cases that are not needed. For example, a Staff does not need to see all practices on the web site, so there is no need to have code load all practices on the server side, and then we don't send it to the client. Instead, we should NOT LOAD those practice objects in memory if the current user is staff since he will not to see them. So, be smart about when to call hibernate to load objects in memory. The fastest thing you can do is not load what is not needed, rather than loading them and not sending them to the client.
21. Use static final when creating constants.
22. Remove all System.out.println from production ready code. For testing it is fine, but not when it is ready to go to production.
23. Use annotations as much as you can because in Java 1.6 they are optimized for speed.
24.Replace strings and other objects with integer constants. Compare these integers by identity.
25.Avoid initializing instance variables more than once.
26.Use short-circuit boolean operators instead of the normal boolean operators. [ i+=1 is faster that i==i+1]
27.Use temporary local variables to manipulate data fields (instance/class variables).
28.String.equals() is expensive if you are only testing for an empty string. It is quicker to test if the length of the string is 0.
29.Avoid character processing using methods (e.g. charAt(), setCharAt()) inside a loop.
31.Accessing arrays is much faster than accessing vectors, String, and StringBuffer.
32.Perform the loop backwards (this actually performs slightly faster than forward loops do).
[Actually it is converting the test to compare against 0 that makes the difference].
33.Use only local variables inside a loop; assign class fields to local variables before the loop.
34.Shifting by powers of two is faster than multiplying.
35.Multiplication is faster than exponentiation.
36.increments are faster than byte or short increments.
37. Floating point increments are much slower than any integral increment.
38.Use -ms and -mx to tune the JVM heap. Bigger means more space but GC takes longer. Use the GC statistics to determine the optimal setting,
i.e the setting which provides the minimum average overhead from GC.
39.StringBuffer default size is 16 chars. Set this to the maximum expected string length.
40.Initialize expensive arrays in class static initializers, and create a per instance copy of this array initialized with System.arrarycopy().
41.Use charAt() instead of StartsWith() in case you are looking for a single character within a String.
42.Use the print() method rather than the println() method.
43.volatile fields can be slower than non-volatile fields, because the system is forced to store to memory rather than use registers. But they may useful to avoid concurrency problems.
44. One way to avoid creating objects simply for information is to provide finer-grained methods which return information as primitives. This swaps object creation for increased method calls.
45. A second technique to avoid creating objects is to provide methods which accept dummy information objects that have their state overwritten to pass the information.
46. A third technique to avoid creating objects is to provide immutable classes with mutable subclasses, by having state defined as protected in the superclass, but with no public updators. The subclass provides public updators, hence making it mutable.

A few Hiberante Performance Tips ,

1 - Composite keys must always be mapped only by identifiers, not associations, because when you map an association to a primary key, everytime that you execute a session.find on this object, all the associations defined at composite key will be eagerly loaded.
2 - Always provide a reasonable BatchSize
3 - Always use Hibernate.initialize(property) before trying to access the lazy properties.
4 - Always prefer to use the session.merge mechanism instead of session.save, session.update or session.saveOrUpdate, to prevent abnormal use of session (like the infamous NonUniqueObjectException).
5 - Provide acess to the hibernate API on the Data Access Layer, to provide extensibility on the points that hibernate has poor or no available resource (like providing named queries, native queries, stored procedures executions, updating lock modes on the fly and using the evict and initialize features.
6 - Always prefer to use named queries instead of string parsed queries, for cache reasons, and to prevent HQL injection
7 - Always prefer to use HibernateCallback in conjunction with the hibernateTemplate.
8 - Using the right Collection for the right association needed (Set if you have to provide a collection without duplications, List if you can have duplicity, Map if you have to provide key/value pairs).
9 - Using the HQL correctly to load only what is needed - excluding all unused data to be performant.
10 - Use @Cache on class level and on association level.

Happy Coding...
Read More

Wednesday, February 3, 2010

Java Tips

1. We have a options for instance initializers which look like this ,
@SuppressWarnings("serial")
ArrayList simpleList=new ArrayList() {{
add("jagadesh");
add("proKarma");
add("software");
}};
You can initialize the arraylist while constructing it .
2. There is a faclilty in instanceOf keyword.this was implemented in such a way that there is no need to check for null ,
Consider
If(someObject != null && someObject instanceOf String){ }
Is equals to
If(someObject instanceOf String) { } which checks for the null also.
3.It is possible in java to call private methods and access private fields by using reflection , this snippet explains the syntax .
public class Foo {

private String myName;
public String yourName;

public String getMyName() {
return myName;
}

private void setMyName(String myName) {
this.myName = myName;
}
public String getYourName() {
return yourName;
}
public void setYourName(String yourName) {
this.yourName = yourName;
}

@Override
public String toString() {
return myName;
}

}

And main in class ,
public class SampleClass {

public static void main(String[] args)throws SecurityException,NoSuchMethodException, IllegalArgumentException, IllegalAccessException, InvocationTargetException, NoSuchFieldException {

Foo obtainedFoo=new Foo();
Method obtainedMethod=Foo.class.getDeclaredMethod("setMyName", String.class);
obtainedMethod.setAccessible(true);
obtainedMethod.invoke(obtainedFoo, "jagadesh");

System.out.println(obtainedFoo.getMyName());
}
}

4.Shutdown Hooks :
We can make a thread to register that will be created immediately but called only when the jvm is ends , it look as
Runtime.getRuntime().addShutdownHook(new Thread(){
public void run() {
System.out.println("Jvm Exit");
}
});

Will be back with some more tips , Happy Coding
Read More