The input program structure of lex contains three parts namely
- Declaration section (or definition section starts with "%{" and ends with "%}" )
- Rules section ( starts with "%%" and ends with "%%" )
- Subroutine section (or c code section)
a lex file would have the extension ".l"
the ".l " file is given as input to the lex it will generate a file with extension ".c" named as "lex.yy.c"
to put it in a more precise way Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and partitioning the input into strings which match the given expressions. As each such string is recognized the corresponding program fragment is executed. The recognition of the expressions is performed by a deterministic finite automaton generated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream.
- A simple pattern: letter(letter|digit)*
- Regular expressions are translated by lex to a computer program that mimics an FSA.(Finite State Automata)
- This pattern matches a string of characters that begins with a single letter followed by zero or more letters or digits.
now let us understand how the lex program behaves by considering an program
let us consider a ,
program to count no of vowels or consonants in the input string
1.%{ #include<stdio.h>
2. int vow=0,cons=0;
3.%}
4.%%
5. \n {return;}
6. [ \t]+ {;}
7. [aeiouAEIOU] {vow++ ;}
8. [a-zA-Z] {cons++;}
9. . ;
10.%%
11. main()
12.{ printf(“enter a string \n“);
13. yylex();
14. printf(“no of vowels=%d \n”,vow);
15. printf(“no of consonants=%d”,cons);
16.}
- lines 1-3 in the above program indicate "declaration section"
- the symbol "%{" in line 1 indicate starting of declaration section
- the symbol "%}" in line 3 indicate ending of declaration section
- in declaration section we declare the preprocessor statements to be used and data types of the variables which we will be using in the program. here we have used "standard I/O header" (a preprocessor statement) and we will be using variables "vow,cons " which are of type " int"
- lines 4-10 indicate "rules section" of the program
- the symbol "%%" in line 4 indicates starting of rules section
- the symbol "%%" in the line 5 indicate ending of rules section
- it's in the rules section we make use of regular expressions which are translated to C functions or program by DFA
- in the above program we hv made use of 5 regular expressions their usage in the program is as explained below
- the expression in line 5 means whenever "new line" is encountered return the control
- the expression in line 6 means whenever "blank space" or "tab" is encountered ignore them
- the expression in line 7 indicates whenever "either 'a' or 'e' or 'i' or 'o' or 'u' or 'A' or 'E' or 'I' or 'O' or 'U' is encountered increment the vowel count(vow) by one
- the expression in line 8 indicates whenever any alphabet between either "a to z" or "A to Z" is encountered increment consonants count by one
- some of us might get a doubt when we say whenever any alphabet between "a to z" or "A to Z" is encountered if we increment consonant count ."even vowels are alphabets between "a to z" or "A to Z" then whenever it encounters vowels also it will increment consonants by one " .But this assumption is wrong here whenever a expression in "line 7" is executed the expression in "line 8" wont execute i.e . whenever a preceeding regular expression is satisified the expression which is next to(following) it wont be executed .
- the expression in line 9 indicates any other character other than newline ignore it
- lines 11-16 indicate "subroutine section" .this section conatins legal " C" code and inbuilt function called "yylex()" is called in this section
compilation and working of lex program
There are two steps in compiling a Lex source program. in the first step following actions take place
- First, the Lex source must be turned into a generated program in the host general purpose language. this is achieved by using "lex" command the general syantax of this command is as shown below SYNTAX: lex program name (with extension .l) The lex command passes C code, unchanged, to the lexical analyzer in the following circumstances:
- Lines beginning with a blank or tab in the definitions section, or at the start of the rules section before the first rule, are copied into the lexical analyzer. If the entry is in the definitions section, it is copied to the external declaration area of the lex.yy.c file. If the entry is at the start of the rules section, the entry is copied to the local declaration area of the yylex subroutine in the lex.yy.c file.
- Lines that lie between delimiter lines containing only %{ (percent sign, left brace) and %} (percent sign, right brace) either in the definitions section or at the start of the rules section are copied into the lexical analyzer in the same way as lines beginning with a blank or tab.
- Any lines occurring after the second %% (percent sign, percent sign) delimiter are copied to the lexical analyzer without format restrictions.
- when above actions are completed all the expressions specified in rules section are converted to "C" statements and are placed in funtion "yylex()" .
the second step in complilation results in following actions
- then this program must be compiled and loaded, usually with a library of Lex subroutines. The generated program is on a file named lex.yy.c. The I/O library is defined in terms of the C standard library
- The library is accessed by the loader flag -ll. So an appropriate set of commands to lex source is "cc lex.yy.c -ll" The resulting program is placed on the usual file a.out for later execution
in the next post i will post about how the other inbuilt functions in lex work. i hope the the above article would be helpful in understanding lex