Lexical analysis identifies the tokens of a language. In this handout, we're going to describe how to create a lexer (Lexical Analyzer), a computer program to do lexical analysis. We'll describe the primary tasks and we'll even give Pascal code fragments for some tasks. However, the bulk of the programming will be left to you.
1 The Grammar
Before we start, we need to know the language that we'll be analyzing. We'll use the language described by the following grammar, based on the grammar introduced in your textbook:
Grammar EXAMPLE
Rule 1a: <statement>
->
<identifier> := <operand>
Rule 1b: <statement> -> <identifier> := (<operand> +
<operand>)
Rule 1c: <statement> -> <identifier> := (<operand> *
<operand>)
Rule 1d: <statement> -> READ <identifier>
Rule 1e: <statement> -> WRITE <identifier>
Rule 2a: <operand> -> <identifier>
Rule 2b: <operand> -> <integer>
Rule 3: <identifier> -> a sequence of letters and/or digits
beginning with a letter
Rule 4: <integer> -> a sequence of one or more decimal
digits
When parsing a statement of this language, we always start with the start symbol, <statement>. Then we can parse statements like "A := (B + C)" and "WRITE A". The tokens of the language are the symbols like "(" and ":=" which are not enclosed in angle brackets "<>". Tokens are analogous to words in English. Rules 3 and 4 define infinite sets of tokens - these are called identifiers or integers (without the "< >").
1. List all of the tokens, except for the identifiers and integers, used in Grammar EXAMPLE. We won't ask you to list all the identifiers and integers!
2. The statements of language EXAMPLE represent addition and multiplication operations as well as simple input and output. Suppose you wanted to modify Grammar EXAMPLE to allow subtraction and division expressions as well. What new grammar rules would you introduce? What new tokens would these rules introduce?
3. Which of the following can be parsed using Grammar EXAMPLE? Show how to parse those that can be parsed and explain why the others can't be parsed.
2 Lexical Analysis
The job of a lexer is to read an input string and separate it into tokens from the input language. That's all it does - it doesn't attempt to do any grammatical analysis. For example, the string "Name := (" is ungrammatical, yet a lexer for Grammar EXAMPLE would return the tokens "Name", ":=" and "(". In Pascal, the lexer could return the tokens in an array of strings. Here's a Pascal code fragment showing a procedure header for a lexer:
TYPE
TokenArray = ARRAY [1..MaxTokens] OF STRING;
PROCEDURE Lexer (Inp: STRING; VAR Tokens: TokenArray; VAR Count: Integer);
In the header, Inp is the input string (the current line of code being analyzed), Tokens is the array used to return the tokens, and Count will hold the number of tokens. Knowing the specifications for Lexer allows us to write code that we can use to test our lexer as we build it.
4. Write a program that includes the constant and type definitions, the procedure header and the code shown above. Write the procedure PrintTokens. For now, the body of the Lexer procedure can consist of nothing more than BEGIN END. You will use this program as a driver to test the pieces of the Lexer procedure as you write them.
Now, what does Lexer do? It examines the beginning of Inp to see if there's a token there. For Grammar EXAMPLE , Lexer can identify a token by checking the first one or two characters. After it finds a token, Lexer removes it from the beginning of Inp. For example, here's a code fragment to handle the token "(":
END;{Lexer}
The lexer continues to find and remove tokens until Inp is empty. Then it returns the token array.
5. All fixed length tokens can be handled by code similar to the above example. In fact, the above fragment can easily be modified to handle all one-character tokens. (Hint: Only the second argument to the call to POS needs to change). Modify the above fragment so that it takes care of all one-character tokens.
6. Both in Grammar EXAMPLE and in Pascal itself, all fixed-length tokens are either one or two characters long. In Grammar EXAMPLE, in fact, there is only one two-character token (the assignment operator ':='). Write a similar Pascal code fragment to handle this two-character token.
7. For this exercise, disregard identifier and integer tokens and assume that Grammar EXAMPLE has only one- or two-character tokens. Using your code fragments from exercises 5 and 6, write a loop for Lexer that finds all fixed-length tokens in Grammar EXAMPLE and places them in the Tokens array. Be sure your loop exits when all tokens have been found. When you type a line of input to test this code, be sure to type only fixed-length tokens.
Other tokens are not quite as simple to check. Identifiers, for example, don't have to start with any particular character. Instead, they can start with any letter. To check for an identifier, Lexer checks to see if the first character of Inp is a letter. This is a classic example of the use of the POS function.
BEGIN
{...}
FirstChar := COPY (Inp, 1, 1);
IF POS (FirstChar, Letters) > 0 THEN
{the first token in Inp is an
identifier}
{...}
END; {Lexer}
Most of the time, computer languages are designed so that very limited checks (of the first few characters) can uniquely identify the current token.
8. In many computer languages, as in Grammar EXAMPLE, integer tokens are defined to be any sequence of one or more decimal digits (this disregards a possible sign preceding the digits). Identifying an integer token is very much like identifying an identifier token. Write a Pascal code fragment similar to the preceding fragment that identifies integer tokens.
Once it's known that a variable-length token (an identifier or an integer) is next, Lexer extracts all the token's characters from the beginning of Inp. For an identifier, the token characters are letters and digits. Using computerese, we say that the lexer spans the letters and digits at the front of Inp. The first step in spanning is to determine the length of the token. We can do this by writing a loop like the following:
{to find the length of an identifier,
SpanSet should be set to AlphaNum}
L := 0;
WHILE (L < LENGTH (Inp)) AND (POS (COPY (Inp, L+1, 1),
SpanSet) > 0) DO
L := L + 1;
When the loop finishes, L is the length of the identifier. After determining the length, the lexer can extract it in the same way it extracted the fixed-length tokens.
9. Why does Lexer check for an identifier using Letters, yet span using AlphaNum, which contains letters and digits?
10. Suppose that the next token is an integer. What will need to be changed in the preceding code fragment to span the characters that form an integer?
The action of spanning a sequence of characters is so common that it's useful to define a procedure, Span, just for that. Here's Span's procedure header:
In the header, Inp is of course the current line of input, SpanSet is the string containing the valid characters for the current identifier type (e.g., alphanumeric characters if we are using Span to extract an identifier token, or digits if we want to extract an integer). Token will be the extracted token and Inp will be modified such that the characters that form Token have been removed. Since we will want the new token to be copied to the token array, we will generally expect to see Span called with Token [Count] as its second argument.
Once Span has determined the length of the current token, it starts at the beginning of Inp, spans all the consecutive characters that are contained in SpanSet and places these spanned characters in Token. The remaining characters are copied as the new value of Inp. We've seen a code fragment that implements the first part of the Span procedure (determining the length of the token), now here's the fragment that makes the assignments to Token and Inp.:
Notice that either Token or Inp or both can be empty ('') when Span is finished.
11. What can cause Span to return an empty string in Token? Inp? Both?
12. Put the code fragments for Span together to make a complete Pascal Span procedure.
13. Since the lexer is designed to read input that is entered by a human, we have to allow for human foibles. For example, humans often like to make their typed text more readable by putting in blank spaces. Spaces aren't part of any token in language EXAMPLE, so they shouldn't be returned in the token array. Spaces still need to be identified, spanned and extracted from Inp, however. Write a code fragment that uses your Span procedure to handle blank spaces.
14. Suppose that the characters at the beginning of Inp aren't spaces or token characters. That means that the user entered an invalid character. In this event, the lexer should print an error message and extract the illegal character from Inp. Write a code fragment to do this.
15. Combine the code fragments in this handout with your answers to the previous exercises to build a working Pascal Lexer procedure for the language EXAMPLE.