Lexical analysis

From Compiler writing

Jump to: navigation, search

Lexical analysis takes as input a sequence of characters and maps them to a sequence of tokens.

In many real life situations it is not guaranteed that any of the sequence of characters will map to a token defined by the target language.



Generating a lexer using one of the available tools rather than manually creating one saves a few hours work. However, it is very difficult to provide good error recovery using these tools, compared to a hand written lexer.

Unless you can guarantee that your input will not contain lexical errors I would recommend writing a lexer by hand.

Lexical grammars are usually straight forward, there is little to be saved in implementing a subset and plenty of advantages for a complete implementation (e.g., no weird errors caused by processing source containing an unsupported tokens, you will probably spend just as much time telling people why you subsetted as you would have implementing the missing functionality)

Minimal to do list

  • Locate the specification of the lexical grammar for your chosen language.
  • Create a file that contains at least one instance of every token supported by your language.
  • Write a function that takes characters from somewhere (e.g., a file, a parameter) to form a single token and return an integer value representing that token.
  • Write a program that takes an input file (e.g., stdin or a filename specified on a command line) and prints out the tokens it contains (in numeric form).
  • You have successfully completed this phase when:
    • Feeding the file containing every token instance to the lexer program generates the correct output.
    • Feeding the executable file of your lexer as input to itself does not crash or go into an infinite loop.

Useful to do list

  • Provide a debug option that supports the printing of the last X characters read.
  • Provide a debug option that prints the characters contained within each token.

Extra to do list


  • Flex tool for generating lexers from a specification.

Test cases


Reading list

Most compiler books give a very brief discussion, if any, of lexical analysis.

  • flex manual
  • lex & yacc by Doug Brown, John Levine, Tony Mason, 2 edition (Oct 1992). Published by O'Reilly Media, Inc. ISBN-10: 1565920007.
  • flex & bison by John Levine is not due until October this year.

Advanced reading list

Personal tools