Lexical Analysis

In this lecture, we start from using regular expressions to specify the language. In order to parse the regular expressions, we need to construct the finite automaton corresponding to the regular expressions.

Regular Expressions

Regular Expression
Meaning
Construction Method
Regular Expression
Meaning

Union

Concatenation

Kleene Closure

Positive Closure

Range Closure

Precedence: Closure > Concatenation > Union.

Lexical Specification

Handling conflicts in recognition.

  • Keywords vs Identifiers: Keywords have a higher priority than identifiers. For example, the string "if" should be recognized as <IF>, not as <ID (if)>.

  • When multiple matching patterns exist, choose the longest match. For example, "<=" should be recognized as <LTE>, and not as two separate tokens <LT><EQ>. The string "ifabc" should be recognized as <ID (ifabc)>, and not as <IF> and <ID (abc)>.

Lexical Parsing

Motivation: We need to convert regular expressions to finite automatons.

Thompson's Construction

Below is an example of converting the regular expression of UNUM to NFA:

Gap: Using an NFA for lexical analysis is impractical due to the numerous possible transition paths.

Converting NFA to DFA

Given an input string, we can simulate an NFA with a DFA by defining the states in the DFA as the reachable states in the NFA.

Powerset Construction

DFA Optimization: Merging Equivalent States

Hopcroft Partitioning Algorithm:

  • The algorithm continues to partition the states until no further splits can be made.

Regular Language

Two regular expressions are considered equivalent if, and only if, they generate the same regular set (language).

Pumping Lemma:

  • If a language consists of a finite set of lexemes, it is guaranteed to be a regular language.

  • For languages with an infinite set of lexemes, the Pumping Lemma can be used to demonstrate that the language is not regular.

Last updated