Lexical Analysis

In this lecture, we start from using regular expressions to specify the language. In order to parse the regular expressions, we need to construct the finite automaton corresponding to the regular expressions.

Regular Expressions

Regular Expression
Meaning

aa

x=ax = a

[ab][ab]

x{a,b}x \in \{a, b\}

[az][a-z]

x{a,,z}x \in \{a, \dots, z\}

[azAZ][a-zA-Z]

x{a,,z,A,,Z}x \in \{a, \dots, z, A, \dots, Z\}

..

xΣx \in \Sigma

a^\hat{a}

xΣ{a}x \in \Sigma \setminus \{a\}

ε\varepsilon

xx \in \emptyset

a?a?

x=a or x=εx = a \text{ or } x = \varepsilon

Construction Method
Regular Expression
Meaning

Union

STS \mid T

xSTx \in S \cup T

Concatenation

STST

x{stsS,tT}x \in \{st \mid s \in S, t \in T\}

Kleene Closure

SS^*

x{s0sn0in,siS,0n<}x \in \{s_0 \dots s_n \mid \forall 0 \leq i \leq n, s_i \in S, 0 \leq n < \infty\}

Positive Closure

S+S^+

x{s0sn0in,siS,1n<}x \in \{s_0 \dots s_n \mid \forall 0 \leq i \leq n, s_i \in S, 1 \leq n < \infty\}

Range Closure

S{min,max}S\{ \text{min,max} \}

x{s0sn0in,siS,minn<max}x \in \{s_0 \dots s_n \mid \forall 0 \leq i \leq n, s_i \in S, \text{min} \leq n < \text{max} \}

Precedence: Closure > Concatenation > Union.

Lexical Specification

Handling conflicts in recognition.

  • Keywords vs Identifiers: Keywords have a higher priority than identifiers. For example, the string "if" should be recognized as <IF>, not as <ID (if)>.

  • When multiple matching patterns exist, choose the longest match. For example, "<=" should be recognized as <LTE>, and not as two separate tokens <LT><EQ>. The string "ifabc" should be recognized as <ID (ifabc)>, and not as <IF> and <ID (abc)>.

Lexical Parsing

Motivation: We need to convert regular expressions to finite automatons.

Thompson's Construction

Screenshot 2024-09-16 at 21.26.53

Below is an example of converting the regular expression of UNUM to NFA:

Screenshot 2024-09-16 at 21.31.05

Gap: Using an NFA for lexical analysis is impractical due to the numerous possible transition paths.

Converting NFA to DFA

Given an input string, we can simulate an NFA with a DFA by defining the states in the DFA as the reachable states in the NFA.

For a single state sis_i in the NFA, the epsilon-closure of sis_i refers to the set of states reachable from sis_i by epsilon (ϵ\epsilon) transitions:

Clϵ(si)={sj(si,ϵ)(sj,ϵ)}Cl^\epsilon(s_i) = \{ s_j \mid (s_i, \epsilon) \to^* (s_j, \epsilon) \}

This means the set of states sjs_j that can be reached from sis_i by taking zero or more epsilon transitions.

For a set of states SS in the NFA, the epsilon-closure of SS refers to the set of all states that can be reached from any state in SS by epsilon transitions:

Clϵ(S)={qqS,(q,ϵ)(q,ϵ)}Cl^\epsilon(S) = \{ q' \mid \forall q \in S, (q, \epsilon) \to^* (q', \epsilon) \}

This means the set of all states qq' that can be reached from any state qq in SS through zero or more epsilon transitions.

For a set of states SS in the NFA, the transition for an input symbol α\alpha refers to the epsilon-closure of the set of all states that can be reached after reading the symbol α\alpha:

Δ(S,α)=Clϵ({qqS,(q,α)q})\Delta(S, \alpha) = Cl^\epsilon(\{ q' \mid \forall q \in S, (q, \alpha) \to q' \})

This means that the transition on symbol α\alpha for a set of states SS is the epsilon-closure of the set of states that can be reached by taking the transition labeled with α\alpha from any state qq in SS.

Powerset Construction

The start state in the DFA corresponds to the start state of the NFA, plus all states reachable via ϵ\epsilon-transitions.

If a state qq in the DFA corresponds to a set of states SS in the NFA, then the transition from state qq on a character aa is found as follows:

  • Let SS' be the set of states in the NFA that can be reached by following a transition labeled a from any of the states in SS. (This set may be empty.)

  • Let SS'' be the set of states in the NFA reachable from some state in SS' by following zero or more epsilon transitions.

  • The state qq in the DFA transitions on a to a DFA state corresponding to the set of states SS''.

DFA Optimization: Merging Equivalent States

For two nodes of the same type, did_i and djd_j, the condition for merging them is:

cΣ,δ(di,c)=δ(dj,c)\forall c \in \Sigma, \delta(d_i, c) = \delta(d_j, c)

Hopcroft Partitioning Algorithm:

Screenshot 2024-09-28 at 17.55.43
  • The Hopcroft partitioning algorithm iteratively divides the state set DD of a DFA (Deterministic Finite Automaton) into subsets to minimize the DFA.

  • Initially, the state set DD is divided into two subsets: accepting states DacD_{ac} and non-accepting states DDacD \setminus D_{ac}.

  • The algorithm continues to partition the states until no further splits can be made.

  • The Split(s) function checks if a set of states ss can be split based on input characters from the alphabet Σ\Sigma. If a split occurs, it returns two new subsets; otherwise, it returns the original set.

Regular Language

Two regular expressions are considered equivalent if, and only if, they generate the same regular set (language).

Pumping Lemma:

  • If a language consists of a finite set of lexemes, it is guaranteed to be a regular language.

  • For languages with an infinite set of lexemes, the Pumping Lemma can be used to demonstrate that the language is not regular.

Last updated