Languages and Machines

Languages and Machines#

Strings#

Alphabet and Symbol

An alphabet is a non-empty, finite set and the elements of an alphabet are called the symbols of the alphabet.

We use capital Greek letters to denote alphabets.

String

A string (or word) over an alphabet is a finite sequence of symbols from that alphabet.

For example, \(01001\) is a string over the alphabet \(\Sigma = \{ 0, 1 \}\) and \(abracadabra\) is a word over the English alphabet \(\Sigma = \{ a, \dots, z \}\).

Sets of strings

The set of all strings of length \(n\) over the alphabet \(\Sigma\) is denoted \(\Sigma^n\).

The set of all strings over the alphabet \(\Sigma\) including the empty string \(\varepsilon\) is denoted \(\Sigma^* := \bigcup_{i \in \mathbb{N}} \Sigma^i\).

String Length

If \(w\) is a string over \(\Sigma\) then \(|w|\) is called the length of \(w\) and is defined as the number of symbols that it contains.

The string of length zero is called the empty string and is written \(\varepsilon\).

If \(w\) has length \(n\) we can write \(w = w_1 w_2 \dots w_n\) where each \(w_i \in \Sigma\).

String Reversal

The reverse of \(w\) is the string obtained by writing \(w\) in the opposite order \(w_n w_{n - 1} \dots w_1\) and is denoted \(w^\mathcal{R}\).

Substring

String \(z\) is a substring of \(w\) if \(z\) appears consecutively within \(w\). For example, \(cad\) is a substring of \(abracadabra\).

String Concatenation

Given string \(x\) of length \(m\) and string \(y\) of length \(n\) the concatenation of \(x\) and \(y\) is denoted \(xy\) and is defined as the string obtained by appending \(y\) to the end of \(x\) as in \(x_1 \dots x_m y_1 \dots y_n\).

A string concatenated with itself \(k\) times is denoted \(x^k = \underbrace{x \dots x}_{k}\)

Lexicographic Order

The lexicographic order of strings is just dictionary order.

For example, the lexicographic order of all strings over the alphabet \(\{ 0, 1 \}\) is \(\{ \varepsilon, 0, 00, 000, \dots, 001, 01, 010, \dots, 011, 1, 10, 100, \dots, 101, 11, 110, \dots, 111, \dots \}\).

Shortlex Order

The shortlex order (or string order) of strings is a modified lexicographic order in which shorter strings precede longer strings.

For example, the shortlex order of all strings over the alphabet \(\{ 0, 1 \}\) is \(\{ \varepsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, \dots \}\).

Prefix and Proper Prefix

String \(x\) is a prefix of string \(y\) if there is a string \(z\) s.t. \(xz = y\).

String \(x\) is a proper prefix of string \(y\) if there is a string \(z\) s.t. \(xz = y\) and \(x \ne y\) (i.e., \(z \ne \varepsilon\)).

Languages#

Language

A language \(L\) is a set of strings over an alphabet \(\Sigma\).

Note that \(L \subseteq \Sigma^*\).

The language with no words is called the empty language and is denoted \(\varnothing\).

A language is prefix free if no element is a proper prefix of another element.

Regular Languages#

A language is called regular if some finite automaton regonizes it.

Regular Operations#

Regular operations are operations defined on regular languages.

Union#

Let \(A\) and \(B\) be languages. The regular operation union is defined as

\[A \cup B = \{ x : x \in A \,\,\text{or}\,\, x \in B \}\]

Concatenation#

Let \(A\) and \(B\) be languages. The regular operation concatenation is defined as

\[A \circ B = \{ xy : x \in A \,\,\text{and}\,\, y \in B \}\]

Star#

Let \(A\) be a language. The regular operation star is defined as

\[A^* = \{ x_1 x_2 \dots x_k : k \ge 0, x_i \in A \}\]

[Example]

\( \begin{aligned} A &= \{ 0, 1 \} \\ B &= \{ 0, 1 \} \\ A \cup B &= \{ 0, 1 \} \\ A \circ B &= \{ 00, 01, 10, 11 \} \\ A^* = B^* &= \{ \varepsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, \dots \} \\ \end {aligned} \)

[Example]

\( \begin{aligned} A &= \{ \text{bat}, \text{cat} \} \\ B &= \{ \text{nat}, \text{rat} \} \\ A \cup B &= \{ \text{bat}, \text{cat}, \text{nat}, \text{rat} \} \\ A \circ B &= \{ \text{batnat}, \text{batrat}, \text{catnat}, \text{catrat} \} \\ A^* &= \{ \varepsilon, \text{bat}, \text{cat}, \text{batbat}, \text{batcat}, \text{catbat}, \text{catcat}, \text{batbatbat}, \text{batbatcat}, \text{batcatbat}, \text{batcatcat}, \dots \} \\ \end {aligned} \)

Properties of Regular Languages#

Closure under Regular Union#

THEOREM [Sipser pp. 45, 1.45 pp. 59]

The class of regular languages is closed under the union operation.

In other words, if \(A\) and \(B\) are regular languages then \(A \cup B\) is a regular language.

Proof Idea

Since \(A\) and \(B\) are regular there are finite automata \(M_1\) and \(M_2\) that recognize them. To show that \(A \cup B\) is regular is to show that some finite automaton \(M\) constructed from \(M_1\) and \(M_2\) recognizes it.

\(M\) is constructed from \(M_1\) and \(M_2\) in the sense that it simulates them.

\(M\) accepts its input whenever \(M_1\) or \(M_2\) accepts their input.

If \(M_1\) has \(k_1\) states and \(M_2\) has \(k_2\) states then the number of states in \(M\) is \(k_1 \times k_2\).

The transitions of \(M\) go from pair to pair and update the current state for both \(M_1\) and \(M_2\).

The accept states of \(M\) are those pairs wherein either \(M_1\) or \(M_2\) is in an accept state.

\(Proof \,\, by \,\, construction\) via deterministic simulation

Let \(M_1 = (Q_1, \Sigma_1, \delta_1, q_1, F_1)\) recognize \(A_1\) and let \(M_2 = (Q_2, \Sigma_2, \delta_2, q_2, F_2)\) recognize \(A_2\).
Construct \(M = (Q, \Sigma, \delta, q_0, F)\) to recognize \(A_1 \cup A_2\).

\[\begin{split} \begin{align*} Q &= \{ (r_1, r_2) : r_1 \in Q_1 \,\,\text{and}\,\, r_2 \in Q_2 \} \\ &= Q_1 \times Q_2 \\ \end {align*} \end{split}\]

\(\Sigma = \Sigma_1 \cup \Sigma_2\)
For each \((r_1, r_2) \in Q\) and each \(a \in \Sigma\) let \(\delta((r_1, r_2), a) = (\delta_1 (r_1, a), \delta_2 (r_2, a))\)
\(q_0 = (q_1, q_2)\)
\(F\) is the set of pairs in which either member is an accept state of \(M_1\) or \(M_2\).

\[\begin{split} \begin{align*} F &= \{ (r_1, r_2) : r_1 \in F_1 \,\,\text{or}\,\, r_2 \in F_2 \} \\ &= (F_1 \times Q_2) \cup (Q_1 \times F_2) \\ \end {align*} \end{split}\]

\(\blacksquare\)

Proof Idea

We have two regular languages \(A_1, A_2\) and we want to show that \(A_1 \cup A_2\) is a regular language.

The idea is to take two NFAs \(N_1, N_2\) for regular languages \(A_1, A_2\) and combine them into a new NFA \(N\) to recognize \(A_1 \cup A_2\).

This is achieved in the following way.

\(N\) accepts its input if either \(N_1\) or \(N_2\) accepts the input. \(N\) has a start state that branches to the start states of the old machines with \(\varepsilon\) arrows. \(N\) non-deterministically guesses which of the two machines accepts the input: if one of them accepts the input then \(N\) accepts it too.

\(Proof\) via non-deterministic simulation

Let NFA \(N_1 = (Q_1, \Sigma, \delta_1, q_1, F_1)\) recognize \(A_1\) and let NFA \(N_2 = (Q_2, \Sigma, \delta_2, q_2, F_2)\) recognize \(A_2\).

Construct NFA \(N = (Q, \Sigma, \delta, q_0, F)\) to recognize \(A_1 \cup A_2\).

\(1\quad\) \(Q = \{ q_0 \} \cup Q_1 \cup Q_2\) \(\quad\) The states of \(N\) are all the states of \(N_1\) and \(N_2\) with the addition of a new start state \(q_0\).
\(2\quad\) The state \(q_0\) is the start state of \(N\).
\(3\quad\) The set of accept states of \(N\) are all the accept states of \(N_1\) and \(N_2\). That way, \(N\) accepts if either \(N_1\) accepts or \(N_2\) accepts.
\(4\quad\) Define \(\delta\) so that for any \(q \in Q\) and any \(a \in \Sigma_\varepsilon\)

\[\begin{split} \delta(q, a) = \begin{cases} \delta_1 (q, a) & q \in Q_1 \\ \delta_2 (q, a) & q \in Q_2 \\ \{ q_1, q_2 \} & q = q_0 \,\,\text{and}\,\, a = \varepsilon \\ \varepsilon & q = q_0 \,\,\text{and}\,\, a \ne \varepsilon \\ \end {cases} \end{split}\]

\(\blacksquare\)

Closure under Regular Intersection#

\(Theorem\)

If \(A_1\) and \(A_2\) are regular languages then \(A_1 \cap A_2\) is a regular language.

\(Proof \,\, by \,\, construction\)

\(F\) is the set of pairs in which both members are accept states of \(M_1\) and \(M_2\) respectively.

\[\begin{split} \begin{align*} F &= \{ (r_1, r_2) : r_1 \in F_1 \,\,\text{and}\,\, r_2 \in F_2 \} \\ &= F_1 \times F_2 \\ \end {align*} \end{split}\]

\(\blacksquare\)

Closure under Regular Concatenation#

THEOREM [Sipser 1.47 pp. 60]

The class of regular languages is closed under the concatenation operation.

In other words, if \(A_1\) and \(A_2\) are regular languages then \(A_1 \circ A_2\) is a regular language.

Proof Idea

We have two regular languages \(A_1, A_2\) and we want to show that \(A_1 \circ A_2\) is a regular language.

The idea is to take two NFAs \(N_1, N_2\) for regular languages \(A_1, A_2\) and combine them into a new NFA \(N\) to recognize \(A_1 \circ A_2\).

This is achieved in the following way.

Assign \(N\)’s start state to be the start state of \(N_1\). The accept states of \(N_1\) have additional \(\varepsilon\) arrows that non-deterministically allow branching to \(N_2\) whenever \(N_1\) is in an accept state, signifying that it has found an initial piece of the input that constitutes a string in \(A_1\). The accept states of \(N\) are the accept states of \(N_2\) only. Therefore, it accepts when the input can be split into two parts, the first accepted by \(N_1\) and the second accepted by \(N_2\). \(N\) non-deterministically guesses where to make the split.

\(Proof\) via non-deterministic simulation

Let NFA \(N_1 = (Q_1, \Sigma, \delta_1, q_1, F_1)\) recognize \(A_1\) and let NFA \(N_2 = (Q_2, \Sigma, \delta_2, q_2, F_2)\) recognize \(A_2\).

Construct NFA \(N = (Q, \Sigma, \delta, q_1, F_2)\) to recognize \(A_1 \circ A_2\).

\(1\quad\) \(Q = Q_1 \cup Q_2\) \(\quad\) The states of \(N\) are all the states of \(N_1\) and \(N_2\).
\(2\quad\) The state \(q_1\) is the same as the start state of \(N_1\).
\(3\quad\) The accept states \(F_2\) are the same as the accept states of \(N_2\).
\(4\quad\) Define \(\delta\) so that for any \(q \in Q\) and any \(a \in \Sigma_\varepsilon\)

\[\begin{split} \delta(q, a) = \begin{cases} \delta_1 (q, a) & q \in Q_1 \,\,\text{and}\,\, q \not\in F_1 \\ \delta_1 (q, a) & q \in F_1 \,\,\text{and}\,\, a \ne \varepsilon \\ \delta_1 (q, a) \cup \{ q_2 \} & q \in F_1 \,\,\text{and}\,\, a = \varepsilon \\ \delta_2 (q, a) & q \in Q_2 \\ \end {cases} \end{split}\]

\(\blacksquare\)

Closure under Regular Star#

THEOREM [Sipser 1.49 pp. 62]

The class of regular languages is closed under the star operation.

In other words, if \(A_1\) is a regular language then \(A_1^*\) is a regular language.

Proof Idea

We have a regular language \(A_1\) and we want to show that \(A_1^*\) is a regular language.

The idea is to take an NFA \(N_1\) for regular language \(A_1\) and modify it into a new NFA \(N\) to recognize \(A_1^*\).

This is achieved in the following way.

\(N\) accepts its input whenever it can be broken into several pieces and \(N_1\) accepts each piece. We can construct \(N\) like \(N_1\) with additional \(\varepsilon\) arrows returning to the start state from the accept states. This way, when processing gets to the end of a piece that \(N_1\) accepts, the machine \(N\) has the option of jumping back to the start state to try to read another piece that \(N_1\) accepts. In addition, we must modify \(N\) so that it accepts \(\varepsilon\) which is always a member of \(A_1^*\).

A bad approach is to add the start state to the set of accept states. This has the effect of adding \(\varepsilon\) to the recognized language, but it may also add other undesirable strings.

Instead, add a new start state, which is also an accept state and which has an \(\varepsilon\) arrow to the old start state. This solution has the desired effect of adding \(\varepsilon\) to the language without adding anything else.

\(Proof\) via non-deterministic simulation

Let NFA \(N_1 = (Q_1, \Sigma, \delta_1, q_1, F_1)\) recognize \(A_1\).

Construct NFA \(N = (Q, \Sigma, \delta, q_0, F)\) to recognize \(A_1^*\).

\(1\quad\) \(Q = \{ q_0 \} \cup Q_1\) \(\quad\) The states of \(N\) are the states of \(N_1\) and a new start state \(q_0\).
\(2\quad\) The state \(q_0\) is the new start state of \(N\).
\(3\quad\) \(F = \{ q_0 \} \cup F_1\) \(\quad\) The accept states are the old accept states and the new start state.
\(4\quad\) Define \(\delta\) so that for any \(q \in Q\) and any \(a \in \Sigma_\varepsilon\)

\[\begin{split} \delta(q, a) = \begin{cases} \delta_1 (q, a) & q \in Q_1 \,\,\text{and}\,\, q \not\in F_1 \\ \delta_1 (q, a) & q \in F_1 \,\,\text{and}\,\, a \ne \varepsilon \\ \delta_1 (q, a) \cup \{ q_1 \} & q \in F_1 \,\,\text{and}\,\, a = \varepsilon \\ \{ q_1 \} & q = q_0 \,\,\text{and}\,\, a = \varepsilon \\ \varnothing & q = q_0 \,\,\text{and}\,\, a \ne \varepsilon \\ \end {cases} \end{split}\]

\(\blacksquare\)

Precedence of the regular operations#

\(1\quad\) star
\(2\quad\) concatenation
\(3\quad\) union

Regular Expressions#

We use arithmetical operators over numbers to construct arithmetic expressions which evaluate to numbers. And we use regular operators over symbols to construct regular expressions which evaluate to languages.

The following definition follows from the definition of the regular operations.

DEFINITION [Sipser 1.51 pp. 64]

Let \(\Sigma\) be any alphabet.
The regular expression \(\Sigma\) describes the language consisting of all strings of length \(1\) over this alphabet.
The regular expression \(\Sigma^*\) describes the language consisting of all strings over this alphabet.

INDUCTIVE DEFINITION Regular Expression [Sipser 1.52 pp. 64]

\(R\) is a regular expression if \(R\) is
\(1\quad\) \(a\) for some \(a\) in the alphabet \(\Sigma\) \(\quad\) The regular expression \(a\) represents the language \(\{ a \}\)
\(2\quad\) \(\varepsilon\) \(\quad\) The regular expression \(\varepsilon\) represents the language \(\{ \varepsilon \}\)
\(3\quad\) \(\varnothing\) \(\quad\) The regular expression \(\varnothing\) represents the empty language
\(4\quad\) \((R_1 \cup R_2)\) where \(R_1\) and \(R_2\) are regular expressions \(\quad\) The regular expression \((R_1 \cup R_2)\) represents the language produced by the union of \(R_1\) and \(R_2\)
\(5\quad\) \((R_1 \circ R_2)\) where \(R_1\) and \(R_2\) are regular expressions \(\quad\) The regular expression \((R_1 \circ R_2)\) represents the language produced by the concatenation of \(R_1\) and \(R_2\)
\(6\quad\) \((R_1^*)\) where \(R_1\) is a regular expression \(\quad\) The regular expression \((R_1^*)\) represents the language produced by the star of \(R_1\)

\( \begin{aligned} R^+ :=&& RR^* && \text{one or more} \\ R^* =&& R^+ \cup \{ \varepsilon \} && \text{zero or more} \\ R^k :=&& \underbrace{R \dots R}_{k} && k \,\,\text{times} \\ \end {aligned} \)

To distinguish between a regular expression and the language that it describes we write \(L(R)\) to be the language of \(R\).

Examples#

Example

\(0 \cup 1\)

s.h. for \(\{ 0 \} \cup \{ 1 \} = \{ 0, 1 \}\)

\(L(0 \cup 1) = \{ w : w \,\,\text{is either}\,\, 0 \,\,\text{or}\,\, 1 \}\)

Example

\(0^*\)

s.h. for \(\{ 0 \}^* = \{ \varepsilon, 0, 00, 000, \dots \}\)

\(L(0^*) = \{ w : w \,\,\text{consists of zero or more}\,\, 0 \,\,\text{s} \}\)

Example

\(\Sigma 0^* = (0 \cup 1) 0^* = 0^+ \cup 10^*\) where \(\Sigma = \{ 0, 1 \}\)

s.h. for \( (\{ 0 \} \cup \{ 1 \}) \circ \{ 0 \}^* = \{ 0, 1 \} \circ \{ 0 \}^* = \{ 0 \} \circ \{ 0 \}^* \cup \{ 1 \} \circ \{ 0 \}^* = 00^* \cup 10^* = 0^+ \cup 10^* \)

\(L(\Sigma 0^*) = \{ w : w \,\,\text{consists of a}\,\, 0 \,\,\text{or a}\,\, 1 \,\,\text{followed by zero or more}\,\, 0 \,\,\text{s}\}\)

Example [Sipser 1.51 pp.64]

\((0 \cup 1)^* = \Sigma^*\)

s.h. for \((\{ 0 \} \cup \{ 1 \})^* = \{ 0, 1 \}^* = \{ \varepsilon, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, \dots \}\)

The language consisting of all possible strings of \(0\) s and \(1\) s.

Example

\((0 \cup 1)^* 1 = \Sigma^* 1\)

s.h. for \((\{ 0 \} \cup \{ 1 \})^* \circ \{ 1 \} = \{ 0, 1 \}^* \circ \{ 1 \} = \{ \varepsilon, \textcolor{red}{\cancel{0}}, 1, \textcolor{red}{\cancel{00}}, 01, \textcolor{red}{\cancel{10}}, 11, \textcolor{red}{\cancel{000}}, 001, \textcolor{red}{\cancel{010}}, 011, \textcolor{red}{\cancel{100}}, 101, \textcolor{red}{\cancel{110}}, 111, \dots \}\)

The language consisting of all strings that end in a \(1\).

Example

\((0 (0 \cup 1)^*) \cup ((0 \cup 1)^* 1) = (0 \Sigma^*) \cup (\Sigma^* 1)\)

s.h. for \( (\{ 0 \} \circ (\{ 0 \} \cup \{ 1 \})^*) \cup ((\{ 0 \} \cup \{ 1 \})^* \circ \{ 1 \}) = (\{ 0 \} \circ \{ 0, 1 \}^*) \cup (\{ 0, 1 \}^* \circ \{ 1 \}) = \{ \varepsilon, 0, 1, 00, 01, \textcolor{red}{\cancel{10}}, 11, 000, 001, 010, 011, \textcolor{red}{\cancel{100}}, 101, \textcolor{red}{\cancel{110}}, 111, \dots \} \)

The language consisting of all strings that start with a \(0\) or end with a \(1\).