Conversational AI Design and Practice
GitHubAuthor
  • Preface
    • Syllabus
    • Schedule
  • 0. Getting Started
    • 0.1. Environment Setup
    • 0.2. Quiz
  • 1. Exploration
    • 1.1. Overview
    • 1.2. Project Ideas
    • 1.3. Quiz
  • 2. Dialogue Graph
    • 2.1. Emora STDM
    • 2.2. State Transition
    • 2.3. Matching Strategy
    • 2.4. Multi-turn Dialogue
    • 2.5. Quiz
  • 3. Contextual Understanding
    • 3.1. Natex
    • 3.2. Ontology
    • 3.4. Regular Expression
    • 3.5. Macro
    • 3.5. Quiz
  • 4. Interaction Design
    • 4.1. State Referencing
    • 4.2. Advanced Interaction
    • 4.3. Compound States
    • 4.4. Global Transition
    • 4.5. Saving and Loading
    • 4.6. Quiz
  • 5. LM-based Matching
    • 5.1. Language Models
    • 5.2. Quickstart with GPT
    • 5.3. Information Extraction
    • 5.4. Quiz
  • 6. Conversational Analysis
    • 6.1. H2H vs. H2M
    • 6.2. Team Evaluation
    • 6.3. Quiz
  • Project
    • Projects
    • Proposal Guidelines
    • Final Report Guidelines
  • Supplements
    • LINC Course
    • Page 1
Powered by GitBook

©2023 Emory University - All rights reserved

On this page
  • Syntax
  • Grouping
  • Repetitions
  • Special Characters
  • Functions
  • match()
  • search()
  • findall()
  • finditer()
  • Natex Integration
  • Variable

Was this helpful?

Export as PDF
  1. 3. Contextual Understanding

3.4. Regular Expression

How to use regular expressions for matching in Natex.

Previous3.2. OntologyNext3.5. Macro

Last updated 2 years ago

Was this helpful?

Regular expressions provide powerful ways to match strings and beyond:

  • , Chapter 2.1, Speech and Language Processing (3rd ed.), Jurafsky and Martin.

  • , Python Documentation

Syntax

Grouping

Syntax
Description

[ ]

A set of characters

( )

A capturing group

(?: )

A non capturing group

|

or

Repetitions

Syntax
Description
Non-greedy

.

Any character except a newline

*

0 or more repetitions

*?

+

1 or more repetitions

+?

?

0 or 1 repetitions

??

{m}

Exactly m repetitions

{m,n}

From m to n repetitions

{m,n}?

Special Characters

Syntax
Description

^

The start of the string

$

The end of the string

\num

The contents of the group of the same number

\d

Any decimal digit

\D

Any non-decimal-digit character

\s

Any whitespace character

\S

Any non-whitespace character

\w

Any alphanumeric character and the underscore

\W

Any non-alphanumeric character

Functions

Several functions are provided in Python to match regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

import re

RE_MR = re.compile(r'M[rs]\.')
m = RE_MR.match('Dr. Wayne')
print(m)

A regular expression is represented by r'expression' where the expression is in a string preceded by the special character r.

The above code prints None, indicating that the value of m is None, because the regular expression does not match the string.

m = RE_MR.match('Mr. Wayne')
print(m)
if m:
    print(m.group(), m.start(), m.end())
  • #1: since RE_MR matches the string, m is a match object.

  • #3: true since m is a match object.

<re.Match object; span=(0, 3), match='Mr.'>
Mr. 0 3
print(m.groups())

What are the differences between a list and a tuple in Python?

RE_MR = re.compile(r'(M[rs])(\.)')
m = RE_MR.match('Ms. Wayne')
print(m.groups())
print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))
  • #1: there are two groups in this regular expression, (M[rs]) and (\.).

  • #4,5: return the entire match "Ms.".

  • #6: returns "Ms" matched by the first group (M[rs]).

  • #7: returns "." matched by the second group (\.).

('Ms', '.')
Ms.
Ms.
Ms
.

The above RE_MR matches "Mr." and "Ms." but not "Mrs." Modify it to match all of them (Hint: use a non-capturing group and |).

RE_MR = re.compile(r'(M(?:[rs]|rs))(\.)')

The non-capturing group (?:[rs]|rs) matches "r", "s", or "rs" such that the first group matches "Mr", "Ms", and "Mrs", respectively.

Since we use the non-capturing group, the following code still prints a tuple of two strings:

print(RE_MR.match('Mrs. Wayne').groups())
--> ('Mrs', '.')

What if we use a capturing group instead?

RE_MR = re.compile(r'(M([rs]|rs))(\.)')

Now, the nested group ([rs]|rs) is considered the second group such that the match returns a tuple of three strings as follows:

print(RE_MR.match('Mrs. Wayne').groups())
--> ('Mr', 'rs', '.')

search()

Let us match the following strings with RE_MR:

s1 = 'Mr. and Ms. Wayne are here'
s2 = 'Here are Mr. and Mrs. Wayne'

print(RE_MR.match(s1))
print(RE_MR.match(s2))
  • #4: matches "Mr." but not "Ms."

  • #5: matches neither "Mr." nor "Mrs."

<re.Match object; span=(0, 3), match='Mr.'>
None

For s1, only "Mr." is matched because match() stops matching after finding the first pattern. For s2 on the other hand, even "Mr." is not matched because match() requires the pattern to be at the beginning of the string.

print(RE_MR.search(s1))
print(RE_MR.search(s2))
  • search() returns a match object as match() does.

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(9, 12), match='Mr.'>

findall()

print(RE_MR.findall(s1))
print(RE_MR.findall(s2))
  • findall() returns a list of tuples where each tuple represents a group of matched results.

[('Mr', '.'), ('Ms', '.')]
[('Mr', '.'), ('Mrs', '.')]

finditer()

for m in RE_MR.finditer(s1):
    print(m)
<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(8, 11), match='Ms.'>
for m in RE_MR.finditer(s2):
    print(m)
<re.Match object; span=(9, 12), match='Mr.'>
<re.Match object; span=(17, 21), match='Mrs.'>
ms = [m for m in RE_MR.finditer(s1)]
print(ms)
  • #1: returns a list of all m (in order) matched by finditer().

[<re.Match object; span=(0, 3), match='Mr.'>, <re.Match object; span=(8, 11), match='Ms.'>]

How is the code above different from the one below?

ms = []
for m in RE_MR.finditer(s1):
    ms.append(m)

What are the advantages of using a list comprehension over a for-loop other than it makes the code shorter?

Write regular expressions to match the following cases:

  • Abbreviation: Dr., U.S.A.

  • Apostrophe: '80, '90s, 'cause

  • Concatenation: don't, gonna, cannot

  • Hyperlink: https://github.com/emory-courses/cs329/

  • Number: 1/2, 123-456-7890, 1,000,000

  • Unit: $10, #20, 5kg

RE_TOK = re.compile(r'([",.]|n\'t|\s+)')
RE_ABBR = re.compile(r'((?:Mr|Mrs|Ms|Dr)\.)|((?:[A-Z]\.){2,})')
RE_APOS = re.compile(r'\'(\d\ds?|cause)')
RE_CONC = re.compile(r'([A-Za-z]+)(n\'t)|(gon)(na)|(can)(not)')
RE_HYPE = re.compile(r'(https?://\S+)')
RE_NUMB = re.compile(r'(\d+/\d+)|(\d{3}-\d{3}-\d{4})|(\d(?:,\d{3})+)')
RE_UNIT = re.compile(r'([$#])?(\d+)([km]g)?')

Natex Integration

'{[{so, very} good], fantastic}'

Write a regular expression that matches the above condition.

r'((?:so|very) good|fantastic)'

It is possible to use regular expressions for matching in Natex. A regular expression is represented by forward slashes (/../):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '/((?:so|very) good|fantastic)/': {
            '`Things are just getting better for you!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}
  • #4: true if the entire input matches the regular expression.

S: Hello. How are you?
U: So good!!!
S: Things are just getting better for you!
S: Hello. How are you?
U: Fantastic :)
S: Things are just getting better for you!
S: Hello. How are you?
U: It's fantastic
S: Sorry, I didn't understand you.

You can put the expression in a sequence to allow it a partial match:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '[/((?:so|very) good|fantastic)/]': {
            '`Things are just getting better for you!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}
  • #4: the regular expression is put in a sequence [].

S: Hello. How are you?
U: It's fantastic!!
S: Things are just getting better for you!
S: Hello. How are you?
U: I'm so good, thank you!
S: Things are just getting better for you!

When used in Natex, all literals in the regular expression (e.g., "so", "good" in #4) must be lowercase because Natex matches everything in lowercase. The design choice is made because users tend not to follow typical capitalization in a chat interface, whether it is text- or audio-based.

Variable

It is possible to store the matched results of a regular expression to variables. A variable in a regular expression is represented by angle brackets (<..>) inside a capturing group ((?..)).

The following transitions take the user name and respond with the stored first and last name:

transitions = {
    'state': 'start',
    '`Hello. What should I call you?`': {
        '[/(?<FIRSTNAME>[a-z]+) (?<LASTNAME>[a-z]+)/]': {
            '`It\'s nice to meet you,` $FIRSTNAME `. I know several people with the last name,` $LASTNAME': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}
  • #4: matches the first name and the last name in order and stores them in the variables FIRSTNAME and LASTNAME.

  • #5: uses FIRSTNAME and LASTNAME in the response.

S: Hello. What should I call you?
U: Jinho Choi
S: It's nice to meet you, jinho . I know several other choi .

#1: imports the .

#3: the regular expression into the RE_MR.

#4: the string "Dr. Choi" with RE_MR and saves the to m.

#4: prints the matched substring, and the (inclusive) and (exclusive) indices of the substring with respect to the original string in #1.

Currently, no are specified in RE_MR:

#1: returns an empty ().

It is possible to specific patterns using parentheses:

#3: returns a of matched substrings ('Ms', '.') for the two groups in #1.

To match a pattern anywhere in the string, we need to for the pattern instead:

search() still does not return the second substrings, "Ms." and "Mrs.". The following shows how to substrings that match the pattern:

Since findall() returns a list of tuples instead of match objects, there is no definite way of locating the matched results in the original string. To return match objects instead, we need to the pattern:

#1: finditer() returns an that keeps matching the pattern until it no longer finds.

You can use a to store the match objects as a list:

The nesting example in has a condition as follows (#4):

Chapter 2.1: Regular Expressions
Regular Expression HOWTO
Regular Expresions 101
regular expression library
compiles
regex object
matches
match object
start
end
groups
tuple
group
tuple
search
find all
interactively find
iterator
list comprehension
Section 3.1