1 of 6

3. Contextual Understanding

This chapter explains how to improve contextual understanding using Natex.

Content

Resource

Source code:

3.1. Natex

Several matching strategies built in Natex.

Emora STDM supports several ways for interpreting the contexts of user inputs through Natex (Natural Langauge Expression), some of which you already experienced in Matching Strategy.

Literal

A literal is what you intend the system to say. A literal is represented by reversed primes (`..`):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': 'end'  # literal
}

#3: the system prompts the literal and ends the dialogue.

S: Hello. How are you?

Matching

Natex supports several ways of matching the input with key terms.

Term

The condition is true if the input exactly matches the term. A term is represented as a string and can have more than one token:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: matches the input with 'could be better'.
#7: error is a reserved term indicating the default condition of this conditional branching, similar to the wildcard condition (_) in a match statement.

S: Hello. How are you?
U: Could be better..
S: I hope your day gets better soon :(

S: Hello. How are you?
U: It could be better
S: Sorry, I didn't understand you.

Set

The condition is true if the input exactly matches any term in the set. A set is represented by curly brackets ({}):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#7: matches the input with either 'good' or 'not bad'.

S: Hello. How are you?
U: Good!!
S: Glad to hear that you are doing well :)

S: Hello. How are you?
U: Not bad..
S: Glad to hear that you are doing well :)

S: Hello. How are you?
U: I'm good
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: Not so bad
S: Sorry, I didn't understand you.

Unordered List

The condition is true if some terms in the input match all terms in the unordered list, regardless of the order. An unordered list is represented by angle brackets (<>):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#10: matches the input with both 'very' and 'good' in any order.

S: Hello. How are you?
U: Very good!
S: So glad that you are having a great day!

S: Hello. How are you?
U: I'm very well and good
S: So glad that you are having a great day!

S: Hello. How are you?
U: Good, things are going very well!
S: So glad that you are having a great day!

S: Hello. How are you?
U: Good
S: Glad to hear that you are doing well :)

Ordered List

The condition is true if some terms in the input match all terms in the ordered list, a.k.a. sequence, in the same order. An ordered list is represented by square brackets ([]):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        '[so, good]': {                # ordered list (sequence)
            '`Things are just getting better for you!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
    }
}

#13: matches the input with both 'so' and 'good' in that order.

S: Hello. How are you?
U: So good!
S: Things are just getting better for you!

S: Hello. How are you?
U: It's so wonderfully good!
S: Things are just getting better for you!

S: Hello. How are you?
U: It's good
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: It's good so far
S: Sorry, I didn't understand you.

Currently, it matches the input "could be better" with the condition in #4, but does not match "it could be better" or "could be better for sure", where there are terms other than the ones indicated in the condition.

Update the condition such that it matches all three inputs.
How about matching inputs such as "could be much better" or "could be really better"?

'[could be better]'
'[could be, better]'

Rigid Sequence

The condition is true if all terms in the input exactly match all terms in the rigid sequence in the same order. A rigid sequence is represented by square brackets ([ ]), where the left bracket is followed by an exclamation mark (!):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        '[so, good]': {                # ordered list (sequence)
            '`Things are just getting better for you!`': 'end'
        },
        '[!hello, world]': {           # rigid sequence
            '`You\'re a programmer!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
    }
}

#16: matches the input with both 'hello' and 'world' in that order.

S: Hello. How are you?
U: Hello World
S: You're a programmer!

S: Hello. How are you?
U: hello world to you
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: It's hello world
S: Sorry, I didn't understand you.

There is no difference between matching a term (e.g., 'hello world') and matching a rigid sequence (e.g., '[!hello, world]'). The rigid sequence is designed specifically for negation, which will be deprecated in the next version.

Negation

The condition is true if all terms in the input exactly match all terms in the rigid sequence except for ones that are negated. A negation is represented by a hyphen (-):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        '[so, good]': {                # ordered list (sequence)
            '`Things are just getting better for you!`': 'end'
        },
        '[!hello, world]': {           # rigid sequence
            '`You\'re a programmer!`': 'end'
        },
        '[!-not, aweful]': {           # negation
            '`Sorry to hear that :(`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#19: matches the input with 'aweful' and zero to many terms prior to it that are not 'not'.

S: Hello. How are you?
U: Aweful!
S: Sorry to hear that :(

S: Hello. How are you?
U: It's so aweful..
S: Sorry to hear that :(

S: Hello. How are you?
U: Not aweful
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: Not so aweful
S: Sorry to hear that :(

S: Hello. How are you?
U: Aweful and terrible
S: Sorry, I didn't understand you.

Nesting

It is possible to nest conditions for more advanced matching. Let us create a term condition that matches both "so good" and "very good" using a nested set:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '{so, very} good': {
                '`Things are just getting better for you!`': 'end'
            },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: uses a set inside a term.

Does this condition match "good"?

No, because the outer condition uses term matching that requires the whole input to be the same as the condition.

However, it does not match when other terms are included in the input (e.g., "It's so good to be here"). To broaden the matching scope, you can put the condition inside a sequence:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '[{so, very} good]': {
                '`Things are just getting better for you!`': 'end'
            },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: the term condition is inside the sequence.

What if we want the condition to match the above inputs as well as "fantastic"? You can put the condition under a set and add fantastic as another term:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '{[{so, very} good], fantastic}': {
                '`Things are just getting better for you!`': 'end'
            },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: the sequence condition and the new term fantastic is inside the set.

S: Hello. How are you?
U: I'm very good, thank you!
S: Things are just getting better for you!

S: Hello. How are you?
U: It's so good to be here :)
S: Things are just getting better for you!

S: Hello. How are you?
U: Fantastic!!!
S: Things are just getting better for you!

S: Hello. How are you?
U: Good
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: It's fantastic
S: Sorry, I didn't understand you.

The above transitions match "Fantastic" but not "It's fantastic". Update the condition such that it can match both inputs.

Put fantastic under a sequence such that '{[{so, very} good], [fantastic]}'.

Variable

Saving user content can be useful in many ways. Let us consider the following transitions:

transitions = {
    'state': 'start',
    '`What is your favorite animal?`': {
        '[{dogs, cats, hamsters}]': {
            '`I like them too!`': 'end'
        },
        'error': {
            '`I\'ve never heard of that animal.`': 'end'
        }
    }
}

S: What is your favorite animal?
U: I like dogs
S: I like them too!

Users may feel more engaged if the system says, "I like dogs too" instead of "them". Natex allows you to create a variable to store the matched term. A variable is represented by a string preceded (without spaces) by a dollar sign $:

transitions = {
    'state': 'start',
    '`What is your favorite animal?`': {
        '[$FAVORITE_ANIMAL={dogs, cats, hamsters}]': {
            '`I like` $FAVORITE_ANIMAL `too!`': 'end'
        },
        'error': {
            '`I\'ve never heard of that animal.`': 'end'
        }
    }
}

#4: creates a variable FAVORITE_ANIMAL storing the matched term from the user content.
#5: uses the value of the variable to generate the follow-up system utterance.

In #5, two literals, `I like` and `too!` surround the variable $FAVORITE_ANIMAL. If a variable were indicated inside a literal, STDM would throw an error.

S: What is your favorite animal?
U: I like dogs!!
S: I like dogs too!

S: What is your favorite animal?
U: Hamsters are my favorite!
S: I like hamsters too!

3.2. Ontology

How to use ontologies for matching in Natex.

Ontology

Let us create a dialogue flow to talk about animals:

#2: the key ontology is paired with a dictionary as a value.
#3: the key animal represents the category, and its subcategories are indicated in the list.
#4-6: each subcategory, mammal, reptile, and amphibian, has its own subcategory.
#7: the ontology hierarchy: animal -> mammal -> dog.

Given the ontology, the above transitions can be rewritten as follow:

#4: matches the key "mammal" as well as its subcategories: "dog", "ape", and "rat".
#5: matches the key "reptile" as well as its subcategories: "snake" and "lizard".
#6: matches the key "amphibian" as well as its subcategories: "frog" and "salamander".

Unlike set matching, ontology matching handles plurals (e.g., "frogs").

Although there is no condition specified for the category dog that includes "golden retriever", there is a condition for its supercategory mammal (#4), to which it backs off.

Currently, ontology matching does not handle plurals for compound nouns (e.g., "golden retrievers"), which will be fixed in the following version.

Expression

It is possible that a category is mentioned in a non-canonical way; the above conditions do not match "puppy" because it is not introduced as a category in the ontology. In this case, we can specify the aliases as "expressions":

#10: the key expressions is paired with a dictionary as a value.
#4: allows matching "canine" and "puppy" for the dog category.

Once you load the updated JSON file, it now understands "puppy" as an expression of "dog":

It is possible to match "puppy" by adding the term as a category of "dog" (#7). However, it would not be a good practice as "puppy" should not be considered a subcategory of "dog".

Variable

Values matched by the ontology can also be stored in variables:

#4,7,10: the matched term gets stored in the variable FAVORITE_ANIMAL.
#5,8,11: the system uses the value of FAVORITE_ANIMAL to generate the response.

Loading

The custom ontology must be loaded to the knowledge base of the dialogue flow before it runs:

#1: loads the ontology in ontology_animal.json to the knowledge base of df.

Code Snippet

3.4. Regular Expression

How to use regular expressions for matching in Natex.

Regular expressions provide powerful ways to match strings and beyond:

, Chapter 2.1, Speech and Language Processing (3rd ed.), Jurafsky and Martin.
, Python Documentation

Syntax

Grouping

Syntax

Description

Repetitions

Syntax

Description

Non-greedy

Special Characters

Functions

Several functions are provided in Python to match regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

A regular expression is represented by r'expression' where the expression is in a string preceded by the special character r.

The above code prints None, indicating that the value of m is None, because the regular expression does not match the string.

#1: since RE_MR matches the string, m is a match object.
#3: true since m is a match object.

What are the differences between a list and a tuple in Python?

#1: there are two groups in this regular expression, (M[rs]) and (\.).
#4,5: return the entire match "Ms.".
#6: returns "Ms" matched by the first group (M[rs]).
#7: returns "." matched by the second group (\.).

The above RE_MR matches "Mr." and "Ms." but not "Mrs." Modify it to match all of them (Hint: use a non-capturing group and |).

The non-capturing group (?:[rs]|rs) matches "r", "s", or "rs" such that the first group matches "Mr", "Ms", and "Mrs", respectively.

Since we use the non-capturing group, the following code still prints a tuple of two strings:

What if we use a capturing group instead?

Now, the nested group ([rs]|rs) is considered the second group such that the match returns a tuple of three strings as follows:

search()

Let us match the following strings with RE_MR:

#4: matches "Mr." but not "Ms."
#5: matches neither "Mr." nor "Mrs."

For s1, only "Mr." is matched because match() stops matching after finding the first pattern. For s2 on the other hand, even "Mr." is not matched because match() requires the pattern to be at the beginning of the string.

search() returns a match object as match() does.

findall()

findall() returns a list of tuples where each tuple represents a group of matched results.

finditer()

#1: returns a list of all m (in order) matched by finditer().

How is the code above different from the one below?

What are the advantages of using a list comprehension over a for-loop other than it makes the code shorter?

Write regular expressions to match the following cases:

Abbreviation: Dr., U.S.A.
Apostrophe: '80, '90s, 'cause
Concatenation: don't, gonna, cannot
Hyperlink: https://github.com/emory-courses/cs329/
Number: 1/2, 123-456-7890, 1,000,000
Unit: $10, #20, 5kg

Natex Integration

Write a regular expression that matches the above condition.

It is possible to use regular expressions for matching in Natex. A regular expression is represented by forward slashes (/../):

#4: true if the entire input matches the regular expression.

You can put the expression in a sequence to allow it a partial match:

#4: the regular expression is put in a sequence [].

When used in Natex, all literals in the regular expression (e.g., "so", "good" in #4) must be lowercase because Natex matches everything in lowercase. The design choice is made because users tend not to follow typical capitalization in a chat interface, whether it is text- or audio-based.

Variable

It is possible to store the matched results of a regular expression to variables. A variable in a regular expression is represented by angle brackets (<..>) inside a capturing group ((?..)).

The following transitions take the user name and respond with the stored first and last name:

#4: matches the first name and the last name in order and stores them in the variables FIRSTNAME and LASTNAME.
#5: uses FIRSTNAME and LASTNAME in the response.

3.5. Macro

How to use macro functions for matching in Natex.

The most powerful aspect of Natex is its ability to integrate pattern matching with arbitrary code. This allows you to integrate regular expressions, NLP models, or custom algorithms into Natex.

Creation

A macro can be defined by creating a class inheriting the Macro in STDM and the run method:

#1: imports Macro from STDM.
#2: imports type hints from the package in Python.
#4: creates the MacroGetName class inheriting Macro.
#5: overrides the run method declared in Macro.

Currently, the run method returns True no matter what the input is.

Integration

Let us create transitions using this macro. A macro is represented by an alias preceded by the pound sign (#):

#4: calls the macro #GET_NAME that is an alias of MacroGetName.
#13: creates a dictionary defining aliases for macros.
#14: creates an object of MacroGetName and saves it to the alias GET_NAME.

To call the macro, we need to add the alias dictionary macros to the dialogue flow:

#3: adds all macros defined in macros to the dialogue flow df.

Parameters

The run method has three parameters:

vars: is the variable dictionary, maintained by a DialogueFlow object, where the keys and values are variable names and objects corresponding to their values.
args: is a list of strings representing arguments specified in the macro call.

Let us modify the run method to see what ngrams and vars give:

#2: prints the original string of the matched input span before preprocessing.
#3: prints the input span, preprocessed by STDM and matched by the Natex.
#4: prints a set of n-grams.

When you interact with the the dialogue flow by running it (df.run()), it prints the followings:

The raw_text method returns the original input:

The text method returns the preprocessed input used to match the Natex:

The ngrams gives a set of all possible n-grams in text():

Finally, the vars gives a dictionary consisting of both system-level and user-custom variables (no user-custom variables are saved at the moment):

Implementation

Let us update the run method that matches the title, first name, and last name in the input and saves them to the variables $TITLE, $FIRSTNAME, and $LASTNAME, respectively:

#2: creates a regular expression to match the title, first name and last name.
#3: searches for the span to match.
#4: returns False if no match is found.
#6-18 -> exercise.
#20-22: saves the recognized title, first name, and last name to the corresponding variables.
#24: returns True as the regular expression matches the input span.

Given the updated macro, the above transitions can be modified as follow:

#5: uses the variables $FIRSTNAME and $LASTNAME retrieved by the macro to generate the output.

The followings show outputs:

Can macros be mixed with other Natex expressions?

3.5. Quiz

Quiz 3: Contextual Understanding

Overview

Your goal is to create a chatbot that talks about movies. Here is a sample dialogue:

Your chatbot aims to collect user information by asking the following:

The latest movie that the user watched (#3-4).
A contextualized question regarding the latest movie (#5-6).
A question regarding the genre of the latest movie (#7-10).

Your chatbot should give an appropriate response to every user response. For this assignment, you must use all of the following:

Task 1

Update them to design a dialogue flow for the chatbot.

Task 2

Create a PDF file quiz3.pdf that describes the following:

Sample dialogues that your chatbot can conduct.
Explanations of how the ontology, macro(s), and regular expression(s) are used for contextual understanding in your chatbot.

Submission

Commit and push quiz3.py to your GitHub repository.
Submit quiz3.pdf to Canvas.

3.1. Natex

Several matching strategies built in Natex.

Emora STDM supports several ways for interpreting the contexts of user inputs through Natex (Natural Langauge Expression), some of which you already experienced in Matching Strategy.

Literal

A literal is what you intend the system to say. A literal is represented by reversed primes (`..`):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': 'end'  # literal
}

#3: the system prompts the literal and ends the dialogue.

S: Hello. How are you?

Matching

Natex supports several ways of matching the input with key terms.

Term

The condition is true if the input exactly matches the term. A term is represented as a string and can have more than one token:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: matches the input with 'could be better'.
#7: error is a reserved term indicating the default condition of this conditional branching, similar to the wildcard condition (_) in a match statement.

S: Hello. How are you?
U: Could be better..
S: I hope your day gets better soon :(

S: Hello. How are you?
U: It could be better
S: Sorry, I didn't understand you.

Set

The condition is true if the input exactly matches any term in the set. A set is represented by curly brackets ({}):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#7: matches the input with either 'good' or 'not bad'.

S: Hello. How are you?
U: Good!!
S: Glad to hear that you are doing well :)

S: Hello. How are you?
U: Not bad..
S: Glad to hear that you are doing well :)

S: Hello. How are you?
U: I'm good
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: Not so bad
S: Sorry, I didn't understand you.

Unordered List

The condition is true if some terms in the input match all terms in the unordered list, regardless of the order. An unordered list is represented by angle brackets (<>):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#10: matches the input with both 'very' and 'good' in any order.

S: Hello. How are you?
U: Very good!
S: So glad that you are having a great day!

S: Hello. How are you?
U: I'm very well and good
S: So glad that you are having a great day!

S: Hello. How are you?
U: Good, things are going very well!
S: So glad that you are having a great day!

S: Hello. How are you?
U: Good
S: Glad to hear that you are doing well :)

Ordered List

The condition is true if some terms in the input match all terms in the ordered list, a.k.a. sequence, in the same order. An ordered list is represented by square brackets ([]):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        '[so, good]': {                # ordered list (sequence)
            '`Things are just getting better for you!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
    }
}

#13: matches the input with both 'so' and 'good' in that order.

S: Hello. How are you?
U: So good!
S: Things are just getting better for you!

S: Hello. How are you?
U: It's so wonderfully good!
S: Things are just getting better for you!

S: Hello. How are you?
U: It's good
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: It's good so far
S: Sorry, I didn't understand you.

Update the condition such that it matches all three inputs.
How about matching inputs such as "could be much better" or "could be really better"?

'[could be better]'
'[could be, better]'

Rigid Sequence

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        '[so, good]': {                # ordered list (sequence)
            '`Things are just getting better for you!`': 'end'
        },
        '[!hello, world]': {           # rigid sequence
            '`You\'re a programmer!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
    }
}

#16: matches the input with both 'hello' and 'world' in that order.

S: Hello. How are you?
U: Hello World
S: You're a programmer!

S: Hello. How are you?
U: hello world to you
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: It's hello world
S: Sorry, I didn't understand you.

Negation

The condition is true if all terms in the input exactly match all terms in the rigid sequence except for ones that are negated. A negation is represented by a hyphen (-):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {         # literal
        'could be better': {           # term
            '`I hope your day gets better soon :(`': 'end'
        },
        '{good, not bad}': {           # set
            '`Glad to hear that you are doing well :)`': 'end'
        },
        '<very, good>': {              # unordered list
            '`So glad that you are having a great day!`': 'end'
        },
        '[so, good]': {                # ordered list (sequence)
            '`Things are just getting better for you!`': 'end'
        },
        '[!hello, world]': {           # rigid sequence
            '`You\'re a programmer!`': 'end'
        },
        '[!-not, aweful]': {           # negation
            '`Sorry to hear that :(`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#19: matches the input with 'aweful' and zero to many terms prior to it that are not 'not'.

S: Hello. How are you?
U: Aweful!
S: Sorry to hear that :(

S: Hello. How are you?
U: It's so aweful..
S: Sorry to hear that :(

S: Hello. How are you?
U: Not aweful
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: Not so aweful
S: Sorry to hear that :(

S: Hello. How are you?
U: Aweful and terrible
S: Sorry, I didn't understand you.

Nesting

It is possible to nest conditions for more advanced matching. Let us create a term condition that matches both "so good" and "very good" using a nested set:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '{so, very} good': {
                '`Things are just getting better for you!`': 'end'
            },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: uses a set inside a term.

Does this condition match "good"?

No, because the outer condition uses term matching that requires the whole input to be the same as the condition.

However, it does not match when other terms are included in the input (e.g., "It's so good to be here"). To broaden the matching scope, you can put the condition inside a sequence:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '[{so, very} good]': {
                '`Things are just getting better for you!`': 'end'
            },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: the term condition is inside the sequence.

What if we want the condition to match the above inputs as well as "fantastic"? You can put the condition under a set and add fantastic as another term:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '{[{so, very} good], fantastic}': {
                '`Things are just getting better for you!`': 'end'
            },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: the sequence condition and the new term fantastic is inside the set.

S: Hello. How are you?
U: I'm very good, thank you!
S: Things are just getting better for you!

S: Hello. How are you?
U: It's so good to be here :)
S: Things are just getting better for you!

S: Hello. How are you?
U: Fantastic!!!
S: Things are just getting better for you!

S: Hello. How are you?
U: Good
S: Sorry, I didn't understand you.

S: Hello. How are you?
U: It's fantastic
S: Sorry, I didn't understand you.

The above transitions match "Fantastic" but not "It's fantastic". Update the condition such that it can match both inputs.

Put fantastic under a sequence such that '{[{so, very} good], [fantastic]}'.

Variable

Saving user content can be useful in many ways. Let us consider the following transitions:

transitions = {
    'state': 'start',
    '`What is your favorite animal?`': {
        '[{dogs, cats, hamsters}]': {
            '`I like them too!`': 'end'
        },
        'error': {
            '`I\'ve never heard of that animal.`': 'end'
        }
    }
}

S: What is your favorite animal?
U: I like dogs
S: I like them too!

transitions = {
    'state': 'start',
    '`What is your favorite animal?`': {
        '[$FAVORITE_ANIMAL={dogs, cats, hamsters}]': {
            '`I like` $FAVORITE_ANIMAL `too!`': 'end'
        },
        'error': {
            '`I\'ve never heard of that animal.`': 'end'
        }
    }
}

#4: creates a variable FAVORITE_ANIMAL storing the matched term from the user content.
#5: uses the value of the variable to generate the follow-up system utterance.

In #5, two literals, `I like` and `too!` surround the variable $FAVORITE_ANIMAL. If a variable were indicated inside a literal, STDM would throw an error.

S: What is your favorite animal?
U: I like dogs!!
S: I like dogs too!

S: What is your favorite animal?
U: Hamsters are my favorite!
S: I like hamsters too!

3.4. Regular Expression

How to use regular expressions for matching in Natex.

Regular expressions provide powerful ways to match strings and beyond:

, Chapter 2.1, Speech and Language Processing (3rd ed.), Jurafsky and Martin.
, Python Documentation

Syntax

Grouping

Syntax

Description

Repetitions

Syntax

Description

Non-greedy

Special Characters

Syntax

Description

Functions

Several functions are provided in Python to match regular expressions.

match()

Let us create a regular expression that matches "Mr." and "Ms.":

import re

RE_MR = re.compile(r'M[rs]\.')
m = RE_MR.match('Dr. Wayne')
print(m)

#1: imports the .
#3: the regular expression into the RE_MR.
#4: the string "Dr. Choi" with RE_MR and saves the to m.

A regular expression is represented by r'expression' where the expression is in a string preceded by the special character r.

The above code prints None, indicating that the value of m is None, because the regular expression does not match the string.

m = RE_MR.match('Mr. Wayne')
print(m)
if m:
    print(m.group(), m.start(), m.end())

#1: since RE_MR matches the string, m is a match object.
#3: true since m is a match object.
#4: prints the matched substring, and the (inclusive) and (exclusive) indices of the substring with respect to the original string in #1.

<re.Match object; span=(0, 3), match='Mr.'>
Mr. 0 3

Currently, no are specified in RE_MR:

print(m.groups())

#1: returns an empty ().

What are the differences between a list and a tuple in Python?

It is possible to specific patterns using parentheses:

RE_MR = re.compile(r'(M[rs])(\.)')
m = RE_MR.match('Ms. Wayne')
print(m.groups())
print(m.group())
print(m.group(0))
print(m.group(1))
print(m.group(2))

#1: there are two groups in this regular expression, (M[rs]) and (\.).
#3: returns a of matched substrings ('Ms', '.') for the two groups in #1.
#4,5: return the entire match "Ms.".
#6: returns "Ms" matched by the first group (M[rs]).
#7: returns "." matched by the second group (\.).

('Ms', '.')
Ms.
Ms.
Ms
.

The above RE_MR matches "Mr." and "Ms." but not "Mrs." Modify it to match all of them (Hint: use a non-capturing group and |).

RE_MR = re.compile(r'(M(?:[rs]|rs))(\.)')

The non-capturing group (?:[rs]|rs) matches "r", "s", or "rs" such that the first group matches "Mr", "Ms", and "Mrs", respectively.

Since we use the non-capturing group, the following code still prints a tuple of two strings:

print(RE_MR.match('Mrs. Wayne').groups())
--> ('Mrs', '.')

What if we use a capturing group instead?

RE_MR = re.compile(r'(M([rs]|rs))(\.)')

Now, the nested group ([rs]|rs) is considered the second group such that the match returns a tuple of three strings as follows:

print(RE_MR.match('Mrs. Wayne').groups())
--> ('Mr', 'rs', '.')

search()

Let us match the following strings with RE_MR:

s1 = 'Mr. and Ms. Wayne are here'
s2 = 'Here are Mr. and Mrs. Wayne'

print(RE_MR.match(s1))
print(RE_MR.match(s2))

#4: matches "Mr." but not "Ms."
#5: matches neither "Mr." nor "Mrs."

<re.Match object; span=(0, 3), match='Mr.'>
None

To match a pattern anywhere in the string, we need to for the pattern instead:

print(RE_MR.search(s1))
print(RE_MR.search(s2))

search() returns a match object as match() does.

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(9, 12), match='Mr.'>

findall()

search() still does not return the second substrings, "Ms." and "Mrs.". The following shows how to substrings that match the pattern:

print(RE_MR.findall(s1))
print(RE_MR.findall(s2))

findall() returns a list of tuples where each tuple represents a group of matched results.

[('Mr', '.'), ('Ms', '.')]
[('Mr', '.'), ('Mrs', '.')]

finditer()

Since findall() returns a list of tuples instead of match objects, there is no definite way of locating the matched results in the original string. To return match objects instead, we need to the pattern:

for m in RE_MR.finditer(s1):
    print(m)

#1: finditer() returns an that keeps matching the pattern until it no longer finds.

<re.Match object; span=(0, 3), match='Mr.'>
<re.Match object; span=(8, 11), match='Ms.'>

for m in RE_MR.finditer(s2):
    print(m)

<re.Match object; span=(9, 12), match='Mr.'>
<re.Match object; span=(17, 21), match='Mrs.'>

You can use a to store the match objects as a list:

ms = [m for m in RE_MR.finditer(s1)]
print(ms)

#1: returns a list of all m (in order) matched by finditer().

[<re.Match object; span=(0, 3), match='Mr.'>, <re.Match object; span=(8, 11), match='Ms.'>]

How is the code above different from the one below?

ms = []
for m in RE_MR.finditer(s1):
    ms.append(m)

What are the advantages of using a list comprehension over a for-loop other than it makes the code shorter?

Write regular expressions to match the following cases:

Abbreviation: Dr., U.S.A.
Apostrophe: '80, '90s, 'cause
Concatenation: don't, gonna, cannot
Hyperlink: https://github.com/emory-courses/cs329/
Number: 1/2, 123-456-7890, 1,000,000
Unit: $10, #20, 5kg

RE_TOK = re.compile(r'([",.]|n\'t|\s+)')
RE_ABBR = re.compile(r'((?:Mr|Mrs|Ms|Dr)\.)|((?:[A-Z]\.){2,})')
RE_APOS = re.compile(r'\'(\d\ds?|cause)')
RE_CONC = re.compile(r'([A-Za-z]+)(n\'t)|(gon)(na)|(can)(not)')
RE_HYPE = re.compile(r'(https?://\S+)')
RE_NUMB = re.compile(r'(\d+/\d+)|(\d{3}-\d{3}-\d{4})|(\d(?:,\d{3})+)')
RE_UNIT = re.compile(r'([$#])?(\d+)([km]g)?')

Natex Integration

The nesting example in has a condition as follows (#4):

'{[{so, very} good], fantastic}'

Write a regular expression that matches the above condition.

r'((?:so|very) good|fantastic)'

It is possible to use regular expressions for matching in Natex. A regular expression is represented by forward slashes (/../):

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '/((?:so|very) good|fantastic)/': {
            '`Things are just getting better for you!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: true if the entire input matches the regular expression.

S: Hello. How are you?
U: So good!!!
S: Things are just getting better for you!

S: Hello. How are you?
U: Fantastic :)
S: Things are just getting better for you!

S: Hello. How are you?
U: It's fantastic
S: Sorry, I didn't understand you.

You can put the expression in a sequence to allow it a partial match:

transitions = {
    'state': 'start',
    '`Hello. How are you?`': {
        '[/((?:so|very) good|fantastic)/]': {
            '`Things are just getting better for you!`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: the regular expression is put in a sequence [].

S: Hello. How are you?
U: It's fantastic!!
S: Things are just getting better for you!

S: Hello. How are you?
U: I'm so good, thank you!
S: Things are just getting better for you!

Variable

It is possible to store the matched results of a regular expression to variables. A variable in a regular expression is represented by angle brackets (<..>) inside a capturing group ((?..)).

The following transitions take the user name and respond with the stored first and last name:

transitions = {
    'state': 'start',
    '`Hello. What should I call you?`': {
        '[/(?<FIRSTNAME>[a-z]+) (?<LASTNAME>[a-z]+)/]': {
            '`It\'s nice to meet you,` $FIRSTNAME `. I know several people with the last name,` $LASTNAME': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

#4: matches the first name and the last name in order and stores them in the variables FIRSTNAME and LASTNAME.
#5: uses FIRSTNAME and LASTNAME in the response.

S: Hello. What should I call you?
U: Jinho Choi
S: It's nice to meet you, jinho . I know several other choi .

3.5. Macro

How to use macro functions for matching in Natex.

The most powerful aspect of Natex is its ability to integrate pattern matching with arbitrary code. This allows you to integrate regular expressions, NLP models, or custom algorithms into Natex.

Creation

A macro can be defined by creating a class inheriting the Macro in STDM and the run method:

#1: imports Macro from STDM.
#2: imports type hints from the package in Python.
#4: creates the MacroGetName class inheriting Macro.
#5: overrides the run method declared in Macro.

Currently, the run method returns True no matter what the input is.

Integration

Let us create transitions using this macro. A macro is represented by an alias preceded by the pound sign (#):

transitions = {
    'state': 'start',
    '`Hello. What should I call you?`': {
        '#GET_NAME': {
            '`It\'s nice to meet you.`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

macros = {
    'GET_NAME': MacroGetName()
}

#4: calls the macro #GET_NAME that is an alias of MacroGetName.
#13: creates a dictionary defining aliases for macros.
#14: creates an object of MacroGetName and saves it to the alias GET_NAME.

To call the macro, we need to add the alias dictionary macros to the dialogue flow:

df = DialogueFlow('start', end_state='end')
df.load_transitions(transitions)
df.add_macros(macros)

#3: adds all macros defined in macros to the dialogue flow df.

Parameters

The run method has three parameters:

ngrams: is a set of strings representing every of the input matched by the Natex.
vars: is the variable dictionary, maintained by a DialogueFlow object, where the keys and values are variable names and objects corresponding to their values.
args: is a list of strings representing arguments specified in the macro call.

Let us modify the run method to see what ngrams and vars give:

def run(self, ngrams: Ngrams, vars: Dict[str, Any], args: List[Any]):
    print(ngrams.raw_text())
    print(ngrams.text())
    print(ngrams)
    print(vars)

#2: prints the original string of the matched input span before preprocessing.
#3: prints the input span, preprocessed by STDM and matched by the Natex.
#4: prints a set of n-grams.

When you interact with the the dialogue flow by running it (df.run()), it prints the followings:

S: Hello. What should I call you?
U: Dr. Jinho Choi
S: It's nice to meet you.

The raw_text method returns the original input:

Dr. Jinho Choi

The text method returns the preprocessed input used to match the Natex:

dr jinho choi

The ngrams gives a set of all possible n-grams in text():

{
    'dr',
    'jinho',
    'choi',
    'dr jinho',
    'jinho choi',
    'dr jinho choi'
}

Finally, the vars gives a dictionary consisting of both system-level and user-custom variables (no user-custom variables are saved at the moment):

{
    '__state__': '0',
    '__system_state__': 'start',
    '__stack__': [],
    '__user_utterance__': 'dr jinho choi',
    '__goal_return_state__': 'None',
    '__selected_response__': 'Hello. What should I call you?',
    '__raw_user_utterance__': 'Dr. Jinho Choi',
    '__converged__': 'True'
}

Implementation

Let us update the run method that matches the title, first name, and last name in the input and saves them to the variables $TITLE, $FIRSTNAME, and $LASTNAME, respectively:

def run(self, ngrams: Ngrams, vars: Dict[str, Any], args: List[Any]):
    r = re.compile(r"(mr|mrs|ms|dr)?(?:^|\s)([a-z']+)(?:\s([a-z']+))?")
    m = r.search(ngrams.text())
    if m is None: return False

    title, firstname, lastname = None, None, None
    
    if m.group(1):
        title = m.group(1)
        if m.group(3):
            firstname = m.group(2)
            lastname = m.group(3)
        else:
            firstname = m.group()
            lastname = m.group(2)
    else:
        firstname = m.group(2)
        lastname = m.group(3)

    vars['TITLE'] = title
    vars['FIRSTNAME'] = firstname
    vars['LASTNAME'] = lastname

    return True

#2: creates a regular expression to match the title, first name and last name.
#3: searches for the span to match.
#4: returns False if no match is found.
#6-18 -> exercise.
#20-22: saves the recognized title, first name, and last name to the corresponding variables.
#24: returns True as the regular expression matches the input span.

Given the updated macro, the above transitions can be modified as follow:

transitions = {
    'state': 'start',
    '`Hello. What should I call you?`': {
        '#GET_NAME': {
            '`It\'s nice to meet you,` $FIRSTNAME `.` $LASTNAME `is my favorite name.`': 'end'
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}?

#5: uses the variables $FIRSTNAME and $LASTNAME retrieved by the macro to generate the output.

The followings show outputs:

S: Hello. What should I call you?
U: Dr. Jinho Choi
S: It's nice to meet you, jinho . choi is my favorite name.

S: Hello. What should I call you?
U: Jinho Choi
S: It's nice to meet you, jinho . choi is my favorite name.

S: Hello. What should I call you?
U: Dr. Choi
S: It's nice to meet you, dr choi . choi is my favorite name.

S: Hello. What should I call you?
U: Jinho
S: It's nice to meet you, jinho .  is my favorite name.

Although the last name is not recognized, and thus, it leaves a blank in the output, it is still considered "matched" because run() returns True for this case. Such output can be handled better by using the capability in Natex.

Can macros be mixed with other Natex expressions?

3.2. Ontology

How to use ontologies for matching in Natex.

Ontology

Let us create a dialogue flow to talk about animals:

S: What is your favorite animal?
U: I love frog
S: Amphibians can be cute :)

S: What is your favorite animal?
U: Cat
S: I've never heard of that animal.

S: What is your favorite animal?
U: Dogs
S: I've never heard of that animal.

For each type of animal, however, the list can be indefinitely long (e.g., there are over 5,400 mammal species). In this case, it is better to use an (e.g., , ).

Let us create a JSON file, , containing an ontology of animals:

{
    "ontology": {
        "animal": ["mammal", "fish", "bird", "reptile", "amphibian"],
        "mammal": ["dog", "ape", "rat"],
        "reptile": ["snake", "lizard"],
        "amphibian": ["frog", "salamander"],
        "dog": ["golden retriever", "poodle"]
    }
}

#2: the key ontology is paired with a dictionary as a value.
#3: the key animal represents the category, and its subcategories are indicated in the list.
#4-6: each subcategory, mammal, reptile, and amphibian, has its own subcategory.
#7: the ontology hierarchy: animal -> mammal -> dog.

Given the ontology, the above transitions can be rewritten as follow:

transitions = {
    'state': 'start',
    '`What is your favorite animal?`': {
        '[#ONT(mammal)]': {
            '`I love mammals!`': 'end'
        },
        '[#ONT(reptile)]': {
            '`Reptiles are slick, haha`': 'end'
        },
        '[#ONT(amphibian)]': {
            '`Amphibians can be cute :)`': 'end'
        },
        'error': {
            '`I\'ve never heard of that animal.`': 'end'
        }
    }
}

#4: matches the key "mammal" as well as its subcategories: "dog", "ape", and "rat".
#5: matches the key "reptile" as well as its subcategories: "snake" and "lizard".
#6: matches the key "amphibian" as well as its subcategories: "frog" and "salamander".

S: What is your favorite animal?
U: I love frogs
S: Amphibians can be cute :)

Unlike set matching, ontology matching handles plurals (e.g., "frogs").

S: What is your favorite animal?
U: I love my golden retriever
S: I love mammals!

Although there is no condition specified for the category dog that includes "golden retriever", there is a condition for its supercategory mammal (#4), to which it backs off.

S: What is your favorite animal?
U: I cannot live without my puppy!
S: I've never heard of that animal.

Currently, ontology matching does not handle plurals for compound nouns (e.g., "golden retrievers"), which will be fixed in the following version.

Expression

{
    "ontology": {
        "animal": ["mammal", "fish", "bird", "reptile", "amphibian"],
        "mammal": ["dog", "ape", "rat"],
        "reptile": ["snake", "lizard"],
        "amphibian": ["frog", "salamander"],
        "dog": ["golden retriever", "poodle"]
    },

    "expressions": {
        "dog": ["canine", "puppy"]
    }
}

#10: the key expressions is paired with a dictionary as a value.
#4: allows matching "canine" and "puppy" for the dog category.

Once you load the updated JSON file, it now understands "puppy" as an expression of "dog":

S: What is your favorite animal?
U: I cannot live without my puppy!
S: I love mammals!

It is possible to match "puppy" by adding the term as a category of "dog" (#7). However, it would not be a good practice as "puppy" should not be considered a subcategory of "dog".

Variable

Values matched by the ontology can also be stored in variables:

transitions = {
    'state': 'start',
    '`What is your favorite animal?`': {
        '[$FAVORITE_ANIMAL=#ONT(mammal)]': {
            '`I love` $FAVORITE_ANIMAL `!`': 'end'
        },
        '[$FAVORITE_ANIMAL=#ONT(reptile)]': {
            '$FAVORITE_ANIMAL `are slick, haha`': 'end'
        },
        '[$FAVORITE_ANIMAL=#ONT(amphibian)]': {
            '$FAVORITE_ANIMAL `can be cute :)`': 'end'
        },
        'error': {
            '`I\'ve never heard of that animal.`': 'end'
        }
    }
}

#4,7,10: the matched term gets stored in the variable FAVORITE_ANIMAL.
#5,8,11: the system uses the value of FAVORITE_ANIMAL to generate the response.

S: What is your favorite animal?
U: I love frogs
S: frogs can be cute :)

S: What is your favorite animal?
U: I can't live without my puppy!
S: I love puppy !

Loading

The custom ontology must be loaded to the knowledge base of the dialogue flow before it runs:

df = DialogueFlow('start', end_state='end')
df.knowledge_base().load_json_file('resources/ontology_animal.json')
df.load_transitions(transitions)

#1: loads the ontology in ontology_animal.json to the knowledge base of df.

Code Snippet

def natex_ontology() -> DialogueFlow:
    transitions = {
        'state': 'start',
        '`What is your favorite animal?`': {
            '[$FAVORITE_ANIMAL=#ONT(mammal)]': {
                '`I love` $FAVORITE_ANIMAL `!`': 'end'
            },
            '[$FAVORITE_ANIMAL=#ONT(reptile)]': {
                '$FAVORITE_ANIMAL `are slick, haha`': 'end'
            },
            '[$FAVORITE_ANIMAL=#ONT(amphibian)]': {
                '$FAVORITE_ANIMAL `can be cute :)`': 'end'
            },
            'error': {
                '`I\'ve never heard of that animal.`': 'end'
            }
        }
    }

    df = DialogueFlow('start', end_state='end')
    df.knowledge_base().load_json_file('resources/ontology_animal.json')
    df.load_transitions(transitions)
    return df
    
if __name__ == '__main__':
    natex_ontology().run()