Several matching strategies built in Natex.
Emora STDM supports several ways for interpreting the contexts of user inputs through Natex (Natural Langauge Expression), some of which you already experienced in Matching Strategy.
A literal is what you intend the system to say. A literal is represented by reversed primes (`..`
):
#3
: the system prompts the literal and ends the dialogue.
Natex supports several ways of matching the input with key terms.
The condition is true if the input exactly matches the term. A term is represented as a string and can have more than one token:
#4
: matches the input with 'could be better'
.
#7
: error
is a reserved term indicating the default condition of this conditional branching, similar to the wildcard condition (_
) in a match statement.
The condition is true if the input exactly matches any term in the set. A set is represented by curly brackets ({}
):
#7
: matches the input with either 'good'
or 'not bad'
.
The condition is true if some terms in the input match all terms in the unordered list, regardless of the order. An unordered list is represented by angle brackets (<>
):
#10
: matches the input with both 'very'
and 'good'
in any order.
The condition is true if some terms in the input match all terms in the ordered list, a.k.a. sequence, in the same order. An ordered list is represented by square brackets ([]
):
#13
: matches the input with both 'so'
and 'good'
in that order.
Currently, it matches the input "could be better" with the condition in #4
, but does not match "it could be better" or "could be better for sure", where there are terms other than the ones indicated in the condition.
Update the condition such that it matches all three inputs.
How about matching inputs such as "could be much better" or "could be really better"?
'[could be better]'
'[could be, better]'
The condition is true if all terms in the input exactly match all terms in the rigid sequence in the same order. A rigid sequence is represented by square brackets ([ ]
), where the left bracket is followed by an exclamation mark (!
):
#16
: matches the input with both 'hello'
and 'world'
in that order.
There is no difference between matching a term (e.g., 'hello world'
) and matching a rigid sequence (e.g., '[!hello, world]'
). The rigid sequence is designed specifically for negation, which will be deprecated in the next version.
The condition is true if all terms in the input exactly match all terms in the rigid sequence except for ones that are negated. A negation is represented by a hyphen (-
):
#19
: matches the input with 'aweful'
and zero to many terms prior to it that are not 'not'
.
It is possible to nest conditions for more advanced matching. Let us create a term condition that matches both "so good" and "very good" using a nested set:
#4
: uses a set inside a term.
Does this condition match "good"?
No, because the outer condition uses term matching that requires the whole input to be the same as the condition.
However, it does not match when other terms are included in the input (e.g., "It's so good to be here"). To broaden the matching scope, you can put the condition inside a sequence:
#4
: the term condition is inside the sequence.
What if we want the condition to match the above inputs as well as "fantastic"? You can put the condition under a set and add fantastic
as another term:
#4
: the sequence condition and the new term fantastic
is inside the set.
The above transitions match "Fantastic" but not "It's fantastic". Update the condition such that it can match both inputs.
Put fantastic
under a sequence such that '{[{so, very} good], [fantastic]}'
.
Saving user content can be useful in many ways. Let us consider the following transitions:
Users may feel more engaged if the system says, "I like dogs too" instead of "them". Natex allows you to create a variable to store the matched term. A variable is represented by a string preceded (without spaces) by a dollar sign $
:
#4
: creates a variable FAVORITE_ANIMAL
storing the matched term from the user content.
#5
: uses the value of the variable to generate the follow-up system utterance.
In #5
, two literals, `I like`
and `too!`
surround the variable $FAVORITE_ANIMAL
. If a variable were indicated inside a literal, STDM would throw an error.
How to use regular expressions for matching in Natex.
Regular expressions provide powerful ways to match strings and beyond:
, Chapter 2.1, Speech and Language Processing (3rd ed.), Jurafsky and Martin.
, Python Documentation
Several functions are provided in Python to match regular expressions.
Let us create a regular expression that matches "Mr." and "Ms.":
A regular expression is represented by r'expression'
where the expression is in a string preceded by the special character r
.
The above code prints None
, indicating that the value of m
is None
, because the regular expression does not match the string.
#1
: since RE_MR
matches the string, m
is a match object.
#3
: true
since m
is a match object.
What are the differences between a list and a tuple in Python?
#1
: there are two groups in this regular expression, (M[rs])
and (\.)
.
#4,5
: return the entire match "Ms.".
#6
: returns "Ms" matched by the first group (M[rs])
.
#7
: returns "." matched by the second group (\.)
.
The above RE_MR
matches "Mr." and "Ms." but not "Mrs." Modify it to match all of them (Hint: use a non-capturing group and |
).
The non-capturing group (?:[rs]|rs)
matches "r", "s", or "rs" such that the first group matches "Mr", "Ms", and "Mrs", respectively.
Since we use the non-capturing group, the following code still prints a tuple of two strings:
What if we use a capturing group instead?
Now, the nested group ([rs]|rs)
is considered the second group such that the match returns a tuple of three strings as follows:
Let us match the following strings with RE_MR
:
#4
: matches "Mr." but not "Ms."
#5
: matches neither "Mr." nor "Mrs."
For s1
, only "Mr." is matched because match()
stops matching after finding the first pattern. For s2
on the other hand, even "Mr." is not matched because match()
requires the pattern to be at the beginning of the string.
search()
returns a match object as match()
does.
findall()
returns a list of tuples where each tuple represents a group of matched results.
#1
: returns a list of all m
(in order) matched by finditer(
).
How is the code above different from the one below?
What are the advantages of using a list comprehension over a for-loop other than it makes the code shorter?
Write regular expressions to match the following cases:
Abbreviation: Dr.
, U.S.A.
Apostrophe: '80
, '90s
, 'cause
Concatenation: don't
, gonna
, cannot
Hyperlink: https://github.com/emory-courses/cs329/
Number: 1/2
, 123-456-7890
, 1,000,000
Unit: $10
, #20
, 5kg
Write a regular expression that matches the above condition.
It is possible to use regular expressions for matching in Natex. A regular expression is represented by forward slashes (/../
):
#4
: true
if the entire input matches the regular expression.
You can put the expression in a sequence to allow it a partial match:
#4
: the regular expression is put in a sequence []
.
When used in Natex, all literals in the regular expression (e.g., "so", "good" in #4
) must be lowercase because Natex matches everything in lowercase. The design choice is made because users tend not to follow typical capitalization in a chat interface, whether it is text- or audio-based.
It is possible to store the matched results of a regular expression to variables. A variable in a regular expression is represented by angle brackets (<..>
) inside a capturing group ((?..)
).
The following transitions take the user name and respond with the stored first and last name:
#4
: matches the first name and the last name in order and stores them in the variables FIRSTNAME
and LASTNAME
.
#5
: uses FIRSTNAME
and LASTNAME
in the response.
How to use macro functions for matching in Natex.
The most powerful aspect of Natex is its ability to integrate pattern matching with arbitrary code. This allows you to integrate regular expressions, NLP models, or custom algorithms into Natex.
A macro can be defined by creating a class inheriting the Macro
in STDM and the run
method:
#1
: imports Macro
from STDM.
#2
: imports type hints from the package in Python.
#4
: creates the MacroGetName
class inheriting Macro
.
#5
: overrides the run
method declared in Macro
.
Currently, the run
method returns True
no matter what the input is.
Let us create transitions using this macro. A macro is represented by an alias preceded by the pound sign (#
):
#4
: calls the macro #GET_NAME
that is an alias of MacroGetName
.
#13
: creates a dictionary defining aliases for macros.
#14
: creates an object of MacroGetName
and saves it to the alias GET_NAME
.
To call the macro, we need to add the alias dictionary macros
to the dialogue flow:
#3
: adds all macros defined in macros
to the dialogue flow df
.
The run
method has three parameters:
vars
: is the variable dictionary, maintained by a DialogueFlow
object, where the keys and values are variable names and objects corresponding to their values.
args
: is a list of strings representing arguments specified in the macro call.
Let us modify the run
method to see what ngrams
and vars
give:
#2
: prints the original string of the matched input span before preprocessing.
#3
: prints the input span, preprocessed by STDM and matched by the Natex.
#4
: prints a set of n-grams.
When you interact with the the dialogue flow by running it (df.run()
), it prints the followings:
The raw_text
method returns the original input:
The text
method returns the preprocessed input used to match the Natex:
The ngrams
gives a set of all possible n-grams in text()
:
Finally, the vars
gives a dictionary consisting of both system-level and user-custom variables (no user-custom variables are saved at the moment):
Let us update the run
method that matches the title, first name, and last name in the input and saves them to the variables $TITLE
, $FIRSTNAME
, and $LASTNAME
, respectively:
#2
: creates a regular expression to match the title, first name and last name.
#3
: searches for the span to match.
#4
: returns False
if no match is found.
#6-18
-> exercise.
#20-22
: saves the recognized title, first name, and last name to the corresponding variables.
#24
: returns True
as the regular expression matches the input span.
Given the updated macro, the above transitions can be modified as follow:
#5
: uses the variables $FIRSTNAME
and $LASTNAME
retrieved by the macro to generate the output.
The followings show outputs:
Can macros be mixed with other Natex expressions?
How to use ontologies for matching in Natex.
Let us create a dialogue flow to talk about animals:
#2
: the key ontology
is paired with a dictionary as a value.
#3
: the key animal
represents the category, and its subcategories are indicated in the list.
#4-6
: each subcategory, mammal
, reptile
, and amphibian
, has its own subcategory.
#7
: the ontology hierarchy: animal
-> mammal
-> dog
.
Given the ontology, the above transitions can be rewritten as follow:
#4
: matches the key "mammal" as well as its subcategories: "dog", "ape", and "rat".
#5
: matches the key "reptile" as well as its subcategories: "snake" and "lizard".
#6
: matches the key "amphibian" as well as its subcategories: "frog" and "salamander".
Unlike set matching, ontology matching handles plurals (e.g., "frogs").
Although there is no condition specified for the category dog
that includes "golden retriever", there is a condition for its supercategory mammal
(#4
), to which it backs off.
Currently, ontology matching does not handle plurals for compound nouns (e.g., "golden retrievers"), which will be fixed in the following version.
It is possible that a category is mentioned in a non-canonical way; the above conditions do not match "puppy" because it is not introduced as a category in the ontology. In this case, we can specify the aliases as "expressions":
#10
: the key expressions
is paired with a dictionary as a value.
#4
: allows matching "canine" and "puppy" for the dog
category.
Once you load the updated JSON file, it now understands "puppy" as an expression of "dog":
It is possible to match "puppy" by adding the term as a category of "dog" (#7
). However, it would not be a good practice as "puppy" should not be considered a subcategory of "dog".
Values matched by the ontology can also be stored in variables:
#4,7,10
: the matched term gets stored in the variable FAVORITE_ANIMAL
.
#5,8,11
: the system uses the value of FAVORITE_ANIMAL
to generate the response.
The custom ontology must be loaded to the knowledge base of the dialogue flow before it runs:
#1
: loads the ontology in ontology_animal.json
to the knowledge base of df
.
Quiz 3: Contextual Understanding
Your goal is to create a chatbot that talks about movies. Here is a sample dialogue:
Your chatbot aims to collect user information by asking the following:
The latest movie that the user watched (#3-4
).
A contextualized question regarding the latest movie (#5-6
).
A question regarding the genre of the latest movie (#7-10
).
Your chatbot should give an appropriate response to every user response. For this assignment, you must use all of the following:
Update them to design a dialogue flow for the chatbot.
Create a PDF file quiz3.pdf
that describes the following:
Sample dialogues that your chatbot can conduct.
Explanations of how the ontology, macro(s), and regular expression(s) are used for contextual understanding in your chatbot.
Commit and push quiz3.py
to your GitHub repository.
Submit quiz3.pdf
to Canvas.
#1
: imports the .
#3
: the regular expression into the RE_MR
.
#4
: the string "Dr. Choi" with RE_MR
and saves the to m
.
#4
: prints the matched substring, and the (inclusive) and (exclusive) indices of the substring with respect to the original string in #1
.
Currently, no are specified in RE_MR:
#1
: returns an empty ()
.
It is possible to specific patterns using parentheses:
#3
: returns a of matched substrings ('Ms', '.')
for the two groups in #1
.
To match a pattern anywhere in the string, we need to for the pattern instead:
search()
still does not return the second substrings, "Ms." and "Mrs.". The following shows how to substrings that match the pattern:
Since findall()
returns a list of tuples instead of match objects, there is no definite way of locating the matched results in the original string. To return match objects instead, we need to the pattern:
#1
: finditer()
returns an that keeps matching the pattern until it no longer finds.
You can use a to store the match objects as a list:
The nesting example in has a condition as follows (#4
):
ngrams
: is a set of strings representing every of the input matched by the Natex.
Although the last name is not recognized, and thus, it leaves a blank in the output, it is still considered "matched" because run()
returns True
for this case. Such output can be handled better by using the capability in Natex.
For each type of animal, however, the list can be indefinitely long (e.g., there are over 5,400 mammal species). In this case, it is better to use an (e.g., , ).
Let us create a JSON file, , containing an ontology of animals:
An covering common movie genres and a branch of movies that you target,
At least one ,
At least one (can be used inside a macro).
Create a Python file under the package.
Create a JSON file under the directory.
^
The start of the string
$
The end of the string
\num
The contents of the group of the same number
\d
Any decimal digit
\D
Any non-decimal-digit character
\s
Any whitespace character
\S
Any non-whitespace character
\w
Any alphanumeric character and the underscore
\W
Any non-alphanumeric character
[ ]
A set of characters
( )
A capturing group
(?: )
A non capturing group
|
or
.
Any character except a newline
*
0 or more repetitions
*?
+
1 or more repetitions
+?
?
0 or 1 repetitions
??
{m}
Exactly m
repetitions
{m,n}
From m
to n
repetitions
{m,n}?