3.4. Regular Expression
How to use regular expressions for matching in Natex.
Last updated
Was this helpful?
How to use regular expressions for matching in Natex.
Last updated
Was this helpful?
Regular expressions provide powerful ways to match strings and beyond:
, Chapter 2.1, Speech and Language Processing (3rd ed.), Jurafsky and Martin.
, Python Documentation
[ ]
A set of characters
( )
A capturing group
(?: )
A non capturing group
|
or
.
Any character except a newline
*
0 or more repetitions
*?
+
1 or more repetitions
+?
?
0 or 1 repetitions
??
{m}
Exactly m
repetitions
{m,n}
From m
to n
repetitions
{m,n}?
^
The start of the string
$
The end of the string
\num
The contents of the group of the same number
\d
Any decimal digit
\D
Any non-decimal-digit character
\s
Any whitespace character
\S
Any non-whitespace character
\w
Any alphanumeric character and the underscore
\W
Any non-alphanumeric character
Several functions are provided in Python to match regular expressions.
Let us create a regular expression that matches "Mr." and "Ms.":
The above code prints None
, indicating that the value of m
is None
, because the regular expression does not match the string.
#1
: since RE_MR
matches the string, m
is a match object.
#3
: true
since m
is a match object.
What are the differences between a list and a tuple in Python?
#1
: there are two groups in this regular expression, (M[rs])
and (\.)
.
#4,5
: return the entire match "Ms.".
#6
: returns "Ms" matched by the first group (M[rs])
.
#7
: returns "." matched by the second group (\.)
.
The above RE_MR
matches "Mr." and "Ms." but not "Mrs." Modify it to match all of them (Hint: use a non-capturing group and |
).
Let us match the following strings with RE_MR
:
#4
: matches "Mr." but not "Ms."
#5
: matches neither "Mr." nor "Mrs."
For s1
, only "Mr." is matched because match()
stops matching after finding the first pattern. For s2
on the other hand, even "Mr." is not matched because match()
requires the pattern to be at the beginning of the string.
search()
returns a match object as match()
does.
findall()
returns a list of tuples where each tuple represents a group of matched results.
#1
: returns a list of all m
(in order) matched by finditer(
).
How is the code above different from the one below?
What are the advantages of using a list comprehension over a for-loop other than it makes the code shorter?
Write regular expressions to match the following cases:
Abbreviation: Dr.
, U.S.A.
Apostrophe: '80
, '90s
, 'cause
Concatenation: don't
, gonna
, cannot
Hyperlink: https://github.com/emory-courses/cs329/
Number: 1/2
, 123-456-7890
, 1,000,000
Unit: $10
, #20
, 5kg
Write a regular expression that matches the above condition.
It is possible to use regular expressions for matching in Natex. A regular expression is represented by forward slashes (/../
):
#4
: true
if the entire input matches the regular expression.
You can put the expression in a sequence to allow it a partial match:
#4
: the regular expression is put in a sequence []
.
It is possible to store the matched results of a regular expression to variables. A variable in a regular expression is represented by angle brackets (<..>
) inside a capturing group ((?..)
).
The following transitions take the user name and respond with the stored first and last name:
#4
: matches the first name and the last name in order and stores them in the variables FIRSTNAME
and LASTNAME
.
#5
: uses FIRSTNAME
and LASTNAME
in the response.
#1
: imports the .
#3
: the regular expression into the RE_MR
.
#4
: the string "Dr. Choi" with RE_MR
and saves the to m
.
#4
: prints the matched substring, and the (inclusive) and (exclusive) indices of the substring with respect to the original string in #1
.
Currently, no are specified in RE_MR:
#1
: returns an empty ()
.
It is possible to specific patterns using parentheses:
#3
: returns a of matched substrings ('Ms', '.')
for the two groups in #1
.
To match a pattern anywhere in the string, we need to for the pattern instead:
search()
still does not return the second substrings, "Ms." and "Mrs.". The following shows how to substrings that match the pattern:
Since findall()
returns a list of tuples instead of match objects, there is no definite way of locating the matched results in the original string. To return match objects instead, we need to the pattern:
#1
: finditer()
returns an that keeps matching the pattern until it no longer finds.
You can use a to store the match objects as a list:
The nesting example in has a condition as follows (#4
):