1 of 5

5. Tries

This chapter discusses tries, also known as prefix trees.

Concept
Implementation
Quiz

References

Trie

5.1. Concept

This section gives an overview of tries.

Overview

A trie is a tree where each node has:

0-to-many children,
A key whose type is character,
A value that can be any type of object, and
An end-state that indicates if the node and its ancestors together represent the end of a certain word.

Let us consider the following list of strings:

["she", "shell", "see", "shore", "selling", "sell"]

Given the strings as keys, a binary search tree can be constructed as follows using a balancing algorithm:

How many character comparisons does it need to make to search a string in a balanced BST?

With the same list of strings, a trie can be constructed as follows:

1st cell: key (a character).
2nd cell: value (the index of the string that the node represents).
3rd cell: end-state (T for true, F for false).

What is the worst-case complexity of searching a string in a trie?

5.2. Implementation

This section describes how to implement a trie.

Trie Node

Let us define a class, TrieNode:

public class TrieNode<T> {
    private final Map<Character, TrieNode<T>> children;
    private TrieNode<T> parent;
    private boolean end_state;
    private char key;
    private T value;

    public TrieNode(TrieNode<T> parent, char key) {
        children = new HashMap<>();
        setEndState(false);
        setParent(parent);
        setKey(key);
        setValue(null);
    }
}

L1: the generic type T represents the type of value in L6.
L2: Map is used to store the children.
L4: true if this node represents the last character of a certain word; otherwise, false.

What other types of maps are implemented in Java?

Let us define getter methods:

public TrieNode<T> getParent() { return parent; }

public char getKey() { return key; }

public T getValue() { return value; }

public TrieNode<T> getChild(char key) { return children.get(key); }

/** @return the map whose keys and values are children's characters and nodes. */
public Map<Character, TrieNode<T>> getChildrenMap() {
    return children;
}

L10: returns the map consisting of children's characters as keys and their nodes as values.

Is there any risk of returning this.children in L11?

Let us then define setter methods:

public void setParent(TrieNode<T> node) { parent = node; }

public void setKey(char key) { this.key = key; }

public void setEndState(boolean endState) { end_state = endState; }

public T setValue(T value) {
    T tmp = this.value;
    this.value = value;
    return tmp;
}

public TrieNode<T> addChild(char key) {
    TrieNode<T> child = getChild(key);

    if (child == null) {
        child = new TrieNode<>(this, key);
        children.put(key, child);
    }

    return child;
}

public TrieNode<T> removeChild(char key) {
    return children.remove(key);
}

L7: returns the previously assigned value of this node.
L13: returns the child with the specific key if exists; otherwise, a new child with the key.

What does the removeChild() method return if key does not exist in this node?

Finally, let us define checker methods:

public boolean isEndState() { return end_state; }

public boolean hasValue() { return value != null; }

public boolean hasChildren() { return !children.isEmpty(); }

L1: returns true if this node represents the last character of a certain word; false.

Trie

Let us create the class, Trie:

public class Trie<T> {
    private final TrieNode<T> root;

    public Trie() {
        root = new TrieNode<>(null, (char)0);
    }

    public TrieNode<T> getRoot() { return root; }
}

L1: defines a generic type T that can be any type.
L5: initializes the root with the null character.

Let us define the find() method that takes a specific key in string and returns the node corresponding the last character of the key:

/** @return the node with the specific key if exists; otherwise, {@code null}. */
public TrieNode<T> find(String key) {
    TrieNode<T> node = root;

    for (char c : key.toCharArray()) {
        node = node.getChild(c);
        if (node == null) return null;
    }

    return node;
}

L5: iterates through every character in the string.
L6: finds the node whose key is the current character.
L7: the child does not exist, implying that key is not introduced in this trie.

Given the find() method, let us define the get() method that takes a string key and returns the value of the node corresponding to the key in this trie:

public T get(String key) {
    TrieNode<T> node = find(key);
    return (node != null && node.isEndState()) ? node.getValue() : null;
}

public boolean contains(String key) {
    return get(key) != null;
}

L3: checks if both the key exists and the key is a word.

We then define the put() method that inserts a key-value pair to the trie:

public T put(String key, T value) {
    TrieNode<T> node = root;

    for (char c : key.toCharArray())
        node = node.addChild(c);

    node.setEndState(true);
    return node.setValue(value);
}

L4-5: iterates through all characters in the key and adds children as necessary.
L7: sets the end state of the node corresponding the last character to true.
L8: returns the value of the previously key if exists; otherwise, null.

What does node.addChild(c) return if the child with the key c already exists?

The following demonstrates how the put() method works:

Finally, let us define the remove() method that removes all appropriate nodes corresponding to the key:

public boolean remove(String key) {
    TrieNode<T> node = find(key);

    if (node == null || !node.isEndState())
        return false;

    if (node.hasChildren()) {
        node.setEndState(false);
        return true;
    }

    TrieNode<T> parent = node.getParent();

    while (parent != null) {
        parent.removeChild(node.getKey());

        if (parent.hasChildren() || parent.isEndState())
            break;
        else {
            node = parent;
            parent = parent.getParent();
        }
    }

    return true;
}

L2: retrieves the node corresponding to the last character in the key.
L4: returns false the key does not exist or the key is not a word.
L7-10: if the retrieved node has children, sets its end-state to false and returns true.
L12-23: removes ancestors who have descendants only related to this key.

The following demonstrates how the remove() method works:

5.3. Quiz

This section provides exercises for better understanding in tries.

Let L be a list of country names in String where spaces are allowed (e.g., "United States", "South Korea"). Consider a trie where all country names in L and their unique IDs are put as key-value pairs as follows:

Implementation

Report

Write a report that briefly describe your approach and explains the worst-case complexity of your approach, and save it as quiz5.pdf.

Notes

Substring matching of country names is not required. If your dictionary has "United States" as a country name and the input string contains "United Nation", the "United" should not be matched as a country name.
Substring matching within input words are expected. If your dictionary has "America" as a country name and the input contains "American", the first 7 characters, "America" should be recognized as a country name.
The worst-case complexity needs to be explained in terms of the number of characters n in the input string.

5.4. Homework

This section explains the homework assignment regarding Autocomplete.

Auto-Complete

Most virtual keyboards provide the option of auto-complete, which gives a list of candidate words that you wish to type given a prefix you enter. For instance, when you type "sh", it gives a list of candidate words such as "she", "shell", "ship", etc.

"sh" -> ["she", "ship", "shell, ...]

If you select a word from the candidates, it should learn your selection as the top candidate for that prefix. For instance, if you select "ship" from the example above, the next time you enter the same prefix "sh", it should give a list of candidates where the first item is "ship" instead of "she".

"sh" -> ["ship", "she", "shell, ...]

Your task is to write a program that gives a list of candidate words for any prefix and learns the selected candidates.

Tasks

Create a class called AutocompleteHW under the autocomplete package:
- This class extends the abstract class Autocomplete, which extends Trie.
- The value type of the generic T can be a collection of strings (e.g., List<String>).
- You must create a constructor that takes two parameters, dict_file and max, and calls its super constructor, which reads all words in the dictionary file (e.g., dict.txt).
Override the getCandidates() method that takes a prefix and returns a list of candidate words matching the prefix:
- The max-number of candidates in the list must be the return value of the getMax() method.
- The most recently selected candidate should appear as the 1st item, the 2nd most recently selected candidate should appear as the 2nd item, and so on.
- The rest of the candidate list should be filled with the shortest words matching the prefix.
- If there are more than one candidate with the same lengths, they should be sorted alphabetically.
- Make sure the same candidate does not appear more than once.
Override the pickCandidate() method that takes a prefix and a selected candidate, and saves the information:
- This is the most recently selected candidate for that particular prefix. It must appear as the first item in the candidate list when the same prefix is entered next time.
Submit AutocompleteHW.java to Canvas.

Extra credit

Create a class called AutocompleteHWExtra under the autocomplete package.
Feel free to change the generic type, representing the value type of each TrieNode, to anything other than List<String> (List<something else>).
The getCandidates() method:
- Must show the most frequently picked candidate first, the 2nd-most frequently picked candidate next, and so on.
- If there are more than one candidate with the same frequency, sort them by recency.
- The rest of candidates should be filled as in the original task.
You must submit both AutocompleteHW and AutocompleteHWExtra to earn the extra credit.

Notes

Do not change the dictionary file. If you find anything peculiar about the dictionary file, please let me know so everyone works on the same copy of the dictionary file.
Please test your program yourself. We will evaluate your program using our unit tests and measure the performance (both speed and accuracy).
Take a look at Map if you are not familiar with methods in the standard library.
If you are having trouble with implementing getCandidates(), think about how to traverse the trie given any node.
If you are having trouble with implementing pickCandidate(), take a look at Trie#put.
You are not allowed to use any type of Map to store candidates for this homework.
Your program should be able to handle prefixes or candidates that do not exist in the dictionary.
All picked candidates should be introduced as real words to the trie.
Whitespaces are not allowed as input; thus, you should trim all input strings.

5.2. Implementation

This section describes how to implement a trie.

Trie Node

Let us define a class, TrieNode:

public class TrieNode<T> {
    private final Map<Character, TrieNode<T>> children;
    private TrieNode<T> parent;
    private boolean end_state;
    private char key;
    private T value;

    public TrieNode(TrieNode<T> parent, char key) {
        children = new HashMap<>();
        setEndState(false);
        setParent(parent);
        setKey(key);
        setValue(null);
    }
}

L1: the generic type T represents the type of value in L6.
L2: Map is used to store the children.
L4: true if this node represents the last character of a certain word; otherwise, false.
L9: HashMap gives the $O(1)$ complexity for searching a key.

What other types of maps are implemented in Java?

Let us define getter methods:

public TrieNode<T> getParent() { return parent; }

public char getKey() { return key; }

public T getValue() { return value; }

public TrieNode<T> getChild(char key) { return children.get(key); }

/** @return the map whose keys and values are children's characters and nodes. */
public Map<Character, TrieNode<T>> getChildrenMap() {
    return children;
}

L10: returns the map consisting of children's characters as keys and their nodes as values.

Is there any risk of returning this.children in L11?

Let us then define setter methods:

public void setParent(TrieNode<T> node) { parent = node; }

public void setKey(char key) { this.key = key; }

public void setEndState(boolean endState) { end_state = endState; }

public T setValue(T value) {
    T tmp = this.value;
    this.value = value;
    return tmp;
}

public TrieNode<T> addChild(char key) {
    TrieNode<T> child = getChild(key);

    if (child == null) {
        child = new TrieNode<>(this, key);
        children.put(key, child);
    }

    return child;
}

public TrieNode<T> removeChild(char key) {
    return children.remove(key);
}

L7: returns the previously assigned value of this node.
L13: returns the child with the specific key if exists; otherwise, a new child with the key.

What does the removeChild() method return if key does not exist in this node?

Finally, let us define checker methods:

public boolean isEndState() { return end_state; }

public boolean hasValue() { return value != null; }

public boolean hasChildren() { return !children.isEmpty(); }

L1: returns true if this node represents the last character of a certain word; false.

Trie

Let us create the class, Trie:

public class Trie<T> {
    private final TrieNode<T> root;

    public Trie() {
        root = new TrieNode<>(null, (char)0);
    }

    public TrieNode<T> getRoot() { return root; }
}

L1: defines a generic type T that can be any type.
L5: initializes the root with the null character.

Let us define the find() method that takes a specific key in string and returns the node corresponding the last character of the key:

/** @return the node with the specific key if exists; otherwise, {@code null}. */
public TrieNode<T> find(String key) {
    TrieNode<T> node = root;

    for (char c : key.toCharArray()) {
        node = node.getChild(c);
        if (node == null) return null;
    }

    return node;
}

L5: iterates through every character in the string.
L6: finds the node whose key is the current character.
L7: the child does not exist, implying that key is not introduced in this trie.

Given the find() method, let us define the get() method that takes a string key and returns the value of the node corresponding to the key in this trie:

public T get(String key) {
    TrieNode<T> node = find(key);
    return (node != null && node.isEndState()) ? node.getValue() : null;
}

public boolean contains(String key) {
    return get(key) != null;
}

L3: checks if both the key exists and the key is a word.

We then define the put() method that inserts a key-value pair to the trie:

public T put(String key, T value) {
    TrieNode<T> node = root;

    for (char c : key.toCharArray())
        node = node.addChild(c);

    node.setEndState(true);
    return node.setValue(value);
}

L4-5: iterates through all characters in the key and adds children as necessary.
L7: sets the end state of the node corresponding the last character to true.
L8: returns the value of the previously key if exists; otherwise, null.

What does node.addChild(c) return if the child with the key c already exists?

The following demonstrates how the put() method works:

Finally, let us define the remove() method that removes all appropriate nodes corresponding to the key:

public boolean remove(String key) {
    TrieNode<T> node = find(key);

    if (node == null || !node.isEndState())
        return false;

    if (node.hasChildren()) {
        node.setEndState(false);
        return true;
    }

    TrieNode<T> parent = node.getParent();

    while (parent != null) {
        parent.removeChild(node.getKey());

        if (parent.hasChildren() || parent.isEndState())
            break;
        else {
            node = parent;
            parent = parent.getParent();
        }
    }

    return true;
}

L2: retrieves the node corresponding to the last character in the key.
L4: returns false the key does not exist or the key is not a word.
L7-10: if the retrieved node has children, sets its end-state to false and returns true.
L12-23: removes ancestors who have descendants only related to this key.

The following demonstrates how the remove() method works:

5.4. Homework

This section explains the homework assignment regarding Autocomplete.

Auto-Complete

"sh" -> ["she", "ship", "shell, ...]

"sh" -> ["ship", "she", "shell, ...]

Your task is to write a program that gives a list of candidate words for any prefix and learns the selected candidates.

Tasks

Create a class called AutocompleteHW under the autocomplete package:
- This class extends the abstract class Autocomplete, which extends Trie.
- The value type of the generic T can be a collection of strings (e.g., List<String>).
- You must create a constructor that takes two parameters, dict_file and max, and calls its super constructor, which reads all words in the dictionary file (e.g., dict.txt).
Override the getCandidates() method that takes a prefix and returns a list of candidate words matching the prefix:
- The max-number of candidates in the list must be the return value of the getMax() method.
- The most recently selected candidate should appear as the 1st item, the 2nd most recently selected candidate should appear as the 2nd item, and so on.
- The rest of the candidate list should be filled with the shortest words matching the prefix.
- If there are more than one candidate with the same lengths, they should be sorted alphabetically.
- Make sure the same candidate does not appear more than once.
Override the pickCandidate() method that takes a prefix and a selected candidate, and saves the information:
- This is the most recently selected candidate for that particular prefix. It must appear as the first item in the candidate list when the same prefix is entered next time.
Submit AutocompleteHW.java to Canvas.

Extra credit

Create a class called AutocompleteHWExtra under the autocomplete package.
Feel free to change the generic type, representing the value type of each TrieNode, to anything other than List<String> (List<something else>).
The getCandidates() method:
- Must show the most frequently picked candidate first, the 2nd-most frequently picked candidate next, and so on.
- If there are more than one candidate with the same frequency, sort them by recency.
- The rest of candidates should be filled as in the original task.
You must submit both AutocompleteHW and AutocompleteHWExtra to earn the extra credit.

Notes

Do not change the dictionary file. If you find anything peculiar about the dictionary file, please let me know so everyone works on the same copy of the dictionary file.
Please test your program yourself. We will evaluate your program using our unit tests and measure the performance (both speed and accuracy).
Take a look at Map if you are not familiar with methods in the standard library.
If you are having trouble with implementing getCandidates(), think about how to traverse the trie given any node.
If you are having trouble with implementing pickCandidate(), take a look at Trie#put.
You are not allowed to use any type of Map to store candidates for this homework.
Your program should be able to handle prefixes or candidates that do not exist in the dictionary.
All picked candidates should be introduced as real words to the trie.
Whitespaces are not allowed as input; thus, you should trim all input strings.

5. Tries

Contents

References

5.1. Concept

Overview

5.2. Implementation

Trie Node

Trie

5.3. Quiz

Implementation

Report

Notes

5.4. Homework

Auto-Complete

Tasks

Extra credit

Notes

5.2. Implementation

Trie Node

Trie

5. Tries

Contents

References

5.3. Quiz

Implementation

Report

Notes

5.4. Homework

Auto-Complete

Tasks

Extra credit

Notes

5.1. Concept

Overview