1 of 5

5. LM-based Matching

Content

Resource

Source codes: , ,
Slides:

5.1. Language Models

Statistical

N-gram Language Models, Jurafsky and Martin, Chapter 3 in Speech and Language Processing (3rd ed.), 2023.

Neural-based

Efficient Estimation of Word Representations in Vector Space, Mikolov et al., ICLR, 2013. <- Word2Vec
GloVe: Global Vectors for Word Representation, Pennington et al., EMNLP, 2014.
Deep Contextualized Word Representations, Ppeters et al., NAACL, 2018. <- ELMo

Transformers

Attention is All You Need, Vaswani et al., NIPS, 2017. <- Transformer
Generating Wikipedia by Summarizing Long Sequences, Liu et al., ICLR, 2018.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al., NAACL, 2018.

Tokenization

Neural Machine Translation of Rare Words with Subword Units, Sennrich et al., ACL, 2016. <- Byte-Pair Encoding (BPE)
Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Wu et al., arXiv, 2016. <- WordPiece
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, Kudo and Richardson, EMNLP, 2018.

GPT (Generative Pre-trained Transformer)

Improving Language Understanding by Generative Pre-Training, Radford et al., OpenAI, 2018. <- GPT-1
Language Models are Unsupervised Multitask Learners, Radford et al., OpenAI, 2019. <- GPT-2
Language Models are Few-Shot Learners, Brown et al., NeurIPS, 2020. <- GPT-3

5.2. Quickstart with GPT

API Key

Create an account for OpenAI and log in to your account.

Click your icon on the top-right corner and select "View API keys":

Click the "+ Create new secret key" button and copy the API key:

Make sure to save this key in a local file. If you close the dialog without saving, you cannot retrieve the key again, in which case, you have to create a new one.

Create a file openai_api.txt under the resources directory and paste the API key to the file such that it contains only one line showing the key.

Add openai_api.txt to the .gitignore file:

.idea/
venv/
/resources/openai_api.txt

Do not share this key with anyone or push it to any remote repository (including your private GitHub repository).

Using GPT API

Open the terminal in PyCharm and install the OpenAI package:

(venv) $ pip install openai

Create a function called api_key() as follow:

import openai

PATH_API_KEY = 'resources/openai_api.txt'
openai.api_key_path = PATH_API_KEY

#4: specifies the path of the file containing the OpenAI API key.

Retrieve a response by creating a ChatCompletition module:

model = 'gpt-3.5-turbo'
content = 'Say something inspiring'
response = openai.ChatCompletion.create(
    model=model,
    messages=[{'role': 'user', 'content': content}]
)

#1: the GPT model to use.
#2: the content to be sent to the GPT model.
#3: creates the chat completion model and retrieves the response.
#5: messages are stored in a list of dictionaries where each dictionary contains content from either the user or the system.

Print the type of the response and the response itself that is in the JSON format:

<class 'openai.openai_object.OpenAIObject'>
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\n\n\"Believe in yourself and all that you are. Know that there is something inside you that is greater than any obstacle.\"",
        "role": "assistant"
      }
    }
  ],
  "created": 1678893623,
  "id": "chatcmpl-6uNDL6Qfh7MjxpLxH3NW7UPJXS3tN",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 26,
    "prompt_tokens": 10,
    "total_tokens": 36
  }
}

#1: the response type.
#2: the response in the JSON format.

Print only the content from the output:

output = response['choices'][0]['message']['content'].strip()
print(output)

"Believe in yourself and all that you are. Know that there is something inside you that is greater than any obstacle."

5.3. Information Extraction

Consider that you want to extract someone's call name(s) during a dialogue in real time:

Design a prompt that extracts all call names provided by the user.

In "My friends call me Pete, my students call me Dr. Parker, and my parents call me Peter.", how does the speaker want to be called? Respond in the following JSON format: {"call_names": ["Mike", "Michael"]}

Let us write a function that takes the user input and returns the GPT output in the JSON format:

#2-6: uses the model to retrieve the GPT output.
#8-10: uses the regular expression (if provided) to extract the output in the specific format.

Let us create a macro that calls MacroGPTJSON:

#3: the task to be requested regarding the user input (e.g., How does the speaker want to be called?).
#4: the example output where all values are filled (e.g., {"call_names": ["Mike", "Michael"]}).
#5: the example output where all collections are empty (e.g., {"call_names": []}).
#7: it is a function that takes the STDM variable dictionary and the JSON output dictionary and sets necessary variables.

Override the run method in MacroGPTJSON:

#2-3: creates a input prompt to the GPT API.
#4-5: retreives the GPT output using the prompt.
#7-11: checks if the output is in a proper JSON format.
#13-14: updates the variable table using the custom function.
#15-16: updates the variable table using the same keys as in the JSON output.

Let us create another macro called MacroNLG:

#3: is a function that takes a variable table and returns a string output.

Finally, we use the macros in a dialogue flow:

The helper methods can be as follow:

5.4. Quiz

Revisit your Quiz 2 and improve its language understanding capability using the large language model such as GPT.

Use ChatGPT to figure out the right prompts.
Use your trial credits from OpenAI to test the APIs.

Task 1

Create a python file quiz5.py under the quiz package and copy the code.
Update the code to design a dialogue flow for the assigned dialogue system.
Create a PDF file quiz5.pdf that describes the approach (e.g., prompt engineering) and how the large language model improved over the limitations you described in Quiz 2.

Task 2

Answer the following questions in quiz5.py:

What are the limitations of the Bag-of-Words representation?
Describe the Chain Rule and Markov Assumption and how they are used to estimate the probability of a word sequence.
Explain how the Word2Vec approach uses feed-forward neural networks to generate word embeddings. What are the advantages of the Word2Vec representation over the Bag-of-Words representation?
Explain what patterns are learned in the multi-head attentions of a Transformer. What are the advantages of the Transformer embeddings over the Word2Vec embeddings?

5.3. Information Extraction

Consider that you want to extract someone's call name(s) during a dialogue in real time:

Design a prompt that extracts all call names provided by the user.

Let us write a function that takes the user input and returns the GPT output in the JSON format:

#2-6: uses the model to retrieve the GPT output.
#8-10: uses the regular expression (if provided) to extract the output in the specific format.

Let us create a macro that calls MacroGPTJSON:

class MacroGPTJSON(Macro):
    def __init__(self, request: str, full_ex: Dict[str, Any], empty_ex: Dict[str, Any] = None, set_variables: Callable[[Dict[str, Any], Dict[str, Any]], None] = None):
        self.request = request
        self.full_ex = json.dumps(full_ex)
        self.empty_ex = '' if empty_ex is None else json.dumps(empty_ex)
        self.check = re.compile(regexutils.generate(full_ex))
        self.set_variables = set_variables

#3: the task to be requested regarding the user input (e.g., How does the speaker want to be called?).
#4: the example output where all values are filled (e.g., {"call_names": ["Mike", "Michael"]}).
#5: the example output where all collections are empty (e.g., {"call_names": []}).
#6: the to check the information.
#7: it is a function that takes the STDM variable dictionary and the JSON output dictionary and sets necessary variables.

Override the run method in MacroGPTJSON:

def run(self, ngrams: Ngrams, vars: Dict[str, Any], args: List[Any]):
    examples = f'{self.full_ex} or {self.empty_ex} if unavailable' if self.empty_ex else self.full_ex
    prompt = f'{self.request} Respond in the JSON schema such as {examples}: {ngrams.raw_text().strip()}'
    output = gpt_completion(prompt)
    if not output: return False

    try:
        d = json.loads(output)
    except JSONDecodeError:
        print(f'Invalid: {output}')
        return False

    if self.set_variables:
        self.set_variables(vars, d)
    else:
        vars.update(d)
        
    return True

#2-3: creates a input prompt to the GPT API.
#4-5: retreives the GPT output using the prompt.
#7-11: checks if the output is in a proper JSON format.
#13-14: updates the variable table using the custom function.
#15-16: updates the variable table using the same keys as in the JSON output.

Let us create another macro called MacroNLG:

class MacroNLG(Macro):
    def __init__(self, generate: Callable[[Dict[str, Any]], str]):
        self.generate = generate

    def run(self, ngrams: Ngrams, vars: Dict[str, Any], args: List[Any]):
        return self.generate(vars)

#3: is a function that takes a variable table and returns a string output.

Finally, we use the macros in a dialogue flow:

transitions = {
    'state': 'start',
    '`Hi, how should I call you?`': {
        '#SET_CALL_NAMES': {
            '`Nice to meet you,` #GET_CALL_NAME `. Can you tell me where your office is and when your general office hours are?`': {
                '#SET_OFFICE_LOCATION_HOURS': {
                    '`Can you confirm if the following office infos are correct?` #GET_OFFICE_LOCATION_HOURS': {
                    }
                }
            }
        },
        'error': {
            '`Sorry, I didn\'t understand you.`': 'end'
        }
    }
}

macros = {
    'GET_CALL_NAME': MacroNLG(get_call_name),
    'GET_OFFICE_LOCATION_HOURS': MacroNLG(get_office_location_hours),
    'SET_CALL_NAMES': MacroGPTJSON(
        'How does the speaker want to be called?',
        {V.call_names.name: ["Mike", "Michael"]}),
    'SET_OFFICE_LOCATION_HOURS': MacroGPTJSON(
        'Where is the speaker\'s office and when are the office hours?',
        {V.office_location.name: "White Hall E305", V.office_hours.name: [{"day": "Monday", "begin": "14:00", "end": "15:00"}, {"day": "Friday", "begin": "11:00", "end": "12:30"}]},
        {V.office_location.name: "N/A", V.office_hours.name: []},
        set_office_location_hours
    ),
}

The helper methods can be as follow:

def get_call_name(vars: Dict[str, Any]):
    ls = vars[V.call_names.name]
    return ls[random.randrange(len(ls))]

def get_office_location_hours(vars: Dict[str, Any]):
    return '\n- Location: {}\n- Hours: {}'.format(vars[V.office_location.name], vars[V.office_hours.name])

def set_office_location_hours(vars: Dict[str, Any], user: Dict[str, Any]):
    vars[V.office_location.name] = user[V.office_location.name]
    vars[V.office_hours.name] = {d['day']: [d['begin'], d['end']] for d in user[V.office_hours.name]}

5.2. Quickstart with GPT

API Key

Create an account for OpenAI and log in to your account.

Click your icon on the top-right corner and select "View API keys":

Click the "+ Create new secret key" button and copy the API key:

Make sure to save this key in a local file. If you close the dialog without saving, you cannot retrieve the key again, in which case, you have to create a new one.

Create a file openai_api.txt under the resources directory and paste the API key to the file such that it contains only one line showing the key.

Add openai_api.txt to the .gitignore file:

.idea/
venv/
/resources/openai_api.txt

Do not share this key with anyone or push it to any remote repository (including your private GitHub repository).

Using GPT API

Open the terminal in PyCharm and install the OpenAI package:

(venv) $ pip install openai

Create a function called api_key() as follow:

import openai

PATH_API_KEY = 'resources/openai_api.txt'
openai.api_key_path = PATH_API_KEY

#4: specifies the path of the file containing the OpenAI API key.

Retrieve a response by creating a ChatCompletition module:

model = 'gpt-3.5-turbo'
content = 'Say something inspiring'
response = openai.ChatCompletion.create(
    model=model,
    messages=[{'role': 'user', 'content': content}]
)

#1: the GPT model to use.
#2: the content to be sent to the GPT model.
#3: creates the chat completion model and retrieves the response.
#5: messages are stored in a list of dictionaries where each dictionary contains content from either the user or the system.

Print the type of the response and the response itself that is in the JSON format:

<class 'openai.openai_object.OpenAIObject'>
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "\n\n\"Believe in yourself and all that you are. Know that there is something inside you that is greater than any obstacle.\"",
        "role": "assistant"
      }
    }
  ],
  "created": 1678893623,
  "id": "chatcmpl-6uNDL6Qfh7MjxpLxH3NW7UPJXS3tN",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 26,
    "prompt_tokens": 10,
    "total_tokens": 36
  }
}

#1: the response type.
#2: the response in the JSON format.

Print only the content from the output:

output = response['choices'][0]['message']['content'].strip()
print(output)

"Believe in yourself and all that you are. Know that there is something inside you that is greater than any obstacle."