LLM API Request

Updated: 2025-11-25

Now that you have made an LLM interaction, let us learn how to structure requests, configure parameters, and handle responses.

Messages

Single Interaction

Most LLM APIs follow a similar pattern using a chat completion interface. Here is a request example:

import os
from dotenv import load_dotenv
from openai import OpenAI
from openai.types.chat import ChatCompletion

load_dotenv()

def single_interaction(client: OpenAI) -> ChatCompletion:
    return client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "user", "content": "Who are you?"}
        ]
    )
    
if __name__ == "__main__":
    c = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    r = single_interaction(c)
    print(r.choices[0].message.content)

L8: Use typing to indicate the parameter type (OpenAI)

The messages array represents the conversation history and follows a structured format with different roles:

user: Represents messages from the user
system: Sets the behavior and context for the model
assistant: Represents previous responses from the model

If you are making an independent API call (as above), it contains a dictionary where role is specified to user and content contains the input prompt.

Multi-Turn Interactions

To facilitate multi-turn interactions with an LLM, you first specify its system role and then provide the user prompt. In this case, the messages array contains an additional dictionary where role is set to system, and content describes its persistent role within this client.

def multi_turn_interactions_0(client: OpenAI) -> ChatCompletion:
    return client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "system", "content": "You are a calculator."},
            {"role": "user", "content": "What is (2 + 3) * 4?"}
        ]
    )

To maintain context across multiple interactions, you should include the entire conversation history. You can add the LLM’s response to messages as an additional dictionary, where role is assistant and content is the LLM output (e.g., “20” in the above example).

def multi_turn_interactions_1(client: OpenAI) -> ChatCompletion:
    return client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "system", "content": "You are a calculator."},
            {"role": "user", "content": "What is (2 + 3) * 4?"},
            {"role": "assistant", "content": "20"},
            {"role": "user", "content": "Can you show me the full derivation?"}
        ]
    )

Sure. Here are two clear derivations.

1) Standard order of operations (parentheses first, then multiplication)
- (2 + 3) * 4
- Inside parentheses: 2 + 3 = 5
- Then: 5 * 4 = 20
- Result: 20

2) Using the distributive property (optional check)
- (2 + 3) * 4 = 2*4 + 3*4
- Compute: 2*4 = 8, and 3*4 = 12
- Sum: 8 + 12 = 20
- Result: 20

Q1: How does an LLM handle the following three messages differently?

m1 = [
    {"role": "user", "content": "What is (2 + 3) * 4?"}
]

m2 = [
    {"role": "user", "content": "You are a calculator. What is (2 + 3) * 4?"}
]

m3 = [
    {"role": "system", "content": "You are a calculator."},
    {"role": "user", "content": "What is (2 + 3) * 4?"}
]

Parameters

Let's explore the important parameters you can configure when making API requests:

model

Specifies which LLM to use. Different models have different capabilities, costs, and context windows.

model="gpt-4"  # or "claude-sonnet-4-5-20250929", "gemini-pro", etc.

max_tokens

Controls the maximum number of tokens (roughly words) the model can generate in its response.

max_tokens=500  # Limit response to approximately 500 tokens

{% hint style="warning" %} Setting max_tokens too low may cause responses to be cut off mid-sentence. Setting it too high increases costs and latency. {% endhint %}

temperature

Controls the randomness of the model's output. Range: 0.0 to 2.0

Low values (0.0 - 0.3): More deterministic and focused responses
Medium values (0.5 - 0.7): Balanced creativity and coherence
High values (0.8 - 2.0): More creative and diverse responses

# For factual tasks
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# For creative tasks
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a creative story opening."}],
    temperature=0.9
)

Q5: When would you use a temperature of 0.0 versus 0.9?

top_p (nucleus sampling)

An alternative to temperature that controls diversity by considering only the top tokens whose cumulative probability adds up to top_p. Range: 0.0 to 1.0

top_p=0.9  # Consider tokens that make up 90% of probability mass

{% hint style="info" %} It's generally recommended to alter either temperature OR top_p, but not both simultaneously. {% endhint %}

frequency_penalty

Reduces repetition by penalizing tokens based on how frequently they've appeared. Range: -2.0 to 2.0

frequency_penalty=0.5  # Moderate penalty for repeated tokens

presence_penalty

Encourages the model to talk about new topics by penalizing tokens that have appeared at all. Range: -2.0 to 2.0

presence_penalty=0.6  # Encourage discussion of new topics

stop

Specifies sequences where the API will stop generating further tokens.

stop=["END", "\n\n"]  # Stop at "END" or double newline

Complete Example with Multiple Parameters

Here's a comprehensive example showing how these parameters work together:

from openai import OpenAI

def analyze_sentiment(text: str, verbose: bool = False) -> str:
    """
    Analyzes the sentiment of given text using GPT-4.
    
    Args:
        text: The text to analyze
        verbose: If True, includes confidence level in output
    
    Returns:
        Sentiment classification (Positive/Negative/Neutral)
    """
    client = OpenAI(api_key="your-api-key")
    
    system_message = """You are a sentiment analysis expert. 
    Classify the sentiment as Positive, Negative, or Neutral."""
    
    if verbose:
        system_message += " Include a confidence level (low/medium/high)."
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": f"Analyze this text: {text}"}
        ],
        max_tokens=100,
        temperature=0.2,  # Low temperature for consistent classification
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        stop=None
    )
    
    return response.choices[0].message.content.strip()

# Test the function
review = "This product exceeded my expectations, though shipping was slow."
result = analyze_sentiment(review, verbose=True)
print(f"Sentiment: {result}")

Handling API Responses

Understanding the response structure is crucial for extracting the information you need:

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Access the generated text
content = response.choices[0].message.content

# Access usage information
prompt_tokens = response.usage.prompt_tokens
completion_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens

print(f"Generated text: {content}")
print(f"Tokens used - Prompt: {prompt_tokens}, Completion: {completion_tokens}")

Response Object Structure

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "gpt-4",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I help you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

Provider-Specific Differences

While the basic structure is similar across providers, there are some differences:

Anthropic (Claude)

from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,  # Required for Anthropic
    messages=[
        {"role": "user", "content": "Hello, Claude!"}
    ]
)

print(response.content[0].text)

Key differences:

max_tokens is required (not optional)
No system role in messages array; use a separate system parameter
Response structure: response.content[0].text instead of response.choices[0].message.content

Google (Gemini)

import google.generativeai as genai

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel('gemini-pro')

response = model.generate_content("Hello, Gemini!")
print(response.text)

Error Handling

Always implement proper error handling when working with APIs:

from openai import OpenAI, OpenAIError
import time

def make_api_call_with_retry(client, messages, max_retries=3):
    """Makes an API call with retry logic for handling rate limits."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=messages,
                max_tokens=500,
                temperature=0.7
            )
            return response.choices[0].message.content
        
        except OpenAIError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Error occurred: {e}. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Failed after {max_retries} attempts: {e}")
                raise
        
        except Exception as e:
            print(f"Unexpected error: {e}")
            raise

# Usage
client = OpenAI(api_key="your-api-key")
messages = [{"role": "user", "content": "Explain error handling."}]
result = make_api_call_with_retry(client, messages)

Best Practices

Start with lower temperature values (0.0-0.3) for factual tasks, increase for creative tasks
Set appropriate max_tokens to balance cost and completeness
Include system messages to set consistent behavior
Monitor token usage to manage costs effectively
Implement retry logic for production applications
Store API keys securely using environment variables
Cache responses when appropriate to reduce API calls

Cost Considerations

API usage is typically priced per token:

Input tokens (your prompts) are usually cheaper
Output tokens (model responses) are usually more expensive
Different models have different pricing tiers

# Estimate costs before making requests
def estimate_cost(prompt: str, max_tokens: int, model: str) -> float:
    """Rough cost estimation (prices vary by provider)."""
    # Example rates (not actual current rates - check provider docs)
    rates = {
        "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
        "gpt-3.5-turbo": {"input": 0.001, "output": 0.002}
    }
    
    # Rough token estimation (1 token ≈ 4 characters)
    input_tokens = len(prompt) / 4
    
    rate = rates.get(model, rates["gpt-4"])
    estimated_cost = (
        (input_tokens / 1000) * rate["input"] +
        (max_tokens / 1000) * rate["output"]
    )
    
    return estimated_cost

Practical Exercise

Try experimenting with different parameter values to see their effects:

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

prompt = "Write a creative story about a robot learning to paint."

# Try different configurations
configs = [
    {"temp": 0.2, "desc": "Low temperature (focused)"},
    {"temp": 0.7, "desc": "Medium temperature (balanced)"},
    {"temp": 1.5, "desc": "High temperature (creative)"}
]

for config in configs:
    print(f"\n{config['desc']}:")
    print("-" * 50)
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        temperature=config["temp"]
    )
    
    print(response.choices[0].message.content)

Q6: Run the above experiment and describe how the outputs differ across temperature settings.

Summary

In this section, you learned:

How to structure API requests with the messages array
Key parameters: model, max_tokens, temperature, top_p, penalties, and stop sequences
How to handle multi-turn conversations
Response structure and accessing generated content
Provider-specific differences (OpenAI, Anthropic, Google)
Error handling and retry logic
Best practices for cost management

These skills form the foundation for all programmatic interactions with LLMs and will be essential for your course projects.

{% hint style="success" %} Practice making API calls with different parameters to develop an intuition for how they affect model behavior. This experimentation is key to becoming proficient in working with LLMs. {% endhint %}

PreviousLLM API Response NextPrompting Engineering

Last updated 21 days ago

Was this helpful?