LLM API Request

Updated: 2025-11-25

Now that you have made an LLM interaction, let us learn how to structure requests, configure parameters, and handle responses.

Basic Request

Most LLM APIs follow a similar pattern using a chat completion interface. Here is a basic example using the OpenAI Python client:

import os
from dotenv import load_dotenv
from openai import OpenAI

response = client.chat.completions.create(
    model="gpt-5-nano",
    messages=[
        {"role": "user", "content": "What is natural language processing?"}
    ]
)

print(response.choices[0].message.content)

Understanding the Messages Array

The messages parameter is the core of your API request. It represents the conversation history and follows a structured format with different roles:

Message Roles

  1. system: Sets the behavior and context for the model

  2. user: Represents messages from the user

  3. assistant: Represents previous responses from the model

response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "system", 
            "content": "You are a helpful teaching assistant for an NLP course."
        },
        {
            "role": "user", 
            "content": "Explain tokenization in simple terms."
        }
    ]
)

Multi-Turn Conversations

To maintain context across multiple exchanges, include the full conversation history:

messages = [
    {"role": "system", "content": "You are a helpful NLP expert."},
    {"role": "user", "content": "What is a transformer?"},
    {"role": "assistant", "content": "A transformer is a neural network architecture..."},
    {"role": "user", "content": "How does self-attention work?"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=messages
)

Q4: Why is it important to include the full conversation history in the messages array?

Key Parameters

Let's explore the important parameters you can configure when making API requests:

model

Specifies which LLM to use. Different models have different capabilities, costs, and context windows.

model="gpt-4"  # or "claude-sonnet-4-5-20250929", "gemini-pro", etc.

max_tokens

Controls the maximum number of tokens (roughly words) the model can generate in its response.

max_tokens=500  # Limit response to approximately 500 tokens

{% hint style="warning" %} Setting max_tokens too low may cause responses to be cut off mid-sentence. Setting it too high increases costs and latency. {% endhint %}

temperature

Controls the randomness of the model's output. Range: 0.0 to 2.0

  • Low values (0.0 - 0.3): More deterministic and focused responses

  • Medium values (0.5 - 0.7): Balanced creativity and coherence

  • High values (0.8 - 2.0): More creative and diverse responses

# For factual tasks
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is 2+2?"}],
    temperature=0.0
)

# For creative tasks
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a creative story opening."}],
    temperature=0.9
)

Q5: When would you use a temperature of 0.0 versus 0.9?

top_p (nucleus sampling)

An alternative to temperature that controls diversity by considering only the top tokens whose cumulative probability adds up to top_p. Range: 0.0 to 1.0

top_p=0.9  # Consider tokens that make up 90% of probability mass

{% hint style="info" %} It's generally recommended to alter either temperature OR top_p, but not both simultaneously. {% endhint %}

frequency_penalty

Reduces repetition by penalizing tokens based on how frequently they've appeared. Range: -2.0 to 2.0

frequency_penalty=0.5  # Moderate penalty for repeated tokens

presence_penalty

Encourages the model to talk about new topics by penalizing tokens that have appeared at all. Range: -2.0 to 2.0

presence_penalty=0.6  # Encourage discussion of new topics

stop

Specifies sequences where the API will stop generating further tokens.

stop=["END", "\n\n"]  # Stop at "END" or double newline

Complete Example with Multiple Parameters

Here's a comprehensive example showing how these parameters work together:

from openai import OpenAI

def analyze_sentiment(text: str, verbose: bool = False) -> str:
    """
    Analyzes the sentiment of given text using GPT-4.
    
    Args:
        text: The text to analyze
        verbose: If True, includes confidence level in output
    
    Returns:
        Sentiment classification (Positive/Negative/Neutral)
    """
    client = OpenAI(api_key="your-api-key")
    
    system_message = """You are a sentiment analysis expert. 
    Classify the sentiment as Positive, Negative, or Neutral."""
    
    if verbose:
        system_message += " Include a confidence level (low/medium/high)."
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": f"Analyze this text: {text}"}
        ],
        max_tokens=100,
        temperature=0.2,  # Low temperature for consistent classification
        top_p=1.0,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        stop=None
    )
    
    return response.choices[0].message.content.strip()

# Test the function
review = "This product exceeded my expectations, though shipping was slow."
result = analyze_sentiment(review, verbose=True)
print(f"Sentiment: {result}")

Handling API Responses

Understanding the response structure is crucial for extracting the information you need:

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

# Access the generated text
content = response.choices[0].message.content

# Access usage information
prompt_tokens = response.usage.prompt_tokens
completion_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens

print(f"Generated text: {content}")
print(f"Tokens used - Prompt: {prompt_tokens}, Completion: {completion_tokens}")

Response Object Structure

{
    "id": "chatcmpl-123",
    "object": "chat.completion",
    "created": 1677652288,
    "model": "gpt-4",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello! How can I help you today?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 9,
        "completion_tokens": 12,
        "total_tokens": 21
    }
}

Provider-Specific Differences

While the basic structure is similar across providers, there are some differences:

Anthropic (Claude)

from anthropic import Anthropic

client = Anthropic(api_key="your-api-key")

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,  # Required for Anthropic
    messages=[
        {"role": "user", "content": "Hello, Claude!"}
    ]
)

print(response.content[0].text)

Key differences:

  • max_tokens is required (not optional)

  • No system role in messages array; use a separate system parameter

  • Response structure: response.content[0].text instead of response.choices[0].message.content

Google (Gemini)

import google.generativeai as genai

genai.configure(api_key="your-api-key")
model = genai.GenerativeModel('gemini-pro')

response = model.generate_content("Hello, Gemini!")
print(response.text)

Error Handling

Always implement proper error handling when working with APIs:

from openai import OpenAI, OpenAIError
import time

def make_api_call_with_retry(client, messages, max_retries=3):
    """Makes an API call with retry logic for handling rate limits."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=messages,
                max_tokens=500,
                temperature=0.7
            )
            return response.choices[0].message.content
        
        except OpenAIError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt  # Exponential backoff
                print(f"Error occurred: {e}. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"Failed after {max_retries} attempts: {e}")
                raise
        
        except Exception as e:
            print(f"Unexpected error: {e}")
            raise

# Usage
client = OpenAI(api_key="your-api-key")
messages = [{"role": "user", "content": "Explain error handling."}]
result = make_api_call_with_retry(client, messages)

Best Practices

  1. Start with lower temperature values (0.0-0.3) for factual tasks, increase for creative tasks

  2. Set appropriate max_tokens to balance cost and completeness

  3. Include system messages to set consistent behavior

  4. Monitor token usage to manage costs effectively

  5. Implement retry logic for production applications

  6. Store API keys securely using environment variables

  7. Cache responses when appropriate to reduce API calls

Cost Considerations

API usage is typically priced per token:

  • Input tokens (your prompts) are usually cheaper

  • Output tokens (model responses) are usually more expensive

  • Different models have different pricing tiers

# Estimate costs before making requests
def estimate_cost(prompt: str, max_tokens: int, model: str) -> float:
    """Rough cost estimation (prices vary by provider)."""
    # Example rates (not actual current rates - check provider docs)
    rates = {
        "gpt-4": {"input": 0.03, "output": 0.06},  # per 1K tokens
        "gpt-3.5-turbo": {"input": 0.001, "output": 0.002}
    }
    
    # Rough token estimation (1 token ≈ 4 characters)
    input_tokens = len(prompt) / 4
    
    rate = rates.get(model, rates["gpt-4"])
    estimated_cost = (
        (input_tokens / 1000) * rate["input"] +
        (max_tokens / 1000) * rate["output"]
    )
    
    return estimated_cost

Practical Exercise

Try experimenting with different parameter values to see their effects:

from openai import OpenAI

client = OpenAI(api_key="your-api-key")

prompt = "Write a creative story about a robot learning to paint."

# Try different configurations
configs = [
    {"temp": 0.2, "desc": "Low temperature (focused)"},
    {"temp": 0.7, "desc": "Medium temperature (balanced)"},
    {"temp": 1.5, "desc": "High temperature (creative)"}
]

for config in configs:
    print(f"\n{config['desc']}:")
    print("-" * 50)
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
        temperature=config["temp"]
    )
    
    print(response.choices[0].message.content)

Q6: Run the above experiment and describe how the outputs differ across temperature settings.

Summary

In this section, you learned:

  • How to structure API requests with the messages array

  • Key parameters: model, max_tokens, temperature, top_p, penalties, and stop sequences

  • How to handle multi-turn conversations

  • Response structure and accessing generated content

  • Provider-specific differences (OpenAI, Anthropic, Google)

  • Error handling and retry logic

  • Best practices for cost management

These skills form the foundation for all programmatic interactions with LLMs and will be essential for your course projects.

{% hint style="success" %} Practice making API calls with different parameters to develop an intuition for how they affect model behavior. This experimentation is key to becoming proficient in working with LLMs. {% endhint %}

Last updated

Was this helpful?