LLM API Request
Updated: 2025-11-25
Now that you have made an LLM interaction, let us learn how to structure requests, configure parameters, and handle responses.
Basic Request
Most LLM APIs follow a similar pattern using a chat completion interface. Here is a basic example using the OpenAI Python client:
import os
from dotenv import load_dotenv
from openai import OpenAI
response = client.chat.completions.create(
model="gpt-5-nano",
messages=[
{"role": "user", "content": "What is natural language processing?"}
]
)
print(response.choices[0].message.content)I’m ChatGPT, an AI assistant created by OpenAI. I’m built on the GPT-4 architecture and I’m here to help with a wide range of tasks—answering questions, explaining ideas, drafting or editing text, writing code, brainstorming, translating, planning, and more.
A few notes:
I don’t access personal data about you unless you share it in the chat.
I don’t browse the web unless a browsing tool is enabled.
I try to be accurate, but I can make mistakes; feel free to double-check important details.
I can remember context within this conversation, but I don’t retain memory between chats.
What would you like to do or know today?
Understanding the Messages Array
The messages parameter is the core of your API request. It represents the conversation history and follows a structured format with different roles:
Message Roles
system: Sets the behavior and context for the model
user: Represents messages from the user
assistant: Represents previous responses from the model
response = client.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a helpful teaching assistant for an NLP course."
},
{
"role": "user",
"content": "Explain tokenization in simple terms."
}
]
)Multi-Turn Conversations
To maintain context across multiple exchanges, include the full conversation history:
messages = [
{"role": "system", "content": "You are a helpful NLP expert."},
{"role": "user", "content": "What is a transformer?"},
{"role": "assistant", "content": "A transformer is a neural network architecture..."},
{"role": "user", "content": "How does self-attention work?"}
]
response = client.chat.completions.create(
model="gpt-4",
messages=messages
)Q4: Why is it important to include the full conversation history in the messages array?
Key Parameters
Let's explore the important parameters you can configure when making API requests:
model
Specifies which LLM to use. Different models have different capabilities, costs, and context windows.
model="gpt-4" # or "claude-sonnet-4-5-20250929", "gemini-pro", etc.max_tokens
Controls the maximum number of tokens (roughly words) the model can generate in its response.
max_tokens=500 # Limit response to approximately 500 tokens{% hint style="warning" %} Setting max_tokens too low may cause responses to be cut off mid-sentence. Setting it too high increases costs and latency. {% endhint %}
temperature
Controls the randomness of the model's output. Range: 0.0 to 2.0
Low values (0.0 - 0.3): More deterministic and focused responses
Medium values (0.5 - 0.7): Balanced creativity and coherence
High values (0.8 - 2.0): More creative and diverse responses
# For factual tasks
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is 2+2?"}],
temperature=0.0
)
# For creative tasks
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Write a creative story opening."}],
temperature=0.9
)Q5: When would you use a temperature of 0.0 versus 0.9?
top_p (nucleus sampling)
An alternative to temperature that controls diversity by considering only the top tokens whose cumulative probability adds up to top_p. Range: 0.0 to 1.0
top_p=0.9 # Consider tokens that make up 90% of probability mass{% hint style="info" %} It's generally recommended to alter either temperature OR top_p, but not both simultaneously. {% endhint %}
frequency_penalty
Reduces repetition by penalizing tokens based on how frequently they've appeared. Range: -2.0 to 2.0
frequency_penalty=0.5 # Moderate penalty for repeated tokenspresence_penalty
Encourages the model to talk about new topics by penalizing tokens that have appeared at all. Range: -2.0 to 2.0
presence_penalty=0.6 # Encourage discussion of new topicsstop
Specifies sequences where the API will stop generating further tokens.
stop=["END", "\n\n"] # Stop at "END" or double newlineComplete Example with Multiple Parameters
Here's a comprehensive example showing how these parameters work together:
from openai import OpenAI
def analyze_sentiment(text: str, verbose: bool = False) -> str:
"""
Analyzes the sentiment of given text using GPT-4.
Args:
text: The text to analyze
verbose: If True, includes confidence level in output
Returns:
Sentiment classification (Positive/Negative/Neutral)
"""
client = OpenAI(api_key="your-api-key")
system_message = """You are a sentiment analysis expert.
Classify the sentiment as Positive, Negative, or Neutral."""
if verbose:
system_message += " Include a confidence level (low/medium/high)."
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": f"Analyze this text: {text}"}
],
max_tokens=100,
temperature=0.2, # Low temperature for consistent classification
top_p=1.0,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=None
)
return response.choices[0].message.content.strip()
# Test the function
review = "This product exceeded my expectations, though shipping was slow."
result = analyze_sentiment(review, verbose=True)
print(f"Sentiment: {result}")Handling API Responses
Understanding the response structure is crucial for extracting the information you need:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
# Access the generated text
content = response.choices[0].message.content
# Access usage information
prompt_tokens = response.usage.prompt_tokens
completion_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens
print(f"Generated text: {content}")
print(f"Tokens used - Prompt: {prompt_tokens}, Completion: {completion_tokens}")Response Object Structure
{
"id": "chatcmpl-123",
"object": "chat.completion",
"created": 1677652288,
"model": "gpt-4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 9,
"completion_tokens": 12,
"total_tokens": 21
}
}Provider-Specific Differences
While the basic structure is similar across providers, there are some differences:
Anthropic (Claude)
from anthropic import Anthropic
client = Anthropic(api_key="your-api-key")
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024, # Required for Anthropic
messages=[
{"role": "user", "content": "Hello, Claude!"}
]
)
print(response.content[0].text)Key differences:
max_tokensis required (not optional)No system role in messages array; use a separate
systemparameterResponse structure:
response.content[0].textinstead ofresponse.choices[0].message.content
Google (Gemini)
import google.generativeai as genai
genai.configure(api_key="your-api-key")
model = genai.GenerativeModel('gemini-pro')
response = model.generate_content("Hello, Gemini!")
print(response.text)Error Handling
Always implement proper error handling when working with APIs:
from openai import OpenAI, OpenAIError
import time
def make_api_call_with_retry(client, messages, max_retries=3):
"""Makes an API call with retry logic for handling rate limits."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=500,
temperature=0.7
)
return response.choices[0].message.content
except OpenAIError as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # Exponential backoff
print(f"Error occurred: {e}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
print(f"Failed after {max_retries} attempts: {e}")
raise
except Exception as e:
print(f"Unexpected error: {e}")
raise
# Usage
client = OpenAI(api_key="your-api-key")
messages = [{"role": "user", "content": "Explain error handling."}]
result = make_api_call_with_retry(client, messages)Best Practices
Start with lower temperature values (0.0-0.3) for factual tasks, increase for creative tasks
Set appropriate max_tokens to balance cost and completeness
Include system messages to set consistent behavior
Monitor token usage to manage costs effectively
Implement retry logic for production applications
Store API keys securely using environment variables
Cache responses when appropriate to reduce API calls
Cost Considerations
API usage is typically priced per token:
Input tokens (your prompts) are usually cheaper
Output tokens (model responses) are usually more expensive
Different models have different pricing tiers
# Estimate costs before making requests
def estimate_cost(prompt: str, max_tokens: int, model: str) -> float:
"""Rough cost estimation (prices vary by provider)."""
# Example rates (not actual current rates - check provider docs)
rates = {
"gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
"gpt-3.5-turbo": {"input": 0.001, "output": 0.002}
}
# Rough token estimation (1 token ≈ 4 characters)
input_tokens = len(prompt) / 4
rate = rates.get(model, rates["gpt-4"])
estimated_cost = (
(input_tokens / 1000) * rate["input"] +
(max_tokens / 1000) * rate["output"]
)
return estimated_costPractical Exercise
Try experimenting with different parameter values to see their effects:
from openai import OpenAI
client = OpenAI(api_key="your-api-key")
prompt = "Write a creative story about a robot learning to paint."
# Try different configurations
configs = [
{"temp": 0.2, "desc": "Low temperature (focused)"},
{"temp": 0.7, "desc": "Medium temperature (balanced)"},
{"temp": 1.5, "desc": "High temperature (creative)"}
]
for config in configs:
print(f"\n{config['desc']}:")
print("-" * 50)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
max_tokens=200,
temperature=config["temp"]
)
print(response.choices[0].message.content)Q6: Run the above experiment and describe how the outputs differ across temperature settings.
Summary
In this section, you learned:
How to structure API requests with the messages array
Key parameters: model, max_tokens, temperature, top_p, penalties, and stop sequences
How to handle multi-turn conversations
Response structure and accessing generated content
Provider-specific differences (OpenAI, Anthropic, Google)
Error handling and retry logic
Best practices for cost management
These skills form the foundation for all programmatic interactions with LLMs and will be essential for your course projects.
{% hint style="success" %} Practice making API calls with different parameters to develop an intuition for how they affect model behavior. This experimentation is key to becoming proficient in working with LLMs. {% endhint %}
Last updated
Was this helpful?