LLM API Request

Updated: 2025-11-25

Now that you have made an LLM interaction, let us learn how to structure requests, configure parameters, and handle responses.

Messages

Single Interaction

Most LLM APIs follow a similar pattern using a chat completion interface. Here is a request example:

import os
from dotenv import load_dotenv
from openai import OpenAI
from openai.types.chat import ChatCompletion

load_dotenv()

def single_interaction(client: OpenAI) -> ChatCompletion:
    return client.chat.completions.create(
        model="gpt-5-nano",
        messages=[
            {"role": "user", "content": "Who are you?"}
        ]
    )
    
if __name__ == "__main__":
    c = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    r = single_interaction(c)
    print(r.choices[0].message.content)
  • L8: Use typing to indicate the parameter type (OpenAI)

The messages array represents the conversation history and follows a structured format with different roles:

  • user: Represents messages from the user

  • system: Sets the behavior and context for the model

  • assistant: Represents previous responses from the model

If you are making an independent API call (as above), it contains a dictionary where role is specified to user and content contains the input prompt.

Multi-Turn Interactions

To facilitate multi-turn interactions with an LLM, you first specify its system role and then provide the user prompt. In this case, the messages array contains an additional dictionary where role is set to system, and content describes its persistent role within this client.

To maintain context across multiple interactions, you should include the entire conversation history. You can add the LLM’s response to messages as an additional dictionary, where role is assistant and content is the LLM output (e.g., “20” in the above example).

Parameters

Let's explore the important parameters you can configure when making API requests:

model

Specifies which LLM to use. Different models have different capabilities, costs, and context windows.

max_tokens

Controls the maximum number of tokens (roughly words) the model can generate in its response.

{% hint style="warning" %} Setting max_tokens too low may cause responses to be cut off mid-sentence. Setting it too high increases costs and latency. {% endhint %}

temperature

Controls the randomness of the model's output. Range: 0.0 to 2.0

  • Low values (0.0 - 0.3): More deterministic and focused responses

  • Medium values (0.5 - 0.7): Balanced creativity and coherence

  • High values (0.8 - 2.0): More creative and diverse responses

Q5: When would you use a temperature of 0.0 versus 0.9?

top_p (nucleus sampling)

An alternative to temperature that controls diversity by considering only the top tokens whose cumulative probability adds up to top_p. Range: 0.0 to 1.0

{% hint style="info" %} It's generally recommended to alter either temperature OR top_p, but not both simultaneously. {% endhint %}

frequency_penalty

Reduces repetition by penalizing tokens based on how frequently they've appeared. Range: -2.0 to 2.0

presence_penalty

Encourages the model to talk about new topics by penalizing tokens that have appeared at all. Range: -2.0 to 2.0

stop

Specifies sequences where the API will stop generating further tokens.

Complete Example with Multiple Parameters

Here's a comprehensive example showing how these parameters work together:

Handling API Responses

Understanding the response structure is crucial for extracting the information you need:

Response Object Structure

Provider-Specific Differences

While the basic structure is similar across providers, there are some differences:

Anthropic (Claude)

Key differences:

  • max_tokens is required (not optional)

  • No system role in messages array; use a separate system parameter

  • Response structure: response.content[0].text instead of response.choices[0].message.content

Google (Gemini)

Error Handling

Always implement proper error handling when working with APIs:

Best Practices

  1. Start with lower temperature values (0.0-0.3) for factual tasks, increase for creative tasks

  2. Set appropriate max_tokens to balance cost and completeness

  3. Include system messages to set consistent behavior

  4. Monitor token usage to manage costs effectively

  5. Implement retry logic for production applications

  6. Store API keys securely using environment variables

  7. Cache responses when appropriate to reduce API calls

Cost Considerations

API usage is typically priced per token:

  • Input tokens (your prompts) are usually cheaper

  • Output tokens (model responses) are usually more expensive

  • Different models have different pricing tiers

Practical Exercise

Try experimenting with different parameter values to see their effects:

Q6: Run the above experiment and describe how the outputs differ across temperature settings.

Summary

In this section, you learned:

  • How to structure API requests with the messages array

  • Key parameters: model, max_tokens, temperature, top_p, penalties, and stop sequences

  • How to handle multi-turn conversations

  • Response structure and accessing generated content

  • Provider-specific differences (OpenAI, Anthropic, Google)

  • Error handling and retry logic

  • Best practices for cost management

These skills form the foundation for all programmatic interactions with LLMs and will be essential for your course projects.

{% hint style="success" %} Practice making API calls with different parameters to develop an intuition for how they affect model behavior. This experimentation is key to becoming proficient in working with LLMs. {% endhint %}

Last updated

Was this helpful?