10 Aug 2023 · Software Engineering

    A Guide to OpenAI: How to Choose the Best Language Model For Your AI Application

    10 min read
    Contents

    When I began experimenting with the OpenAI API, I didn’t pause to consider the range of available models. Instead, I used what was generally regarded as the most potent model at the time, GPT-3.5. But with the announcement of GPT-4’s general availability, I decided it was time to delve further.

    Initially, I hadn’t planned to write this article, let alone publish it. It started with me exploring other models to understand their purpose and why they were created. Upon realizing that other developers had similar questions, I decided that this topic warranted a post. This is that post.

    My aim is to provide an overview of all the currently available text models and offer guidelines to assist you in choosing the best fit for your AI application.

    A cheatsheet showing the GPT models. From GPT-3 to GPT-4. Under GPT-3 we have the davinci, curie, babbage, and ada models, which are the only ones that can be fine-tuned. Also, under GPT-3 we find text-davinci-003, text-curie-001, text-babbage-001, text-ada-001 which are trained to follow instructions. Under GPT-3.5 models, which support 4096 tokens,we find gpt-3.5-turbo with its 0613, 16k, and 16k-0613 variants. Under GPT-4 we find gpt-4 and gpt-4-0613 models. The 0613 models support function calling, the 16k models support 16384 thousand tokens. The GPT-4 models support 8182
    Cheatsheet with all the models discussed in this tutorial. Feel free to download the full-page image.

    Before delving deeper, let’s review the available GPT generations.

    The Three GPT Generations

    OpenAI has introduced three generations of Large Language Models (LLMs). These models can execute all kinds of text-processing tasks, including question-answering, translation, creative content writing, summarizing, and role-following behavior.

    • GPT-3 is the first generation of LLMs from OpenAI (GPT-2 was never publicly released). While effective at text generation, it can get confused and make mistakes.
    • GPT-3.5 represents a minor update to GPT-3. It is more factual and less prone to harmful outputs than GPT-3. Also, it’s faster.
    • GPT-4 is the latest generation of LLMs from OpenAI. It significantly surpasses GPT-3 or GPT-3.5 in power, producing more coherent, accurate, and creative text. It’s also multi-modal, meaning it can process audio and images.

    Considerations When Selecting a Language Model

    When selecting the appropriate Large Language Models (LLMs) for our application, there are several factors to consider.

    Text Completion vs. Chat Models

    A completion model attempts to extend the user’s prompt with relevant text. For instance, if our prompt is “The horse is,” the model will likely respond with something like “a domesticated animal.” Completion models do not have conversational capabilities. We send a single request, receive a response, and that’s the end of the interaction.

    On the other hand, chat models are optimized for engaging in conversations. While they technically function as completion models, they can tell apart who said what, leading to a more natural conversation flow. We can pose questions, receive answers, and make follow-up requests. Anyone who has used ChatGPT would be familiar with this type of interaction.

    In the table below, you can find both completion and chat models. As illustrated, both types of models use different API endpoints.

    CompletionChat
    Endpoint/v1/completion/v1/chat/completion
    Modelsdavinci
    curie
    babbage
    ada
    text-davinci-003
    text-curie-001
    text-babbage-001
    text-ada-001
    All “gpt-4” models
    All “gpt-3.5-turbo” models
    Completion vs Chat models

    OpenAI has labeled all completion models as legacy. While they remain available and updated this decision suggests that OpenAI’s focus will shift toward chat models in the future.

    Instruction Following Capabilities

    Base completion models, such as “davinci” or “curie,” perform poorly when it comes to following instructions, as the example below demonstrates (note: you can try all the examples in this article for yourself in the OpenAI playground).

    Giving the model the instruction, “Say this is a test,” results in rather random text. While this might be entertaining, given our intended use, it’s not particularly helpful.

    Screenshot of OpenAI playground showing the completion of the text 'Say this is a test'. The completion that follows is: and you're here to give me a big break.' You won't make the mistake of saying that? And thus continues with a few more lines of ramblings
    Davinci base completion response. Note that it does not follow instructions.

    OpenAI offers several models fine-tuned to follow instructions. Any of these will do:

    • All the “gpt-3.5-turbo” models
    • All the “gpt-4” models
    • All models beginning with “text-*”.

    To illustrate the difference that fine-tuning makes, let’s apply the same prompt to “text-davinci-003”:

    Screenshot of the OpenAI playground showing the completion of "Say this a test". The completion reads: Yes, this is a test.
    Screenshot of the OpenAI playground showing the completion of “Say this a test”. The completion reads: Yes, this is a test.

    The full breakdown of model behavior follows below:


    Base completion models
    Instruction-following models
    davincitext-davinci-003
    curietext-davinci-002
    babbagetext-davinci-001
    adatext-curie-001
    text-babbage-001
    text-ada-001
    All gpt-3.5-turbo models
    All gpt-4 models
    Base completion vs Instruction-following models

    Fine-Tunable Models

    If you have been wondering about the relevance of models like “davinci”, “curie”, “babbage”, and “ada”, this is where they finally come into play. These base models are the only ones that can be fine-tuned.

    But what is fine-tuning? Fine-tuning is the process of further training a model to carry out a specific task by feeding the model with example questions and responses. A fine-tuned model can achieve superior results, diminished costs, and enhanced latency for a given task.

    The training data consists of example prompts, paired with their optimal completions, formatted as JSON Lines:

    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    ...
    

    For instance, if we wanted to train the model to be an expert in Barbie dolls, the first lines of the training data might look like these:

    { "prompt": "Can you tell me about the first Barbie doll ever released?", "completion": "The first Barbie doll was introduced by Mattel at the American International Toy Fair in New York on March 9, 1959. It was named Barbie after the creators' daughter, Barbara. The doll was a teenage fashion model, available in either a black and white striped swimsuit or a bandeau-style swimsuit with high-heeled shoes, sunglasses, and earrings." }
    { "prompt": "What is the most expensive Barbie ever sold?", "completion": "The most expensive Barbie doll ever sold at auction was the Stefani Canturi Barbie, which fetched $302,500 in 2010. This Barbie was designed by Australian jewelry designer Stefani Canturi. She was adorned with a necklace featuring a one-carat pink diamond." }
    { "prompt": "What are some popular Barbie playsets?", "completion": "There are many popular Barbie playsets that have been released over the years. Some of these include the Barbie Dreamhouse, the Barbie Dream Camper, the Barbie Care Clinic Vehicle, and the Barbie Malibu House. These playsets often come with a variety of accessories and can be used with any Barbie doll." }
    ...
    

    Once the examples are ready, we upload them using the openai cli to begin the training. This process has a cost, so check out the pricing page before starting as you may need to process thousands of examples. It may also take several hours to complete. However, once trained, the model should excel as a world-class Barbie expert.

    Token Capacity and Model Size

    A significant factor to consider is the size of the model’s “context window,” measured in tokens. A token can represent a word, a part of a word, punctuation marks, or symbols (for English, a token represents, on average, four characters).

    The larger the model’s token capacity, the more text it can process. However, the token capacity does not say anything about the model’s “intelligence.” For that, we need to consider the number of parameters.

    Large language models are artificial neural networks, and the parameters are the connections and weights between the neurons in the network. The number of parameters relates to how much information the model can “remember” or “learn” from the training data it has been exposed to. In other words, the more parameters a model has, the more knowledgeable it is. A large model can reason, understand complex concepts, and make more nuanced decisions.

    While OpenAI has not officially disclosed the size of its models, the Microsoft AI Reference page offers guidance. For the rest, we only have estimations.


    Model
    CapabilitiesToken windowParameters
    adaCapable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.2049350 Millions
    text-ada-001Same as “ada” but fine-tuned for instruction following.2049350 Millions
    babbageCapable of straightforward tasks, very fast, and lower cost.20493 Billions
    text-babbage-001Same as “babbage” but fine-tuned for instruction following.20493 Billions
    curieVery capable, but faster and lower cost than Davinci.204913 Billions
    text-curie-001Same as “curie” but fine-tuned for instruction following.204913 Billions
    davinciMost capable GPT-3 model. It can do any task the other models can, often with higher quality.2049175 Billions
    text-davinci-003Better quality, longer output, and consistent instruction-following. Supports the inserting text feature.4097175 Billions
    text-davinci-002Similar to text-davinci-003 but trained with supervised fine-tuning instead of reinforcement learning4097175 Billions
    gpt-3.5-turboMost capable GPT-3.5 model and optimized for chat.4096175 Billions
    gpt-3.5-turbo-16kSame capabilities as the standard gpt-3.5-turbo model but with 4 times the context.16,384175 Billions
    gpt-3.5-turbo-0613Same as “gpt-3.5-turbo” but with the function calling feature.4,096175 Billions
    gpt-3.5-turbo-16k-0613Same as “gpt-3.5-turbo-16k” but with the function calling feature.16,384175 Billions
    gpt-4More capable than any GPT-3.5 model, able to do more complex tasks, and optimized for chat.8,192Estimated at > 1 trillion
    gpt-4-0613Same as “gpt-4” but with the function calling feature.8,192Estimated at > 1 trillion
    Overview of available OpenAI text models.

    Pricing

    Whether it’s completion, chatting, or fine-tuning, every interaction with the API incurs a cost. Generally, the larger the model, the higher the price. The exception is the “gpt-3.5-turbo” series, which is the same size as “davinci” but costs a tenth of the price, making it one of the most cost-effective models for powering AI applications.

    Before building your application, carefully examine the OpenAI pricing page to estimate your system’s API costs.

    Other Considerations

    While I have covered the more general factors to consider when choosing a model, there are a few other minor aspects that might be significant in certain scenarios:

    • Function calling: The “gpt-3.5” and “gpt-4” models ending in “0613” have been trained to detect the need for function calling based on the user’s prompt, and they respond with a structured call request instead of regular text. This feature allows us to integrate a GPT model with the external world.
    • Static vs. updated models: All the models ending in a version number, such as “gpt-3.5-turbo-0613”, “gpt-3.5-turbo-16-0613”, or “gpt-4-0613”, are static — they do not receive updates. Using a static model is like pinning dependencies, you know they won’t change so it’s safer to build a product on top but sooner or later OpenAI will deprecate the model and you’ll have to switch up. As new versions of the models are released, the older ones are deprecated. The unversioned models, such as “gpt-3.5-turbo” and “gpt-4”, are periodically being updated by OpenAI.
    • Multi-modal capabilities: Only “gpt-4” supports multi-modal capabilities, meaning it can process images and text. Unfortunately, this feature is not available to the general public yet, but you can see a few examples of how it works.

    A Few Thoughts on Choosing a Model

    The “gpt-3.5-turbo” series is the most cost-effective of all the available models. Since it’s also the second-largest model, it’s a logical choice for basing a minimum viable product (MVP) on. Yet, the parameter size of a model only tells part of the story. The rest comes from experimentation.

    Avoid using ChatGPT, as this application injects its own prompts. Instead, use the OpenAI playground or access the API directly for better control. Experiment with the prompts your application will need to determine which model works best.

    Training your model might be worthwhile if you have lots of training data. It’s an investment that could pay off in the long term, as a trained model requires fewer prompts to do the same work (and fewer prompts equals fewer tokens, leading to lower costs). Before proceeding, however, estimate the cost of the fine-tuning process itself; everything is outlined on the pricing page.

    Conclusion

    As we’ve explored, OpenAI offers a range of options, each with unique capabilities, costs, and implications. The “gpt-3.5-turbo” series, being the most cost-effective, is a popular choice, but understanding your specific application needs, fine-tuning possibilities, and budget constraints will guide you toward the most appropriate option.

    Other articles that might interest you:

    References:

    One thought on “A Guide to OpenAI: How to Choose the Best Language Model For Your AI Application

    1. Thank you for the detailed article! I have a few additional questions:

      1. Do you have any recommendations for extracting data from a product sheet containing substances and returning it in JSON format?
      2. Is there a specific GPT model you would recommend for this task?
      3. Is fine-tuning necessary to improve the results?
      4. What is the minimum number of entries required for effective fine-tuning?
      5. What practices can be adopted to minimize model hallucinations?
      6. What do you recommend to ensure consistent responses between two identical prompts, as the same prompt sometimes generates different answers?

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    mm
    Writen by:
    I picked up most of my skills during the years I worked at IBM. Was a DBA, developer, and cloud engineer for a time. After that, I went into freelancing, where I found the passion for writing. Now, I'm a full-time writer at Semaphore.
    Avatar for Ahmed
    Reviewed by:
    I picked up most of my soft/hardware troubleshooting skills in the US Army. A decade of Java development drove me to operations, scaling infrastructure to cope with the thundering herd. Engineering coach and CTO of Teleclinic.