Selecting the right AI Model: A Step-by-Step Breakdown for Enterprises

24 May

Selecting the right AI Model: A Step-by-Step Breakdown for Enterprises

This post was also published on May 26, 2024 on LinkedIn

Over the past year, it has become increasingly clear that effective GenAI usage is a necessity for the modern enterprise. As a result of this paradigm shift, business leaders worldwide are grappling with the question: How do we make the best use of GenAI?

Among many other points an important part of the answer to this question centers around selecting the right AI model(s). At Fisent we’ve had the privilege of navigating this question with various Fortune 1000 companies. Throughout these implementations, we’ve arrived at two significant realizations. First, selecting the right “model” or “Large Language Model” (LLM) for a particular use case is never a one-size-fits-all proposition. And second, when integrating an LLM within an organization, companies can not rely solely on a ‘set and forget’ strategy.

Selecting The Right LLM:

When selecting the right LLM, application context is incredibly important. What this means is that organizations must identify their organizational priorities and pick a model that aligns with these priorities.

For example, take a retail organization using GenAI to interpret complex purchase orders to automate and accelerate order fulfillment. Due to the length of these complex purchase orders the retail organization would likely need a large context window. Additionally, the inherent cost to interpret purchase orders manually and a need for a high degree of accuracy might drive this organization to prioritize model accuracy over pricing. As a result this organization would likely consider using an expensive, high accuracy model, like Claude 3: Opus.

On the other hand, take an energy provider planning to shorten customer wait times by using GenAI to accelerate the processing of a high volume of simple utility bills. In this example, the simplicity of these utility bills, the high-volume nature of the task, and the importance of speed would likely drive this organization toward a cheaper and quicker model like Llama 3.

These examples show how organizations can use published benchmarks and model statistics to evaluate the plethora of available models and identify the model(s) that align best with their needs. The table below shows some of the current industry leading LLMs and some details an enterprise may consider during evaluation.

	GPT-4o	Claude 3 Opus	Gemini 1.5 Pro	Llama 3 70B	Mistral Large
Context Window:	128k	200k	1M	8k	32k
Type:	Proprietary	Proprietary	Proprietary	Open Source	Proprietary
Native Input Types:	Audio, Image, Video, Text	Image, Text	Audio, Image, Video,Text	Text	Text
MMLU Benchmark	88.7%	86.8%	85.9%	82%	81.2%
HumanEval Benchmark (Python)	90.2%	84.9%	84.1%	81.7%	45.1%
Input Cost per 1M tokens	$5	$15	$7	$0.59	$4
Output Cost per 1M tokens	$15	$75	$21	$0.79	$12

Yet, while benchmarks and model statistics serve as a great initial guide it’s important to note that GenAI models can often perform in unexpected ways. These less tangible outcomes can occasionally be equally as important for an organization on their quest for the right model.

For example, some models easily react to prompts containing potentially ‘sensitive content’. So an organization interacting with information relating to dangerous chemicals may find that a model that statistically fits their requirements might fail to process their content because of flagged prompts. Organizations therefore still need to experiment with multiple models using realistic simulations to finalize the ideal choice.

In short, a powerful framework enterprises should consider is as follows:

Identify organizational model priorities
Short-list using model statistics
Identify ideal model(s) through use case specific testing

Staying Ahead of Change:

In the ever-changing technical landscape of GenAI, change truly is the only constant. In other words, the best model today will not be the best tomorrow. Just this past week, OpenAI announced the release of GPT-4o, setting a new standard for model multimodality and accuracy. As we mentioned in our last blog post, models have been making material improvements at a rapid rate, a trend we can expect to see continue. This, paired with the irregular release of new models from different providers, has resulted in what Groq CEO Jonathan Ross calls a “leapfrogging effect,” where there isn’t just one industry leader and the most effective models can change by the month.

What this all means is that it is important for organizations to pick their models purposefully but also be agile enough that they can swap to the best models as needed. At Fisent we believe in bringing simplicity and effectiveness to the lives of our clients. This means that we help select the best models for our clients and through BizAI’s native optionality we ensure that our clients are always using the latest and greatest models for their needs.

Selecting the right AI Model: A Step-by-Step Breakdown for Enterprises