This is part two of a three-part AI-Core Insights series. Click here for part one, “Foundation models: To open-source or not to open-source?”
In the first part of this three-part blog series, we discussed the practical approach towards foundation models (FM), both open and closed source. From a deployment perspective, the proof in the pudding is which foundation model works best to solve the intended use case.
Let us now simplify the seemingly infinite infrastructure needed to realize a product out of compute-intensive foundation models. There are two heavily discussed problem statements:
- Your fine-tuning cost, needing a large amount of data and GPUs with enough vRAM and memory to host large models – this is especially applicable if you’re building your moat around differentiated fine-tuning or prompt engineering
- Your inference cost that’s fractional per call but compounds with the number of inference calls—this stays regardless.
Put simply, the return and investment should go hand in hand. In the beginning, however, this can require a huge sunk cost. So, what do you focus on?
The infrastructure dilemma for FM startups
If you have a fine-tuning pipeline, it looks something like this:
- Data preprocessing and labeling: You have a big pool of datasets. You’re preprocessing your data—cleaning it, sizing it, removing backgrounds, etc. You need small GPUs here—T4s, but potentially A10s, depending on availability. Then you label it, perhaps using small models and small GPUs.
- Fine-tuning: As you start fine-tuning your model, you start needing larger GPUs, famously A100s. These are expensive GPUs. You load your large model and fine-tune over specialized data and hopefully none of the hardware fails in the process. If it does, you hopefully have minimal checkpoints (which is time-consuming). If it does fail and you had a checkpoint, you try to retrieve your fine-tuning as much as possible. However, depending on how sub-optimal the checkpointing is, you did lose some good few hours anyway.
- Retrieval and inference: After this, you serve the models for inference. Since the model size is still huge, you host it on the cloud and rack up the inference cost per query. If you need super-optimal configuration, you debate between an A10 and an A100. If you configure your GPUs to completely spin up and down, it lands you in cold-start problem. If you keep your GPUs running, you rack up huge GPU costs (aka investments) without paying users (aka return).
Note: if you do not have a fine-tuning pipeline, the pre-processing elements are out, but you are still thinking about serving infrastructure.
The biggest decision that relates to our sunk cost conversation is this: What constitutes your infrastructure? Do you A) the infrastructure problem and borrow it from providers, while focusing on your core product, or do you B) build components in-house, investing time and money upfront, discovering, and solving the challenges as you go? Do you A) consolidate locations, saving on ingress/egress and many associated costs with regions and zones, or do you B) decentralize it from various sources, diversifying the points of failure but spreading it across zones or regions, potentially creating a latency problem needing a solution?
The trend that I see in growing startups is this: focus on your core product differentiation and commoditize the rest. Infrastructure can be a complicated overhead taking you away from the monetizable problem statement, or it can be a big powerhouse with bits and pieces that can easily scale on single clicks with your growth.
Beyond compute: The role of platform and inference acceleration
There is a euphemism that I have heard in the startup community: “You cannot throw GPU at every problem.” How I interpret it is this: “Optimization is a problem that can’t be completely solved by hardware (generally speaking).” There are other factors at play like model compression and quantization, not to mention the crucial role of platform and runtime software such as inference acceleration and checkpointing.
Thinking of the big picture, the role of optimization and acceleration rapidly becomes centralized. Runtime accelerators like ONNX can give 1.4X faster inference while rapid checkpointing features like Nebula can help recover your training jobs from hardware failures, thus saving the most vital resource: time. Along with this, simple techniques like autoscaling or scaling and workload triggers can help you spin down the number of GPUs sitting idle and waiting for your next burst of inference requests by going back to a minimum where you can scale it up from.
In the roundtables that we’ve hosted for startups, sometimes the most cash-burning questions are the simplest ones: To manage your growth, how do you balance serving your customers short-term with the most efficient hardware and scale vs. serving them long-term with efficient scale-ups and -downs?
As we think about productionizing with foundation models, involving large-scale training and inference, we need to consider the role of platform and inference acceleration together with the role of infrastructure. Techniques such as ONNX runtime or Nebula are only a couple of such considerations and there are many more. Ultimately, startups face the challenge of efficiently serving customers in the short term while managing growth and scalability in the long term.
For more tips on leveraging AI for your startup and to start building on industry-leading AI infrastructure, sign up today for Microsoft for Startups Founders Hub.