The Weights & Biases (W&B) platform is a leading choice for AI developers such as OpenAI to build and deploy machine learning models faster on Microsoft Azure AI infrastructure. To help AI developers accelerate the development of LLM applications, the W&B Tokyo team is playing a leading role in supporting the AI developer community’s efforts to advance LLM’s Japanese abilities by publishing the “Nejumi LLM Leaderboard.” Since its launch in July 2023, it has grown to become one of the largest and most notable LLM benchmarks on Japanese language understanding and generation capabilities.
Weights & Biases is a member of the Microsoft for Startups (MfS) Pegasus Program, which provides access to Azure credits, Go-to-Market (GTM), technical support and unique benefits such as Azure AI infrastructure reservations on a dedicated GPU cluster. In 2024, more than 60 Y-Combinator and Pegasus startups, including W&B, have reserved dedicated cluster time to train or finetune the next generation of multimodal models. These models are being applied to applications ranging from text-to-video and text-to–music generation to real-time video speech translation, image captioning to molecular prediction, and de novo molecule generation for drug discovery.
To build on its success in enabling AI developers in Japan, the W&B Tokyo team recently used the MfS dedicated GPU cluster for a novel use case. They ran batch inferencing to evaluate leading LLMs on Korean language understanding and generation benchmarks to kick-start the “Horani LLM leaderboard” benchmark. The post outlines how the W&B team is leveraging MfS programs to promote the development of the Japanese and Korean LLM application ecosystems through its LLM benchmarking efforts that are a starting point for AI developers on whether to build or buy LLMs for their use cases.
W&B and Azure OpenAI help AI developers build production LLM applications
The core services of the Weights & Biases platform enable collaboration across AI development teams throughout the machine learning lifecycle from training and evaluation to deployment and monitoring. This is done by logging key metrics, versioning models and datasets, searching hyperparameters, and generating shareable evaluation tables and reports. For builders of LLM applications, W&B offers Weave developer tools, which provide detailed traces of application data flows and sliceable and drillable evaluation reports. This allows developers to debug and optimize application components such as prompts, models, document retrieval, function calls, and custom behaviors. Whether it is revolutionizing healthcare by accelerating drug discovery through protein analysis, optimizing recommendation engines for e-commerce and media, or enhancing autonomous systems for vehicles and drones, the W&B platform’s versatility facilitates the development of AI technologies across diverse sectors.
In fact, Yan-David Erlich, Chief Revenue Office of Weights & Biases, believes that machine learning models are unparalleled when built with other like minds. As the industry continues to learn from itself and understands how to best optimize machine learning training, the key to the future lies in working together.
“I think that the best machine learning models are built collaboratively,” says Erlich. “And we think the best with machine learning models require an understanding of training in massive scale that the likes that you see over at Open AI, for example, that’s training a lot of GPUs and a lot of parallel runs.”
Moreover, seamless integration with Azure Open AI not only augments the user experience but also enables the efficient analysis of fine-tuning experiments.
“One of our unique integrations with Microsoft Azure is specifically with Azure Open AI,” Erlich mentions. “What we have built is essentially called an automated logger. Anyone who is optimizing with Azure OpenAI can easily leverage the Weights & Biases platform to analyze their fine-tuning experiments and understand the performance of the model to make the decisions they need to move forward or not.”
W&B Japan LLM benchmarks inform AI developer Japanese LLM model choices
The W&B Tokyo team is at the forefront of efforts to accelerate AI development in their respective countries through the W&B platform, by socializing AI development best practices, and publishing LLM benchmarks to help AI developers transparently evaluate the performance of LLMs. Since July 2023, W&B Japan has been operating the “Nejumi LLM Leaderboard,” which publishes the ranking of the results of evaluating the Japanese performance of large language models (LLMs). The number of LLM models evaluated exceeds 45, making it one of the largest LLM model leaderboards for Japanese performance evaluation in Japan.
The W&B Tokyo team originally embarked on developing the Nejumi LLM leaderboard because they found much of the global LLM development and evaluation was conducted primarily in English. For example, HuggingFace, the world’s largest public repository of open-source models, publishes English-only rankings on its “Open LLM Leaderboard.” It evaluates the performance of various models across multiple evaluation datasets, such as ARC for multiple-choice questions, and HellaSwag for sentence completion questions. The team also found that many of the models that were highly regarded globally often had low or unknown Japanese language understanding. Furthermore, many Japanese companies have developed Japanese-specific LLMs and there was a great deal of interest from the AI developer community to see how well these models performed compared to those developed globally. As a result, the Nejumi LLM leaderboard project took off and it is now a leading reference for the AI development community in Japan. It is helping AI founders and enterprises build the next generation of LLM Japanese understanding and generation capabilities.
To read more about the team’s learnings from operating the Nejumi LLM leaderboard, see the post “2023 Year in Review from LLM Leaderboard Management|Weights & Biases Japan)” (note: the article is in Japanese, please leverage browser translation features to read in English). For the live and interactive leaderboard, see the W&B report: “Nejumi LLM Leaderboard: Evaluating Japanese Language Proficiency | llm-leaderboard – Weights & Biases.”
Microsoft for Startups GPU cluster accelerates creation of Weights & Biases Korean LLM benchmark
Building off the success of the Nejumi leaderboard in Japan, the W&B Tokyo created a Korean LLM benchmark, the “Horani LLM Leaderboard,” to assess the Korean language proficiency of LLMs. Their goal is to help the AI developer community drive improvements in Korean LLM language understanding and generation capabilities. In March 2024, the team leveraged eight Azure Machine Learning NDm A100 instances on the Microsoft for Startups GPU cluster for large batch evaluation of 20 LLMs on the “llm-kr-eval” benchmark dataset. Their goal: assess Korean comprehension in a Q&A format and MT-Bench for evaluating generative abilities through prompt dialogs.
“Amid the difficulty of securing GPUs [in the market], the Azure Startup GPU Cluster Access Program has been extremely helpful,” explains W&B Success Machine Learning Engineer, Kesuke Kamata. “The ability to launch VS Code directly from the GUI after starting Compute instances was particularly convenient. It was also easy to set the GPUs to stop in case of non-activity for a certain period of time, so I was able to perform work without worrying about activation times. At this time, thanks to these features, I was able to diligently conduct experiments on LLM finetuning continuously.”
When starting a leaderboard, the W&B team couldn’t begin with just a single model. The usefulness of an LLM benchmark to AI founders and developers increases with the number of model results. To kickstart the Horani LLM Leaderboard, the Weights & Biases team was able to reserve dedicated GPU time on the MfS GPU cluster to conduct batch benchmarking experiments across a greater number of models without the traditional challenges of needing to access GPUs on-demand and wait for their activation. This enabled the team to efficiently benchmark over 20 LLMs on Korean language tasks for AI developers to evaluate.
As of writing this post, benchmarking work on the MfS GPU cluster continues. The Horani LLM leaderboard is expected to become a critical reference for the Korean AI developer and founder communities in build vs. buy LLM decisions that will help drive the development of Korean LLM powered application ecosystem forward. For more details on the ‘Horani LLM Leaderboard’ and updated rankings, see the live report here: Nejumi LLM Leaderboard: Evaluating Korean Language Proficiency | korean-llm-leaderboard – Weights & Biases.
W&B team advises AI founders to prioritize experimentation
Throughout the rapid expansion in LLM development and availability since OpenAI launched GPT-4 in November 2022, the Weights & Biases team and platform has played an active role in enabling AI developers across the world. Do AI developers incorporate top performing proprietary models e.g., GPT-4, finetune open-source models e.g., Mistral-7B, or build LLMs from scratch? With more high-performance LLM choices in 2024, LLM benchmarks such as the W&B team’s “Nejumi LLM Leaderboard” and “Horani LLM leaderboard” are increasingly critical starting points for AI developers to make “build vs. buy” decisions. What does the W&B team advise for AI developers facing this dilemma? Prioritize experimentation.
“As a founder, it’s easy to get very laser-focused on what you’re currently dealing with today and what the business has been built upon, especially in the space of machine learning and A.I.,” Weights & Biases Chief Information Security Office and co-founder, Chris Van Pelt, tells Microsoft for Startups. He emphasizes the power of curiosity, advising founders to create space for experimentation.
AI founders play a critical role in setting the initial bounds for their team’s successful experimentation by driving specificity for target customers and use cases their ML-powered solution solves for. Continuous experimentation is key for AI startups to innovate with rapid AI developments, and bringing specificity helps with measuring and understanding the outcomes of AI development trials. However, AI teams should not only experiment with which models they select from an LLM leaderboard to start developing with, but also how they align model evaluation with their business goals.
“We believe that there is no single good evaluation for everyone,” shares Akira Shibata, W&B country manager for Japan and Korea. As the capabilities of LLMs are getting better, a greater range of tests and evaluations are needed to benchmark LLM performance.
For AI founders looking to build or finetune models that align with domain-specific use cases, Akira recommends: “You would want to be more specific and possibly develop evaluation datasets of your own to evaluation your model. One of the things we realized that we could contribute to better understanding LLM performance is that we have this report feature [W&B Tables] that allows you to not just visualize these results, but also allows you to analyze the results interactively to help you understand the context of where these models are.”
As the AI space progresses, founders should strongly consider building upon flexible platforms such as W&B to experiment efficiently and adapt their AI capabilities to embrace the excitement of what is coming next.
Are you a current or aspiring AI founder? Sign up for the Microsoft Founder’s Hub today for Azure credits, partner benefits, and technical advisory to accelerate your startup here: Microsoft for Startups Founders Hub. You can get started with Weights & Biases on the Azure Marketplace here.