When you’re scaling your startup, spreadsheets and manual file tracking just aren’t efficient enough. Azure Machine Learning enabled Sciffer to implement MLOps, scale our model training, track our versions, and automate our processes to improve our product continuously and grow our company.
Sciffer, incorporated in 2016, is a technology company with expertise in artificial intelligence, machine learning, and deep learning. We build innovative solutions for media and entertainment companies, helping them automate mission-critical operations, increase efficiency, maximize viewership and revenues, and ultimately gain a competitive advantage.
Having started our journey with developing content optimizers and schedulers for television broadcasters, we’ve recently launched our flagship product, Reflexion.ai. We built this AI-powered content analytics and collaboration platform to automate metadata extraction and facilitate remote partnerships.
Reflexion.ai is a unique product that combines collaboration, AI-based tagging, and storyboarding in a single web-based Software-as-a-Service (SaaS) application. We help media companies connect everything from task management and review to collaboration, storyboarding, and content metadata extraction and compliance detection using a single easy-to-use tool.
Our tagging and smart search features help users dig into the video content based on search criteria. Media companies can use these search results to run targeted ads for their advertisers. This capability also helps them create a better sales pitch.
Reflexion.ai can also help sort pictures and videos by the people who are in it. A critical application is autodetecting compliance components in the content, such as smoking, drinking, and violence.
We’ve integrated with various external applications to ensure a seamless user experience. Some key integrations are cloud storage, including Dropbox, Blob, and S3 for hassle-free media movement at various stages. We’ve also integrated with editing software such as After Effects, Premiere Pro, Photoshop, and Final Cut Pro to get feedback, manage approvals, and manage files and feedback from anywhere. Reflexion.ai also works with Zapier, further facilitating integration with over 150 Project Management tools, including Jira and Slack.
Probably our most significant partnership is with Azure Machine Learning. Various Azure Machine Learning tools have helped us gain efficiency and scale our operations. We’ll discuss them in depth throughout this article.
Why We Use Azure Machine Learning
Building machine learning models and improving them is an iterative process. This method differs from traditional software development, where the outcome is primarily deterministic. Building and maintaining machine learning models has unique challenges involving standardization, model reproducibility, reusability, and versioning. Azure Machine Learning provides rich, consolidated capabilities for model training, versioning, and deployment.
Azure Machine Learning helps boost our machine learning productivity using AutoML, operationalize at scale with MLOps for multiple model lifecycle management, and provide us with tools to build responsible machine learning solutions. Azure Machine Learning enables us to build an open and flexible platform across development tools, languages, and frameworks, all while maintaining advanced security and governance across the platform.
When we started with Azure Machine Learning, we first turned to their documentation for guidance. It reviews Azure Machine Learning and how to set up resources and architecture. It also got us started with Python. We also found it helpful to learn more about Azure Machine Learning’s capabilities and review their developer resources.
Building Reflexion.ai: Implementing MLOps
When we started developing Reflexion.ai, we had much to learn. It was our first time building an AI-enabled SaaS application. The journey helped us improve our practices, significantly increasing the efficiency and effectiveness of our work using MLOps.
MLOps is based on the principles and practices of DevOps that increase your workflow efficiency. MLOps combine practices and philosophies for handling applications with machine learning models in a production environment.
Azure Machine Learning provides several MLOps capabilities that enable us to follow the machine learning pipeline efficiently. For example, they help us create reproducible pipelines, develop reusable software environments for training our models, and deploy our models from anywhere—a key component of Reflexion.ai’s collaborative nature.
Using Azure Machine Learning and the MLOps Pipeline
As a startup, we had plenty of learning and growing to do in our early days — especially with machine learning. Let’s examine some events in our journey where we had to adapt and redefine our practices and explore how we leveraged the power of MLOps in Azure Machine Learning to improve the consistency and efficiency of our AI models.
Preparing and Managing Our Data
Today, we have more than 20 deep learning audio and vision models in Reflexion.ai. We tune these models based on many datasets for best-in-industry performance. But in the beginning, the team had to work on a limited number of models. So, we had to manually manage the datasets, managing the data in folders then using it for training. Eventually, as we progressed, the number of models increased, and our original process became wildly inefficient. We knew we had to change our methods drastically and introduce automation to keep up.
Data Validation
Critical to any model’s success is the dataset’s quality. Instead of relying on ready-made data collections, we employ a manual process for Reflexion.ai to replicate real-world use cases better. We also have firm guidelines in place to ensure that the collected data is usable.
This data requirement sparked the need for a validation exercise to help weed out abnormalities detected early in the process. This exercise helps us avoid backtracking later when the model fails accuracy tests. Since manually validating every data point is tedious, we explored options to automate the process. We realized Azure Machine Learning could help.
Azure Machine Learning helps us develop scripts to automate initial dataset checks. We also benefit from the collaboration of notebooks and datasets, making it easier for multiple developers to work parallelly. This workflow assures us that the dataset is always in perfect condition, allowing our developers to focus on other aspects of the operation whenever a need for optimization arises.
Dataset Versioning
Data collection is an inherently iterative process, and in turn, the datasets are constantly evolving. Especially for non-textual data, the format of data fed to the model can change the model’s performance. So, our team frequently experiments with different versions of training data to determine the best fit for the model. This experimentation requires a robust procedure for handling different data versions and tracking their performance.
We started doing this experimentation in a rudimentary fashion by creating multiple folders on a single server, which was certainly not going to survive the test of time. Now we’ve entirely ceased doing this and are migrating the entire operation to Azure Machine Learning.
Azure datastores help us dynamically create dataset versions on the fly, attaching proper descriptions that help us use them easily and document the results. It’s reduced our costs, too.
We initially lacked an interface for viewing the datasets quickly whenever needed. Now, with Azure MLFlow, our developers can easily switch between various dataset versions to study the differences. The user interface (UI) removes ambiguity from our operations while helping us compare performance changes.
Analyzing Data
During model development, we perform various analyses to evaluate which attributes to focus on. We’d have to perform these analyses for every version of the dataset — and every iteration of the process. To reduce the manual overhead, we use MLFlow and Azure Databricks to automate the process.
Training Our Models
The training process was crucial to improving our efficiency and better handling our machine learning models when we first started. Our first step was to create scoring files and dependencies to adequately document the training and ensure all items in our MLOps pipeline would meet regulatory requirements. When added to Azure Machine Learning’s automated processes, this transparency helped us quickly create accurate models — and understand how the model was built.
Environment and Resource Management
We work with many Python libraries, including PyTorch, TensorFlow, OpenCV, fast,ai, PySceneDetect, librosa, and pydub. Working with all these libraries inevitably led to compatibility issues. Sorting these issues out required us to perform manual testing, installing different versions of the libraries with a trial-and-error method. This manual testing is time-consuming and, frankly, it’s a waste of a developer’s time. Azure Resource Manager has helped us save time by automating testing and selecting the best set, which we then create into an environment.<
Azure Machine Learning also helps by creating packages of the libraries and managing what to install on a server. Now, we can quickly create environments using presets. This approach reduces the setup time drastically and lets our developers concentrate on development instead of spending time setting up the environment every time. It also helps us scale up Reflexion.ai without worrying about optimizations.
Experiment Management
Three major factors determine the output of any model:
- The model code
- The training set
- Model hyperparameters
Tuning and finalizing a model requires running multiple experiments on these factors’ various permutations of values and versions.
Model Versioning
The best way to experiment and improve a model is by employing version control, as this helps us roll back changes easily in case of a performance drop. The simplest way to do this is using a Git version control system. However, Git is meant mainly for source code versioning. Also, these models are often large, meaning they’re not exactly optimal for Git versioning. Plus, we need to track the other training hyperparameters alongside the code as well.
Integrating with Azure Machine Learning helps us significantly in managing the model versions. We can work with large models and easily adjust the hyperparameters.
Experiment Documentation and Versioning
Initially, we managed to document our experiments and versioning using elaborate Excel spreadsheets. But as we started scaling up, the challenges increased exponentially. Rerunning the experiment again (and again) by tweaking some factors entailed plenty of setup and documentation. Doing it at scale was challenging — and it was impossible to do it efficiently. Now, we’re able to perform these reruns seamlessly using Azure ML Studio.
Experiment versioning helps us save the experiment’s iteration during any step of the model building lifecycle. So, rerunning the same experiment configuration and tinkering with the code, parameters, or training set becomes effortless.
Experiment Evaluation and Tracking
Evaluating experiments on the go helps in quickly planning further iterations. The MLFlow interface helps our developers collaborate and track each experiment’s performance. It also enables us to simultaneously compare every version and epoch to determine the best-performing model from the experiments run.
Our team can rerun the same set of model code and training data with various configurations to determine the most optimum parameter values. The collaboration feature helps us track the changes we made, ensuring we don’t rerun the same experiments.
Experiment Scheduling
MLFlow helps us schedule and queue experiments and gives us options to distinguish various priority levels for each model. This approach has automated our process significantly. So, our developers don’t have to spend time deciding which experiments to run and in which order. The automated process helps our development teams work more efficiently and eliminates hiccups and miscommunications about model prioritization.
Validating and Deploying Our Models
CI/CD Testing Pipeline
When we first started and had a small team and only a few models, it was easy to deploy them whenever they were production-ready. We were also able to modify the models when needed. As the models increased in scale, we chose to automate our testing process using Azure’s continuous integration and continuous deployment (CI/CD) pipelines.
A CI/CD pipeline is a deployment pipeline integrated with automation tools for improved workflow. When we conduct them properly, they minimize manual errors and enhance the feedback loops throughout the software development lifecycle, enabling teams to deliver smaller chunks of releases within a shorter time.
Now, we can automatically source artifacts from Azure’s Machine Learning model registry and link it with the pipeline. We can do so while ensuring that updates to production code are subject to appropriate approvals before being updated.
Using Triton Inferencing Server
When moving from experimentation to production, one of the most challenging tasks is serving deep learning models quickly and easily. Serving deep learning models involves taking a trained model and making it available to support prediction requests. When serving in production, we need to ensure our environment is reproducible, secure, and isolated.
One of the easiest ways to serve deep learning models is by using Nvidia’s Triton Inference Server. Triton Inference Server supports all major frameworks, including TensorFlow, TensorRT, PyTorch, ONNX Runtime, and even custom framework backends. It can perform real-time inference, batch inference to maximize GPU or CPU use, and streaming inference.
In Azure Machine Learning, Triton is available as a Docker container. Microsoft Azure Kubernetes Service (AKS) also supports Triton, making deployment easier.
Autoscaling GPU Computation
Most of the tasks in our deep learning models are resource-intensive and require plenty of computational power to run. GPUs have been an effective solution for us as we accelerate deep learning and AI workloads such as computer vision, conversational AI, and natural language processing (NLP). So, we use GPUs to accelerate our programs.
For operations like matrix multiplication that we can process in parallel, GPUs are significantly faster than CPUs. But in operations that access memory given indices, CPUs outperform GPUs. So, placing the correct block of code on the proper hardware helps improve the model’s performance.
It’s become easy for us to optimize a deep learning model for GPU now that GPU optimized packages and libraries — like PyTorch and TensorFlow — are available. We can use these libraries to improve the model performance.
Many GPU-optimized virtual machines are available on the Azure Cloud Platform. These specialized virtual machines are available with single, multiple, or fractional GPUs. We use ND A100 VM to train computer vision models like object detection, face recognition, and place identification. We also use NCasT4 VM for model deployment and inferencing all the vision models, as it’s excellent for deploying AI services like real-time inference.
Conclusion
Azure Machine Learning has several strong use cases for media and entertainment companies across the entire value chain of pre-production, production, and post-production, including OTT/digital platforms, TV broadcasters, advertisers, agencies, and even small-time photographers and videographers. Use cases include getting content reviewed and approved, receiving feedback on the content, smart searching to find relevant sections of the content based on AI tagging, creating a compelling sales pitch for advertisers and agencies, and even encouraging quick storyboard creation to prevent back and forth with the editing team.
There are already products out there for remote collaboration. Still, with the support of Azure Machine Learning and the MLOps pipeline, we’ve been able to push boundaries and gain confidence in our models’ performance. We’ve learned valuable lessons during our expansion, particularly about the importance of automation, consistency, and replicability. And while we’ve had to make many expensive changes to our architecture, Azure Machine Learning has helped us create a robust platform that’s simple to implement and easy to use as a media company working with content. And, with our machine learning models and AI continuously improving, we’re confident that Reflexion.ai will continue to expand.
While we were expanding Reflexion.ai’s capabilities, Azure Machine Learning enabled our growth through automation and efficiency. If you’re ready to scale your startup, consider joining Microsoft for Startups Founders Hub as we did to gain personalized technical resources and business tools to reach your next milestone.