MLOps in AWS

Behind the scenes of a continual-improvement AI model

17 min readJul 30, 2021

For the last 3 years, I’ve worked in startups as a Software Engineer. My experience ranges from Full-Stack to DevOps to Tech Lead. In parallel, I’ve been studying Artificial Intelligence in all its flavors for the last decade.

So when I joined an AI startup, I was excited to finally merge my two passions. I was hoping to learn more about how AI models are built and trained. But what I was not expecting was also to learn more about Software Development.

The difference between DevOps and MLOps might be only 2 or 3 letters but in practice, is big. And while the components are also similar: CI & CD plus Continuous Training. The possibilities expand as you begin to enter the realm of randomness. And for software engineers:

Going from DevOps to MLOps is similar to what physicists experienced with the discovery of quantum mechanics

What is MLOps?

3 fundamental pillars comprise how MLOps is practiced in the enterprise. The first pillar is interdisciplinary collaboration. Just as DevOps aims to increase the delivery of value through automation and improved communication. Besides software and business people, in MLOps, two new profile personas are added to the mix: data scientists and domain experts.

The second pillar is effective data management, while also present in DevOps culture. The amount of work required to make sure the data is sufficient for the ML models makes it necessary for even the smallest startup to build and maintain fairly sophisticated data pipelines.

The third and final pillar is improving accuracy. DevOps aims to decrease errors preferably through small, gradual improvements. This translates in MLOps to automating the training and deployment of new models making them agile and adaptable to diverse business conditions.

How MLOps look in AWS?

In parallel to the Kardashev scale, I foresee 3 different levels of adoption of the MLOps paradigm in AWS. Phase-I uses the cloud to train, deploy and serve their AI models. Characteristics of this phase include using the UI to manage infrastructure, training models with open-source libraries and architectures, manual testing (if any), and a single instance to serve the models without auto-scaling.

Phase-II automates the ML pipeline. Infrastructure is managed as a code, with each stage serving a different ML model instance. Tests are automated, updates are deployed through Continous Delivery pipelines, and monitoring is enabled in production. Additionally, some cloud services are used to improve the efficiency of data scientists, and data is replicated for training and testing purposes.

In Phase-III the cloud is used to accelerate the training and deployment of ML models. There is a single pipeline to manage the full cycle of training and serving an ML model which requires little input from Software engineers. The focus turns on creating beneficial interactions between domain experts, business developers, data scientists, and software engineers. Multiple models co-exist in production A/B testing different hypotheses. Building and training new models are partially automatized using efficient data pipelines, Continuous Integration is done through non-deterministic tests, and instances are self-healing in production.

Selective focus photography of succulent plant by Francisco Moreno on Unsplash

SageMaker the all-in-one AWS MLOps solution

To experience the value that SageMaker brings to an ML team, I’ll walk through the different stages of an ML cycle. At each stage, AWS SageMaker offers distinct alternatives to improve the productivity of Data Scientists and minimize the reliance on Software Engineers.

A new ML project usually starts with conversations between data scientists, domain experts, and business developers. This leads to a feasibility analysis that includes finding, and evaluating potential datasets.

1. Data Exploration

The process of feature engineering is composed of repetitive tasks that involve a certain amount of luck. While creativity and domain knowledge may boost a model’s accuracy, AWS offers a radically distinct alternative. Autopilot is a feature that creates hundreds of models with distinct algorithms and features for tabular data. The results are displayed in a leaderboard which provides valuable insights to start building a definitive model.

Sometimes the amount of data is not enough, and data augmentation can only take you to a certain extent. Amazon SageMaker Ground Truth automates data labeling for Computer Vision and Topic Classification problems. For Computer Vision, I’ve also found that Cord has an excellent tool for semantic segmentation and object tracking. The main benefit is that by using micro-models Cord can make label predictions using only a handful of labeled images.

Lastly, I want to talk about AWS QuickSight, a Business Intelligence tool that brings ML analysis capabilities. An easy-to-use interface is a perfect tool for business developers and domain experts. That is usually the place to center conversations between technical, and non-technical stakeholders.

2. Model Development

Naturally, AWS has all kinds of setups to minimize the amount of time spent preparing a workspace. But where it really makes a difference is by their ability to run training jobs in parallel. By offering their own implementations of open-source libraries and ML frameworks is possible to use one EC2 instance for developing and a fleet of GPU instances to train the models. This can drastically impact the time of development and costs. It also creates a smoother experience without dead time.

Also very impressive is their hyper-parameter tuning. Probably the most mysterious, time-consuming, and even dreadful activity is now also partially automatable. Using smart meta-heuristics SageMaker improves the accuracy of a model by testing distinct hyper-parameter sets.

3. Model Evaluation

Once the model is trained, is to decide if the model is good enough for production. It is at this stage, where the concept of Continous Integration from DevOps is useful. We could set up a pipeline using GitHub actions to test the performance of the new model. But there is a caveat, in Software changes to code are lightweight, but Deep Learning may require hundreds of gigabytes of data to train new models. Is GIT prepared to handle those files? Probably not!

DVC is an open-source version control system for machine learning projects. Using similar commands to GIT, DVC allows you to create a mapping between the experiments, the data used, and the results. Allowing you to return to a given state of the model at any point in time. DVC has a smooth integration with AWS S3 where the data can be hosted and retrieved easily.

Additionally, CML is a library to automate reporting the results of ML jobs to GitHub pull requests. The results can then be shared to Slack and keep the whole team engaged to fulfill one of the 3 pillars of MLOPS: interdisciplinary cooperation.

4. Deploying to Production

Probably my favorite part is the single line of code that exposes an endpoint from SageMaker to make predictions. Of course, there are alternatives, like serving a docker image from a lambda or hosting it from an ECS cluster. But empowering data scientists to control the deployment of their models opens the possibility to focus interactions between data scientists and engineers in building and designing the required data pipelines.

“In MLOps, the database is the like heart of the body that pumps information to the rest of the system.”

AWS also offers Elastic Inference to accelerate the speed of predictions. Augmented intelligence for the most common AI use-cases as speech recognition and machine translation. And even a Marketplace to use third-party models.

5. Performance Monitoring

Using auto-scale and provisioned infrastructure monitoring efforts can focus on ML tasks: detecting data drift, online learning & explainability.

Data drift: In theory, data is assumed to be part of a probabilistic distribution, and the knowledge of this distribution is used to make predictions. Statistical analysis can be performed to measure if the inputs given to the model are from the same distribution that the ones used during training. In practice, this can be done by setting pipelines where sub-samples of the predictions are tested to assure the accuracy levels are consistent.

Online Learning: in life, there are good problems to have, and there are bad ones. The same happens in AI, some problems are nice to solve, their characteristics include an auto-labeling nature. An example of this is recommendation systems: the effectiveness of the recommendation can be measured by click rates, bounce rate, time spent, and other relevant metrics after the interaction. This is opposed to problems in Machine Vision where a human is required to label the image to test the accuracy of the predictions. I warn you though, to build these pipelines and I will talk in more detail about it in the next section (continuous training).

Explainability: ML models were traditionally seen as black boxes. We knew the results of the model but not the why. The need to explain the main drivers that influence the results usually falls in one either two buckets: compliance and business. Compliance, part of Ethical AI is especially relevant in industries such as medical diagnosis, self-driving cars, and robotics, where a significant level of control and predictability of deep learning methods is mandatory [1]. While relatively a new area, AWS does have support for SHAP that uses game theory, and other few methods using SageMaker Clarifai.

6. Continuous Training

Whether you have a Software or Data Science background, this should be the most exciting for you. Because while the components may not be new (data pipelines) the possibilities are endless. With continuous training, one can truly start thinking of self-learning models. The idea is simple to collect the predictions performed by the model, label the data, and feed the new data to the ML training script. Once the training is over if the new model meets the Continous Integration requirements deploy it, preferably using the canary strategy. I understand that it might sound a bit too abstract, and I will share a sample architecture for your benefit.

Scenario: a personalized newsletter. Xolotl is a startup that delivers you the most relevant news based on your interests. Their continuous training makes it nimble to shift their recommendations based on your interests. And their crowdsourced data features the best stories for you to learn about new topics.

How it works: using word embeddings, Xolotl stores each of your interests in a 512-dimensional map. Each interest has a strength factor computed by the frequency, recency, and quality of your engagements (clicks & ratings). Every day, before sending a new issue Xolotl bots surface the web to find new stories, the meaning of the stories are embedded in the 512-dimensional map. Then, based on your interactions of the prior day the weight of your interests is updated and the stories that will be served to you are selected. Using GPT-3 API, Xolotl summarizes the parts that you find the most interesting, and the output is used as the content for the newsletter email. Occasionally, Xolotl may ask you to rate some of the recommendations to get a second data point besides click rate.

ML Design Patterns for Software Engineers

In the next section, I will introduce a potential roadmap for introducing MLOps in your organization. However, since most likely the Software Engineers, familiar with DevOps, will be the main drivers of change. I find it important to introduce at a gross level the patterns used by Data Scientists to perform their job. These patterns are explained thoroughly in the homologous book by Valliappa Lakshmanan. I highly recommend the book, whether you are new to the field, or an expert looking for a reference guide.

Data Representation

There are two common problems with features: too many of them or too few. When dealing with few features Data Scientists use Feature Cross to uncover hidden relationships between the features. While theoretically, the ML model could learn that for itself, explicitly feeding the relationships can accelerate training or compensate insufficient data.

On the opposite end: two many features, as when considering distinct words in Natural Language Processing problems, the solution is to reduce the dimensionality using embeddings. Interestingly, this is also used to visualize human language.

Problem Representation

When choosing an architecture for the model there is a clear taxonomy depending on the nature of the problem. However, this may not be sufficient to cover every single case, and data scientists occasionally chain distinct models to get the predictions. When the models are chained one after the other this is called the Cascade Pattern. And when the models are simultaneously used to arrive at a solution is called ensemble learning.

Model Training & Resilient Serving

In Computer Vision it is fairly common to use pre-trained architectures to minimize development time. This is done by replacing the top layers with the new labels and retraining them with your data in classification problems. This also happens with word embeddings in NLP, and most recently in tabular data with TabNet.

Another pattern keyed predictions may be used both during training and serving the model. The idea behind it is that instead of using the network to make predictions, data is stored in a dictionary based on the inputs that lead to the result. This is especially useful to augment the data in time series problems or making predictions in low computing power devices.

There is a second alternative for making predictions in IoT and mobile devices, which is called quantization. Basically, the weights of the network used to make the prediction are quantized reducing the number of bits used to make the prediction. An example of this is GPT-3 that has 4 different models each with different precision and consequently different accuracy.

Reproducibility

Given the probabilistic nature of ML, data scientists face diverse challenges to reproduce the training of their models. Training is usually done following Gradient Descent where the initial conditions can be aleatory, to overcome this a seed is used to store the state at the beginning of the training. Another aspect to consider before training is using the same set of ordered data.

Data scientists also have to deal with changes in the structure of the data, various alternatives are exposed to handle this in the bridge pattern. Finally, one has to consider reusing the code performed during data preprocessing, otherwise, duplication of code between development and production could contribute to errors. The latter is solved using a Feature Store connected to various data pipelines. For Data Engineers used to work with AirFlow one might want to check out KubeFlow a dedicated toolkit for Machine Learning which has components to integrate with EKS and SageMaker.

A final benefit of the Feature Store is that it can also be accessible to Domain Experts where they can review how the inputs are used to make predictions.

Responsible AI

The book saved the best part for last, covering 3 aspects of responsible AI: heuristics, explicability & fairness. Heuristics involves creating simple representations of how the model makes the predictions, this is especially important for decision-makers who may not be well-versed in Data Science terminology. In case the ML model intends to replace previously human-performed actions, the recommended approach is searching for a baseline to compare how effective the model is.

Concerning fairness, the problem is different. ML models are a representation of the data it was used to train it, but what happens if the model learns unfair biases against a particular gender or ethnicity? Then, data scientists require to use what-if and other frameworks to ensure the model is not prejudicial to particular sectors of society.

Implementing MLOps in your organization

Before you jump to implement an MLOps culture in your organization, a final warning: the guide assumes your startup is already using AI. If that’s not the case, I recommend the magnificent playbook by Andrew NG on how to introduce AI to transform an organization.

“Between now and 2030, AI will create an estimated $13 trillion of GDP growth.” [2]

The five-step plan to implement MLOps:

Map the data that every department requires to run its operations. Understand why do they use it and how it is used.
Detect the main pain points of the current ML training cycle. Make fast improvements to it using familiar DevOps solutions.
Prototype small experiments with SageMaker to understand the challenges the Data Science team will face in their adoption of MLOps.
Engage the stakeholders to buy in the investment that the transition will require.
Reshape the communication across the different teams to align the new, faster development with strategic business and commercial goals.

Beyond MLOps

As I engaged in this exercise it made me reflect on the implications of MLOps to the startup community. I intended to direct this article to Software Engineers. Yet, I will share some thoughts that arose during my research that I did not pursue forward to remain technical.

Tesla over Toyota: the fall of Agile?

As with every branch of Agile, DevOps was originated from Toyota’s lean manufacturing system. But this poses an interesting question in 2021:

Why did Toyota Prius fail where the Tesla Model 3 succeeded?

After all, Toyota Prius was an efficient, fast, gradual improvement in technology that was supposed to bridge the adoption of electric cars. Tesla, on the contrary, was an aggressive gamble tied to a grandiose vision fully defined from the start. In short, Tesla is everything that Agile is not supposed to be.

One common characteristic that is attributed to Tesla’s skyrocketing growth is vertical integration. From manufacturing to its commercial operations, Tesla reimagined how things should be. Perhaps, auto-makers have fallen lazy outsourcing critical operations. Could this have implications to open-source?

A second interesting characteristic is how the Lithium batteries came to be used. It was because of Tesla’s founders, Martin Eberhard and Marc Tarpenning, previous work in consumer electronics that they were aware of the potential of the technology. Most of their early work involved making the technology safe for driving. My hunch is that the success of AI will fall more on domain experts than anyone else. Hot startups will be lead by 40 year-olds having worked for a decade in boring industries, rather than 20-year old tech geniuses.

The third fact in Tesla’s rise is the superhero figure of Elon Musk. It’s funny, Agile depends on empowering common workers to reduce production flaws. At its most extreme Agile is practiced by self-organizing teams, where the figure of the leader is more of a servant. Elon is the opposite, a visible leader that commands his troops to fulfill his mission. Deeply technical, and with broad knowledge can talk in-depth about engineering or tweet virally. Yet, Tesla's rise has been filled with errors, delays, and critics. Again, the opposite of Agile.

I remember Sam Altman sharing is easier to build something hard in Silicon Valley than something easy. You will not get the attention, help, and resources to succeed. I write this as Tesla installs its first GigaPress and releases a subscription model for self-driving cars. I’m guessing there is no room for the conservative, practical, wise advice of Agile in Tesla’s HQ.

The implications of a consequential fall of Agile for MLOps based on Tesla’s gutsiness are to be seen. Remember that Lean Manufacturing was only discovered as Toyota’s shared the secrets with their providers to improve their productivity. But if Tesla is vertically integrated who will leak the secrets to the world and the Software Industry about its success?

As interdisciplinary collaboration grows, who will guide it?

In Graph Theory, there is a phenomenon that occurs on fully connected graphs: as the number of nodes increases the number of edges increases even faster.

As I start to look at the number of different persons involved in an MLOps team, I see no leadership. Worst, I see no one that can step up, have a full picture and make decisions. And that can lead to very unproductive teams.

Back in 1900 when the first MBA was granted that person was trained, educated to lead a business. But as I look to the landscape of the technological revolution I see an average MBA unqualified to make the technical decisions that can impair or advance a business.

Even the name feels antiquated: Masters? Definitively not! Business? Etymologically it’s derived from the word busy, and we all know that being busy is not the same as being helpful. How managers do we see with a schedule packed with meetings yet adding zero value. And finally administrators? We are no longer in a world with oil rigs or railways to manage. There is a long way from administering to innovation.

So who should then lead an MLOps team? Well, many things are still nice with the MBA. The exclusivity (selecting the better), its inter-disciplinary nature, the two-year full immersion. I propose the LEG: Leaders in Enterprise Growth.

With a background in Software Engineering and work experience in Machine Learning, the best, with the top interpersonal skills might be able to learn about business, marketing, finance and be prepared to guide inter-disciplinary teams for startups, multinationals, or as consultants.

The implications of AI in society and economics

Working in an AI startup has been mesmerizing. It’s impressive to learn how many of our skills are outdated, close to obsolete. No way can we compete against the computing power of an AI model. Nor are we as creative as we like to be. And with the amount of information a computer can retrieve, it’s just a matter of time for them to start making better decisions than us,

And it led to wonder, what will happen to society? Not in a dystopic view, but from a pragmatic economic perspective. And as I meditated on this, I remembered Marx and the fear surging from the thought that capital would enslave the proletariat. How hopeful, we wished for a revolution that would to a fair, just, equal, society.

I see something similar happening, the social class called the technocrats that will harness the power of AI to profit, and a second class that is relegated losing important middle-class occupations as physicians, lawyers, engineers, and more.

Naturally, there are also only two alternatives in this ever-wider AI world. The first utopic, embodying the principles that Marx thought to love, would create a universal income redistributing wealth based on taxes. As ideal as this may sound, is probably unlikely and we are rather instead shifting to a more stratified society. Where, at least for the short term, economic mobility will be scarce.

But I don’t want this to sound like a bad outcome. After all, obesity is a greater issue in developed countries than hunger is. And many of the changes Marx predicted failed to materialize (perhaps in part because of him). And workers, the proletariat were left with universal education, 40-hour workweeks, prolonged life, and many conveniences that technological progress brought.

In this regard, it will be interesting to see what institutional changes will AI bring. For example, schools and childcare were partly created to free time from workers and have an educated population. This standardization came at the expense of individualism, creativity, emotional development, and family bonds.

I foresee no shortage of human labor needs because it's easier to automate skills that depend on logic and no kinesthetic abilities. Usually, these skills are prevalent on the higher-end of the market. But how these will play in the development of new institutions, human development, and economic progress I will not share.

Conclusion

As mentioned, this piece was written for you: my fellow Software Engineer. I wanted to share what MLOps is, how to do it in AWS, and how to implement it in your organization. I took extra time to incentivize leadership in you to guide this transition and explained why ultimately AI is the direction technology is going. I want to conclude with the last remark, based on the assumption that AI will radically shift how society operates: please take an extra time to consider the impact that your work has on society as a whole. You have the liberty to choose projects that contribute positively to how you imagine our society should be shaped.

Bibliography

Additional consulted resources not linked during the article:

Thank you to the team at ProsperIA that welcomed me, and provided valuable insights during the writing of the essay. Every month I publish a piece about startups and AI, follow me if you would like to read more. And if you enjoyed the story, or it provided value to you I will appreciate your claps.

Best,
Santiago M.