Blueprint of a Machine Learning Project

Santiago M. Quintero
6 min readJan 27, 2020

--

Photo by Sven Mieke on Unsplash

Every happy ML project starts with good data, each unhappy project starts in its own different way.

You have a dataset, and you start to wonder could it serve to implement an ML feature/project. Three questions I begin to answer:

  1. Is the data size sufficient? Personally, when a dataset reaches 100K records is the time that I start to wonder whether it could serve an ampler purpose than what it was originally created for. 10K items may be sufficient if there is a good enough motivation, and although something could be done with 1K items, I would argue that you do not have a sufficiently common use case to be worth it.
  2. Is the data growing? Static datasets may work great for academia, in business applications you want a constant supply of data for continuous improvement, and you’ll also want to ensure the rate of growth is increasing. Otherwise, why investing in it?
  3. Is the data structured interestingly? Look at the data features (columns) how many of them are numeric? how many categorical? do they contain a date, if so how many? Then run descriptive queries to the data, find the mean, standard deviation, max-min & median values. Count the unique values, compute the correlation matrix. Make some plots: histograms, line charts & scatter plots.

Good, you’ve determined the quality of the data, Now, let’s move to assess the infrastructure, important questions to answer are: Where is the data stored? What format is in? How do I query/retrieve it? How is it collected? What size is each record? Is it replicated? What is the value of it? I like to answer this and related questions through a flow diagram, based on it, I have good anticipation on how expensive the Data Engineering will be.

Drafting Business Requirements

On a small subset of occasions, the feature to be implemented predates the data. It is more common though the other way around, and based on the data different features can be evaluated.

I like to group ML features based on two groups: Features that Generate Growth & Features that Generate Profit.

Growth features comprise all that your product cannot do now, and things that it does but can be improved. On the profit features, common characteristics are: reducing costs, increasing productivity and automating processes.

The process of generating possible features is best carried through brainstorming. It’s a good idea to include other engineers and team members. Technical stakeholders and top customers (evangelists) can also greatly contribute. I’ve found that involving friends and family informally in the process can produce amazing ideas as well, never underestimate an outsider’s perspective. At this stage it may not be a good idea to include non-technical people from the organization as it is not unusual for expectations to over escalate about AI and ML, I would, however, take into account any comments shared in previous discussions. Scheduling anything less than a week for this is not advisable and it can be done through a collaborative document, email thread or Slack channel.

Photo by You X Ventures on Unsplash

A good framework for evaluating the feasibility of ideas is provided by Michael Seibel from YCombinator where each idea is placed in a cell on a 3 by 3 matrix with the x-axis classifying how easy is the implementation of the feature is expected to be, and the y-axis measuring how much value potentially will the feature provide. The analysis of each feature will be served by an intersection between the previous assessment of the data and the expected output of the ML algorithm. After this analysis, 3 to five candidates shall remain with the expectation of implementing one or two of them.

Time for the design of the solution. In practice, most Machine Learning implementations are of supervised nature. Thus, it is a good practice, to ask yourself what can I expect to predict or classify from this dataset. This is purposefully an open-ended question as you will never know what path will ultimately take you to the solution. Secondly, start working on the mathematical representation of each output, starting to look at literature at this point is a good idea, this is also a good time to start working on your own notes (war stories).

When designing, another good question to ask yourself is which of these features could I estimate using a linear classification algorithm. If there is any, your project has a great potential for success.

Backing your hypothesis with trivial implementations is a good idea, this will expand the insight from the correlation matrix done previously. At this point, you already have a good idea of how your project will ultimately look but it is best to complement the analysis with pseudo-code on how I expect the solution will look. This is the second time I look for available literature to find the best algorithm for each case. In the first implementation of ML for a dataset, I tend to focus on Linear Regression, Nearest Neighbor, Support Vector Machines & Decision Trees. Anything further than that may be too difficult to implement on a first try. Good design is over, let’s move to estimate and planning the task.

Implementation

Now you have a good idea of how you would like your brand new feature to be, but before training the algorithm estimate how long it will take. Machine Learning is not only about Data Science, but Data Engineering also plays a huge roll in it as well. You will require at least three data pipelines: the first to an iPython notebook where the data exploration will take place, notebooks are also a good place to train your first algorithms but is better if the training and maintenance of the models take place in a reserved computing instance. The third and final pipeline required is where the prediction will take place this can be done through an API or in the client for fancier solutions. Based on the available infrastructure evaluate if you require more additional tools for the project such as additional databases, ELT software, computing instances, etc. After this, the panorama on how complex the Data Engineering work will be and how much time you can expect to allocate the training of the model to fit with the expected timeframe.

Photo by Mike Benna on Unsplash

Personally, before implementing a solution I like to take a deep dive into the available literature. Before starting, I like to read on average: 10 blog posts, 2 academic papers, and 1 technical book. After my study, I feel very confident and jump to the development by trying to replicate some of the results and doing some ensemble between the papers.

Despite all the care, I expect 1 of every 2 Machine Learning projects to fail.

Deployment of the solution is best served by stages if possible, and I usually try to keep the expectations low when communicating the results, it is a good practice to accompany communication with hard facts derived by the goal of the project.

Congratulations, you now have a good template to handle upcoming Machine Learning Projects. I like to think of them in eight stages: assessing the data, gathering requirements, feasibility analysis, design, tool selection, implementation, deployment & communication.

Thank you for reading and do share what are the practices that work best for you.

Best,
Santiago M.

--

--

Santiago M. Quintero
Santiago M. Quintero

Written by Santiago M. Quintero

Entrepreneur, Software Engineer & Writer specialized in building ideas to test Product Market Fit and NLP-AI user-facing applications.

No responses yet