From the course: Artificial Intelligence and Business Strategy

Building a robust data pipeline

- Data to AI is like fuel to a fire, the cleaner, richer, and larger the dataset on which it is trained, the more accurate your AI model will be. Here are five of the most important steps towards building a robust dataset. First, strive for maximum consistency in the labels attached to and the format of each case in your dataset. Lack of standardization is like ambient noise when you are trying to listen to someone in a room. Take VideaHealth, a startup that uses AI to help dentists diagnose teeth X-rays. As noted in the Harvard case, Videa obtained several million x-rays from dental service organizations and used a subset of these to train the AI model. They realized, however, that image formats and clinical labeling conventions varied widely across dental practices. Videa developers built software to standardize image formats and labels. This consistency was essential for building a more accurate AI model. Second, assess whether the dataset is rich enough in terms of features, that is variables associated with each case. A richer set of relevant features will create a more accurate AI model. Say you want to build an AI model to recommend what jacket might go well with a pair of trousers that a female customer is considering on your online store. Some of the relevant features might include her purchase history, her age, her ethnicity, her profession, whether she lives in a large city or a small town, even her climatic region. If you do not have information on some of these features, your AI model will be less accurate than it could be. Third, look for missing data. One solution is to manually collect missing information. With large datasets, this can be time consuming and expensive. An alternative can be to fill in the missing values via statistical interpolation from other cases. Yet, another alternative can be to train the AI model only on those features where you have more complete data, or on features that are more important. Fourth, assess the possibility of unacceptable biases embedded in the database. Say you are an HR manager and want to train an AI model to help screen job applicants. In this case, you need to evaluate whether ethnic or gender biases of your predecessors may have made the database a poor predictor of how future hiring decisions should be made. Later in this course, we discuss some of the ways historical biases can be accounted for when training an AI model. Fifth, assess whether the size of your database is large enough. A few thousand cases may suffice when the connection between inputs and outputs is simple. For example, training an AI model to identify the makes models of cars based on rear-view images. However, if the number of relevant factors is larger and the connections between inputs and outputs more complex, for example, estimating the value of a used car, you'd need a much larger dataset. Organizations with multiple units doing similar things in different locations often fail to centralize the collection of data. This is a needlessly missed opportunity to assemble a larger data space. Another important action is to standardize, automate, and centralize the collection of data from every transaction. Most importantly, organizations need proactive operational protocols, defining the what, why, how, and who of data collection. Now, consider two opportunities in your organization to train and deploy AI models. Analyze how you would construct a robust data pipeline in each of these two contexts.

Contents