From the course: Security Risks in AI and Machine Learning: Categorizing Attacks and Failure Modes

ML dataset hygiene

- [Instructor] If you've ever accidentally taught your autocorrect system that a typo is an actual word, and then had to spend months correcting that typo every time your system tried to insert it, you know how frustrating it can be when machines learn the wrong thing. If the data ML and AI systems are trained on isn't good or accurate, the outcome from the model won't be good either. That's why it's incredibly important to vet datasets and implement dataset hygiene policies. Biased data leads to biased classifications and predictions. But bias isn't always obvious or intentional. Consider an automated faucet that is programed to turn on when the computer vision recognizes human hands in front of the faucet. If the system is trained only with light-skinned adult hands, it may not turn on if darker-skinned or small hands are in front of the sensor. Systems need to be trained on datasets that represent the entire population of potential users. Existing datasets can be biased too. Traditionally, many technical fields, like computer programming, have been staffed predominantly by male workers. The statistics are changing as more women enter technical fields but if an ML-based CV or resume analysis system is trained on the older biased data, it could learn that male candidates are desirable. To get ahead of bias in systems, training sets and technique should be carefully vetted by a wide number of experts. Because we can rarely recognize our own bias, it helps to have different people with different skillsets and viewpoints assess the training data for bias. The more eyes and minds that vet the data for bias, the more likely that bias will be identified and corrected before the resource-intensive training phase. Intentional poisoning of data is another way to cause ML and AI to fail. A great way to protect training data or any sensitive data for that matter is by applying the principle of least privilege. Using the principle in practice means that only those that need access to something are granted access. It is often used in conjunction with data classification programs, which help to identify an organization's most important and sensitive information. For organizations that have not yet started on the data protection journey, this is an opportunity to address overall data security while also defending and protecting AI and ML systems. Existing data classification and privilege management programs should be extended to include AI and ML training data. Not all training data is created or developed in house. Many companies use existing training datasets from public sources or partners. When using a third-party training set, evaluate the source carefully. Is it a source you trust? How long have you had a relationship with the source? Are there outliers in the data that could skew the training process or the model predictions? Have a data scientist at your organization review the data before it's used for training. The last impact on data bias to consider is the size of the training dataset. Evaluate that there's enough data to train and test the system with. Systems shouldn't be tested on the same data that was used for training. So you'll need enough data in the original set to support both activities. Small samples of data can lead to overfitting where the model performs very well on past data but not well on new data, as well as underfitting because there simply isn't enough data for proper training. The quality of training data used to develop AI and ML solutions directly impacts the success and accuracy of the models. This is why training dataset hygiene is one of the most important things that can be done to ensure AI and ML systems are accurate and unbiased.

Contents