From the course: Complete Guide to Generative AI for Data Analysis and Data Science
Measures of central tendency
From the course: Complete Guide to Generative AI for Data Analysis and Data Science
Measures of central tendency
- [Instructor] When we're analyzing data, one of the things we often want to do when we're looking at a particular variable or a column in the data is try and get a sense if that value, if that variable tends to a particular value or have a central tendency. There are different ways of measuring central tendency. The mean or the average is probably the best well-known, and the mean is also known as the arithmetic average. Now, one disadvantage of the mean or the arithmetic average is that it's highly influenced by extreme values. So you may have almost all of your values around a particular value, but then you have some outliers, either extremely smaller than the central tendency value or extremely higher. Well, with the mean, the mean is actually influenced pretty heavily by those, or it can be, I should say, influenced by those extreme values. Now, the median on the other hand, is the middle value in an ordered set of data. So if you order all the values in a column, and look at the halfway mark, or if it's an even number of values in the column, if you look at the middle two, and average those, you'll get the median. And what's nice is the median is not influenced by extreme values. So if you want to understand central tendency without being influenced by extreme values, then median is a good option. Now, mode works well when you're dealing with like countable data, and you want to understand the most frequently appearing value. Now, we use mean when we're working with symmetric distributions. So symmetric means that we have a central tendency. There is roughly the same number of values, less than the central tendency value, and greater than that value. And we're dealing with data that doesn't have extreme outliers. Now, median works well when we're dealing with extreme values or skewed distributions like that may be more toward the smaller side or more toward the larger side. And mode is useful of course when working with categorical data. So for example, here is some symmetric data without outliers. The central tendency is around 100 to 102, as we can see here. And the mean and the median are very close. The median is 102, the mean is about 102.7 or so. Now here we have symmetric values, but we have outliers. So you'll notice, the smallest value is quite a bit smaller than the median. It's 12, and the largest value is quite a bit larger. It's 395. So what that means is our median, 102 is a good indicator of central tendency because we've just tweaked two of the numbers in our dataset. But tweaking two of the numbers in the dataset significantly changed the value of mean. What had been about 102.7 is now 135. So this just shows you the effect that outliers can have when we're working with mean, and why median is sometimes used when we're working with outliers. Now, when we're working with categorical variables, we might want to know, well, what's the most common value in a particular variable or a column? So in the category column we have electronics appearing three times while clothing and furniture each appear twice. So the mode or the most frequently occurring categorical variable is electronics. So those are some examples of central tendencies, and when we want to use them.