From the course: Learning Data Analytics Part 2: Extending and Applying Core Knowledge
Duplicate or reference data sets
From the course: Learning Data Analytics Part 2: Extending and Applying Core Knowledge
Duplicate or reference data sets
- [Instructor] Two of my favorite commands in Power Query is the duplicate and reference command. This allows me to use one source data set, and create smaller, focused data sets to work with. Let's go ahead and open up our data into Power Query. I'll choose from table range. Okay, I'll go expand my queries. And the very first thing I want to do is go ahead and name this WageSurvey. Perfect. So a duplicate is purely independent of the other data set. Meaning if I make changes in the data set, like deleting columns or replacing values, they'll not impact the duplicated data. If you reference a data set, it will in fact carry over all the changes you make. But this is easier to understand if you see it in action. So I'll go ahead and right-click WageSurvey and choose duplicate. And it gives me WageSurvey 2. I'm going to go ahead and rename this as WageSurvey. I'll add duplicate. Now I'll go ahead and choose my columns for this sample set. I'll uncheck select. I'll do respondent ID. I'll what is your current age and your highest level of education, and how do you classify your employment? All right. I'll do OK. So now I'm using the original WageSurvey data. I've created a duplicate. But now that I see these columns, if I scroll over. I see I have four columns. Okay, but when I go back to my WageSurvey, I still have all of my columns in place. All right, let me show you what it looks like to reference. I'm going to right-click and reference the WageSurvey. And this one, I'm going to rename and add the word reference. So now, if I make changes in the main WageSurvey, they'll carry over to the WageSurvey Reference. This can be a valuable cleaning aspect. If I make changes to the WageSurvey Reference, they don't write back to the original. Alright, so let's look at it this way first. I'm going to go WageSurvey. And I know for what is my current age, I want to change that 65+ to be 65 and older. So I'll go ahead and right-click and replace values. And anywhere it says 65+, I'm going to change it to read 65 or older. And when I click OK, I see that's 65 or older. When I check my duplicate, I'm still going to see that 65+. But when I see my reference, I see that that actually carried over. Okay, so here, I want to choose my columns. I'm going to unselect all. I'm going to choose these first few options, and I'll click okay. So notice that I have seven columns left. Let's scroll over, I see the seven columns for this reference. And when I go to WageSurvey, I still have my original set. So anything that I change in the columns that remain in WageSurvey Reference in the WageSurvey will absolutely carry over. But I still get to work with this as an independent data set. Now I'll go ahead and choose close and load. And then I see each one of my new data sets. Okay. So I'm on sheet three, which is my reference. If I click on sheet two, it immediately highlights WageSurvey Duplicate on the right. If I go to sheet one, that shows me that original WageSurvey data. And then of course, I have my actual source data on the survey tab. I'll go back over to sheet three. The choice you make with a decision like this is solely based on what you're trying to achieve. There's some time savers when you reference a data set, but working with the duplicate, again, you may have to reclean certain things. It just depends on what you're trying to accomplish to which option you want to choose.