From the course: Intro to Snowflake for Devs, Data Scientists, Data Engineers
Snowpark DataFrames: Part 1 - Snowflake Tutorial
From the course: Intro to Snowflake for Devs, Data Scientists, Data Engineers
Snowpark DataFrames: Part 1
- In this video, we're going to learn about Snowpark DataFrames. You've probably figured this out by now, but at Snowflake, we use lots of snow imagery. My hope is that as you move through this section and hear the word Snowpark, Snowpark, Snowpark again and again, you won't think of cold and ice and loneliness unless you live in a crowded, tropical country, in which case, maybe those sound desirable to you. Instead, I hope you'll think of spending the day hugged by a warm coat, enjoying the crisp air. I hope you'll think of spending the evening cozy and warm sipping from a hot drink with friends and family nearby, friends and family who are enjoying each other's company, laughing, laughing with you, not at you, they assure you. I hope you'll think of that. In any case, before we talk about spar DataFrames, I wanted to talk more about the term Snowpark, since we use it a bunch at Snowflake, and it's important to know how to interpret it when you hear it. The best way to think about Snowpark is as a set of different non-SQL capabilities, so Python, et cetera, you can use in Snowflake. We actually covered or mentioned a few of these already in this course, Python UDFs, Python stored procedures, though I didn't call them out as falling under the Snowpark umbrella at the time. In my experience when someone says Snowpark, you should think, "Okay, now I know this person's talking about non-SQL programming in Snowflake," but you'll need to wait for them to specify which aspect of Snowpark they're talking about before you can be 100% sure where the discussion is going. So in addition to Snowpark Python UDFs, Snowpark Python stored procedures, and Snowpark DataFrames, which we'll cover now, in future videos, we'll talk about Snowpark ML and Snowpark Container Services. Those aren't everything in the Snowpark world, but they make up a good chunk of it. Okay, now let's jump into Snowpark DataFrames. Snowpark DataFrames is a library that helps you transform data. To contextualize this, it's good to think about data engineering works having three parts, ingestion, transformation, and delivery, ITD. Ingestion refers to gathering data, transformation refers to cleaning, changing, and preparing that data, and delivery refers to handing over a finished data product, dataset, et cetera, to a customer or system like an analyst or an app. So Snowpark DataFrames are useful in the middle step of that process, the transformation part. Why is it worth learning about Snowpark DataFrames? I think there are a few answers to this. One, if you're used to doing transformations with DataFrames like pandas, it's good to be aware that Snowflake has an expressive DataFrame API that's efficient and scales well. Two, if you're partial to manipulating data in a language other than SQL, like Python, Java, or Scala, you might prefer Snowpark DataFrames. And three, if you learn about them and decide you're not a big fan, it's still helpful to be familiar with them because you'll see them a bunch in the Snowflake ecosystem in code examples, et cetera, and it's good to have some sense of what you're looking at when you run into them. We're going to work with the Python flavor of Snowpark DataFrames, though, as I mentioned, you could work in Java or Scala as well, and to learn about Snowpark DataFrames, we're going to use something we haven't used thus far in the course, a Python worksheet. In our case, Python worksheets are convenient because they have the Snowpark library preinstalled. Enough talking, let's see an example. (text whooshing) Okay, so we open up our Python worksheet, and immediately you can see the import statements at the top. If you're a Python user, this is always a little comforting because you can be like, "Ah, I'm in Python land, I'm home." Take off your shoes, put your feet up by the fire. Here's a nice, warm drink of Snowpark DataFrames. Here, we're importing snowflake.snowpark. One way that Python worksheets differ from SQL worksheets is that you can't run them piece by piece. Running one part runs the whole thing, so that's a constraint we'll have to get used to. In any case, let's set our context at the top to our FROSTBYTE_TASTY_BYTES database and the ANALYTICS schema, and then let's start having some fun with Snowpark DataFrames. The first thing I want to do is create a DataFrame from one of our existing SQL tables. The syntax to do that is the name of the new DataFrame, then equals, then session.table, with the name of the table in parentheses. I included the database, schema, and table name, though the context is already set to this database, so we could have just used the schema and table name. Now instantly, a ton of questions are likely popping up. We called session.table, but what is session? When you work outside Snowsight, the session object is really, really important because it includes all of your connection details. You need to create your table from a session so you have all the right permissions to use that table. Here, because we're inside Snowsight, we created the session automatically when we defined main above, so it's a different situation. I wouldn't worry too much about it. We can't learn everything in this one course, and the main thing I want you to focus on here is the DataFrame syntax after you've already pulled in your table. Another question you might have is what is .show? So Snowpark DataFrames differ from say, pandas DataFrames in that Snowpark DataFrames execute lazily. You can add a filter statement to a DataFrame, we'll talk about filtering in a moment, or add an aggregate statement, talk about that too, and if you run that command, Snowflake will record that logic, but it won't actually compute the results until you run something like .show or .collect. So it's lazy in that it waits for you to explicitly ask it to compute something. This could be annoying if you're used to eager execution where you don't have to explicitly ask it to do the computation, but because lazy execution builds up chains of commands before you actually execute them, it can identify optimizations that aren't available in eager systems. A couple of other quick comments, you might notice that we have df_table.show in there and also a return statement. When I first encountered this, I thought this sounded duplicative. df_table.show is going to basically print the table, so why do we need to also return the table? This has to do with Python worksheets. If you scroll up to the top, you'll see that the Python worksheet has a Settings dropdown. If we click on that, you'll see that the handler is main, and if we scroll over that, we'll see that the handler is defined as the function that will be called when executing this worksheet. Okay, that makes sense, it felt like that's what was happening when we ran the worksheet last time, like it executed main, and now we know why. But right below that, you'll notice that there's a specified return type of Table. The other possible return types are String and Variant. So Python worksheets require you to have a handler, the function that gets called when the worksheet is run, and they require you to have a return statement. To make this feel less abstract, let's comment out the return statement, then let's run this worksheet again. We get an error, Handler did not return a Snowpark DataFrame, but when we look closely, we can see that there's a PY, Python Output tab next to results. If we click on that, we can see that the df_table.show did something. It displayed an output that got recorded here, separate from the table that didn't get returned this time, because we commented out the return statement. Now let's switch that and instead, comment out the df_table.show. Now we have a result, but when we click on PY Output, Python Output, we don't see anything. So it seems like executing main and returning a DataFrame can operate similarly to explicit execution commands like .show or .clicked. (text whooshing) So I know I've just done a lot of talking, but don't worry, we're going to be much more hands-on again coming up. (upbeat music)
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.