From the course: Scala Essential Training for Data Science
Unlock this course with a free trial
Join today to access over 24,600 courses taught by industry experts.
Introduction to Spark Datasets
From the course: Scala Essential Training for Data Science
Introduction to Spark Datasets
- [Instructor] In Spark, we have two options when working with collections of data. We have Spark DataFrames. Spark DataFrames are an untyped collection for distributed data. There are no compiled time checks with DataFrames. And when we're manipulating data within DataFrames, we're using basic column expressions. And one of the key advantages of DataFrames is they're really easy to create. And we've seen that with regards to how quickly we can create a DataFrame just loading data from a CSV file or a JSON file. Well, an alternative that we have available to us in Spark Scala is something known as Spark Datasets. Now, Spark Datasets are strongly typed, so they can provide compiled time data type checks. They also support the use of column expressions, like we have in DataFrames, but also, it supports use of more complex operators, like Lambda functions if you're in a functional programming environment, or object-oriented expressions if you're more in an OO kind of environment. Now…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.