Apache Spark

From the course: Data Platforms: Spark to Snowflake

Start my 1-month free trial Buy for my team

Apache Spark

“

- [Dr. Berman] Apache Spark, which sits on top of Hadoop, is also a big data analytics engine or platform. It keeps as much of the data and memory as possible. This means that it's generally faster than Hadoop, especially for iterative work such as running machine learning algorithms. These are the algorithms for which it was originally designed. Unlike Hadoop, Spark does not come with its own cluster management system, but attaches to a number of pre-existing ones, including Hadoop's YARN system. Also, unlike Hadoop, Spark does not have its own distributed data store, but once again can attach to a number of existing data stores including the one supplied in Hadoop. Let's talk about some Spark concepts. In Spark, there's a driver which sends jobs to various executors. The jobs are divided into stages and the data is partitioned with tasks running in parallel per partition. Though not always in parallel, but in parallel whenever possible. At the heart of Spark is the resilient…

- Resilient distributed dataset (RDD)
  
  3m 5s
- (Locked)
  
  RDD demo
  
  4m 13s

Unlock this course with a free trial

Join today to access over 24,600 courses taught by industry experts.

Apache Spark

From the course: Data Platforms: Spark to Snowflake

Apache Spark

Download courses and learn on the go

Contents

Start learning today.

Explore Business Topics

Explore Creative Topics

Explore Technology Topics