From the course: Data Platforms: Spark to Snowflake
Unlock this course with a free trial
Join today to access over 24,600 courses taught by industry experts.
Apache Spark
From the course: Data Platforms: Spark to Snowflake
Apache Spark
- [Dr. Berman] Apache Spark, which sits on top of Hadoop, is also a big data analytics engine or platform. It keeps as much of the data and memory as possible. This means that it's generally faster than Hadoop, especially for iterative work such as running machine learning algorithms. These are the algorithms for which it was originally designed. Unlike Hadoop, Spark does not come with its own cluster management system, but attaches to a number of pre-existing ones, including Hadoop's YARN system. Also, unlike Hadoop, Spark does not have its own distributed data store, but once again can attach to a number of existing data stores including the one supplied in Hadoop. Let's talk about some Spark concepts. In Spark, there's a driver which sends jobs to various executors. The jobs are divided into stages and the data is partitioned with tasks running in parallel per partition. Though not always in parallel, but in parallel whenever possible. At the heart of Spark is the resilient…