Scaling extraction and loading operations - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Start my 1-month free trial Buy for my team

Scaling extraction and loading operations

“

- [Presenter] When scaling a data engineering pipeline, all stages in the pipeline need to scale in order for the entire pipeline to scale. Extracting data and loading process data into destinations are time-consuming as they usually deal with discreets and rights. How do we scale these steps when building pipelines with Apache Spark? Let's start with data extraction. Spark supports parallel extraction of data from various data sources. For example, Spark can read JDBC records in parallel across its executors. Similarly, it can divide up Kafka partitions between executors and process them in parallel. For the data source being used, analyze the out-of-the-box options provided by Spark for that data source. Exploit these options for parallel reads of data while maintaining data consistency. If possible, choose a source technology and build the source schema in such a way to suit parallel operations. This option is only available if the source systems are also being designed and built…

- (Locked)
  
  More about Apache Spark
  
  43s

Unlock the full course today

Join today to access over 24,600 courses taught by industry experts.

Scaling extraction and loading operations - Apache Spark Tutorial

From the course: Apache Spark Essential Training: Big Data Engineering

Scaling extraction and loading operations

Practice while you learn with exercise files

Download courses and learn on the go

Contents

Explore Business Topics

Explore Creative Topics

Explore Technology Topics