From the course: Apache Spark Essential Training: Big Data Engineering
Unlock the full course today
Join today to access over 24,600 courses taught by industry experts.
Scaling extraction and loading operations - Apache Spark Tutorial
From the course: Apache Spark Essential Training: Big Data Engineering
Scaling extraction and loading operations
- [Presenter] When scaling a data engineering pipeline, all stages in the pipeline need to scale in order for the entire pipeline to scale. Extracting data and loading process data into destinations are time-consuming as they usually deal with discreets and rights. How do we scale these steps when building pipelines with Apache Spark? Let's start with data extraction. Spark supports parallel extraction of data from various data sources. For example, Spark can read JDBC records in parallel across its executors. Similarly, it can divide up Kafka partitions between executors and process them in parallel. For the data source being used, analyze the out-of-the-box options provided by Spark for that data source. Exploit these options for parallel reads of data while maintaining data consistency. If possible, choose a source technology and build the source schema in such a way to suit parallel operations. This option is only available if the source systems are also being designed and built…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.