From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Handling a large data request with R - GitHub Tutorial
From the course: Data Pipeline Automation with GitHub Actions Using R and Python
Handling a large data request with R
- [Instructor] In the previous video, we saw the limitation of the EIA get function to pull a large data set due to the API role limitation per get request. In this video, we will see how to handle a large data request from the API using the backfill function. The EIA backfill function splits a large request into a sequence of small requests, send those requests to the API using the EIA get function, and appends the output into a single table. The function uses the same parameters as the EIA get function with the distinction of the start and end arguments that use POSIX, CT or LT inputs. Let's set the get request to pull data from July 1st, 2018 to February 24th, 2024. Be sure to use the POSIX function to set those inputs, so let's start by the start argument. We want to start by 2018, July, 07, and first, and we want to pull it. We will set it since eight o'clock in the morning 'cause this is the first data point in this series and the minutes is zero and the second, so, set to zero. Okay? And also, make sure to set the time zone to UTC. Similarly, we will set the endpoint to February 24, 2024. So we want the year 2024, the month, February, date, 24, and let's set it to zero. Okay. The next argument is the offset. The offset argument enable us to control the size of the sequential request the function sends to the API. For example, if you're pulling a series with 10,000 observations and we set the offset to 1,000, the function will generate 10 sequential request each of a size of 1,000 observations. While you can set it up to 5,000 observations, it is recommended not to set the offset beyond 2,500 observations. Let's set it to 2,250. Two thousand two hundred fifty, and now we can execute the code. So let's go ahead and send the get request to the API. Notice that since we are having a large request, this is going to close to 50,000 observation. It might take time and you can see that the execution on the right side and this is the output. Let's just remove it and now we can explore the output. Again, we are going to use the head and the str function to see the structure of the data. So let's go ahead and maybe we can see the head of the function over here. And as you can see that we're getting almost the exact same output as before with the EIA get function with the distinction of the index. This time, the index is set as time because it's a time object, and it's the structure can see it's already set as a POSIXct object. We can now go ahead and plot the output and see the results. Oh, one more thing. As you can see over here, there was close to 50,000 observation, so we were able to pull more than 5,000 observation. Let's go ahead and plot the series using the same method as before. And now as you can see, the output look pretty much normal. You might observe here, and there are some kind of like a missing values. And those missing values are data points that are not available on the API as well.
Contents
-
-
-
(Locked)
EIA API2m 47s
-
(Locked)
Setting an environment variable3m 22s
-
(Locked)
The EIA API dashboard4m 10s
-
(Locked)
GET request structure5m 41s
-
Querying the data via the browser4m 4s
-
(Locked)
Querying data with R and Python2m 50s
-
(Locked)
Pulling metadata from API with R3m 5s
-
(Locked)
Sending a simple GET request with R5m 19s
-
(Locked)
API limitations with R4m 43s
-
Handling a large data request with R4m 27s
-
Pulling metadata from API with Python3m 47s
-
(Locked)
Sending a simple GET request with Python4m 44s
-
(Locked)
API limitations with Python3m 54s
-
(Locked)
Handling a large data request with Python3m 10s
-
(Locked)
Challenge: Query the API1m 2s
-
(Locked)
Solution: Query the API with R7m 28s
-
(Locked)
Solution: Query the API with Python7m 45s
-
(Locked)
-
-
-
-