Why is my Fabric Data Pipeline CU usage so high?
- 4 minutes read - 678 wordsPicture this, you have a 100 small CSV files to load into your Fabric Warehouse and need as you want to do it quickly you setup a Data Factory pipeline. You give it a quick run through to test it and see that it runs fast, with each copy taking around 10 seconds for a full duration of around 40 seconds due to the loop being in parallel. Based on a quick read of the pipeline pricing model page you expect the total CU cost of this run to be
Estimated Data Movement CU Cost
100 activities x 10 seconds x 1.5 (Data Factory CU Rate) = 1,500 CU(s)
However when you go to the Capacity Metrics App you see that the total cost for Data Movement is 36,000 CU(s), 24 times higher than expected!
The cause of this is two fold, one documented, and one not (at least in Microsoft Learn).
-
Firstly, we missed that the intelligent optimization throughput factor is also a component of the cost calculation. This is a value between 4 and 256 that represent how parallel the copy activity is. As the CSV files being copied are small this would be at the minimum value of four changing our CU estimate to 6,000 CU(s) so we are still out by a factor of 6. This is an often misunderstood factor and impacts copy activities at all scales so make sure to monitor this in your pipeline runs.
-
The second and more significant factor for small copy activities is that every copy activity is rounded up to a duration of least one minute. This means that instead of each copy activity charging 10 seconds of duration they charge for 60! Once we add that into our formula we now get the same value as the Capacity Metric app.
100 activities x 60 seconds x 4 intelligent optimization throughput factor x 1.5 (Data Factory CU Rate) = 36,000 CU(s)
From what I can find this is not covered at all by Fabric documentation and is only mentioned in Microsoft Q&A or the Azure Data Factory pricing page.
This small file problem applies to many different scenarios where you might want to use data factory. Obviously loading small CSV files as shown here but also REST API requests, web scraping etc.
Remedies
If your pipelines costing far more than documented is not something you want to continue there are some alternatives.
Within Pipelines
Staying within data factory pipelines there are some options to avoid the looping of copy activities however, these are mostly dependent on what source system you are copying from.
If using a REST API or Web Scraping look into the pagination rules. These can allow you to load in multiple pages of data within one copy activity meaning that even if your total runtime is less than one minute you are only running one copy activity rather than one per API response page.
For other sources such as CSV you may be able to use something like a wild card file path but this will depend on your destination configuration.
Alternative Fabric Engines
For the best results with small data ingestion you may need to look outside of the data factory experience in Fabric. If your destination is a warehouse look into using the COPY INTO command to load the data directly into the warehouse avoiding copy activity.
Python notebooks are also an excellent option for ingesting small data if you feel comfortable writing python (or having Copilot write python). They work great with both lakehouses and warehouses via the T-SQL cells. The CU usage of python notebooks is also very low with the base cost being 1 CU per second, so if that size is enough for your data you can get six minutes of python notebook runtime for the same cost as one copy activity. To learn more about this there are two great blogs by Microsoft employees Mim Djouallah and Miles Cole that do a good job demonstrating the extensive capability of python notebooks and also where they fall down.