How to get the most out of Web.BrowserContents()
So back in my post about my submission to a London Power BI Challenge I mentioned how I had issues with the page I was trying to webscrape. This was because the links that I wanted didn’t load immediately but after a short delay. At the time I used a python script work around this. Now this did work for a one off report however due to Power BI lacking of support for refreshing Python code in the service this solution could not be used for a regularly occurring report.
However, I was looking through the Power Query documentation for Web.BrowserContents() and I noticed that in it’s option record it allows you to set a delay before scraping the html. Combining this functionality with the Html.Table() example from Chris Webb achieves the same results as my script but is far better. It is faster, refreshes in the service, and does not require the use of a gateway to refresh. I just wish I knew about this a few months ago when I needed it.
Hopefully you can now avoid this and learn from my mistake, I will include some example code below as a starting point. There are a lot of possibilities with this option field such as using CSS selectors. If all goes well there will be another blog post going into further detail soon.
let
Source =
Web.BrowserContents("http://cycling.data.tfl.gov.uk/",[WaitFor = [Timeout = #duration(0,0,0,10)]]),
Links = Html.Table(Source, {{"Link","a[href^=""http""]", each [Attributes][href]}})
in
Links