Web Scraper Firefox Extension

Firefox is my personal favorite browser, due in part to all of the great extensions available for it. When you try running Firefox with Selenium, however, you’ll probably find that Firefox is missing the extensions you have installed and normally use when browsing. Luckily, there’s a quick and easy way to install all your favorite Firefox extensions when using Selenium.

For example, let’s say we’d like to do a little light web scraping. To keep things simple, let’s just grab what’s trending off of Yahoo’s home page.

Scraping websites content on demand. Our Web Scraping API and Tools are built for everyone, from data scientist to a developer. Start crawling and scraping websites in minutes thanks to our APIs created to open your doors to internet data freedom. We offer web scraping APIs for developers & web scraper for chrome & firefox for Non-Developers. Alternatively, you can get to the same place by entering about:support in your Firefox navigation bar. In the “Application Basics” section click the “Open Directory” button, and in the file browser that pops up open the “extensions” folder. These are the extension installation files we’ll need to reference in our script.

You can see the top 10 trending subjects off to the right, starting with Beaufort County.

Selenium without Firefox Extensions

Here’s how we’d normally scrape that info:

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div><div>16</div></div></td><td><div><div>geckodriver = 'C:UsersGraysonDownloadsgeckodriver.exe'</div><div>browser = webdriver.Firefox(executable_path=geckodriver)</div><div>browser.get('http://www.yahoo.com')</div><div>trending_xpath = '//li[@class='trending-list selected']/ul/li/a/span'</div><div>trending = browser.find_elements_by_xpath(trending_xpath)</div><div># trending # and subject are separate elements, concatenate like so</div><div> print(trending[i].text + trending[i+1].text)</div><div>browser.quit()</div></div></td></tr></tbody></table><div><textarea wrap='soft' readonly='>1.Beaufort County 2.Faith Hill 3.Gretchen Carlson 4.Nicki Minaj 5.Taapsee Pannu 6.Cox Cable 7.Scott Disick 8.Airbnb Vacation Rentals 9.Brett Favre 10.Ally Bank

2.Faith Hill

4.Nicki Minaj

6.Cox Cable

8.Airbnb Vacation Rentals

10.Ally Bank

Great, that seems to work. But let’s say we’d prefer Firefox to be running with a couple of my favorite extensions, namely:

HTTPS Everywhere: Automatically enables HTTPS encryption on sites that support it, making for more secure browsing.
uBlock Origin: An efficient blocker that can make bloated web pages load much faster.

How do we get these extensions installed on Selenium’s instance of Firefox?

Getting the Necessary Information

First, we’ll need to find where those extensions are stored locally. Note that this means you’ll need to already have them installed on your machine for regular use of Firefox.

To find them, open up Firefox and navigate to the main drop down menu. Go to “Help”, and then “Troubleshooting Information”. Alternatively, you can get to the same place by entering about:support in your Firefox navigation bar.

In the “Application Basics” section click the “Open Directory” button, and in the file browser that pops up open the “extensions” folder. These are the extension installation files we’ll need to reference in our script. There should be a different “.xpi” file for every Firefox extension you have installed, and the file path to this folder should look something like “C:UsersGraysonAppDataRoamingMozillaFirefoxProfiles3rqg4psi.defaultextensions”.

It might be difficult to tell which files correspond to which extensions based on the file names, as the file names are sometimes unintelligible. To get around this, go back to your browser and on the same page as before scroll down to the “Extensions” section. Here you’ll find a table that pairs each extension name with its corresponding ID, and the ID should be almost the same as the installation file name, lacking just the “.xpi” suffix.

Online Web Scraper

In our case though, the extension and file names aren’t too hard to match:

HTTPS Everywhere: https-everywhere@eff.org.xpi
uBlock Origin: uBlock0@raymondhill.net.xpi

Selenium with Firefox Extensions

Now we just need to add a few lines of code to our original script to install these extensions. We’ll perform the installations right after we initialize the browser.

Run this script and see if you get the same results as last time. Look for the extension symbols near the top right of the browser. You should see the blue-and-white “S” symbol for HTTPS Everywhere and the reddish badge symbol for uBlock Origin.

</div><table><tbody><tr><td><div><div>2</div><div>4</div><div>6</div><div>8</div><div>10</div><div>12</div><div>14</div><div>16</div><div>18</div><div>20</div><div>22</div><div>24</div><div>26</div></div></td><td><div><div>geckodriver = 'C:UsersGraysonDownloadsgeckodriver.exe'</div><div>browser = webdriver.Firefox(executable_path=geckodriver)</div><div>extension_dir = 'C:UsersGraysonAppDataRoamingMozillaFirefoxProfiles3rqg4psi.defaultextensions'</div><div># remember to include .xpi at the end of your file names </div><div> 'https-everywhere@eff.org.xpi',</div><div> ]</div><div>for extension in extensions:</div><div> browser.install_addon(extension_dir + extension, temporary=True)</div><div>browser.get('http://www.yahoo.com')</div><div>trending_xpath = '//li[@class='trending-list selected']/ul/li/a/span'</div><div>trending = browser.find_elements_by_xpath(trending_xpath)</div><div>for i in range(0, 20, 2):</div></div></td></tr></tbody></table><img src='https://priloznost.com/centimg/155834.jpg' alt='Web' title='Web' /><div><textarea wrap='soft' readonly='>1.Beaufort County 2.Faith Hill 3.Gretchen Carlson 4.Nicki Minaj 5.Taapsee Pannu 6.Cox Cable 7.Scott Disick 8.Airbnb Vacation Rentals 9.Brett Favre 10.Ally Bank

Web Scraper Firefox Extension Chrome

2.Faith Hill

4.Nicki Minaj

6.Cox Cable

8.Airbnb Vacation Rentals

10.Ally Bank

So there you have it. We performed the same operation, but got to take our two favorite Firefox extensions along for the ride.

In addition to the peace of mind knowing that HTTPS security was used whenever possible, you may have noticed that our second script took significantly less time to load the page. This is because uBlock Origin blocked a number of unnecessary, resource-intensive requests, a great feature to have when you’re dealing with the slow, bloated web pages that are all too common nowadays.

Anyways, I hope this gives you a few ideas as to how you can make your life a little more convenient. Let me know if you have any questions, and happy automating.

April 17, 2019

Sitemap.xml, Release

We are happy to announce that Web Scraper 0.4.0 has been released. This release contains a new selector, updates to other selectors and improved CSS selector generator. Starting from version 0.4.0 Web Scraper is also available in Firefox.

Sitemap.xml link selector

Many websites want to be crawled by scrapers. For example, news outlets want their articles to appear in search engine results. In order for this to happen, a search engine has to crawl the entire site. The site can make this work more efficient by listing all of the relevant URLs in a sitemap.xml file. This makes the job for a crawler more efficient and also ensures that everything within the site is being indexed.

With Sitemap.xml Link selector you can leverage this feature to access all of the relevant URLs in a site without having to build a path through the site using the Link selectors for navigation and pagination. With a single selector you can access every product page in an e-commerce site. It is always worth checking out whether the site has sitemap.xml files before creating other selectors, as using this method can speed up the scraper configuration significantly.

Firefox Extension Web Scraper

When using the Sitemap.xml Link selector use the Add from robots.txt button to automatically discover sitemap.xml links. If no links are discovered you can conduct a manual check whether a example.com/sitemap.xml page exists. Add child selectors under the Sitemap.xml Link selector that extract data from URLs that the sitemap.xml file leads to.

Element click selector

With this release it is now possible to add an Element Click Selector under another Element Click Selector. With this feature you can go through multiple product color/size variations within a single product page to get the SKU and the price for every variation.

You can also now use element click selector to click through options within a <select> element.

Element scroll down selector

Element scroll down selector now scrolls down with a smooth animation. It will additionally try a few tricks to trigger the data load event within the website. Generally the Element scroll down selector isn't as reliable as Link selectors but with this update it should also work in some additional edge cases.

Web Scraper Firefox Extension Free

Firefox

I'll start by saying big thanks to Firefox team. They have done a lot work in order to bring the Web Extensions API into their browser. The most painful part of this probably was that they had to remove their previous add-on API with all of the add-ons that developers had been building for years. Despite this, this was a good choice that they made. The Web Extensions API is compatible with other browser and removes the overhead of developing the same solution for different platforms.

You can download Firefox version of Web Scraper here. If the Firefox version isn't behaving as expected please let us know by posting a bug report in Web Scraper Forum.

CSS Selector generator

Top Firefox Extensions

When you are selecting an element within a page, Web Scraper generates a CSS selector. In this release we made some improvements to the CSS Selector generator. When generating a CSS Selector the generator will additionally try to use element attributes and their values. Additionally it will generate better CSS selectors for description lists using the :contains() selector. We made some additional tweaks to reduce the use of order based selector :nth-of-type() which frequently doesn't work well across multiple pages.

Go back to blog page