IMPORTXML Data Collection in Google Sheets: Tutorial

SEO can be challenging for beginners. Discover the best tools to simplify SEO and improve your website’s ranking with ease using this comprehensive guide.
hero image blog

We've all been in a situation where we've had to pull data from a site at some point.

When working on a new project or campaign, you may simply not have the data or information you need to create ads.

Those who don't have the tools to do web scraping — or the coding skills to use Python — may have had to find themselves copying and pasting hundreds or thousands of pieces of data manually.

In a recent project, I was asked to:

  • Go to the customer's site.
  • Download over 150 new products on 15 different pages.
  • Copy and paste the product name and landing page URL for each product into a spreadsheet.

You can imagine how long the task would have taken if the work had been done manually.

Not only would this take time, but the risks of making a mistake would also be high: it would then require even more time to connect the data and ensure that it is error-free.

logo Google Sheets

Let me show you how to collect data with Google Sheets and its XML IMPORT function without having any development skills.

READ MORE: How do you collect data with Python?

How to use IMPORTXML in Google Sheets to collect data

You want collect data on the internet... quickly?

Maybe you want to copy a table from the site, or maybe you want to quickly retrieve SEO elements from a competitor.

IMPORT.XML is here to help you automate your operations of collecting data in Google Sheets.

What is the IMPORTXML function?

The IMPORTXML function in Google Sheets imports data from various types of structured databases (data types), including XML, HTML, CSV, TSV, and RSS and ATOM XML feeds (atom xml feeds).

It extracts information from any XML field, that is, any field surrounded by HTML tags (and therefore web pages), like <balise>and</balise>.

You can use it to import information that is publicly available from the web.

The syntax of the IMPORTXML function is as follows:

Documentation Google Sheets : Fonction IMPORTXML
Google Sheets documentation: IMPORTXML function

You are therefore not allowed to use it with websites that you need to connect to access it, for more information, the article below should interest you.

READ MORE: Is web scraping legal?

Basics you need to know to use IMPORTXML

importXML function: Imports data from various types of structured data (xml html csv tsv ross).

documentation importxml

The base of a URL

ImportXML URL
ImportXML URL
  • The argument url is the URL of the web page from which you want to import data (data types),
  • Indicate the protocol (https://or http://).
  • The URL value should be either in quotation marks or a reference to a cell that contains the appropriate text.

The basis of an XPath Request

  • The argument xpath_query is the XPath query to be executed on the data in the given URL.
  • Each result of the XPath query is placed in its own row in the spreadsheet.
  • XPath is a query language used to retrieve information elements from sites.
  • It has its own syntax that you can learn in detail from online resources.

There are a few basic rules for creating your own argument. xpath_query :

The double slash:

  • The double slash”//“means the selection of all the elements of the tag that is written after it.
  • For example: //h1 means select all h1 Title items, //p means all p items, and so on.

The class

  • [@class = "] means that you want to select only items that meet this given criterion.
  • For example, //h1 [@class ='title'] only imports h1 Title elements whose class attribute is “title”.

Elements

  • You can define these elements on multiple levels as they appear in the site source.
  • For example, //h1/span will look at the h1 elements and then the span elements that they contain.

XPath Basics

logo xpath

To use XPath, you need to understand the basics of HTML.

  • Website data is displayed with HTML and stored in XML format.
  • XPath is the language used to query these databases.
  • You can customize these XPath arguments based on what you are looking for on the site.

XPaths examples for marketers

Check out these useful XPath extraction shortcuts to gather data from the web more quickly:

- All links on a page: “//@href”
- Extract all internal links from a page:
“//a [contains (@href, 'example.com')]/@href”
- Extract all external links from a page:
“//a [not (contains (@href, 'example.com'))]/@href”
- Page title:
“//title”
- H1 (s):
“//h1"
- Meta description:
“//meta [@name ='description']/@content”
- Canonical:
“//link [@rel ='canonical']/@href”.
- Robots:
“//meta [@name ='robots']/@content”
- Hreflang attributes:
“//link [@rel ='alternate']/@hreflang”

The function IMPORT.XML allows you to import site data directly into your spreadsheets

All you need to do is consult the source code of the website from which you want to import data, find the HTML element that can be used, and retrieve it using XPath queries.

What are the steps to use IMPORTXML in Google Sheets?

1. Start by opening a new Google sheet

First, we open a new, blank Google Sheets document:

Google Sheets
Google Sheets

For this example, you can use the Google Sheets sample sheet:

Google Sheets example sheet: Data collection with ImportXML

2. Add the content you need to extract

Add the URL of the page (or pages) whose information we want to extract from the web.

In our case, we are going to extract the page titles of all the articles, as well as their URL and meta descriptions.

So our starting URL will be as follows: https://www.sales-hacking.com/blog

Web Scraping avec Google Sheet
Data collection with Google Sheet

3. Finding the XPath

Use the Navigator Inspector

In order to find what the XPath looks like and how to find it simply, all you have to do is use a simple tool that is present on all modern browsers: The Inspector.

This XPath will allow us to specifically retrieve each information desired on the page in such a way as:

  • Relative : I want all the blog titles on the web page
  • Specific : I only want the first 10 or 30 article titles on the web page or only 1 in particular

In our example, let's start with the titles of the last 30 articles.

  1. Go to your preferred browser.
  2. Once you have hovered over the title of one of the articles, right-click
  3. And select Inspect (Inspect in English)
Utilisez l'inspecteur des moteurs de recherche pour trouver le XPath
Use the search engine inspector to find the XPath

Once clicked the Browser will show you the HTML code of the web page as below:

code HTML
HTML code of the Page

Doesn't that speak to you? Don't worry, we'll see in the step below that the hardest thing for you to do will once again be to make a”Right click

Copy/Paste the Xpath

The inspector (generally available in the Developer Tools section of your browser) allows you to directly select the item page for which you want more information:

  1. Select the part you are interested in on the page with the Inspector, i.e. in this example: the title of one of the articles
  2. Make sure the article title is still selected and highlighted, then right-click again and choose Copy > Copy XPath.
Inspecteur du navigateur
Retrieving the XPath by the Navigator Inspector

That's it, you've just selected the XPath that you're going to use right away in Google Sheets.

4. Extracting data into Google Sheets

To try it yourself in the Google Sheets example, you can use the example sheet below (Create a copy to use it):

Google Sheets example sheet: Data collection with ImportXML

Note : Data collection for the blog posts in this article has been voluntarily limited to 10 items to avoid any restrictions by Google services.

Retrieving blog title articles

Back in your Google Sheets document, introduce the IMPORTXML function as follows:

ImportXML function example

Data can also be retrieved using specific XPaths in Google Sheets.

For example, in step 3, we selected the XPath for a specific article which is this:

=IMPORTXML (B1,” /HTML/body/div [3] /div [1] /div [2] /div [3] /div/div ["&B3&"] /div/div/a/h3")
Récupération du Titre par des XPath Spécifiques
Retrieving the Title by Specific XPaths

First of all, in our formula, we replaced the page URL with the reference to the cell where the URL is stored.

secondly, when you copy the XPath, it will always be enclosed in quotation marks.

Retrieving blog post URLs

This is what it looks like on the Google Sheets document

Exemple fonction IMPORT XML pour collecter des URL
Extracting URLs using IMPORTXML in Google Sheets

As you can see, the list returns all the articles and URLs that are on the page we just extracted.

You can also apply this principle to the extraction of any other piece of information necessary to set up your projects, such as retrieving the meta descriptions of each page with Google Sheets (example can be found in the file below):

Google Sheets example sheet: Data collection with ImportXML
feuille d'exemple complète
Data formatting
How can I automate the data collection process with IMPORTXML in Google Sheets?
You can automate the data collection process using Google Apps scripts or third-party programming tools like Python.
Scripts can be used to set a data collection schedule, add extracted data to a spreadsheet, and send update notifications.
logo python

Other Examples of IMPORTXML formulas

Let's say you want to import the paragraph titles from this Wikipedia article with IMPORTXML from Google Sheets.

<h2>After examining the source code, you can see that the titles are enclosed in tags.

So that's what you need to put in your IMPORTXML formula. In XPath syntax, you should write it as “//h2".

Try the following function where cell A2 contains the article URL:

Web Scraping - ImportXML 3
Example of the IMPORTXML function - =IMPORTXML (A2,” //h2")

First, you'll see a “Loading...” sign “, while data import is in progress. Importing from the site takes some time (but it's still faster than doing it manually).

After a few seconds, the function returns what it found on the site with the tags<h2>.

This solution is not perfect because there are additional [edit] texts next to the titles that the site contains, but it does return paragraph titles.

This time, visible text has been imported from the site.

But with the IMPORTXML function, you can import all the data that the HTML source contains. For example, it is very useful to import meta information, link sources, or any other hidden HTML information that is not visible when browsing the site.

Let's get the links to YouTube videos from the Google privacy policy site.

In this case, the first argument to the IMPORTXML function is the URL of this site.

After looking at the site's source, you may realize that YouTube videos are enclosed in tags <iframe>and</iframe>.

So you need to add //iframe to the function's xPath argument.

Then, you need to go deeper into this tag <iframe>to find the required data that you want to display.

The source of the video link can be found in the src attributes.

Add a slash “/” after the first part of the XPath query followed by the src attribute.

The “@” character indicates that it is an attribute. The whole xpath_query request will be: //iframe/ @src.

And the whole function for importing YouTube links is:

Web Scraping - ImportXML 2
Example of using the IMPORTXML function - IMPORTXML (A2,” //iframe/ @src “)

Sometimes it's not so obvious what differentiates HTML from the content you want to import, but XPath allows you to create more complex queries that extract only the data you want.

XPath has some functions that you can include in your requests, such as “contains” or “starts with.”

Let's see how to import all the references (citations) at the bottom of the Google Sheets article from Wikipedia.

<li>After looking at the site source, you can see that the references are enclosed in tags.

But that's not enough because there are a lot of tags <li>on the site.

You need to find something more that differentiates these list items from other parts of the site.

Here is the attribute Id references that all contain the term “cite_note.”

The function Contain of XPath can be used to import only items <li>whose id contains the term 'cite_note'.

That's what the following IMPORTXML function in Google Sheets does:

Web Scraping - ImportXML 1
Example of using the Import XML - IMPORTXML function (A2,” //li [contains (@id, 'cite_note')]”)

And it returns all the references that were found on the site with that ID.

What type of data can I collect with IMPORTXML in Google Sheets?
You can collect a variety of data from the website using IMPORTXML, such as prices, product descriptions, contact information, business dates and hours, exchange rates, formatting info, and more.

When should IMPORTXML be used to collect data?

importxml function
  • Extraction of databases from websites or XML feeds (atom xml feeds)
  • When you need page data but don't want to use tools
  • When you need page data but don't know how to code

Is the use of IMPORTXML to collect data legal and ethical?

In most cases, using IMPORTXML to collect data is legal as long as the data collected is not used for illegal purposes.

However, it is important to respect the terms of use of the sites whose data is collected. In some cases, websites may prohibit the extraction of data for commercial purposes or prohibit the use of automated extraction methods such as IMPORTXML.

Before using IMPORTXML to collect data, it is important to check the terms of use of the site.

READ MORE: List of tools for collecting data on the web

FAQs

What is IMPORTXML in Google Sheets?

IMPORTXML is a function built into Google Sheets that allows you to retrieve XML data from the website and display it in your spreadsheet.

How does IMPORTXML work in Google Sheets?

IMPORTXML works by using XPath expressions to extract data from a web page.

XPath expressions are instructions that tell IMPORTXML where to find the data on the page.

Is IMPORTXML easy for beginners to use?

Although using IMPORTXML may seem complex at first glance, with a bit of practice, the function can be used easily even by beginners.

There are also numerous online resources, such as video tutorials and community support forums, that can help you learn how to use IMPORTXML effectively.

Conclusion

As you can see, IMPORTXML from Google Sheets can be a very powerful feature in your arsenal.

This gives you a fully automated and error-free method for extracting data from (potentially) any web page, whether it's product content and descriptions or e-commerce data such as product price or shipping costs.

At a time when information and data may be the advantage needed to achieve better-than-average results, the ability to collect web page data and structured content in a simple and fast manner can be invaluable.

Additionally, as discussed above, IMPORTXML from Google Sheets can help reduce execution times and the risk of errors.

Furthermore, this Google Sheets feature is not only a great tool that can be used exclusively for PPC tasks.

But it can be really useful in a lot of different projects that require scrapping the web, including SEO and content tasks.

profil auteur de stephen MESNILDREY
Stephen MESNILDREY
Digital & MarTech Innovator

Your time is valuable... imagine:

Doubling your productivity in 30 days...Cutting operational costs by 40%...Increasing your ROI by 25% in 6 months...

Sounds too good to be true? Yet:

  • ✅ 71,000+ executives have seen their growth soar by 35% on average
  • ✅ 5 years guiding startups to success (valued at $20M+)
  • ✅ 100,000+ professionals draw inspiration from my articles every month

Want to stay ahead of the curve? You're in the right place! 💡

📩 Subscribe to my newsletter and receive weekly:

  • 👉 1 high-impact, ready-to-use strategy
  • 👉 2 in-depth analyses of transformative SaaS tools
  • 👉 3 practical AI applications for your industry

The journey starts now... and it's going to be extraordinary! 🚀

🔗 DISCLOSURE ON AFFILIATE LINKS
Our strict policy prohibits any recommendations based solely on commercial agreements. These links can generate a commission at no additional cost to you if you opt for a paid plan. These brands - tested and approved 👍 - contribute to maintaining this free content and keeping this website alive 🌐

For more details, see our editorial process complete updated on 01/08/2024.