What is the principle of collecting links from a web page with Javascript
Extracting and cleaning data from websites and documents are daily tasks.
I like to learn how to systematically extract data from multiple web pages and even multiple websites using Python and/or Web Scraping tools.
But sometimes, a project only requires a small amount of data, coming from a single page on a website. Previously, when a case like this occurred, I would always launch my Developer Interface and write and execute a script to extract this information. It's the same as using a mass to break a nut.
The good old JavaScript is powerful enough to extract information from a single web page.
This JavaScript code in question can be executed in the browser development console in a matter of seconds.
In this example, I extract all the links from a web pageb, because it's a task that many like me regularly perform on web pages.
However, this code would work just as well to extract any other type of element in HTM documentsL, with a few minor changes.
When this code runs, it opens a new tab in the browser and creates a table containing the text of each hyperlink and the link itself.
Javascript code and how it works
Open your browser and go to the page you want to extract links from
Go to the target page
In our case, we will use the Blog page on this website as an example. - this is a page that contains a significant amount of links.
All that's left for us to do is open the developer console and run the code.
Open the browser console
To open the developer console, you can right-click on the page and select “Inspect” or “Inspect Item”:
Once done, you should appear in your browser (at the bottom or on the side), the HTML code of the page. Don't worry at this point, even if you don't understand what's on the screen, that's okay.
Execute the code in the browser console
Now click on the Console tab here circled in red to make the browser console appear as its name suggests.
Once on the Console, it may not be empty and there may be lots of things written. Once again, don't worry, you can empty it by deleting the history to start with a nice “clean” section.
To erase the contents of the console, right-click and clear the history (or “Clear History” in English)
From now on you have almost reached the end of the method.
You're going to copy the code snippet below to your console.
Here it is in text format for easy copy/paste.
var x = document.querySelectorAll (“a”);
var myarray = []
for (var i=0; i<x.length; i++) {
var nametext = x [i] .textContent;
var cleantext = nametext.replace (/\ s+/g, '') .trim ();
var cleanlink = x [i] .href;
myarray.push ([cleantext, cleanlink]);
};
function make_table () {
var table = <table><thead><th>'Name</th> <th>Links'</th></thead><tbody>;
for (var i=0; i<myarray.length; i++) {
table += '<tr><td>'+ myarray [i] [0] +'</td> <td>'+myarray [i] [1] +'</td></tr> ';
};
var w = window.open (“”);
w.document.write (table);
}
make_table ()
As in the example below, you only need to place the piece of code in the console.
Retrieve the result of the Javascript code
This last step is the easiest. Then all you have to do is press the enter key.
This will open a new tab in your browser with a table containing all the link texts and hyperlinks for the web page you have chosen.
This table can then be copied and pasted into a spreadsheet or document to be used as you see fit.
What does this Javascript code do step-by-step to collect data from the web?
Here's a breakdown of the code and what each aspect does.
Step 1: Declaring variables
Here, we find all the “a” elements on the page (the a elements are links) and assign them to the variable x. Then we create an array variable, but we leave it empty for now.
Step 2: Loop over all the links
We then loop over all of the “a” elements in x, and for each element, we try to find the text content of the element and the link.
For textual content, we replace white space with simple space and cut the text, as there may be large amounts of white space that would make our table unreadable.
Step 3: Creating the Table
- We then create the array using the “make_table ()” function.
- This function creates a variable, table, with the start of an HTML table and the table headers.
- We then use a “for” loop to Add table rows containing link text and hyperlinks.
- Then we then open a new window using “window.open ()” and write the HTML table into that window using “document.write ()”.
Disadvantages of this Javascript code for collecting data on the web
There is a downside to the current code - it will take ALL the links on a page. This means all the links in the menus, all the internal links that take you to other pages on the website.
You could be more specific and look for all the “a” items in certain areas of the web page. In the case of the current page, simply edit the first line of code for the querySelector to examine a targeted area of the page.
To do this, right-click on the area of the page whose links you want to remove and click on “Inspector.”
We can simply edit the first line to find all the “a” items in the item with the “class”class name“:
var x = document.querySelectorAll (“.class name a”);
What to use on a daily basis to do Web Scraping
To go quickly, this type of useful code, Javascript is very useful.
If you have development skills
However, I mainly use Python in the vast majority of my work as a Web Scraping, but it's useful to have a quick and easy way to extract information from a web page without needing to open applications other than the browser.
READ MORE: How do you collect data on the web with Python? -->
If you don't have development skills
1) Use web scraping tools
There are many tools that are very practical for extracting data from the web.
In addition, data can easily be exported in the form of spreadsheets (XLS, CSV, etc.) or an API.
Although Web Scraping can be done by hand (by copying and pasting), in most cases, these tools are less expensive, free of human errors, and allow for the collection of significant amounts of data.
2) Learn development
When I talk here about learning development, this does not necessarily mean becoming an expert but knowing how to understand at least one code, redesign it or write it in part if necessary.
To do this, training courses for all budgets are available on Udemy.
Je Take at least 1 course myself per month to keep me informed and improve my skills in many areas.
Good training and happy web scraping to you.!