How to Scrape URLs of a Wordpress Website?
You are about to publish a new website and you need to create url-redirects – but how? Manually copy-pasting urls is a lot of work. There are some tools for scraping urls, but they are pricey.
There is an easy and free way to scrape urls of a wordpress website by using node.js and command prompt / terminal.
Step 1 – Export
Log in to the Wordpress admin panel, go to Tools and Export. Choose All content and click Download Export File.
Step 2 – Node.js
Download and install Node.js on your computer. You can download Node.js here.
Step 3 – Project folder and script-file
Create a folder for the project. Place the wordpress export-file (XML-format) inside the folder.
Create a javascript-file with code editor or notepad. Copy-paste the following script and save it as, for example, scriptname.js
Remember to change the file paths to correct ones. See the rows 4 & 5 of the script code.
Script:
const fs = require('fs');
const { parseString } = require('xml2js');
const xmlFilePath = ‘/path/to/file.xml'; // Replace with the path to your WordPress export file
const outputFilePath = ‘/path/to/output-file.txt'; // Replace with the desired path for the output file
fs.readFile(xmlFilePath, 'utf-8', (error, data) => {
if (error) {
console.error('Error reading the XML file:', error);
return;
}
parseString(data, (error, result) => {
if (error) {
console.error('Error parsing the XML:', error);
return;
}
const urls = extractUrls(result);
writeUrlsToFile(urls, outputFilePath);
});
});
function extractUrls(xmlData) {
const urls = [];
const items = xmlData.rss.channel[0].item;
items.forEach(item => {
const link = item.link[0];
urls.push(link);
});
return urls;
}
function writeUrlsToFile(urls, filePath) {
const content = urls.join('\n');
fs.writeFile(filePath, content, 'utf-8', (error) => {
if (error) {
console.error('Error writing the file:', error);
return;
}
console.log('URLs written to file:', filePath);
});
}
Step 4 – Open Terminal
Open Command Prompt (PC) or Terminal (Mac).
a) Locate the project folder by typing cd path/to/folder (on Mac it could be, for example: cd /Users/firstname.lastname/Documents/foldername)
b) Launch node.js -project by typing:
npm init -y
c) Install xml2js-library to parse the XML-file. Type: npm install xml2js
npm install xml2js
d) Run the script:
node scriptname.js
Step 5 – The result
This will create a txt-file on the project folder. You can copy the urls from the txt-file and paste them to Google Sheets / Excel. After assigning the wanted redirects, you can import the file as CSV-file to your wordpress redirect-plugin of choice.