Grab All URLs Using Sitemaps

For simple sites that are not too large, you can use a tool like this one.

How do I grab all URLs from a Site?

Prereq:

WSL2 or a Linux Environment with grep (usually built in)
A website with a sitemap index

Step 1: Find the sitemap

Sitemaps contain all the links from a site for search engines to use. You can locate it from common locations, robots.txt, or online tools. Here is a link to an article that can help you find a sitemap.

Step 2: Get the sitemap files

Sites that have pages in the thousands or more will use a sitemap index. This is a list to gzipped files that contain the actual sitemap files. Ballotpedia is an example. Copy the links for the gzip files (.gz) and open it with a tool like 7zip and drag it to a location, preferably its own folder. You will have to do this manually, and yes it is annoying, but it should not take you that long.

Step 3: Extract the links

Using WSL2 (or just a terminal if you are on linux), navigate in your terminal to where you put your sitemap files and enter the following command with each sitemap file that you have. Replace sitemapname.xml with the name of the sitemap file that you extracted, and outputfile.txt with what you want your output file to be named.

grep -Po '<loc>\K.*?(?=</loc>)' sitemapname.xml > outputfile.txt

That’s it!

You will have files that contain all the links to each page on the website.

An Insublime Blog

Explorer