5 Python For Seo Scripts To Automate Your Workflow

SEO is important for improving website ranking, but it can take a lot of time and effort. Many tasks, like checking keywords or analyzing data, are repetitive and boring. Doing them manually can be slow and lead to mistakes. That’s where Python comes in. Python is a powerful tool that can automate these tasks. It makes the work faster, easier, and more accurate. By using Python, you can save time and focus on planning and improving your SEO strategy.

In this article, we will share 5 Python scripts for SEO. These scripts will help with tasks like keyword research and competitor analysis. They are easy to use and will make your work faster. Let’s get started!

Prerequisites to Run Python

  • Install Python: You need Python installed on your system or use
  • Google Colab. Code Editor or IDE: Use any editor like VS Code, PyCharm, or Jupyter Notebook. Basic Python Knowledge:
  • Understand basic Python concepts like variables and loops.
  • Library Installation: You need to know how to install Python libraries. 
  • Internet Access: Needed for installing packages or working with online data.

Unlock SEO Insights with Sitemap Scraping

Sitemap scraping is useful for SEO. It helps you find all the pages a website wants search engines to index, including hidden or missing pages. You can check your site for errors and ensure all important pages are included.

For competitors, it reveals their site structure and content strategy. This can give you new ideas and show gaps in your own SEO plan. You can also monitor changes, track broken links, and ensure your sitemap is organized for better crawling.

Scraping also helps analyze content types and find top-performing pages. Using sitemap data with analytics tools shows how your pages are performing. Overall, it’s a simple way to improve your site’s SEO and performance.

Essential Skills and Libraries for Sitemap Scraping

  • Install Required Libraries: Install bs4 (BeautifulSoup) and requests to fetch and process sitemap data using:
    pip install beautifulsoup4 requests
  • Know Variables: Understand how to use variables to store and use data.
  • Know Loops and Functions: Learn how to use loops (like for) and functions to process many URLs.
  • Basic XML Knowledge: Understand how XML files work since most sitemaps are in XML.
  • Handle Errors: Learn how to use try-except to manage errors like broken links or missing sitemaps.

Python Script to Scrape Multi-Level Sitemaps and Extract URLs

import requests
from bs4 import BeautifulSoup


def fetch_sitemap(url, visited_sitemaps=set()):
    try:
        if url in visited_sitemaps:
            return []

        print(f"Fetching sitemap: {url}")
        response = requests.get(url)
        if response.status_code != 200:
            print(f"Failed to fetch sitemap (HTTP Status: {response.status_code})")
            return []

        visited_sitemaps.add(url)

        # Parse the sitemap
        soup = BeautifulSoup(response.text, 'xml')
        urls = []

        # Check for nested sitemaps
        sitemap_tags = soup.find_all('sitemap')
        if sitemap_tags:
            print(f"Found {len(sitemap_tags)} nested sitemaps.")
            for sitemap in sitemap_tags:
                loc = sitemap.find('loc')
                if loc:
                    urls.extend(fetch_sitemap(loc.text, visited_sitemaps))

        # Extract URLs from the current sitemap
        loc_tags = soup.find_all('loc')
        if loc_tags:
            print(f"Found {len(loc_tags)} URLs in this sitemap.")
            for loc in loc_tags:
                urls.append(loc.text)

        return urls
    except Exception as e:
        print(f"An error occurred while processing {url}: {e}")
        return []


# Main function to start scraping
def scrape_multilevel_sitemap():
    sitemap_url = input("Enter the main sitemap URL (e.g., https://example.com/sitemap.xml): ")
    all_urls = fetch_sitemap(sitemap_url)
    if all_urls:
        print(f"\nTotal URLs found: {len(all_urls)}")
        for url in all_urls:
            print(url)
    else:
        print("No URLs found.")


# Run the script
scrape_multilevel_sitemap()

Sample Input

Enter the main sitemap URL (e.g., https://example.com/sitemap.xml): https://example.com/sitemap.xml

You provide the main sitemap URL for the script to process. For example, if the website has a sitemap located at https://example.com/sitemap.xml, the script will start fetching URLs and nested sitemaps from there.

Sample Output

Fetching sitemap: https://example.com/sitemap.xml
Found 2 nested sitemaps.
Fetching sitemap: https://example.com/blog-sitemap.xml
Found 50 URLs in this sitemap.
Fetching sitemap: https://example.com/news-sitemap.xml
Found 30 URLs in this sitemap.

Total URLs found: 80
https://example.com/blog/post1
https://example.com/blog/post2
...
https://example.com/news/article1
https://example.com/news/article2
...

Explanation of Output:

  • The script fetches the main sitemap and identifies 2 nested sitemaps (blog-sitemap.xml and news-sitemap.xml).
  • It then recursively fetches URLs from each nested sitemap.
  • Finally, it displays the total count of URLs and lists them.

Asynchronous Backlink Verification with CSV Updates

When you buy backlinks from someone, it’s important to check if the backlink is actually present on the website. Some dishonest webmasters may remove the backlink after some time, which can harm your SEO efforts. This script helps you verify if your backlinks are still live, ensuring you get what you paid for and maintain your SEO strategy effectively.

 

Backlink Verification Script

import pandas as pd
import aiohttp
import asyncio
from datetime import datetime

mysite = "thekitchn.com"  # Replace with your site URL
csv_file = 'mybacklink.csv'  # CSV file path

# Define the headers to avoid blocks
HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3', 
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://www.google.com/',  # Simulate coming from Google
    'Connection': 'keep-alive'
}

async def check_backlink_status(url, domain, session):
    try:
        async with session.get(url, headers=HEADERS, timeout=10) as response:
            if response.status == 200:
                content = await response.text()
                if domain in content:
                    return "Present"
                else:
                    return "Absent"
            else:
                return "Absent"
    except asyncio.TimeoutError:
        return "Timeout"
    except Exception:
        return "Unable to Fetch"

async def check_backlinks(df, domain):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for url in df['backlink_url']:
            task = check_backlink_status(url, domain, session)
            tasks.append(task)
        
        return await asyncio.gather(*tasks)

def update_csv_with_results(csv_file, domain):
    df = pd.read_csv(csv_file)
    current_date = datetime.now().strftime('%Y-%m-%d')  # Get current date as column name
    
    results = asyncio.run(check_backlinks(df, domain))  # Use asyncio.run
    
    df[current_date] = results
    df.to_csv(csv_file, index=False)
    print(f"Backlink check complete. Results saved in '{csv_file}'.")

# Run the backlink check
update_csv_with_results(csv_file, mysite)

Steps to Configure and Run the Backlink Checker Script

  • Install Required Libraries: Install the necessary Python libraries: aiohttp, pandas, and requests. Use the following command:
    pip install aiohttp pandas requests

    Create a CSV file (e.g., mybacklink.csv) with a column named backlink_url. Each row should contain a backlink URL to be checked. For example:
    ​​​​​​​

    backlink_url
    https://example.com
    https://anotherexample.com
    https://somedomain.com
  • Replace the mysite variable in the script with your domain name
    mysite = "your-site-url.com"
  • Execute the script to start checking the backlinks in your CSV file
  • Keep the CSV File in same folder<

Example Output

After running the script, the CSV file (mybacklink.csv) will be updated with a new column for the current date:

backlink_url,2024-10-04
https://example.com,Present
https://anotherexample.com,Absent
https://somedomain.com,Unable to Fetch

Broken URL Detector

The Broken Link Checker script extracts URLs from a sitemap, scrapes links from each page, and validates them by checking their HTTP status codes. It identifies broken or invalid links and generates a detailed CSV report with the invalid URLs, the pages they were found on, and their status codes. This tool helps ensure your website’s links are functional and improves SEO health.

Python Script for Broken Link Checking and CSV Reporting

import aiohttp
import asyncio
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv

# Function to extract all URLs from the sitemap asynchronously
async def get_sitemap_urls(session, sitemap_url):
    """Extract all URLs from the sitemap, including nested sitemaps."""
    urls = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    
    print(f"Fetching sitemap: {sitemap_url}")
    try:
        async with session.get(sitemap_url, headers=headers, timeout=10) as response:
            print(f"Response Code: {response.status}")
            if response.status != 200:
                print(f"Failed to access {sitemap_url} with status: {response.status}")
                return urls

            content = await response.text()
            soup = BeautifulSoup(content, 'xml')

            # Check for nested sitemaps
            for sitemap in soup.find_all('sitemap'):
                sitemap_loc = sitemap.find('loc').text
                print(f"Found nested sitemap: {sitemap_loc}")
                urls.extend(await get_sitemap_urls(session, sitemap_loc))

            # Get URLs from current sitemap
            for url in soup.find_all('url'):
                loc = url.find('loc').text
                urls.append(loc)
                print(f"Found URL: {loc}")
    
    except asyncio.TimeoutError:
        print(f"Timeout occurred while accessing {sitemap_url}")
    except aiohttp.ClientError as e:
        print(f"Client error occurred while fetching sitemap: {e}")
    except Exception as e:
        print(f"Error fetching sitemap: {e}")
    
    return urls

# Function to extract links from a page asynchronously
async def extract_links_from_page(session, page_url):
    """Scrape the page and return all URLs found in the body."""
    links = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    
    print(f"Scraping page: {page_url}")
    try:
        async with session.get(page_url, headers=headers, timeout=10) as response:
            print(f"Page response code: {response.status}")
            if response.status != 200:
                print(f"Failed to load page: {page_url}")
                return links

            content = await response.text()
            soup = BeautifulSoup(content, 'html.parser')

            # Extract all anchor tags and convert relative links to absolute
            for a_tag in soup.find_all('a', href=True):
                href = a_tag['href']
                full_url = urljoin(page_url, href)  # Handle relative URLs
                links.append(full_url)
                print(f"Found link: {full_url}")
    
    except asyncio.TimeoutError:
        print(f"Timeout occurred while scraping page {page_url}")
    except aiohttp.ClientError as e:
        print(f"Client error occurred while fetching page: {e}")
    except Exception as e:
        print(f"Error fetching page content for {page_url}: {e}")
    
    return links

# Function to check URL status asynchronously
async def check_url_status(session, url):
    """Check if the URL is valid or not."""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    
    try:
        async with session.get(url, headers=headers, allow_redirects=True, timeout=10) as response:
            print(f"Checked URL: {url} - Status Code: {response.status}")
            return response.status
    except asyncio.TimeoutError:
        print(f"Timeout occurred while checking URL {url}")
        return None
    except aiohttp.ClientError as e:
        print(f"Client error occurred while checking URL {url}: {e}")
        return None
    except Exception as e:
        print(f"Error checking URL {url}: {e}")
        return None

# Function to generate a CSV report
def generate_report(invalid_urls, filename="invalid_links_report.csv"):
    """Generate a CSV report of invalid URLs."""
    try:
        with open(filename, mode='w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file)
            writer.writerow(['Invalid URL', 'Page URL', 'Status Code'])
            for invalid_url, page_url, status_code in invalid_urls:
                writer.writerow([invalid_url, page_url, status_code])
        print(f"Report saved as {filename}")
    except Exception as e:
        print(f"Error saving report: {e}")

# Main async function to drive the process
async def main(sitemap_url):
    # Create a session to reuse across requests
    async with aiohttp.ClientSession() as session:
        # Step 1: Extract all URLs from the sitemap
        print("Starting sitemap extraction...")
        sitemap_urls = await get_sitemap_urls(session, sitemap_url)
        
        if not sitemap_urls:
            print("No URLs found in sitemap.")
            return
        
        invalid_urls = []

        # Step 2: Scrape each URL and find links in the body
        for page_url in sitemap_urls:
            print(f"\nProcessing page: {page_url}")
            page_links = await extract_links_from_page(session, page_url)
            
            # Step 3: Check each link's status asynchronously
            check_tasks = [check_url_status(session, link) for link in page_links]
            status_codes = await asyncio.gather(*check_tasks)

            # Collect invalid URLs
            for link, status_code in zip(page_links, status_codes):
                if status_code and status_code != 200:
                    invalid_urls.append((link, page_url, status_code))
                    print(f"Invalid link found: {link} on page {page_url} with status {status_code}")

        # Step 4: Generate report for invalid URLs
        if invalid_urls:
            generate_report(invalid_urls)
        else:
            print("No invalid URLs found.")

# Example usage
if __name__ == "__main__":
    sitemap_url = 'https://i-golf-pro.com/sitemap_index.xml'  # Replace with your sitemap URL
    
    # Run the main event loop
    asyncio.run(main(sitemap_url))

How to Run Broken Link Checker Script

  • Install Dependencies: Install the required libraries using
  • Update Sitemap URL: Replace the sitemap_url variable in the script with your sitemap URL (e.g., https://example.com/sitemap.xml).
  • Run the Script: Execute the script with python script_name.py.
  • View the Report: Check the invalid_links_report.csv file for broken links, status codes, and their pages.

Google Page Speed Checker Script

This script is important because website speed plays a critical role in user experience, SEO rankings, and conversion rates. A slow-loading website can lead to higher bounce rates and lower search engine rankings. By using the Google Page Speed Checker Script, you can analyze key performance metrics for your website, identify areas for improvement, and optimize your pages for faster loading. Regularly monitoring page speed ensures your site remains competitive and provides a better experience for visitors.

import requests

# Function to fetch PageSpeed Insights data
def fetch_pagespeed_data(api_key, url, strategy="desktop"):
    try:
        base_url = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed"
        params = {
            "url": url,
            "key": api_key,
            "strategy": strategy,  # "desktop" or "mobile"
        }
        response = requests.get(base_url, params=params)
        if response.status_code == 200:
            data = response.json()
            score = data.get("lighthouseResult", {}).get("categories", {}).get("performance", {}).get("score", 0) * 100
            print(f"PageSpeed Score ({strategy.capitalize()}): {score}")
            return score
        else:
            print(f"Error fetching data for {url}: HTTP {response.status_code}")
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

# Replace with your Google PageSpeed API key
api_key = "YOUR_API_KEY"

# URL to check
url = input("Enter the URL to check (e.g., https://example.com): ")

# Run the PageSpeed Checker for desktop and mobile
print(f"Checking PageSpeed Insights for {url}...")
desktop_score = fetch_pagespeed_data(api_key, url, "desktop")
mobile_score = fetch_pagespeed_data(api_key, url, "mobile")

# Display Results
print("\nResults:")
if desktop_score is not None:
    print(f"Desktop Score: {desktop_score}")
if mobile_score is not None:
    print(f"Mobile Score: {mobile_score}")

How to Use

To use this script, you first need to obtain a Google PageSpeed Insights API key from the Google Cloud Console. Once you have the API key, replace the placeholder YOUR_API_KEY in the script with your actual key. Then, run the script, and when prompted, enter the URL of the page you want to analyze. The script will fetch and display the PageSpeed scores for both desktop and mobile versions of the page. This tool is ideal for quickly analyzing a single page's performance and identifying areas for improvement to enhance user experience and SEO.

Website Age Detector: Check How Old a Website Is!

Website age is important for SEO because older domains are often seen as more credible and trustworthy by search engines. Knowing the age of a website helps you assess its authority, compare it with competitors, and develop effective SEO strategies. This script calculates the website's age in years by fetching its creation date using the WHOIS protocol, making it a valuable tool for SEO analysis and research.

import whois
from datetime import datetime

domain = input("Enter Domain name: ")

domain_info = whois.whois(domain)
creation_date =  domain_info.creation_date

age = (datetime.now() - creation_date[0]).days // 365
print(age, 'years')

How to Use:

  1. Install Dependencies:Ensure you have the whois library installed. If not, install it using:
    pip install python-whois
  2. Run the Script:Save the script to a .py file (e.g., website_age_detector.py)
    python website_age_detector.py
  3. Enter Domain Name:When prompted, type the domain name (e.g., example.com) and press Enter.
  4. View the Result:The script will calculate and display the website's age in years.

This tool helps you quickly determine the age of any website for SEO or research purposes.

These Python scripts simplify SEO tasks like backlink checks, sitemap scraping, and page speed analysis. They save time, improve accuracy, and help you focus on strategy. Automating SEO ensures consistent results and keeps you ahead in the competitive online space. You can also develop more scripts with Python to automate other SEO tasks and further boost your efficiency

0 Comments

0 Comments

Leave a comment