Version: 3.4

Scraping with BeautifulSoup and HTTPX

In this guide, you'll learn how to scrape web pages with the BeautifulSoup and HTTPX libraries in your Apify Actors.

Introduction

BeautifulSoup is a Python library for extracting data from HTML and XML files. It provides simple methods and Pythonic idioms for navigating, searching, and modifying a website's element tree, enabling efficient data extraction.

HTTPX is a modern, high-level HTTP client library for Python. It provides a simple interface for making HTTP requests and supports both synchronous and asynchronous requests.

To create an Actor which uses those libraries, start from the BeautifulSoup & Python Actor template. This template includes the BeautifulSoup and HTTPX libraries preinstalled, allowing you to begin development immediately.

Example Actor

The following example is a simple Actor that recursively scrapes data from linked pages on the same site, up to a specified maximum depth, starting from URLs provided in the Actor input. It uses HTTPX for fetching pages through Apify Proxy and BeautifulSoup for parsing their content to extract the title, headings, and links to other pages.

Run on

import asyncio
from typing import Any
from urllib.parse import urljoin, urlsplit

import httpx
from bs4 import BeautifulSoup

from apify import Actor, Request
from apify.storages import RequestQueue


async def scrape_page(
    url: str,
    *,
    proxy_url: str | None = None,
) -> tuple[dict[str, Any], list[str]]:
    """Fetch a page with HTTPX and return its data and same-site links."""
    # A fresh client per call lets each request use a new proxy URL.
    async with httpx.AsyncClient(proxy=proxy_url) as client:
        response = await client.get(url, follow_redirects=True)

    soup = BeautifulSoup(response.content, 'html.parser')

    data = {
        'url': url,
        'title': soup.title.string if soup.title else None,
        'h1s': [h1.text for h1 in soup.find_all('h1')],
        'h2s': [h2.text for h2 in soup.find_all('h2')],
        'h3s': [h3.text for h3 in soup.find_all('h3')],
    }

    # Keep only absolute links on the same host.
    links: list[str] = []
    host = urlsplit(url).netloc
    for link in soup.find_all('a'):
        link_url = urljoin(url, link.get('href'))
        if not link_url.startswith(('http://', 'https://')):
            continue
        if urlsplit(link_url).netloc == host:
            links.append(link_url)

    return data, links


async def enqueue_links(
    request_queue: RequestQueue,
    links: list[str],
    *,
    depth: int,
    max_depth: int,
) -> None:
    """Enqueue the links one level deeper, unless max_depth was reached."""
    if depth >= max_depth:
        return

    for link_url in links:
        Actor.log.info(f'Enqueuing {link_url} ...')
        request = Request.from_url(link_url)
        request.crawl_depth = depth + 1
        await request_queue.add_request(request)


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        max_depth = actor_input.get('maxDepth', 1)

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Set up Apify Proxy and the request queue.
        proxy_configuration = await Actor.create_proxy_configuration()
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs (crawl depth defaults to 0).
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing start URL: {url}')
            await request_queue.add_request(Request.from_url(url))

        # Cap the crawl. Raise or remove the limit to follow more pages.
        max_requests = 50
        handled_requests = 0

        while handled_requests < max_requests and (
            request := await request_queue.fetch_next_request()
        ):
            handled_requests += 1
            url = request.url
            depth = request.crawl_depth
            Actor.log.info(f'Scraping {url} (depth={depth}) ...')

            try:
                # Fresh proxy URL per request (None if no proxy).
                proxy_url = None
                if proxy_configuration:
                    proxy_url = await proxy_configuration.new_url()

                data, links = await scrape_page(url, proxy_url=proxy_url)
                await Actor.push_data(data)
                Actor.log.info(
                    f'Stored data from {url} '
                    f'(title={data["title"]!r}, {len(links)} links found).'
                )
                await enqueue_links(
                    request_queue, links, depth=depth, max_depth=max_depth
                )

            except Exception:
                Actor.log.exception(f'Cannot extract data from {url}.')

            finally:
                await request_queue.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Using Apify Proxy

Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with Actor.create_proxy_configuration and fetches a fresh proxy URL for every request. Each page then goes through a different IP. A new HTTPX client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Conclusion

In this guide, you learned how to use the BeautifulSoup with the HTTPX in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Scraping with BeautifulSoup and HTTPX

Introduction

Example Actor

Using Apify Proxy

Conclusion

Additional resources