Version: 3.4

Scraping with Parsel and Impit

In this guide, you'll learn how to scrape web pages with the Parsel and Impit libraries in your Apify Actors.

Introduction

Parsel is a Python library for extracting data from HTML and XML documents using CSS selectors and XPath expressions. It offers an intuitive API for navigating and extracting structured data, making it a popular choice for web scraping. Compared to BeautifulSoup, it also delivers better performance.

Impit is Apify's high-performance HTTP client for Python. It supports both synchronous and asynchronous workflows and is built for large-scale web scraping, where making thousands of requests efficiently is essential. With built-in browser impersonation and anti-blocking features, it simplifies handling modern websites.

Example Actor

The following example shows a simple Actor that recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth. It uses Impit to fetch pages through Apify Proxy and Parsel to extract the title, headings, and links.

Run on

import asyncio
from typing import Any
from urllib.parse import urljoin, urlsplit

import impit
import parsel

from apify import Actor, Request
from apify.storages import RequestQueue


async def scrape_page(
    url: str,
    *,
    proxy_url: str | None = None,
) -> tuple[dict[str, Any], list[str]]:
    """Fetch a page with Impit and return its data and same-site links."""
    # A fresh client per call lets each request use a new proxy URL.
    async with impit.AsyncClient(proxy=proxy_url) as client:
        response = await client.get(url)

    selector = parsel.Selector(text=response.text)

    data = {
        'url': url,
        'title': selector.css('title::text').get(),
        'h1s': selector.css('h1::text').getall(),
        'h2s': selector.css('h2::text').getall(),
        'h3s': selector.css('h3::text').getall(),
    }

    # Keep only absolute links on the same host.
    links: list[str] = []
    host = urlsplit(url).netloc
    for link_href in selector.css('a::attr(href)').getall():
        link_url = urljoin(url, link_href)
        if not link_url.startswith(('http://', 'https://')):
            continue
        if urlsplit(link_url).netloc == host:
            links.append(link_url)

    return data, links


async def enqueue_links(
    request_queue: RequestQueue,
    links: list[str],
    *,
    depth: int,
    max_depth: int,
) -> None:
    """Enqueue the links one level deeper, unless max_depth was reached."""
    if depth >= max_depth:
        return

    for link_url in links:
        Actor.log.info(f'Enqueuing {link_url} ...')
        request = Request.from_url(link_url)
        request.crawl_depth = depth + 1
        await request_queue.add_request(request)


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        max_depth = actor_input.get('maxDepth', 1)

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Set up Apify Proxy and the request queue.
        proxy_configuration = await Actor.create_proxy_configuration()
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs (crawl depth defaults to 0).
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing start URL: {url}')
            await request_queue.add_request(Request.from_url(url))

        # Cap the crawl. Raise or remove the limit to follow more pages.
        max_requests = 50
        handled_requests = 0

        while handled_requests < max_requests and (
            request := await request_queue.fetch_next_request()
        ):
            handled_requests += 1
            url = request.url
            depth = request.crawl_depth
            Actor.log.info(f'Scraping {url} (depth={depth}) ...')

            try:
                # Fresh proxy URL per request (None if no proxy).
                proxy_url = None
                if proxy_configuration:
                    proxy_url = await proxy_configuration.new_url()

                data, links = await scrape_page(url, proxy_url=proxy_url)
                await Actor.push_data(data)
                Actor.log.info(
                    f'Stored data from {url} '
                    f'(title={data["title"]!r}, {len(links)} links found).'
                )
                await enqueue_links(
                    request_queue, links, depth=depth, max_depth=max_depth
                )

            except Exception:
                Actor.log.exception(f'Cannot extract data from {url}.')

            finally:
                await request_queue.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Using Apify Proxy

Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with Actor.create_proxy_configuration and fetches a fresh proxy URL for every request. Each page then goes through a different IP. A new Impit client is created per request to apply that URL. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Conclusion

In this guide, you learned how to use Parsel with Impit in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: Parsel provides excellent CSS selector and XPath support for data extraction, while Impit offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Scraping with Parsel and Impit

Introduction

Example Actor

Using Apify Proxy

Conclusion

Additional resources