Version: 3.4

Browser automation with Playwright

In this guide, you'll learn how to use Playwright for browser automation and web scraping in your Apify Actors.

Introduction

Playwright is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would.

Some of the key features of Playwright for web scraping include:

Cross-browser support - Playwright supports Chromium, Firefox, and WebKit with a single API, ensuring consistent behavior across all browsers.
Auto-waiting - Playwright automatically waits for elements to be ready before performing actions, reducing flaky scripts and eliminating the need for manual sleep calls.
Headless and headful modes - Playwright can run with or without a visible browser window, making it suitable for both local development and containerized environments.
Powerful selectors - Playwright provides CSS selectors, XPath, text matching, and its own resilient locator API for targeting elements on a page.
Network interception - Playwright can intercept and modify network requests, allowing you to block unnecessary resources or mock API responses during scraping.

To create Actors which use Playwright, start from the Playwright & Python Actor template.

On the Apify platform, the Actor will already have Playwright and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode.

When running the Actor locally, you'll need to finish the Playwright setup yourself before you can run the Actor.

Linux / macOS
Windows

source .venv/bin/activate
playwright install --with-deps

.venv\Scripts\activate
playwright install --with-deps

Example Actor

This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.

It uses Playwright to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.

Run on

import asyncio
from typing import Any
from urllib.parse import urljoin, urlsplit

from playwright.async_api import BrowserContext, async_playwright

from apify import Actor, Request
from apify.storages import RequestQueue

# To run locally, install the browsers first: `playwright install --with-deps`.
# On the Apify platform, browsers are already in the Actor's Docker image.


def to_playwright_proxy(proxy_url: str) -> dict[str, str]:
    """Split an Apify Proxy URL into Playwright's server/username/password."""
    parts = urlsplit(proxy_url)
    return {
        'server': f'{parts.scheme}://{parts.hostname}:{parts.port}',
        'username': parts.username or '',
        'password': parts.password or '',
    }


async def scrape_page(
    context: BrowserContext, url: str
) -> tuple[dict[str, Any], list[str]]:
    """Open the URL in a new page and return its data and same-site links."""
    page = await context.new_page()
    try:
        await page.goto(url)

        data = {
            'url': url,
            'title': await page.title(),
            'h1s': [await h1.text_content() for h1 in await page.locator('h1').all()],
            'h2s': [await h2.text_content() for h2 in await page.locator('h2').all()],
            'h3s': [await h3.text_content() for h3 in await page.locator('h3').all()],
        }

        # Keep only absolute links on the same host.
        links: list[str] = []
        host = urlsplit(url).netloc
        for link in await page.locator('a').all():
            link_href = await link.get_attribute('href')
            link_url = urljoin(url, link_href)
            if not link_url.startswith(('http://', 'https://')):
                continue
            if urlsplit(link_url).netloc == host:
                links.append(link_url)

        return data, links

    finally:
        await page.close()


async def enqueue_links(
    request_queue: RequestQueue,
    links: list[str],
    *,
    depth: int,
    max_depth: int,
) -> None:
    """Enqueue the links one level deeper, unless max_depth was reached."""
    if depth >= max_depth:
        return

    for link_url in links:
        Actor.log.info(f'Enqueuing {link_url} ...')
        request = Request.from_url(link_url)
        request.crawl_depth = depth + 1
        await request_queue.add_request(request)


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        max_depth = actor_input.get('maxDepth', 1)

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Set up the proxy configuration. A fresh proxy URL is fetched per request below.
        proxy_configuration = await Actor.create_proxy_configuration()

        # Open the request queue and enqueue the start URLs (crawl depth 0).
        request_queue = await Actor.open_request_queue()
        for start_url in start_urls:
            url = start_url.get('url')
            Actor.log.info(f'Enqueuing start URL: {url}')
            await request_queue.add_request(Request.from_url(url))

        # Cap the crawl. Raise or remove the limit to follow more pages.
        max_requests = 50
        handled_requests = 0

        Actor.log.info('Launching Playwright...')

        async with async_playwright() as playwright:
            browser = await playwright.chromium.launch(
                headless=Actor.configuration.headless,
                args=['--no-sandbox', '--disable-dev-shm-usage', '--disable-gpu'],
            )

            while handled_requests < max_requests and (
                request := await request_queue.fetch_next_request()
            ):
                handled_requests += 1
                url = request.url
                depth = request.crawl_depth
                Actor.log.info(f'Scraping {url} (depth={depth}) ...')

                # A new context with a fresh proxy URL per request rotates the proxy IP.
                proxy_url = (
                    await proxy_configuration.new_url() if proxy_configuration else None
                )
                context = await browser.new_context(
                    proxy=to_playwright_proxy(proxy_url) if proxy_url else None,
                )

                try:
                    data, links = await scrape_page(context, url)
                    await Actor.push_data(data)
                    Actor.log.info(
                        f'Stored data from {url} '
                        f'(title={data["title"]!r}, {len(links)} links found).'
                    )
                    await enqueue_links(
                        request_queue, links, depth=depth, max_depth=max_depth
                    )

                except Exception:
                    Actor.log.exception(f'Cannot extract data from {url}.')

                finally:
                    await context.close()
                    await request_queue.mark_request_as_handled(request)


if __name__ == '__main__':
    asyncio.run(main())

Using Apify Proxy

Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with Actor.create_proxy_configuration and fetches a fresh proxy URL for every request. Playwright applies a proxy per browser context. Each request runs in its own new context to rotate the IP. The to_playwright_proxy helper splits that URL into the server, username, and password fields Playwright expects. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Conclusion

In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Browser automation with Playwright

Introduction

Example Actor

Using Apify Proxy

Conclusion

Additional resources