Version: Next

Building crawlers with Crawlee

In this guide, you'll learn how to build web crawlers with the Crawlee library in your Apify Actors.

Introduction

Crawlee is a Python library for web scraping and browser automation that provides a robust and flexible framework for building web scraping tasks. It seamlessly integrates with the Apify platform and supports a variety of scraping techniques, from static HTML parsing to dynamic JavaScript-rendered content handling. Crawlee offers a range of crawlers, including HTTP-based crawlers like HttpCrawler, BeautifulSoupCrawler and ParselCrawler, and browser-based crawlers like PlaywrightCrawler, to suit different scraping needs.

In this guide, you'll learn how to use Crawlee with BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler to build Apify Actors for web scraping.

Actor with BeautifulSoupCrawler

The BeautifulSoupCrawler is ideal for extracting data from static HTML pages. It uses BeautifulSoup for parsing and ImpitHttpClient for HTTP communication, ensuring efficient and lightweight scraping. If you do not need to execute JavaScript on the page, BeautifulSoupCrawler is a great choice for your scraping tasks. The following example shows how to use it in an Apify Actor.

Run on

import asyncio

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from crawlee.router import Router

from apify import Actor

# Define the router up front. The crawler is created later in `main`.
router = Router[BeautifulSoupCrawlingContext]()


# Handler called for every request.
@router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
    Actor.log.info(f'Scraping {context.request.url} ...')

    data = {
        'url': context.request.url,
        'title': context.soup.title.string if context.soup.title else None,
        'h1s': [h1.text for h1 in context.soup.find_all('h1')],
        'h2s': [h2.text for h2 in context.soup.find_all('h2')],
        'h3s': [h3.text for h3 in context.soup.find_all('h3')],
    }

    await context.push_data(data)
    Actor.log.info(f'Stored data from {context.request.url} (title={data["title"]!r}).')

    # Enqueue links found on the page.
    await context.enqueue_links(strategy='same-domain')


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = [
            url.get('url')
            for url in actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        ]

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Crawlee rotates the proxy URL per request on its own.
        proxy_configuration = await Actor.create_proxy_configuration()
        if proxy_configuration is None:
            raise RuntimeError('Failed to create the proxy configuration.')

        crawler = BeautifulSoupCrawler(
            proxy_configuration=proxy_configuration,
            request_handler=router,
            # Cap the crawl. Remove or increase the limit to follow all links.
            max_requests_per_crawl=50,
        )

        await crawler.run(start_urls)


if __name__ == '__main__':
    asyncio.run(main())

Actor with ParselCrawler

The ParselCrawler works in the same way as BeautifulSoupCrawler, but it uses the Parsel library for HTML parsing. This allows for more powerful and flexible data extraction using XPath selectors. It should be faster than BeautifulSoupCrawler. The following example shows how to use ParselCrawler in an Apify Actor.

Run on

import asyncio

from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
from crawlee.router import Router

from apify import Actor

# Define the router up front. The crawler is created later in `main`.
router = Router[ParselCrawlingContext]()


# Handler called for every request.
@router.default_handler
async def request_handler(context: ParselCrawlingContext) -> None:
    Actor.log.info(f'Scraping {context.request.url} ...')

    data = {
        'url': context.request.url,
        'title': context.selector.xpath('//title/text()').get(),
        'h1s': context.selector.xpath('//h1/text()').getall(),
        'h2s': context.selector.xpath('//h2/text()').getall(),
        'h3s': context.selector.xpath('//h3/text()').getall(),
    }

    await context.push_data(data)
    Actor.log.info(f'Stored data from {context.request.url} (title={data["title"]!r}).')

    # Enqueue links found on the page.
    await context.enqueue_links(strategy='same-domain')


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = [
            url.get('url')
            for url in actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        ]

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Crawlee rotates the proxy URL per request on its own.
        proxy_configuration = await Actor.create_proxy_configuration()
        if proxy_configuration is None:
            raise RuntimeError('Failed to create the proxy configuration.')

        crawler = ParselCrawler(
            proxy_configuration=proxy_configuration,
            request_handler=router,
            # Cap the crawl. Remove or increase the limit to follow all links.
            max_requests_per_crawl=50,
        )

        await crawler.run(start_urls)


if __name__ == '__main__':
    asyncio.run(main())

Actor with PlaywrightCrawler

The PlaywrightCrawler is built for handling dynamic web pages that rely on JavaScript for content rendering. Using the Playwright library, it provides a browser-based automation environment to interact with complex websites. The following example shows how to use PlaywrightCrawler in an Apify Actor.

Run on

import asyncio

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
from crawlee.router import Router

from apify import Actor

# Define the router up front. The crawler is created later in `main`.
router = Router[PlaywrightCrawlingContext]()


# Handler called for every request.
@router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
    Actor.log.info(f'Scraping {context.request.url} ...')

    data = {
        'url': context.request.url,
        'title': await context.page.title(),
        'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()],
        'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()],
        'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()],
    }

    await context.push_data(data)
    Actor.log.info(f'Stored data from {context.request.url} (title={data["title"]!r}).')

    # Enqueue links found on the page.
    await context.enqueue_links(strategy='same-domain')


async def main() -> None:
    async with Actor:
        # Read the Actor input.
        actor_input = await Actor.get_input() or {}
        start_urls = [
            url.get('url')
            for url in actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
        ]

        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Crawlee rotates the proxy URL per request on its own.
        proxy_configuration = await Actor.create_proxy_configuration()
        if proxy_configuration is None:
            raise RuntimeError('Failed to create the proxy configuration.')

        # Common Chrome flags for running the browser in a container.
        browser_args = ['--no-sandbox', '--disable-dev-shm-usage', '--disable-gpu']

        crawler = PlaywrightCrawler(
            proxy_configuration=proxy_configuration,
            request_handler=router,
            # Cap the crawl. Remove or increase the limit to follow all links.
            max_requests_per_crawl=50,
            headless=True,
            browser_launch_options={'args': browser_args},
        )

        await crawler.run(start_urls)


if __name__ == '__main__':
    asyncio.run(main())

Using Apify Proxy

All three crawlers above route their requests through Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. Actor.create_proxy_configuration returns a Crawlee-compatible proxy configuration, which is passed to the crawler as proxy_configuration. Crawlee then rotates the proxy IP for every request on its own. Because the configuration is only available inside the running Actor, the crawler is created in main and the request handler is registered on a standalone Router up front. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Conclusion

In this guide, you learned how to use the Crawlee library in your Apify Actors. By using the BeautifulSoupCrawler, ParselCrawler, and PlaywrightCrawler crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Building crawlers with Crawlee

Introduction

Actor with BeautifulSoupCrawler

Actor with ParselCrawler

Actor with PlaywrightCrawler

Using Apify Proxy

Conclusion

Additional resources