Adaptive scraping with Scrapling
In this guide, you'll learn how to use the Scrapling library for adaptive web scraping in your Apify Actors.
Introduction
Scrapling is an adaptive web scraping library for Python that combines fetching and parsing behind a single, high-level API. It can fetch a page with fast HTTP requests or with a real browser, parse the result with familiar CSS selectors and XPath, and relocate your selectors automatically when a website's structure changes.
Scrapling is a great fit for Apify Actors:
- A single API exposes a fast HTTP client with browser TLS-fingerprint impersonation, as well as full browser automation for JavaScript-heavy or protected pages.
- Scrapling can remember the elements you scraped and find them again after a website redesign. Your scrapers keep working with fewer manual fixes.
- Built-in stealth features (browser impersonation, realistic headers, and automatic Cloudflare Turnstile solving with the browser fetchers) help you avoid being blocked.
- Elements are selected with CSS selectors (including the
::textand::attr()pseudo-elements) or XPath, with a Scrapy/Parsel-like.get()and.getall()interface. - Every fetcher has an asynchronous variant, which integrates naturally with the asyncio-based Apify SDK.
Scrapling's parser works on its own. The fetchers are an optional extra. To get the HTTP and browser fetchers, install Scrapling with the fetchers extra:
pip install "scrapling[fetchers]"
Choosing a fetcher
All of Scrapling's fetchers are importable from scrapling.fetchers. Pick the one that matches the website you're scraping:
Fetcher/AsyncFetcher- Plain HTTP requests via.get(),.post(),.put(), and.delete(). Fast and lightweight, with optional browser TLS-fingerprint impersonation (impersonate) and realistic headers (stealthy_headers). This is the best choice for static pages and APIs, and it doesn't need browser binaries.DynamicFetcher/DynamicSession- Full browser automation based on Playwright, for pages that require JavaScript rendering or interaction. Fetch a page with.fetch()or its async variant.async_fetch().StealthyFetcher/StealthySession- A stealth-hardened browser fetcher that can automatically solve Cloudflare Turnstile challenges (solve_cloudflare=True). Use it for the most heavily protected websites.
The returned Response object is also a Scrapling selector, so you can call .css(), .xpath(), .find_all(), and the other parsing methods on it directly.
The HTTP fetchers work with just the scrapling[fetchers] extra. The browser-based fetchers (DynamicFetcher and StealthyFetcher) additionally need browser binaries, which you download with the scrapling install command. See Running browser-based fetchers.
The example Actor in this guide uses the HTTP AsyncFetcher, which is the simplest to deploy and pairs well with Apify Proxy.
Example Actor
The following Actor recursively scrapes data from linked pages on the same site, up to a user-defined maximum depth, starting from the URLs in the Actor input. It uses Scrapling's AsyncFetcher to fetch each page through Apify Proxy, and CSS selectors to extract the title, headings, and links.
The whole Actor fits in a single file. A scrape_page helper holds the Scrapling-specific fetching and parsing, while the main coroutine handles the Actor lifecycle, reads the input, sets up Apify Proxy and the request queue, and drives the crawl:
import asyncio
from typing import Any
from urllib.parse import urlsplit
from scrapling.fetchers import AsyncFetcher
from apify import Actor, Request
from apify.storages import RequestQueue
async def scrape_page(
url: str,
*,
proxy_url: str | None = None,
) -> tuple[dict[str, Any], list[str]]:
"""Fetch a page with Scrapling's HTTP fetcher and return data and links."""
# `impersonate` and `stealthy_headers` make the request look like Chrome.
response = await AsyncFetcher.get(
url,
proxy=proxy_url,
impersonate='chrome',
stealthy_headers=True,
timeout=60,
)
data = {
'url': url,
'title': response.css('title::text').get(),
'h1s': response.css('h1::text').getall(),
'h2s': response.css('h2::text').getall(),
'h3s': response.css('h3::text').getall(),
}
# Keep only absolute links on the same host.
links: list[str] = []
host = urlsplit(url).netloc
for href in response.css('a::attr(href)').getall():
link_url = response.urljoin(href)
if not link_url.startswith(('http://', 'https://')):
continue
if urlsplit(link_url).netloc == host:
links.append(link_url)
return data, links
async def enqueue_links(
request_queue: RequestQueue,
links: list[str],
*,
depth: int,
max_depth: int,
) -> None:
"""Enqueue the links one level deeper, unless max_depth was reached."""
if depth >= max_depth:
return
for link_url in links:
Actor.log.info(f'Enqueuing {link_url} ...')
request = Request.from_url(link_url)
request.crawl_depth = depth + 1
await request_queue.add_request(request)
async def main() -> None:
async with Actor:
# Read the Actor input.
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
max_depth = actor_input.get('maxDepth', 1)
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Set up Apify Proxy and the request queue.
proxy_configuration = await Actor.create_proxy_configuration()
request_queue = await Actor.open_request_queue()
# Enqueue the start URLs (crawl depth defaults to 0).
for start_url in start_urls:
url = start_url.get('url')
Actor.log.info(f'Enqueuing start URL: {url}')
await request_queue.add_request(Request.from_url(url))
# Cap the crawl. Raise or remove the limit to follow more pages.
max_requests = 50
handled_requests = 0
while handled_requests < max_requests and (
request := await request_queue.fetch_next_request()
):
handled_requests += 1
url = request.url
depth = request.crawl_depth
Actor.log.info(f'Scraping {url} (depth={depth}) ...')
try:
# Fresh proxy URL per request (None if no proxy).
proxy_url = None
if proxy_configuration:
proxy_url = await proxy_configuration.new_url()
data, links = await scrape_page(url, proxy_url=proxy_url)
await Actor.push_data(data)
Actor.log.info(
f'Stored data from {url} '
f'(title={data["title"]!r}, {len(links)} links found).'
)
await enqueue_links(
request_queue, links, depth=depth, max_depth=max_depth
)
except Exception:
Actor.log.exception(f'Cannot extract data from {url}.')
finally:
await request_queue.mark_request_as_handled(request)
if __name__ == '__main__':
asyncio.run(main())
Note that:
- Keeping the fetching and parsing in
scrape_pageseparates the Scrapling-specific code from the Actor's orchestration logic. The function returns the extracted data together with the discovered links, somaindecides what to store and what to enqueue. - The response of
AsyncFetcher.getis a Scrapling selector, soresponse.css('title::text').get()reads the page title andresponse.css('a::attr(href)').getall()returns every link'shrefin one call. response.urljoin(link_href)resolves relative links against the page URL, so you can enqueue them directly.- The
impersonate='chrome'andstealthy_headers=Trueoptions make the request look like it comes from a real Chrome browser. Combined with Apify Proxy, it reduces the chance of being blocked.
Adaptive selectors
The example above uses plain CSS selectors. Scrapling can also track the elements you scrape and relocate them when a website changes its markup, so a redesign doesn't immediately break your scraper. This is most useful for scrapers that revisit the same pages over time, rather than one-off crawls.
-
Enable adaptive matching once on the fetcher:
AsyncFetcher.configure(adaptive=True) -
On the first run, pass
auto_save=Truewhen you select an element. Scrapling records a fingerprint of that element, keyed by the selector:title = response.css('h1.product-title::text', auto_save=True).get() -
On a later run, if the selector no longer matches because the page changed, pass
adaptive=Truewith the same selector. Scrapling uses the saved fingerprint to find the element in its new location:title = response.css('h1.product-title::text', adaptive=True).get()
Scrapling keeps these fingerprints in a local SQLite database. On the Apify platform the Actor's filesystem doesn't persist between runs, so to keep them across runs, store that database in a key-value store and restore it on startup. For details, see Scrapling's adaptive parsing documentation.
Using Apify Proxy
Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. In the example above, main creates a proxy configuration with Actor.create_proxy_configuration and passes a fresh proxy URL to scrape_page for every request, which forwards it to Scrapling's proxy argument.
Scrapling accepts the proxy as a URL string (for example http://user:pass@proxy.apify.com:8000), which is what ProxyConfiguration.new_url returns. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management. The browser-based fetchers accept the same proxy argument.
Running browser-based fetchers
DynamicFetcher and StealthyFetcher drive a real browser, so they need the browser binaries installed with the scrapling install command. Locally, run it once after installing the scrapling[fetchers] extra:
scrapling install
To switch the example from HTTP to a real browser, fetch each page through a browser session instead of AsyncFetcher. Opening a fresh browser for every page would be wasteful, so main enters an AsyncDynamicSession once and reuses it for the whole crawl, while scrape_page fetches with session.fetch. The parsing API is identical, so the extraction code stays the same:
import asyncio
from typing import Any
from urllib.parse import urlsplit
from scrapling.fetchers import AsyncDynamicSession
from apify import Actor, Request
from apify.storages import RequestQueue
async def scrape_page(
session: AsyncDynamicSession,
url: str,
*,
proxy_url: str | None = None,
) -> tuple[dict[str, Any], list[str]]:
"""Fetch a page through the shared browser session and return data and links."""
# `network_idle` waits until the page stops making network requests.
response = await session.fetch(url, proxy=proxy_url, network_idle=True)
data = {
'url': url,
'title': response.css('title::text').get(),
'h1s': response.css('h1::text').getall(),
'h2s': response.css('h2::text').getall(),
'h3s': response.css('h3::text').getall(),
}
# Keep only absolute links on the same host.
links: list[str] = []
host = urlsplit(url).netloc
for href in response.css('a::attr(href)').getall():
link_url = response.urljoin(href)
if not link_url.startswith(('http://', 'https://')):
continue
if urlsplit(link_url).netloc == host:
links.append(link_url)
return data, links
async def enqueue_links(
request_queue: RequestQueue,
links: list[str],
*,
depth: int,
max_depth: int,
) -> None:
"""Enqueue the links one level deeper, unless max_depth was reached."""
if depth >= max_depth:
return
for link_url in links:
Actor.log.info(f'Enqueuing {link_url} ...')
request = Request.from_url(link_url)
request.crawl_depth = depth + 1
await request_queue.add_request(request)
async def main() -> None:
async with Actor:
# Read the Actor input.
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('startUrls', [{'url': 'https://crawlee.dev'}])
max_depth = actor_input.get('maxDepth', 1)
if not start_urls:
Actor.log.info('No start URLs specified in Actor input, exiting...')
await Actor.exit()
# Set up Apify Proxy and the request queue.
proxy_configuration = await Actor.create_proxy_configuration()
request_queue = await Actor.open_request_queue()
# Enqueue the start URLs (crawl depth defaults to 0).
for start_url in start_urls:
url = start_url.get('url')
Actor.log.info(f'Enqueuing start URL: {url}')
await request_queue.add_request(Request.from_url(url))
# Cap the crawl. Raise or remove the limit to follow more pages.
max_requests = 50
handled_requests = 0
# Open the browser once and reuse it for every page in the crawl.
async with AsyncDynamicSession(headless=True) as session:
while handled_requests < max_requests and (
request := await request_queue.fetch_next_request()
):
handled_requests += 1
url = request.url
depth = request.crawl_depth
Actor.log.info(f'Scraping {url} (depth={depth}) ...')
try:
# Fresh proxy URL per request (None if no proxy).
proxy_url = None
if proxy_configuration:
proxy_url = await proxy_configuration.new_url()
data, links = await scrape_page(session, url, proxy_url=proxy_url)
await Actor.push_data(data)
Actor.log.info(
f'Stored data from {url} '
f'(title={data["title"]!r}, {len(links)} links found).'
)
await enqueue_links(
request_queue, links, depth=depth, max_depth=max_depth
)
except Exception:
Actor.log.exception(f'Cannot extract data from {url}.')
finally:
await request_queue.mark_request_as_handled(request)
if __name__ == '__main__':
asyncio.run(main())
Note that:
AsyncDynamicSessionlaunches one browser and keeps it open acrosssession.fetchcalls, so the crawl doesn't pay the browser-startup cost on every page.- The proxy URL is passed per fetch, so each page can go through a fresh Apify Proxy IP while sharing the same browser.
To run this on the Apify platform, build on top of the Apify Playwright base image, which already ships a browser together with all of its system-level dependencies. Then run scrapling install during the Docker build to download the browser binaries that Scrapling expects:
FROM apify/actor-python-playwright:3.14
# Install the Actor's Python dependencies.
COPY requirements.txt ./
RUN pip install -r requirements.txt
# Download the browser binaries that Scrapling's browser fetchers need.
RUN scrapling install
# Copy in the source code and launch the Actor as a module.
COPY . ./
CMD ["python", "-m", "src"]
Conclusion
In this guide, you learned how to use Scrapling in your Apify Actors. You can now fetch pages with Scrapling's HTTP or browser-based fetchers, extract data with its CSS and XPath selectors, route requests through Apify Proxy, and run the whole thing on the Apify platform. To get started with your own scraping tasks, see the Actor templates. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!