Web scraper¶
Build a distributed web scraper, progressing from a local script to parallel remote execution with custom images.
Step 1: Scrape links locally¶
Start with plain Python:
import re
import urllib.request
def get_links(url):
response = urllib.request.urlopen(url)
html = response.read().decode("utf8")
links = []
for match in re.finditer('href="(.*?)"', html):
links.append(match.group(1))
return links
if __name__ == "__main__":
print(get_links("http://example.com"))
Step 2: Run it remotely¶
Add OpenModal — the only changes are the import, the decorator, and the entrypoint:
import re
import urllib.request
import openmodal
app = openmodal.App(name="example-webscraper")
@app.function()
def get_links(url):
response = urllib.request.urlopen(url)
html = response.read().decode("utf8")
links = []
for match in re.finditer('href="(.*?)"', html):
links.append(match.group(1))
return links
@app.local_entrypoint()
def main(url: str = "http://example.com"):
links = get_links.remote(url)
print(links)
openmodal --local run examples/webscraper.py --url http://example.com
# or: openmodal run examples/webscraper.py --url http://example.com (GCP)
The function ran inside a container, not on your machine.
Step 3: Add dependencies with a custom image¶
Use requests and beautifulsoup4 for better HTML parsing:
import openmodal
app = openmodal.App("example-webscraper-requests")
scraper_image = openmodal.Image.debian_slim().pip_install("requests", "beautifulsoup4")
@app.function(image=scraper_image)
async def get_links(url: str) -> list[str]:
import asyncio
import requests
from bs4 import BeautifulSoup
resp = await asyncio.to_thread(requests.get, url, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
return [a["href"] for a in soup.find_all("a", href=True)]
@app.local_entrypoint()
def main():
urls = ["http://example.com", "http://modal.com"]
for links in get_links.map(urls):
for link in links:
print(link)
The first run builds the Docker image. Subsequent runs use the cached image.
What this demonstrates¶
| Feature | How it's used |
|---|---|
f.remote(url) |
Run a single function call in a container |
f.map(urls) |
Run multiple calls in parallel |
Image.debian_slim() |
Base container image with Python |
.pip_install(...) |
Add Python packages to the image |
async def |
Async functions work transparently |
CLI args (--url) |
Entrypoint parameters become CLI flags |