Notes & opinionsCrawling is a product problem, not a script problem

Crawling is a product problem, not a script problem

April 9, 20269:40 PM UTC~1 min · 136w

Most people picture a crawler as a loop that fetches URLs and parses HTML. That part is teachable. What is not teachable in a weekend is everything that happens when you run at volume against the real web: shifting layouts, rate limits, bot management, flaky networks, and the need for observability so you know what failed and why.

In practice the product is the pipeline: scheduling, retries, storage, delivery, and the human workflows that depend on the data showing up on time. The fetcher is one component. Treating the rest as an afterthought is how projects stall after the demo works.

If you are hiring or scoping work, ask how success is measured beyond “we got the page once.” Consistency, cost per million requests, and time-to-recover after a target changes are where the engineering actually lives.