PHP isn't the first language people reach for when they think of scraping, but for collecting prices, news, ratings or reviews from sites that don't need heavy browser emulation, it's stable, simple and fast to develop in. For years the standard tool was Goutte. There's one thing you need to know before you start, though: Goutte is deprecated, and building new projects on it directly is a mistake.
What happened to Goutte
Goutte was a lightweight library built on Symfony's DomCrawler and BrowserKit that let you send HTTP requests, get HTML back, and extract data with CSS selectors or XPath. It was popular for good reason — easy to install via Composer, clean syntax, gentle learning curve.
But as of April 2023 the Goutte repository was archived. In its final version it had already become a thin proxy around the HttpBrowser class from Symfony's BrowserKit component — so the official guidance is to drop the wrapper and use the Symfony components directly. Everything Goutte did, those components do, with active maintenance behind them. Tutorials still mention Goutte because the underlying experience is unchanged; the name is just legacy now.
The modern PHP scraping stack
Instead of fabpot/goutte, install the components it wrapped:
composer require symfony/browser-kit
composer require symfony/http-client
composer require symfony/dom-crawler
composer require symfony/css-selector
Each has a job:
- browser-kit simulates a browser — navigation, form submission, cookie handling. This is the core that Goutte exposed.
- http-client makes the actual requests, with redirects, timeouts and — importantly for scraping — proxy configuration.
- dom-crawler is the extraction engine: CSS-selector and XPath traversal of the HTML.
- css-selector translates CSS selectors into XPath under the hood.
The migration is mechanical: where old code used Goutte\Client, you use Symfony\Component\BrowserKit\HttpBrowser. The crawler object and the rest of the API stay the same.
How a PHP scraper works, step by step
The shape of the job hasn't changed in years:
- Set up the client — create an
HttpBrowserinstance that will send requests and hand back parsed pages. This is also where you wire in HTTP settings like timeouts and a proxy. - Fetch the page — request a URL; you get back a Crawler object wrapping the page's HTML.
- Extract — use CSS selectors or XPath to pull out the elements you want: headlines, prices, review blocks.
- Pull attributes — text is only half of it. Extract
hreffor links,srcfor media,id,class, anddata-*attributes — invaluable for building a catalogue, mapping internal links, or collecting image URLs. - Follow links — have the scraper click through from a list to detail pages, the way a person would, only faster.
- Handle forms — for data behind a search, login or filter, BrowserKit can fill and submit forms as if a real user did.
- Handle pagination — chain requests across page-by-page listings until you've collected everything.
- Store — clean, structure and save to files, a database, or onward to another system.
The parts that decide whether it survives in production
Two things separate a demo scraper from one that runs reliably.
Error handling. On a test bench everything works; in the wild, pages vanish, servers time out, and HTML changes overnight. A solid scraper catches errors, skips a broken block, retries when sensible, and adds delays between requests so it doesn't hammer the server.
Proxies. Any real scraping volume runs into rate limits, IP checks and geo-restrictions. Routing through proxies distributes the load, hides your real IP, and lets you reach region-specific data — and http-client accepts proxy configuration directly, so it slots straight into the stack. For steady, structured extraction the dependable choice is a clean, stable address rather than a shared, already-flagged one: a dedicated static IPv4 or ISP proxy gives a predictable origin with HTTP and SOCKS5 on one port.
Why PHP still makes sense
For sites that don't need full browser rendering, the Symfony stack hits a sweet spot — minimal setup, precise extraction via CSS/XPath, good speed, and easy integration with whatever stores or processes the data afterwards. The tooling is mature and maintained. Just skip Goutte the package and use the components directly: same approach, same syntax, none of the deprecation baggage.