How to find all links using DOM Crawler and PHP?

Learn to extract all webpage links using PHP and DOM Crawler. Discover practical examples with filter and filterXPath methods to enhance your scraping skills.

Profile picture of Tuan Nguyen
Tuan Nguyen
Cover Image for How to find all links using DOM Crawler and PHP?

Using DOM Crawler alongside PHP opens up a streamlined avenue for extracting every hyperlink from a webpage. This capability is particularly handy for web scraping projects where gathering link data is essential. Both the filter and filterXPath methods stand at the ready to aid in your data extraction efforts, offering a straightforward approach to pinpoint and retrieve hyperlink details.

Read this article further for an in depth explanation.

Start by incorporating PHP's DOM Crawler and Guzzle to streamline the process:

<?php
use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Client;

// Set up the Guzzle client
$client = new \GuzzleHttp\Client();
$response = $client->get('https://serply.io');
$html = (string) $response->getBody();

// Initiate the DOM Crawler
$crawler = new Crawler($html);

// Zero in on all page links
$links = $crawler->filter('a');

// Iterate and display link addresses
foreach ($links as $link) {
    echo $link->getAttribute('href').PHP_EOL;
}
?>

This script fetches all `<a>` tags from "serply.io" and prints their URLs. It's a handy way to catalog every link on a page.

Extracting Links via the FilterXPath Method

For those who prefer XPath for its detailed querying capabilities, here's how to apply it:

<php
use Symfony\Component\DomCrawler\Crawler;
use GuzzleHttp\Client;

// Initialize the Guzzle client
$client = new \GuzzleHttp\Client();
$response = $client->get('https://www.serply.com');
$html = (string) $response->getBody();

// Load into DOM Crawler
$crawler = new Crawler($html);

// Fetch all links using XPath
$links = $crawler->filterXPath('//a');

// Loop through and print each link's URL
foreach ($links as $link) {
    echo $link->getAttribute('href') . PHP_EOL;
}
?>

This approach does much the same as the first but leverages the power of XPath for pinpoint precision in link selection.

Wrapping It Up

To sum up, combining PHP's DOM Crawler with either the `filter` or `filterXPath` methods provides a solid foundation for extracting all links from any webpage. By integrating Guzzle, you can efficiently pull and process the content of web pages. Whether you're compiling a list of resources or analyzing web content structure, these tools and methods make web scraping a breeze, turning complex data extraction tasks into straightforward scripts. For further exploration and enhancement of your web scraping capabilities, consider the Google Crawl API, which offers advanced features tailored for scraping at scale.