PHP – Parse and Extract Image URL from HTML

When working with HTML content, it is often necessary to extract specific information, such as image URLs, from the markup. In PHP, there are multiple approaches to achieve this. In this article, we will explore two commonly used methods: preg_match_all and the PHP DOMDocument class.

Let's assume we have an HTML content snippet as follows:

$html = '<html>
    <body>
        <div class="example-class">
            <p><img src="https://terryl.in/1.jpg" class="img-responsive"></p>
            <p><img src="https://terryl.in/2.jpg" class="img-responsive"></p>
        </div>
    </body>
</html>';

Using preg_match_all

One way to parse HTML and extract image URLs is by using the preg_match_all function in PHP. This function allows us to search for patterns using regular expressions. In this case, we want to find all occurrences of the img tag and extract the src attribute.

To achieve this, we can use the following code:

preg_match_all('/<img.*?src=[\'"](.*?)[\'"].*?>/i', $html, $matches);
var_dump($matches);

Result:

array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(58) "<img src="https://terryl.in/1.jpg" class="img-responsive">"
    [1]=>
    string(58) "<img src="https://terryl.in/2.jpg" class="img-responsive">"
  }
  [1]=>
  array(2) {
    [0]=>
    string(23) "https://terryl.in/1.jpg"
    [1]=>
    string(23) "https://terryl.in/2.jpg"
  }
}

When executed, this code will output an array containing two subarrays: $matches[0] and $matches[1]. $matches[0] contains the complete image tags, while $matches[1] contains only the extracted URLs.

To display the extracted URLs individually, we can use a loop:

$elements = $matches[1];

foreach($elements as $element) {
    echo $element . "\n";
}

By running this code, we obtain the following output:

https://terryl.in/1.jpg
https://terryl.in/2.jpg

Using PHP DOMDocument

Another method to parse HTML and extract image URLs is by utilizing the PHP DOMDocument class. In the DOM, everything is considered a node, including the img elements and the src attributes. DOMDocument allows you to fetch node lists from a DOM. Let's take an example:

$doc = new DOMDocument();
$doc->loadHTML($html);
$elements = $doc->getElementsByTagName('img');

foreach($elements as $element) {
    echo $element->getAttribute('src') . "\n";
}

When executed, this code will also output the same URLs as the previous method:

https://terryl.in/1.jpg
https://terryl.in/2.jpg

The PHP DOMDocument approach treats every element in the DOM as a node, including the img elements and their src attributes. By fetching the img elements and retrieving their src attributes, we can easily extract the desired information.

Conclusion

Now that we have explored both methods, the question arises: which one should you prefer?

The answer depends on your specific requirements and constraints. If you are working with smaller HTML snippets and do not want to rely on any additional PHP extensions, preg_match_all can be a suitable choice. It provides a straightforward approach and does not require the libxml PHP extension.

On the other hand, if you are dealing with larger HTML documents and prefer a more structured and object-oriented approach, PHP DOMDocument is the way to go. While it requires the libxml PHP extension to be installed, it offers better support for handling complex HTML structures

I prefer to use preg_match_all instead of PHP DOMDocument for two reasons. First, PHP DOMDocument requires the libxml PHP extension to be installed on the system. Second, preg_match_all is faster when parsing large HTML content.

So, which one you perfer?

PHP – Parse and Extract Image URL from HTML

Using preg_match_all

Using PHP DOMDocument

Conclusion

Author

Comments

Write a Reply or Comment Cancel reply

PHP – Parse and Extract Image URL from HTML

Using preg_match_all

Using PHP DOMDocument

Conclusion

Posts you may like:

Author

Comments

Write a Reply or Comment Cancel reply