•  


GitHub - duzun/hQuery.php: An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Skip to content

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.

License

Notifications You must be signed in to change notification settings

duzun/hQuery.php

Repository files navigation

hQuery.php Donate

An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.

You can use the familiar jQuery/CSS selector syntax to easily find the data you need.

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.

See tests/README.md .

API Documentation

?? Features

  • Very fast parsing and lookup
  • Parses broken HTML
  • jQuery-like style of DOM traversal
  • Low memory usage
  • Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
  • Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl() )
  • Caches response for multiple processing tasks
  • PSR-7 friendly (see hQuery::fromHTML($message))
  • PHP 5.3+
  • No dependencies

?? Install

Just add this folder to your project and include_once 'hquery.php'; and you are ready to hQuery .

Alternatively composer require duzun/hquery

or using npm install hquery.php , require_once 'node_modules/hquery.php/hquery.php'; .

? Usage

Basic setup:

// Optionally use namespaces

use
 duzun\hQuery;

// Either use composer, or include this file:

include_once
 '/path/to/libs/hquery.php'
;

// Set the cache path - must be a writable folder

// If not set, hQuery::fromURL() would make a new request on each call

hQuery::
$
cache_path
 = "
/path/to/cache
";

// Time to keep request data in cache, seconds

// A value of 0 disables cache

hQuery::
$
cache_expires
 = 
3600
; 
// default one hour

I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility.

Load HTML from a file

hQuery::fromFile ( string $filename , boolean $use_include_path = false, resource $context = NULL )
// Local

$
doc
 = hQuery::
fromFile
(
'/path/to/filesystem/doc.html'
);

// Remote

$
doc
 = hQuery::
fromFile
(
'https://example.com/'
, 
false
, 
$
context
);

Where $context is created with stream_context_create() .

For an example of using $context to make a HTTP request with proxy see #26 .

Load HTML from a string

hQuery::fromHTML ( string $html , string $url = NULL )
$
doc
 = hQuery::
fromHTML
(
'<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>'
);

// Set base_url, in case the document is loaded from local source.

// Note: The base_url property is used to retrieve absolute URLs from relative ones.

$
doc
->
base_url
 = 
'http://desired-host.net/path'
;

Load a remote HTML document

hQuery::fromUrl ( string $url , array $headers = NULL, array|string $body = NULL, array $options = NULL )
use
 duzun\hQuery;

// GET the document

$
doc
 = hQuery::
fromUrl
(
'http://example.com/someDoc.html'
, [
'Accept'
 => 
'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'
]);

var_dump(
$
doc
->
headers
); 
// See response headers

var_dump(hQuery::
$
last_http_result
); 
// See response details of last request


// with POST

$
doc
 = hQuery::
fromUrl
(
    
'http://example.com/someDoc.html'
, 
// url

    [
'Accept'
 => 
'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'
], 
// headers

    [
'username'
 => 
'Me'
, 
'fullname'
 => 
'Just Me'
], 
// request body - could be a string as well

    [
'method'
 => 
'POST'
, 
'timeout'
 => 
7
, 
'redirect'
 => 
7
, 
'decode'
 => 
'gzip'
] 
// options

);

For building advanced requests (POST, parameters etc) see hQuery::http_wr() , though I recommend using a specialized ( PSR-7 ?) library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg.

PSR-7 example:

composer require php-http/message php-http/discovery php-http/curl-client

If you don't have cURL PHP extension , just replace php-http/curl-client with php-http/socket-client in the above command.

use
 duzun\hQuery;

use
 Http
\
Discovery
\
HttpClientDiscovery
;
use
 Http
\
Discovery
\
MessageFactoryDiscovery
;

$
client
 = 
HttpClientDiscovery
::
find
();
$
messageFactory
 = 
MessageFactoryDiscovery
::
find
();

$
request
 = 
$
messageFactory
->
createRequest
(
  
'GET'
,
  
'http://example.com/someDoc.html'
,
  [
'Accept'
 => 
'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'
]
);

$
response
 = 
$
client
->
sendRequest
(
$
request
);

$
doc
 = hQuery::
fromHTML
(
$
response
, 
$
request
->
getUri
());

Another option is to use stream_context_create() to create a $context , then call hQuery::fromFile($url, false, $context) .

Processing the results

hQuery::find ( string $sel , array|string $attr = NULL, hQuery\Node $ctx = NULL )
// Find all banners (images inside anchors)

$
banners
 = 
$
doc
->
find
(
'a[href] > img[src]:parent'
);

// Extract links and images

$
links
  = 
array
();
$
images
 = 
array
();
$
titles
 = 
array
();

// If the result of find() is not empty

// $banners is a collection of elements (hQuery\Element)

if
 ( 
$
banners
 ) {

    
// Iterate over the result

    foreach
(
$
banners
 as
 $
pos
 => 
$
a
) {
        
$
links
[
$
pos
] = 
$
a
->
attr
(
'href'
); 
// get absolute URL from href property

        $
titles
[
$
pos
] = trim(
$
a
->
text
()); 
// strip all HTML tags and leave just text


        // Filter the result

        if
 ( !
$
a
->
hasClass
(
'logo'
) ) {
            
// $a->style property is the parsed $a->attr('style')

            if
 ( strtolower(
$
a
->
style
[
'position'
]) == 
'fixed'
 ) 
continue
;

            
$
img
 = 
$
a
->
find
(
'img'
)[
0
]; 
// ArrayAccess

            if
 ( 
$
img
 ) 
$
images
[
$
pos
] = 
$
img
->
src
; 
// short for $img->attr('src')

        }
    }

    
// If at least one element has the class .home

    if
 ( 
$
banners
->
hasClass
(
'home'
) ) {
        
echo
 'There is .home button!'
, 
PHP_EOL
;

        
// ArrayAccess for elements and properties.

        if
 ( 
$
banners
[
0
][
'href'
] == 
'/'
 ) {
            
echo
 'And it is the first one!'
;
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)

$
charset
 = 
$
doc
->
charset
;

// Get the size of the document ( strlen($html) )

$
size
 = 
$
doc
->
size
;

?? Live Demo

On DUzun.Me

A lot of people ask for sources of my Live Demo page. Here we go:

view-source:https://duzun.me/playground/hquery

?? Run the playground

You can easily run any of the examples/ on your local machine. All you need is PHP installed in your system. After you clone the repo with git clone https://github.com/duzun/hQuery.php.git , you have several options to start a web-server.

Option 1:
cd
 hQuery.php/examples
php -S localhost:8000

#
 open browser http://localhost:8000/
Option 2 (browser-sync):

This option starts a live-reload server and is good for playing with the code.

npm install
gulp

#
 open browser http://localhost:8080/
Option 3 (VSCode):

If you are using VSCode, simply open the project and run debugger ( F5 ).

?? TODO

  • Unit tests everything
  • Document everything
  • Cookie support (implemented in mem for redirects)
  • Improve selectors to be able to select by attributes
  • Add more selectors
  • Use HTTPlug internally

?? Support my projects

I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub ).

If you like what I'm doing and this project helps you reduce time to develop, please consider to:

  • ★ Star and Share the projects you like (and use)
  • ? Give me a cup of coffee - PayPal.me/duzuns (contact at duzun.me)
  • ? Send me some Bitcoin at this addres: bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa (or using the QR below) bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본