•  


GitHub - zrashwani/arachnid: Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites
Skip to content

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

License

Notifications You must be signed in to change notification settings

zrashwani/arachnid

Repository files navigation

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth.

This library is using symfony/panther & FriendsOfPHP/Goutte libraries to scrap site pages and extract main SEO-related info, including: title , h1 elements , h2 elements , statusCode , contentType , meta description , meta keyword and canonicalLink .

This library is based on the original blog post by Zeid Rashwani here:

http://zrashwani.com/simple-web-spider-php-goutte

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard.

Build Status codecov

Sponsored By

How to Install

You can install this library with Composer . Drop this into your composer.json manifest file:

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install .

Getting Started

Basic Usage:

Here's a quick demo to crawl a website:

    <?php

    require
 'vendor/autoload.php'
;

    
$
url
 = 
'http://www.example.com'
;
    
$
linkDepth
 = 
3
;
    
// Initiate crawl, by default it will use http client (GoutteClient), 

    $
crawler
 = 
new
 \
Arachnid
\
Crawler
(
$
url
, 
$
linkDepth
);
    
$
crawler
->
traverse
();

    
// Get link data

    $
links
 = 
$
crawler
->
getLinksArray
(); 
//to get links as objects use getLinks() method

    print_r(
$
links
);

Enabling Headless Browser mode:

Headless browser mode can be enabled, so it will use Chrome engine in the background which is useful to get contents of JavaScript-based sites.

enableHeadlessBrowserMode method set the scraping adapter used to be PantherChromeAdapter which is based on Symfony Panther library:

    $
crawler
 = 
new
 \
Arachnid
\
Crawler
(
$
url
, 
$
linkDepth
);
    
$
crawler
->
enableHeadlessBrowserMode
()
            ->
traverse
()
            ->
getLinksArray
();

In order to use this, you need to have chrome-driver installed on your machine, you can use dbrekelmans/browser-driver-installer to install chromedriver locally:

composer require --dev dbrekelmans/bdi
./vendor/bin/bdi driver:chromedriver drivers

Advanced Usage:

Set additional options to underlying http client, by specifying array of options in constructor or creating Http client scrapper with desired options:

    <?php

        use
 \
Arachnid
\
Adapters
\
CrawlingFactory
;
        
//third parameter is the options used to configure http client

        $
clientOptions
 = [
'auth_basic'
 => 
array
(
'username'
, 
'password'
)];
        
$
crawler
 = 
new
 \
Arachnid
\
Crawler
(
'http://github.com'
, 
2
, 
$
clientOptions
);
           
        
//or by creating and setting scrap client

        $
options
 = 
array
(
            
'verify_host'
 => 
false
,
            
'verify_peer'
 => 
false
,
            
'timeout'
 => 
30
,
        );
                        
        
$
scrapperClient
 = 
CrawlingFactory
::
create
(
CrawlingFactory
::
TYPE_HTTP_CLIENT
, 
$
options
);
        
$
crawler
->
setScrapClient
(
$
scrapperClient
);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog ):

    <?php
    
    $
crawler
 = 
new
 \
Arachnid
\
Crawler
(
$
url
, 
$
linkDepth
); 
// ... initialize crawler   


    //set logger for crawler activity (compatible with PSR-3)

    $
logger
 = 
new
 \
Monolog
\
Logger
(
'crawler logger'
);
    
$
logger
->
pushHandler
(
new
 \
Monolog
\
Handler
\
StreamHandler
(sys_get_temp_dir().
'/crawler.log'
));
    
$
crawler
->
setLogger
(
$
logger
);
    
?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:

    <?php

    //filter links according to specific callback as closure

    $
links
 = 
$
crawler
->
filterLinks
(
function
(
$
link
) {
                        
//crawling only links with /blog/ prefix

                        return
 (
bool
)preg_match(
'/.*\/blog.*$/u'
, 
$
link
); 
                    })
                    ->
traverse
()
                    ->
getLinks
();

You can use LinksCollection class to get simple statistics about the links, as following:

    <?php

    $
links
 = 
$
crawler
->
traverse
()
                     ->
getLinks
();
    
$
collection
 = 
new
 LinksCollection
(
$
links
);

    
//getting broken links

    $
brokenLinks
 = 
$
collection
->
getBrokenLinks
();
   
    
//getting links for specific depth

    $
depth2Links
 = 
$
collection
->
getByDepth
(
2
);

    
//getting external links inside site

    $
externalLinks
 = 
$
collection
->
getExternalLinks
();

How to Contribute

  1. Fork this repository
  2. Create a new branch for each feature or improvement
  3. Apply your code changes along with corresponding unit test
  4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the PSR-2 standard .

System Requirements

  • PHP 7.2.0+

Authors

License

MIT Public License

About

Crawl all unique internal links found on a given website, and extract SEO related information - supports javascript based sites

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본