Menu
in , ,

Install Phantomjs Headless Browser And Scrape A Website Data Using Proxy And Random UserAgent

PhantomJS

In this lab, we will install and use Phantomjs, the headless browser, to apply a simple scrape website data example. Phantomjs is a powerful web automation testing project that can be driven by Javascript.

we will:

  • Install Phanomjs on our Linux system.
  • Allow random Phantomjs browser user-agent strings for each request.
  • Allow proxy settings for Phantomjs browser.
  • Scrape sample data from a website page.

Install Phantomjs 

For a more stable workflow, installing the following packages is recommended.

For Ubuntu Linux

# sudo apt-get install build-essential g++ flex bison gperf ruby perl libsqlite3-dev libfontconfig1-dev libicu-dev libfreetype6 libssl-dev libpng-dev libjpeg-dev python libx11-dev libxext-dev ttf-mscorefonts-installer

On Centos

# sudo yum install gcc gcc-c++ make flex bison gperf ruby openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel libpng-devel libjpeg-devel

Then download and extract the latest 64bit or 32bit Linux binary phantomjs package from Here (Linux 64Bit: phantomjs-2.1.1-linux-x86_64.tar.bz2 | Linux 32Bit: phantomjs-2.1.1-linux-i686.tar.bz2).

# sudo wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
# sudo tar xvf phantomjs-2.1.1-linux-x86_64.tar.bz2

Change directory to the Phantomjs binary file located phantomjs-X.XX.XXX-linux-x86_64/bin

# cd phantomjs-2.1.1-linux-x86_64/bin
# sudo chmod +x phantomjs
# cp phantomjs /usr/local/bin

Now phantomjs the command exists in the PATH and is ready to run.

Set Phantomjs Browser User-Agent And Start Scarping

We can start to build the controller Javascript file. We can name it hello.js and add the below code to it.

The first step is to build our user-agent string array, then select a random one each time open a web page function is requested.

File: hello.js

"use strict";
var page = require('webpage').create();
// print the default/current user-agent string to the std outpout console log
console.log('The default and current user-agent string is: ' + page.settings.userAgent);
// create a user-agent string array
var uagents = [
    "Mozilla/5.0 (Linux; Android 7.0; SM-G892A Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/60.0.3112.107 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 7.0; SM-G930VC Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/58.0.3029.83 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; SM-G935S Build/MMB29K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/55.0.2883.91 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; SM-G920V Build/MMB29K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 6P Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 7.1.1; G8231 Build/41.2.A.0.219; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/59.0.3071.125 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; E6653 Build/32.2.A.0.253) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0; HTC One M9 Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.3",
    "Mozilla/5.0 (Linux; Android 6.0; HTC One X10 Build/MRA58K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/61.0.3163.98 Mobile Safari/537.36",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25"
];  
// select random one in the variable randagent.
var randagent = uagents[Math.floor(Math.random() * uagents.length)];
// set the browser user-agent string
page.settings.userAgent = randagent;
// open the target website url...
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
// extract information and value you want using element ID
        var ua = page.evaluate(function () {
            return document.getElementById('qua').value;
        });
        console.log(ua);
    }
    phantom.exit();
});

Set Phantomjs To Using Proxy Access

We may need to set proxy to Phantomjs, as we also need to select the user-agent string to avoid blocking.

We must add the following setting to our Javascript hello.js controller file before executing the open method.

phantom.setProxy('proxy-host-or-IP', 'proxy-port', 'manual', 'proxy-username', 'proxy-password');

Adjustment Phantomjs Options

The Phantomjs command line will be like the following:

# phantomjs /path/to/javascript/file/hello.js

But we need to adjust the Web Request Options as the following

1- Allow Browser Cache

So we can append the following Options to the Phantomjs CLI

--disk-cache=true --max-disk-cache-size=1000000 --disk-cache-path=/opt/phantomjs/cache

2- Allow Browser Cookies

So we can append the following Options to the Phantomjs CLI

--cookies-file=/opt/phatomjs/cookies/cookies.txt

3- Skip Browser SSL Verification Checks and Errors

--web-security=false --ignore-ssl-errors=true --ssl-protocol=any

4- Enable Debug Mode

--debug=true

And the final command will be like that:

# phantomjs --debug=true --disk-cache=true --max-disk-cache-size=1000000 --disk-cache-path=/opt/phantomjs/cache --cookies-file=/opt/phatomjs/cookies/cookies.txt --web-security=false --ignore-ssl-errors=true --ssl-protocol=any /path/to/javascript/file/hello.js

And the final Javascript controller file: hello.js will be like that

"use strict";
var page = require('webpage').create();
// print the default/current user-agent string to the std outpout console log
console.log('The default and current user-agent string is: ' + page.settings.userAgent);
// create a user-agent string array
var uagents = [
    "Mozilla/5.0 (Linux; Android 7.0; SM-G892A Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/60.0.3112.107 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 7.0; SM-G930VC Build/NRD90M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/58.0.3029.83 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; SM-G935S Build/MMB29K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/55.0.2883.91 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; SM-G920V Build/MMB29K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 5.1.1; SM-G928X Build/LMY47X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 6P Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.83 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 7.1.1; G8231 Build/41.2.A.0.219; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/59.0.3071.125 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0.1; E6653 Build/32.2.A.0.253) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36",
    "Mozilla/5.0 (Linux; Android 6.0; HTC One M9 Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.3",
    "Mozilla/5.0 (Linux; Android 6.0; HTC One X10 Build/MRA58K; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/61.0.3163.98 Mobile Safari/537.36",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25"
];  
// select random one in the variable randagent.
var randagent = uagents[Math.floor(Math.random() * uagents.length)];
// set the browser user-agent string
page.settings.userAgent = randagent;
// setup the proxy settings for the browser
phantom.setProxy('proxy-host-or-IP', 'proxy-port', 'manual', 'proxy-username', 'proxy-password');
// open the target website url...
page.open('http://www.httpuseragent.org', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
// extract information and value you want using element ID
        var ua = page.evaluate(function () {
            return document.getElementById('qua').value;
        });
        console.log(ua);
    }
    phantom.exit();
});

You can control Phantomjs using scripting languages like Perl, PHP, and Python See: PHP Execute And Kill Process In Linux.

Same time. Here are examples of using the Phantomjs Headless Browser and Integration with Jenkins example.

Exit mobile version