in ,

Using Python To Scrape Facebook Fan Page Users.. A Simple Example

python scrape facebook

The most important when trying to scrape Facebook content is to emulate humans and target the simplest URL entry of Facebook. We will use Python to do a little scrape task in this example.

We need to Scrape Facebook Fan Page users who act with a share for a post.

To install Python on Windows, you will need to follow the A Simple Way To Installing And Run Python And PIP On Windows guide.

Sametime will need to install more Python Package using  PIP  as the following

> pip install xlsxwriter selenium numpy codecs

And to emulate a human action, we will use Selenium, The Automation web driver. You can download it from here, apply a code delay between actions like the login process, move to the internal post URL, etc.

And will go to Facebook login, Then target post using the mobile browser old version of the Facebook URL scheme.

In This example, we targeted users who share a post, i.e., https://www.facebook.com/hahahahaaa.vn/posts/3144355232550870 so that we will use a mobile browser Facebook URL scheme for the share action: https://m.facebook.com/browse/shares?id=3144355232550870.

Note: you will only need to replace your targeted post id with the existing one.

The Python Code

We will scrape the data using  XPath  of elements, so we may need to recheck if XPath is still OK. or change it to the correct values, then the code can go like the following.

#!/usr/bin/env python
# coding: utf-8

from time import sleep
import xlsxwriter
from selenium import webdriver
import numpy as np
import random
import time
import codecs

# Enter Credintial information for your facebook account.
# We consider that chromedriver.exe file same path as the running python code file.
url ="https://m.facebook.com/browse/shares?id=[your post id]"
usr ="user@email.com"
pwd ="password"
driver = webdriver.Chrome(executable_path=r'chromedriver.exe')
driver.get('https://m.facebook.com/login')
print("Trying Opening the Facebook Login Page")
# Make a human delay...
sleep(3)

# looking for login form using field id
username_box = driver.find_element_by_id('m_login_email')
username_box.send_keys(usr)
print("usr email is entered")
# Make a human delay...
sleep(3)

password_box = driver.find_element_by_id('m_login_password')
password_box.send_keys(pwd)
print("user password is entered")
# Make a human delay...
sleep(3)

# Click the login button.
login_box = driver.find_element_by_name('login')
login_box.click()
# Make a human delay...
sleep(3)

# Replace Mobile Url Web version with basic Url version.
url=url.replace('https://m.','https://mbasic.')
# Go to the Basic url of the targeted post..
driver.get(url)
print(url)
# Make a human delay...
sleep(3)

# Build WorkSheet Of The Output Data.
cellindex=1
filexls=open("Output"+".xlsx",'a')
workbook = xlsxwriter.Workbook("Output_"+".xlsx")
worksheet=workbook.add_worksheet()
worksheet.write('A1','FB User Name')
worksheet.write('B1','FB User ID')
worksheet.write('C1','FB User URL')

# Build Scrape Index
index=0
cntnu=True
while(x==True):
    # try to collect data till finishing and no more load
    try:
        index=index+1
        user_info =  driver.find_elements_by_xpath("//*[contains(@class,'_4mn c')]/a")
        for li in user_info:
            cellindex=cellindex+1
            link=li.get_attribute('href')
            name=li.text
            name=name.replace('\nFollow','')
            # Restore Usual Web URL
            link=link.replace('https://m.','https://www.')
            id=((((link.replace('https://www.facebook.com/profile.php?id=','')).replace('/?fref=pb','')).replace('?fref=pb','')).replace('https://www.facebook.com/','')).replace('&fref=pb','')
            worksheet.write('A'+str(cellindex),name)
            worksheet.write('B'+str(cellindex),id)
            worksheet.write('C'+str(cellindex),link)
            # Make human delay
            sleep(3)
        # Load more shares    
        load = driver.find_element_by_xpath("//*[contains(@id,'m_more_item')]/a")
        load.click()
        # Make human delay
        sleep(5)
        print(index)
        cntnu=True
    except:
        # no more data or no more loads
        cntnu=False
workbook.close()

What do you think?

Leave a Reply

Your email address will not be published.

bad bots

Catch The Bad Bots Which Cause High CPU Usage And Blocking Them In Simple 2 Steps

Cloudflare real ip address

How To Pass Real IP Address From Cloudflare To Nginx Apache and WordPress