Published Date : 2020年9月9日8:07

How to handle Selenium, which can be used for web scraping and also for RPA

This blog has an English translation


Here's a little more about the 「【python】How to after scraping Job board with requests and bs4, data is processed with pandas and saved in Excel」 video I uploaded to YouTube.


Table of Contents

① 動画の説明
① Video Description


How to after scraping Job board with requests and bs4, data is processed with pandas and saved in Excel.


The way to scrape data from job sites is pretty much the same as before.

import requests
from bs4 import BeautifulSoup
import re
import time

URL = "https://xxxxxxx/xxxxxxxxx"

res = requests.get(URL)

soup = BeautifulSoup(res.content, 'lxml')


Use regular expressions and the replace and strip methods to replace or remove useless characters and list the retrieved data.

searchcountpage ="\d+,?\d+",'div#searchCount')[0].get_text().strip())


By the way, the regular expression [r"\d+,?\d+"] is a pattern of one or more digits followed by zero or one comma between them, followed by one or more digits.


Install [openpyxl] to export data to an Excel file.

pip install openpyxl


Convert the data retrieved with Pandas into a data frame and export the data frame to an Excel file.

values = [salary, jobtype, location[:-1], company, title]
rows = {}
for idx, col in enumerate(cols):
    rows[col] = pd.Series(values[idx])

dftest = pd.DataFrame(rows, columns=cols)

dftest.to_excel('test.xlsx', sheet_name='Java ' +
searchcountpage, index=False)


I think it will be easier to see if you separate the names of the sheets by keywords and the number of sheets.


It's important to note that if the UI of the site changes or if the site pops up, you won't be able to scrape the content of the site.


In that case, the previously described selenium works.


All you have to do is list the keywords you want to look up and repeat the process.

    salary = [s.get_text().replace("\n", "").replace("\u3000", "")
              for s in'div#SALARY ul li a')]
    jobtype = [j.get_text().replace("\n", "").replace("\u3000", "")
               for j in'div#JOB_TYPE ul li a')]
    location = [l.get_text().replace("\n", "").replace("\u3000", "")
                for l in'div#LOCATION ul li a')]
    company = [c.get_text().replace("\n", "").replace("\u3000", "")
               for c in'div#COMPANY ul li a')]
    title = [t.get_text().replace("\n", "").replace("\u3000", "")
             for t in'div#TITLE ul li a')]


With [pd.ExcelWriter], you can combine multiple data frames into an Excel file.

with pd.ExcelWriter('test.xlsx') as writer:
    for idx, pl in enumerate(pl_list):
        dfs[idx].to_excel(writer, sheet_name=pl +
                searchcountpages[idx], index=False)


That's all. Thank you for your hard work.