Published Date : 2019年10月18日21:21
架空のお仕事をしてみる企画(3-3)です。
It's a project to try a fictitious job(3-3).
仮に自分がフリーランスで、 ある依頼者から適当な仕事を振られてみたら、 果たしてクリアできるのか?といった企画。
If i am a freelance programmer, When a client assigns a suitable job, Can I clear it? That's the plan.
この企画は架空のものですが、日本のクラウドソーシング市場に氾濫しているよくある案件と値段と工数を参考にしてます。
This project is a fictitious one, but it is based on common cases, prices and man-hours flooding the Japanese crowdsourcing market.
もう一つのサイト用のクラスファイルを作成。
2つ動かすため、crawler.pyの修正。
Create class file for another site.
Modified crawler.py to run two Collectors.
ファイル構成は以下の通り
The file structure is as follows
prototype002 collectors firstCollector.py secondCollector.py data first_collection settings.csv second_collection settings.csv scripts collectors_utils.py chromedriver.exe crowler.py
まずはcrawler.pyの修正点の説明をします。
First, I will explain the modification of crawler.py.
import argparse import time import sys sys.path.append('.') from collectors.firstCollector import FirstCollector from collectors.secondCollector import SecondCollector def collectors(num1, num2): if num1 is not None: if num1 != 0 and num1 != 1: print('Please make it 0 or 1 for now.') sys.exit() else: start_time = time.time() first_collector = FirstCollector('https://woderfulspam.spam', num1, 'data/first_collection') first_collector.main() print(f'The first collector completed the collection in {round(time.time() - start_time)} sec') else: pass if num2 is not None: if num2 != 0 and num2 != 1: print('Please make it 0 or 1 for now.') sys.exit() else: start_time = time.time() second_collector = SecondCollector('https://spam.lovelyspam', num2, 'data/second_collection') second_collector.main() print(f'The second collector completed the collection in {round(time.time() - start_time)} sec') else: pass def crawler_py(): parser = argparse.ArgumentParser() first_collector_usage = """ This argument crawls the site for the specified https://woderfulspam.spam. [-c1, --c1 Number] Number: Choose between (round trip -> 0) and (one way -> 1). """ second_collector_usage = """ This argument crawls the site for the specified https://spam.lovelyspam. [-c2, --c2 Number] Number: Choose between (round trip -> 1) and (one way -> 0). """ parser.add_argument('-c1','--c1', type=int, help=first_collector_usage) parser.add_argument('-c2','--c2', type=int, help=second_collector_usage) args = parser.parse_args() collectors(args.c1, args.c2) if __name__=='__main__': crawler_py()
先にURLと保存先を指定して、オプション引数(数値)によって起動させる。 後に(するか分からないが)バッシュスクリプトやバッチファイルで動かす際に便利かなと思い作成。
First specify the URL and save location, then invoke with the optional argument(Number). I created this as i thought it would be useful to run a bash script or batch file later. (I don't know if I will.)
collectors関数では同じような処理で行数が増えてしまっているが、 とりあえずの試作品でっす。ご勘弁を。
The collectors function does the same thing, increasing the number of rows, but It's just a prototype. Sorry about this.
続いて、secondCollector.pyの説明をします。
Next, let's look at secondCollector.py.
# wonderful spam! from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys # from selenium.webdriver.support.ui import Select import csv import time from datetime import datetime import os,sys sys.path.append('./scripts') import collectors_utils # if you want to use choromedriver_binary # import chromedriver_binary # chromedriver_binary.add_chromedriver_to_path() class SecondCollector(): def __init__(self, url, way, data_dir): self.base_url = url self.way = way self.data_dir = data_dir def switch_ways(self, driver, way_num): """ way_num:0 = oneway way_num:1 = roundtrip ways = driver.find_elements_by_xpath('//ul[@class="clearfix"]/li') """ try: WebDriverWait(driver,3).until(EC.visibility_of_element_located((By.XPATH,'//input[@id="oneway"]'))) except: print('element invisible') if way_num == 0: way = driver.find_element_by_xpath('//input[@id="oneway"]') elif way_num == 1: way = driver.find_element_by_xpath('//input[@id="roundtrip"]') else: print('element click intercepted') driver.quit() driver.execute_script("arguments[0].click()", way) def page_feed(self, driver, cdate, end_date): try: WebDriverWait(driver,9).until(EC.element_to_be_clickable((By.XPATH,'//button[@class="btn_weekly"]'))) except: print('element click intercepted') current_btn = driver.find_element_by_xpath('//button[@class="btn_weekly today"]') cyear, cmonth, cday = cdate.split('(')[0].split('/') if f'{cyear}/{cmonth}/{cday}' == end_date: return True else: parent_li = current_btn.find_element_by_xpath('..') sibl_li = parent_li.find_element_by_xpath('following-sibling::li') sibl_li.find_element_by_xpath('button[@class="btn_weekly"]').click() return False def fetch_table_header(self, driver): header_element = driver.find_elements_by_xpath('//tr[@id="search_head_new"]/th') table_header = [e.text for e in header_element if e.text != ''] return table_header def fetch_table_contents(self, driver, table_header): try: WebDriverWait(driver,6).until(EC.visibility_of_element_located((By.XPATH,'//div[@id="fare_list_new"]/table[@id="fare_result_list_new"]'))) except: print('empty contents') # See if it's empty content_elements = driver.find_elements_by_xpath('//div[@id="fare_list_new"]/table[@id="fare_result_list_new"]/tbody') if content_elements != []: table_contents = [{table_header[idx]:td.text.replace('\n','->') for idx,td in enumerate(c.find_elements_by_tag_name('td'))} for c in content_elements] return table_contents else: return None def fetch_select_options(self, element): select_element = Select(element) select_options_dic = {option.text:option.get_attribute('value') for option in select_element.options if option.text != '出発地'} return select_element, select_options_dic def collect_price(self, driver, current_url, way_num): # Load a file with specified conditions. rows = collectors_utils.read_file(self.data_dir) for row in rows: try: WebDriverWait(driver,30).until(EC.visibility_of_element_located((By.XPATH,'//select[@id="departure_airport"]'))) except: print('element invisible') driver.quit() # Enterring the data of the imported file. dep_box = driver.find_element_by_xpath('//select[@id="departure_airport"]') replace_btn = driver.find_element_by_xpath('//button[@id="set_reverse"]') des_box = driver.find_element_by_xpath('//select[@id="arrival_airport"]') dep = row['dep'] des = row['des'] # Departure locations dep_box, select_options_dic = self.fetch_select_options(dep_box) dep_box.select_by_value(select_options_dic[dep]) # Destination locations des_box, select_options_dic = self.fetch_select_options(des_box) des_box.select_by_value(select_options_dic[des]) if way_num == 0: # dep-date dep_date = driver.find_element_by_xpath(f'//div[@class="search__ctrl__date"]/div[@id="date"]/input[@id="js-DP"]') else: # dep-date dep_date = driver.find_element_by_xpath(f'//div[@class="search__ctrl__date"]/div[@id="date"]/input[@id="js-DP"]') # des-date des_date = driver.find_element_by_xpath(f'//div[@class="search__ctrl__date__arrival"]/div[@id="date2"]/input[@id="js-DP2"]') # set date start = row['start'] end = row['end'] start_date_check = collectors_utils.check_date(start) end_date_check = collectors_utils.check_date(end) if start_date_check: start = start.replace('/','/') else: start = collectors_utils.get_date()[0] if end_date_check: end = end.replace('/','/') else: end = collectors_utils.get_date()[1] if way_num == 0: # enter departure date driver.execute_script(f"arguments[0].value = '{start.replace('/','/')}'", dep_date) else: # enter departure date driver.execute_script(f"arguments[0].value = '{start.replace('/','/')}'", dep_date) # enter destination date driver.execute_script(f"arguments[0].value = '{end.replace('/','/')}'", des_date) # click search button driver.find_element_by_xpath('//button[@id="js-btnSearchTicket"]').click() # fetch table header table_header = self.fetch_table_header(driver) date_obj = datetime.now() dir_date = datetime.strftime(date_obj,'%Y%m%d') if way_num == 0: dir_name = f'片道-{dep}-{des}-{dir_date}' else: dir_name = f'往復-{dep}-{des}-{dir_date}' if os.path.exists(f'{self.data_dir}/{dir_name}'): pass else: os.mkdir(f'{self.data_dir}/{dir_name}') while True: try: WebDriverWait(driver,3).until(EC.visibility_of_element_located((By.XPATH,'//p[@class="date"]'))) except: print('element invisible') current_date_text = driver.find_element_by_xpath('//p[@class="date"]').text current_date = current_date_text.replace('\u3000|\u3000','-').replace(' - ','-').replace(' (','(').replace('/','-') current_date = f"{current_date.split('-')[1]}-{current_date.split('-')[2]}" if way_num == 0: file_name = f'{self.data_dir}/{dir_name}/{current_date}' else: endsplit = end.split('/') file_name = f"{self.data_dir}/{dir_name}/{current_date}_{endsplit[1]}-{endsplit[2]}" table_contents = self.fetch_table_contents(driver, table_header) if table_contents is not None: if self.page_feed(driver, current_date_text, end): if way_num == 1: endsplit = end.split('/') last_day = int(endsplit[-1]) + 1 end = f'{endsplit[1]}-{last_day}' file_name = f"{self.data_dir}/{dir_name}/{current_date}_{end}" collectors_utils.write_file(file_name, table_header, table_contents) break else: collectors_utils.write_file(file_name, table_header, table_contents) else: if self.page_feed(driver, current_date_text, end): break else: pass driver.get(current_url) driver.quit() def main(self): driver = collectors_utils.set_web_driver(self.base_url) way_num = int(self.way) self.switch_ways(driver, way_num) current_url = driver.current_url self.collect_price(driver, current_url, way_num)
from selenium.webdriver.support.ui import Select
SelectメソッドはSelectタグをSeleniumで扱えるようにしてくれる便利なあんちくしょーなやつです。 今回はSelectタグから直接Optionリストを操作しないと行き先等が打ち込めないサイトでした。
The Select method is a handy way to handle Select tags in Selenium. This time, it was a site where you could not specify the destination etc. without operating the Option list directly from the Select tag.
こんな感じで使用します。
Use it like this.
def fetch_select_options(self, element): select_element = Select(element) select_options_dic = {option.text:option.get_attribute('value') for option in select_element.options if option.text != '出発地'} return select_element, select_options_dic
# Departure locations dep_box, select_options_dic = self.fetch_select_options(dep_box) dep_box.select_by_value(select_options_dic[dep])
空港名がキーで、番号がバリューの辞書を作成します。 settings.pyには空港名で指定されるからです。 ただ、後に存在しない空港名を除外、またはルールベースで推論するように 修正しないといけません。
Create a dictionary with the airport name as key and the number as value. This is because settings.py is specified by the airport name. However, we have to exclude the names of airports that do not exist later or modify them to infer them based on rules.
# Departure locations dep_box, select_options_dic = self.fetch_select_options(dep_box) dep_box.select_by_value(select_options_dic[dep])
その他、細かい修正が入ってます。 WebDriverWait()の種類の変更。 「/」をエスケープする必要が無くなる。 勿論、Xpathも変わってます。
There are other minor corrections. Changing the WebDriverWait() type. There is no need to escape the "/". Of coursem, the Xpath has also changed.
ただ、基本的にはfirstCollectorと同じなので、さっさと合成もしくは継承させたいです。 そして、Xpathは外部から変更可能にしたいところです。 ですが、そんなにコード量が大きくないので、 取り敢えず残り4つのサイトの挙動を確かめた後に作ったほうが楽そうです。
However, it is basically the same as firstCollector, so I want to merge or inherit it quickly. And I want Xpath to be externally modifiable. However, the amount of code is not so large. It seems to be easier to make it after checking the behavior of the remaing 4 sites.
その他、細かい修正が入ってます。 WebDriverWait()の種類の変更。 「/」をエスケープする必要が無くなる。 勿論、Xpathも変わってます。
There are other minor corrections. Changing the WebDriverWait() type. There is no need to escape the "/". Of coursem, the Xpath has also changed.
準備ができたら、crawler.pyを動かす。
When ready, run crawler.py
python crawler.py -c2 0
セカンドコレクターのみ動かし、片道(0)を指定。
Activates only the secondCollector and specifies oneway(0).
settings.csvの中身は前回と同じ。
The content of settings.py is the same as last time.
dep,des,start,end 新千歳,羽田,, 羽田,新千歳,2019/11/21,2019/12/20
取れたデータの中身は以下のようになる。
Here's what the data looks like.
気付いていると思いますが、往復便の航空チケット料金が片道しか取れてません。 実は行きの便をクリックしないと帰りの便の時間と料金が表示されないのです。 つまりもし片道10件あったら、10回ずつクリックして、10回ずつスクレイピングして、10回ずつバックしなければなりません。
You may have noticed, but we only have data on round-trip air tickets for one way. Actually, the time and fare of the return flight are not displayed unless I click the flight to go. So if you had 10 oneway trips, you'd have to click 10 times, scrape 10 times, and go back 10 times.
何が言いたいかというと、実装は簡単ですが、クロールの時間がかかります。 取り敢えず他のサイトも終わらせたあとに色々考えることにします。 おやすみ。
What I mean is that it's easy to implement, but it takes time to crawl. For now, I will think about other sites after I finish them. Good night.