Published Date : 2020年9月24日7:53

【python】住所を使ったデータ処理:scrapy shell編
【python】Data processing using address: Scrapy Shell

This blog has an English translation

YouTubeにアップした動画、「【python】住所を使ったデータ処理:scrapy shell編」の補足説明の記事です。

Here's a little more about the 「【python】Data processing using address: Scrapy Shell」 video I uploaded to YouTube.


Table of Contents

① 動画の説明
① Video Description

最初は[街区レベル位置参照情報 国土交通省]のデータを使ってみようかと思いましたが、

At first, I thought about using the data of [City Level Location Reference Information Ministry of Land, Infrastructure, Transport and Tourism],


but I didn't think I could use that data much, so I use another site to scrape the address and collect it.


※For example, latitude and longitude of 2 Rokuban-cho and 4 Rokuban-cho, Chiyoda Ward are the same.


It is a website that shows detailed addresses all over Japan. I will use this site this time.

始めに、挙動を確かめる為、scrapy shellを立ち上げます。

First, to see how it works, you need to start up a scrapy shell.

scrapy shell "URL"


Use the browser inspector to determine the structure of the site in advance, and store the XPATH in a variable that you use many times.

input_text_field_xpath = '//input[@class="input_text_field"]'


You join the URL of the prefecture to be checked first with the base URL, and access the URL with the fetch method.

fetch(base_url + prefecture_url)


If the status code of the response is 200, the access is successful. This is stored in a variable for use in a conditional branch.

status_code = response.status
if statuscode == 200


Gets the HTML tag that contains the address and links to the next block or street address.

address_box = response.xpath(address_box_xpath)
next_page_link = response.xpath(next_page_link_xpath)


It then extracts the text information from the retrieved elements.

address_text = response.xpath(address_text_xpath).get()
address_list = address_box.getall()


Also, the list containing the link is converted to a list again after using the set method to avoid duplication.

link_list = list(set([next_page.get() for next_page in next_page_link]))


Use the previous list and the status code to branch out the operations you want to perform.

if statuscode == 200 and link_list != []:


Import the time module and allow 3 seconds between requests. It is to avoid putting a burden on the server.

import time


Now, let's combine all the previous steps and use the For statement once to see how it works.


We have successfully obtained the address of the deepest URL hierarchy on the site.


You then need to apply these processes to all links so that address information can be retrieved recursively.


That's all. Thank you for your hard work.