独学初心者プログラマーの日英ブログ

賃貸不動産サイトから収集したデータを使っての簡単なデータの分析：データ前処理編

Simple data analysis using data collected from apartment rental sites：Data Preprocessing

今回は前回取得したデータのデータクレンジングを行っていきます。

This time, we will perform data cleansing of the data obtained last time.

結果はこのようになります。

The result looks like this.

必要なライブラリをインポートしていきます。

Import the required libraries.

from datetime import datetime
from decimal import *
import re
import numpy as np
import pandas as pd

前回保存したCSVファイルを読み込みます。

Loads a previously saved CSV file.

df = pd.read_csv(csv_path)

一列ずつデータの中身を確認しながら、クレンジングしていきましょう。

Let's cleanse while checking the contents of the data one by one.

アパート名はいらないので、アパートの階数だけを抜き出していきます。

I don't need the name of the apartment, so I will extract only the number of floors of the apartment.

reモジュールのcompileメソッドを使って正規表現のパターンを予め決めていきます。

Use the compile method of the re module to determine the regular expression pattern in advance.

pattern = re.compile(r"[0-9０-９]+(?=階|号室|号|f|F|Ｆ|[-－]|\s)")

辞書を格納するリストを用意します。

Prepare a list to store dictionaries.

アパート名を一行ずつ取り出し、正規表現を使って階数だけを取り出します。

It takes the name of the apartment line by line and uses a regular expression to get just the floors.

階数が不明な行は数字の0に入れ替えます。

Lines whose number of stories is unknown are replaced with the number 0.

結果をデータフレームに直します。

Changes the result to a dataframe.

describeメソッドを使うとデータフレームの簡単な統計データが表示されます。

Use the describe method to display simple statistics for a dataframe.

続いては交通情報に関する行です。

The next one is about transportation and address information.

strメソッドのsplitメソッドを使うと指定した文字で分割されます。

Use the split method of the str method to split at the specified character.

今回は改行文字で区切るので特に指定は無くても区切られます。

This time it is separated by a newline character, so it is separated without any specification.

結果は区切られたリストのSeriesが返されるので、

The result is a Series of delimited lists,

キーワード引数のexpandをTrueにしてやれば自動でデータフレームに直してくれます。

which can be automatically converted into dataframes by setting the keyword argument expand to True.

今回の場合、バス停があるのは2行しかないので、バス停を省きます。

In this case, there are only 2 bus stops, so we will delete the line of bus stops.

dropメソッドを使って列名を指定する、縦方向、つまり列を削除するにはaxisを1に指定します。

To use the drop method to specify a column name, or to drop a column vertically, specify axis as 1.

transport_df = df['transport'].str.split(expand=True).drop(3, axis=1)

カラムを指定し直します。

Respecify the column.

transport_df.columns = ['沿線', '駅名', '所在地']

factorizeメソッドを使うとユニークの行に対して番号が振り分けられます。

Use the factorize method to assign a number to a unique line.

文字列を分析に使えるように、数値化するのに便利です。

This is useful for quantifying strings so they can be used for analysis.

address_id = pd.factorize(transport_df['所在地'])[0]

同じようにfactorizeメソッドを使っていきましょう。

Let's use the factorize method in the same way.

factorizeメソッドはndarrayを返します。

The factorize method returns ndarray.

それらをまとめてさらにndarrayに変換します。

Convert them together again to ndarray.

後はそのndarrayを転置してからデータフレームに直します。

The next step is to convert the transposed ndarray into a dataframe.

transport_id_ndarray = np.array([line_id, station_id, address_id])

transport_id_df = pd.DataFrame(transport_id_ndarray.T)
transport_id_df.columns = ['沿線', '駅名', '所在地']

上手くいっているかどうか確かめてみましょう。

Let's see if it's working.

基本的な操作は殆どこれですみます。後は上記を繰り返すだけです。

This is all you need for most basic operations. All you have to do is repeat the above.

applyメソッドを使えば、ラムダ関数やカスタム関数を使って、

With the apply method, you can use a lambda function or a custom function

データフレームに対してさせたい処理を一度に実行できます。

to do whatever you want the dataframe to do at once.

rent_s = price_df['賃料'].apply(lambda rent: float(re.sub('万円', '', rent)))

Decimalオブジェクトのquantizeメソッドを使えば、

The quantize method of the Decimal object allows you

小数点以下の指定した桁に対して切り上げや切り下げ等を行えます。

to perform operations such as rounding up or down on a specified number of decimal places.

for idx in deposit_mask_index:
    replace_d = []
    for d in deposit[idx]:
        if re.search(r'万円', d):
            d = float(re.sub(r'万円', '', d))
            per_month = Decimal(
                d/rent_s[idx]).quantize(Decimal('.01'), rounding=ROUND_DOWN)
            replace_d.append(f'{per_month}ヶ月')
        else:
            replace_d.append(d)
    deposit[idx] = replace_d

現在の日付のデータタイムオブジェクトの年と、

Calculates the age of the building by subtracting the year of the panda timestamp object

pandasのタイムスタンプオブジェクトの年を引いて、築年数を算出します。

from the year of the current date data time object.

age = year_built.apply(lambda t: datetime.now().year - t.year)

後は定量化したデータフレームと、比較用の新しいデータフレームをCSVファイルにして保存するだけです。

All you have to do is save the quantified dataframe and the new dataframe for comparison as a CSV file.

quantified_df = pd.concat([number_of_stories_df, transport_id_df, minute_walk_df, new_price_df,
                room_id_df, new_extprice_df, remarks_id_df], axis=1)

賃貸不動産サイトから収集したデータを使っての簡単なデータの分析：データ前処理編
Simple data analysis using data collected from apartment rental sites：Data Preprocessing

This blog has an English translation

目次

Table of Contents

① 動画の説明
① Video Description

賃貸不動産サイトから収集したデータを使っての簡単なデータの分析：データ前処理編 Simple data analysis using data collected from apartment rental sites：Data Preprocessing

This blog has an English translation

目次 Table of Contents

① 動画の説明① Video Description

賃貸不動産サイトから収集したデータを使っての簡単なデータの分析：データ前処理編
Simple data analysis using data collected from apartment rental sites：Data Preprocessing

目次

Table of Contents

① 動画の説明
① Video Description