Published Date : 2020年9月27日2:49

【python】Calculation of the ratio of workers by wage class and age group

This blog has an English translation


Here's a little more about the 「【python】Calculation of the ratio of workers by wage class and age group」 video I uploaded to YouTube.


Table of Contents

① 動画の説明
① Video Description


First, go to the website of [Overview of the 2018 Basic Survey on Wage Structure] published by the Ministry of Health, Labour and Welfare.


Download this PDF and install [pdfminer.six] so you can read it from Python.

pip install pdfminer.six


[pdfminer.six] reads PDFs even if they are written in Japanese.


Once installed, [pdfminer.six] is also available from the command line.

python samples/simple1.pdf


However, the longer the path, the harder it is to see and type commands,


so let [pdfminer.six] run from a Python file.


Use the sys and pathlib modules to get the path up to the, which is the pdfminer.six executable.


Then use call in subprocess to run the Python file with arguments on the command line.

import sys
from pathlib import Path
from subprocess import call

pdf_path = r"C:\Users\user\Downloads\13.pdf"

pdf2txt_py_path = Path(sys.exec_prefix) / "Scripts" / ""

call(["py", str(pdf2txt_py_path), "-o extract-pdf.txt", pdf_path])


Save and run this Python file.



This will pull the text out of the PDF beautifully.


The characters in the table extracted from the PDF are properly extracted, although there are some strange parts.


Use the information in the [Table 7 Ratios of Workers by Wage, Sex and Age Group (2 -2)]


table from a text file extracted from a PDF using the [pdfminer.six].


Organize the information in this table and save it as a separate text file.


Save it as a CSV file.

import pandas as pd

text_file_path = r"C:\Users\user\第7表2-2.txt"
df = pd.read_csv(text_file_path)
csv_file_path = r"C:\Users\user\第7表2-2.csv"
df.to_csv(csv_file_path, index=None)


I don't think this is all that interesting,


so I'll use this CSV file and Pandas to input your age


and annual income and create a program that


shows what percentage of your age group is getting the same salary as you.


We use Pandas as usual to process the data.

import pandas as pd
import re

df = pd.read_csv(csv_file_path)

range_ai = df['賃金階級'][:24].apply(
    lambda x: x.replace('(千円)', '').split('~'))
range_ai = range_ai.apply(
    lambda x: [float(y) if y != '' else '' for y in x])

range_age = pd.Series([age.replace('歳', '').split('~')
                        for age in df.columns[2:]])
range_age = range_age.apply(
    lambda x: [int(y) if y != '' else '' for y in x])


All you have to do is find out where the user's age and annual income


fall in the table and use the Print function to display the final result.

# Please enter your annual income(ten thousand yen)
annual_income = input('あなたの年収を入力してください(万円) : ')
# Please enter your age
age = input('あなたの年齢を入力してください : ')

ai_index = 0
for idx, l_ in enumerate(range_ai):
    if idx == 0:
        l = [i for i in range(0, int(l_[1]+0.1))]
    elif idx == len(range_ai)-1:
        l = [i for i in range(int(l_[0]), 30000)]
        l = [i for i in range(int(l_[0]), int(l_[1]+0.1))]

    if int(annual_income) in l:
        ai_index = idx

age_index = 0
for idx, l_ in enumerate(range_age):
    if idx == 0:
        l = [19]
    elif idx == len(range_age)-1:
        l = [i for i in range(70, 120)]
        l = [i for i in range(int(l_[0]), int(l_[1])+1)]

    if int(age) in l:
        age_index = idx

answer = df.iat[ai_index, age_index + 2]
# The annual income of {annual_income} (ten thousand yen) at your age of {age} is {answer}% of {df.columns[age_index+2]} years old.


It might be interesting to make this program available on the web using a web framework like Django.


That's all. Thank you for your hard work.