独学初心者プログラマーの日英ブログ

Published Date : 2019年7月21日19:11

MeCabのipadic辞書ビルド方法 -How to build an ipadic dictionary in MeCab-

This article has an English translation.

MeCabのipadic辞書ビルド方法
-How to build an ipadic dictionary in MeCab-

普段自分はMeCabのipadic辞書のインストールはbrewでインストールしていました。

その方法ではユーザー辞書の登録もすんなりいっていたのですが、

今回ソースコードから辞書をビルドしてから、システム辞書をビルドし直す司令が入った訳ですが、

そこで思いの他手間取ってしまったので、自分への戒め的なメモとして残しておくことにします。

I used to install MeCab's ipadic dictionary with "brew".

It was easy to register a custom dictionary,

but this time, someone asked me to build a dictionary from the source code and then rebuild the system dictionary,

which took me a lot of time, so I decided to write this article for someone like me.

一連の流れ -A series of flows-

取り敢えず、MeCab本体はbrewでインストール。その後の流れが以下の通り。

For now, I installed MeCab itself with "brew". The flow after that is as follows.

1 公式のIPA辞書をダウンロード、解凍、ビルドする。
Download, extract, and build the official IPA dictionary.
2 ※重要※ nkfを使用して辞書内の全てのcsvとdefをUTF-8に変換する。
* IMPORTANT * Use nkf to convert all csv and def files in the dictionary to UTF-8.
3 Makefile内の書き換え、dicrc内も書き換える。
Rewrite in Makefile and in dicrc.
4 ユーザー辞書を作り、反映させる。
Create and reflect a custom dictionary.
5 mecabrcのIPA辞書のパスを置き換える。
Replace the path to the IPA dictionary in mecabrc.
6 おまけ： Pythonから使えるようにする為の簡単なティップス。
Bonus: Simple tips for making it available in Python.

IPA辞書ビルド -Build the IPA dictionary-

MeCab自体はbrewでインストール。

Install MeCab with brew.

$ brew install mecab

Home Brewが無いならここにあるRubyのスクリプトをターミナルに打ち込み、インストール。

If you don't have Home Brew, type this Ruby script into Terminal and install it.

続いて、公式サイトからIPA辞書を落とす、解凍、移動、ビルドする。

It then drops, decompresses, moves, and builds the IPA dictionary from the official site.

ダウンロードしたフォルダを解凍する。（解凍先はどこでも良いが、自分が把握しやすい場所にするのが吉。）

Unzip the downloaded folder. (It doesn't matter where you unzip it, but it's best to make it easy for you to find it.)

$ tar zxfv mecab-ipadic-2.7.0-xxxxxxxx.tar.gz

解凍したフォルダへ移動。

Move to the unzipped folder.

$ cd mecab-ipadic-2.7.0-xxxxxxxx

nkfをインストールして、辞書内のcsvとdefをUTF-8に変換する。

Install nkf to convert csv and def in dictionary to UTF-8.

nkfとは指定したファイルの文字セットを変換してくれるツールです。

nkf is a tool that converts the character set of a specified file.

$ brew install nkf

$ nkf -w --overwrite *.csv

$ nkf -w --overwrite *.def

configureというスクリプトファイルをUTF-8のオプションをつけて実行します。

Run the configure script file with UTF -8 options.

$ ./configure --enable-utf8-only --with-charset=utf8

出来上がったMakefileの中身をUTF-8に対応させるために書き換えます。

Rewrite the contents of the resulting Makefile to match UTF-8.

$ nano Makefile

- (mecab_dict_index) -d . -o . -f EUC-JP -t utf8

+ (mecab_dict_index) -d . -o . -f utf8 -t utf8

makeコマンドでコンパイルします。

Compile with the make command.

$ make

コンパイルされたファイルのインストールをします。

Install the compiled files.

$ sudo make install

MeCabで使うデフォルトの辞書のパスを、今回ビルドした辞書のパスに書き換えます。

Replaces the path of the default dictionary used by MeCab with the path of the dictionary you built.

$ sudo nano /usr/local/etc/mecabrc

dicdir = /User/(your user name)/path/to/ipadic dictionary

MeCabがきちんと機能するか確かめる。

See if MeCab works.

$ mecab

デフォルトだと、名詞として解析されない、マニアックな漫才師の名前を入力してみます。

By default, I will try to enter the name of a manzai performer who is not parsed as a noun.

若井はんじ・けんじ

結果は予想通り。

The results were as expected.

この漫才師の名前を名詞、固有名詞としてユーザー辞書を作ってみます。

I will make a user dictionary with the name of this manzaishi as noun and proper noun.

$ nano user.csv

ユーザー辞書のCSVファイルの作り方はこちらの人の記事を参考にしました。

I referred to this person's article about how to make CSV file of user dictionary.

若井はんじ・けんじ,0,0,1,名詞,固有名詞,*,*,*,*,わかいはんじけんじ,わかいはんじけんじ,わかいはんじけんじ

ちなみに、Mac標準搭載のnanoエディターで保存する時は「Ctrl + O」。エディターを終了する時は「Ctrl + X」です。

By the way, when saving in the Mac's native nano editor "Ctrl + O". To exit the editor, use "Ctrl + X".

辞書を作るツールの設定ファイルを変更します。

Change the configuration file for the tool that creates the dictionary.

findコマンドで場所を見つけます。（各個人の環境によって違う可能性があるので。）

Find the location with the find command. (It may be different depending on each individual's environment.)

# find /usr -name "dicrc"

/usr/local/lib/mecab/dic/ipadic/dicrc

設定ファイルの場所を見つけたら、「config-charset」の部分を書き換えます。

If you find the location of the configuration file, replace "config-charset".

$ sudo nano /usr/local/lib/mecab/dic/ipadic/dicrc

- config-charset = EUC-JP

+ config-charset = UTF-8

続いて、辞書を作ってくれるツールの場所を探します。

Next, look for a tool to make a dictionary.

$ find /usr -name "mecab-dict-index"

/usr/local/libexec/mecab/mecab-dict-index

場所が見つかったら、「user.csv」があるフォルダ内で、（今回は直接システム辞書内部で作成したので、システム辞書の中で）以下のコマンドを打ちます。

Once the location is found, type the following command (This time, I created it directly in the system dictionary, so I created it in the system dictionary.) in the folder where "user.csv" is located.

$ /usr/local/libexec/mecab/mecab-dict-index -f utf-8 -t utf-8

$ sudo make install

きちんと登録したユーザー辞書が機能しているかどうかを試してみる。

Try to see if a properly registered custom dictionary works.

$ mecab

若井はんじ・けんじ

OK、大丈夫そう。

Okay. It's working.

Pythonで使う場合 -For use with Python-

それではPythonで使えるようにしていきます。とは言ってもすごく簡単ですが、意外な盲点だったのでメモしておきます。

I'll make it available in Python. That said, it's pretty easy, but it was an unexpected blind spot, so I'll write it down.

$ brew install swig

$ pip install mecab-python3

import MeCab

# dオプションでipadic辞書としてビルド、パスを通した場所を指定してあげる。（Neologd辞書にも使える）
# Use the d option to build it as an ipadic dictionary and specify its location along the path. (Can also be used in Neologd dictionaries)
tagger = MeCab.Tagger ('-d /path/to/your ipadic dictionary/')

# "Wakai-hanji-kenji to, wakai Hanji to Kenji."
# It means "Wakai-hanji-kenji and young Hanji and Kenji."
# "Hanji" in Japanese means a judge.
# "Kenji" means a prosecutor in Japanese.
# "wakai" means young.
# Although the pronunciation is the same, 
# "Wakai Hanji・Kenji" is the name of a manzaishi or a person.
# In short, it's like a wordplay.
text = '若井はんじ・けんじと、若いはんじとけんじ。'
tokens = tagger.parse(text)
print(tokens)

ポイントは単純に、形態素解析オブジェクトを作る際に、登録した辞書のパスをdオプションをつけて指定してあげることだけ。

The point is simply to specify the path of the registered dictionary with the d option when creating a morphological analysis object.

ただ、「はんじ」と「けんじ」が上手く解析できていないので、こういう時はわざわざユーザー辞書を作るよりNeologd辞書を使ったほうが良いと思う。

However, "Hanji" and "Kenji" are not analyzed well, so I think it's better to use Neologd dictionary instead of creating a custom dictionary.

終了 -Conclusion-

今回のメモした内容のような方法は、漫才師の名前だけ形態素してくれればいいとか、自分が使う環境で特別な条件がある場合等は有効だと思う。

今日は漫才師と漫才について解析する。明日は若者言葉について解析するとか使い分けをするために、ユーザー辞書を入れ替えて使用する時なんかも便利かもしれない。

I think the method like the one in the memo this time is effective if only the name of the manzaishi is morphed, or if there are special conditions in the environment you use.

For example, today we analyze manzaishi and manzai. Tomorrow, I'm going to use it to analyze young people's language. It can also be useful if you want to swap out a custom dictionary.

とりあえず、これにて終了。

That's it for now.

See You Next Page!