言語処理100本ノック with Python 第1章

はじめに

研究室の教授が最近東北大学の乾・岡崎研究室 Web ページにて公開されている言語処理100本ノックをやり始めていたので、自分のスキル向上を目的に取り組んでみました。

www.cl.ecei.tohoku.ac.jp

あまり人に対してコードを公開したことがないために、汚いコードになってしまうかもしれませんが、ご指摘いただければ幸いです。

github.com ↑のgitでコードを公開しています。

第1章準備運動

キーワード:文字列, ユニコード, リスト型, 辞書型, 集合型, イテレータ, スライス, 乱数

00. 文字列の逆順

文字列"stressed"の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

文字列の操作だけで簡単にできます。

list="stressed"
print(list[::-1])

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

こちらも文字列の逆順同様に、文字列の操作だけで簡単にできます。

str="パタトクカシーー"
print(str[::2])

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

zip()関数を利用することによって、各引数から要素を一つずつ取り出してくれます。

str1="パトカー"
str2="タクシー"
word=""
for a,b in zip(str1,str2):
    word+=a+b
print(word)

03. 円周率

“Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

split()関数を利用して、文字列を分割します。

text="Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
words=text.split(" ")
for word in words:
    print(len(word))

04. 元素記号

“Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭に2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

分割の部分はsplit()関数を利用し、1文字か2文字の判定は、リストを作成しておきます。

text="Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
words=text.split(" ")
i=1
dict={}
one_wordlist=[1,5,6,7,8,9,15,16,19]
for word in words:
    if(i in one_wordlist):
        dict.update({i:word[0]})
    else:
        dict.update({i:word[0:2]})
    i+=1
print(dict)

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，"I am an NLPer"という文から単語bi-gram，文字bi-gramを得よ．

n-gramとは、ある文字列の中で、N個の文字列または単語の組み合わせが、どの程度出現するかということです。 bi-gramという指定があるので、文字を2文字、単語を2個で切り出すということです。今回は、引数に'char'で文字、'word'で単語のn-gramを作成する関数を作ることにしました。

def ngram(text,n,mode):
    result=[]
    if(mode=='char'):
        words=text.split(' ')
        for i in range(len(words)-n+1):
            result.append([words[i],words[i+1]])
    elif(mode=='word'):
        word=text.replace(" ","")
        if(len(word)>n):
            for i in range(len(word)-n+1):
                result.append(word[i:i+n])
    return result
if __name__ == '__main__':
    print(ngram('I am an NLPer',2,'char'))
    print(ngram('I am an NLPer',2,'word'))

06. 集合

“paraparaparadise"と"paragraph"に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，'se'というbi-gramがXおよびYに含まれるかどうかを調べよ．

n-gramで作成した関数を利用して求めてみました。和集合や積集合などは、集合演算を利用することによって求めることができます。

def ngram(text,n):
    result=[]
    word=text.replace(" ","")
    if(len(word)>n):
        for i in range(len(word)-n+1):
            result.append(word[i:i+n])
    return result
if __name__ == '__main__':
    X=set(ngram('paraparaparadise',2))
    Y=set(ngram('paragraph',2))
    print("X or Y=",X | Y)
    print("X and Y=",X & Y)
    print("X - Y=",X - Y)
    print("Y - X=",Y - X)
    if('se' in X):
        print("Find 'se' in X")
    elif('se' in Y):
        print("Find 'se' in Y")

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y=“気温”, z=22.4として，実行結果を確認せよ．

これは与えられる引数が分かっていたので、one-linerで書いてもよかったのですが、他の文字の型が入力されても大丈夫なように、type()関数で型を調べるようにしています。

def textmake(x,y,z):
    if(type(x)!=str):
        x=str(x)
    if(type(y)!=str):
        y=str(y)
    if(type(z)!=str):
        z=str(z)
    return(x+"時の"+y+"は"+z)
if __name__ == '__main__':
    print(textmake(12,"気温",22.4))

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ.
英小文字ならば(219 - 文字コード)の文字に置換その他の文字はそのまま出力この関数を用い，英語のメッセージを暗号化・復号化せよ．

ord()関数を利用して、文字列を文字コードに変換しています。

def cipher(word):
    charlist=list(word)
    result=""
    for char in charlist:
        if char.isalpha() & char.islower():
            result+=chr(219-ord(char))
        else:
            result+=char
    return(result)
if __name__ == '__main__':
    print(cipher("Hello World"))

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば"I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .“）を与え，その実行結果を確認せよ．

先頭の文字と末尾の文字をpop()関数で変数に格納してから、残ったものをrandom.shuffle()でランダムに並び替えます。

import random
text="I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
words=text.split(' ')
result=[]
for word in words:
    if(len(word)>4):
        word=list(word)
        first=word.pop(0)
        last=word.pop()
        random.shuffle(word)
        word=first+''.join(word)+last
        result.append(word)
    else:
        result.append(word)
print(' '.join(result))

おわりに

今回は準備運動ということもあったので、比較的簡単に解くことができました。感じたことは、Pythonってやっぱり便利だなーということです。余力があれば、第2章にも取り組んでいきます。

yama53の技術系ブログ

yama53の日記のようなブログです。趣味や取り組んできたことをまとめていきたと思います。

言語処理100本ノック with Python 第1章

はじめに

第1章準備運動

00. 文字列の逆順

01. 「パタトクカシーー」

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

03. 円周率

04. 元素記号

05. n-gram

06. 集合

07. テンプレートによる文生成

08. 暗号文

09. Typoglycemia

おわりに

はじめに

第1章 準備運動

00. 文字列の逆順

01. 「パタトクカシーー」

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

03. 円周率

04. 元素記号

05. n-gram

06. 集合

07. テンプレートによる文生成

08. 暗号文

09. Typoglycemia

おわりに

第1章準備運動