Demonstrating the effect of text pre-processing using fastText classification for the Japanese language

This is Shuochen Wang from Flect research lab. Today I am going to focus on the NLP task for the Japanese language. Specifically I am going to demonstrate:

How to use fastText for text classification
The effect of text preprocessing and for text classification.

Table of contents

Introduction
Pre-processing
Training the model
Conclusion
Future work
Appendix

Introduction

Natural Language Processing (NLP) allows machines to break down and interpret human language. It is used in a variety of tasks, including text classification, spam filter, machine translation, search engine and chatbots.

What is text classification?

To put it simply, text classification is type of NLP task that classifies a set of predefined categories to any text. It is a type of supervised learning.

Supervised learning means the machine learning model is trained on input data that has been labeled for a particular output (known as labeled data). The model is trained until it can detect the underlying patterns and relationships between the input data and the output labels. We determine when this is good enough by evaluating the loss function until the loss has reached the threshold which is also determined by ourselves. Then we can apply this model to never-seen-before problem (unseen data). If the model has truly captured the underlying relationship, then it can identify the problems with high accuracy.

Now in the context of text classification, the problem would be any type of text. The task is to classify it correctly to one or more applicable labels. The type of labels depends on the nature of the task.

To give a concrete example, I will use the example given by the fastText official page. The example problems use sentences from the cooking section of Stack-exchange. For instance, this question is "Which baking dish is best to bake a banana bread ?" The correct tags for this question are "baking", "equipment", "bread", and "bananas". The goal for this case would be given any question, the machine model needs to correctly identify all the correct labels if possible.

Text classification would works the same way in Japanese language. However, there is one major difference between English and Japanese language NLP tasks. That is, unlike English which is already segmented and can be processed right away, Japanese text needs to be segmented before applying any machine learning model. I will cover this in more detail in the pre-processing section.

What is fastText classification?

There are many possible ways to tackle text classification, both using deep learning and not using deep learning. Possible approaches include SVM (support vector machine), random forest, logistic regression, CNN, word2vec, fastText and other transformer models. There is no such thing as the perfect model choice for text classification. Some model may performs better than other models in some situations but in other cases it maybe worse than its alternatives. In addition, a model with higher precision may not always be desirable because it may be computationally expensive.

The model of our choice in this case is fastText. fastText is a library for efficient learning of word representations and sentence classification. fastText actually offers 2 different functionalities: word representations and sentence classification. The functionalities we will be using is the sentence classification. fastText has achieved state of art results when it is first released, although it has been superseded by transformer models. Another major advantage of fastText is its computation speed. It relatively takes fewer training time than other machine learning models.

Corpus material

The corpus chosen for this task is livedoor news corpus. The reason this corpus is chosen is because of it is free and it is a popular choice for Japanese NLP tasks. This corpus is a collection of "livedoor" news published by NHN ltd. Creative licence applies to this corpus. The news are divided into 9 types collected in the year of 2012. The 9 types of categories are: 'dokujo-tsushin', 'it-life-hack', 'kaden-channel', 'livedoor-homme', 'movie-enter', 'peachy', 'smax', 'sports-watch', 'topic-news'. You can download it here¹:

Each citatory is grouped into its own folder. Inside the folder, each news entry is compiled into one txt file. As for the format, the first line contains the original URL of the entry, the second line contains date of the entry, the third line contains the title of the entry and from fourth line onwards contains the real entry.

Although the description of the corpus says a lot of effort has been put in removing the html links, upon careful inspection, some file still contains html links. Therefore, we need to remove these html link the in the preprocessing step.

Pre-processing

As I have mentioned in the previous sections, pre-processing is necessary in order for the machine learning model to work properly. It is possible to get away with minimal pre-processing for English, but for Japanese language the impact on the performance will be significant. I will demonstrate the effect in the coming section.

Before we can perform the actual pre-processing, we need to compile the given data into a form that can be processed by the fastText model. Although machine learning can be implemented in other languages, Python will be the language of choice due to its excellent support of machine learning modules.

Grouping the data

After downloading and extracting the files, the first step of the pre-processing is to group everything into a text file and then load this file in to a pandas Dataframe. Apart from extra html links, we assume there are no other needs to corrected. In other words, we assume the data entries are complete and correct. The files are arranged in the following order:

dataset
└ text
  ├ CHANGES.txt
  ├ README.txt
  ├ dokujo-tsushin
  │  ├ LICENSE.txt
  │  ├ dokujo-tsushin-4778030.txt
  │  │ ...  
  │  └ dokujo-tsushin-6915005.txt
  ├ it-life-hack
  │  ├ LICENSE.txt
  │  ├ it-life-hack-6292880.txt
  │  ...  
  ├ kaden-channel
  │  ├ LICENSE.txt
  │  ├ kaden-channel-6054293.txt
  │  ...  
  ├ livedoor-homme
  │  ├ LICENSE.txt
  │  ├ livedoor-homme-4568088.txt
  │  ...  
  ├ movie-enter
  │  ├ LICENSE.txt
  │  ├ movie-enter-5840081.txt
  │  ...  
  ├ peachy
  │  ├ LICENSE.txt
  │  ├ peachy-4289213.txt
  │  ...  
  ├ smax
  │  ├ LICENSE.txt
  │  ├ smax-6507397.txt
  │  ...  
  ├ sports-watch
  │  ├ LICENSE.txt
  │  ├ sports-watch-4597641.txt
  │  ...  
  └ topic-news
     ├ LICENSE.txt
     ├ topic-news-5903225.txt
      ...

To create the text file that contains all the files (except LICENSE.txt), run the following command (source ²). This assumes you are running in Linux or Mac operating system:

$ echo -e "filename\tarticle"$(for category in $(basename -a `find ./text -type d` | grep -v text | sort); do echo -n "\t"; echo -n $category; done) > ./text/livedoor.tsv

I do not know the equivalent command for the Windows system, but you could easily achieve the same result using the following Python Snipplet (as long the texts are grouped in a pandas column and you can retrieve them easily, the names and the format of the Dataframe does not matter):

Snippet 1 process and read each file

import glob
import os

for folder in glob.glob("*/"):
    for filename in glob.glob(folder + '*.txt'):
        if filename != folder + "LICENSE.txt":
            with open(os.path.join(os.getcwd(), filename), 'r') as f: # open in readonly mode
                data = f.readlines()
                #do you task here, read the data from the fourth lines onwards and write it to a txt file
}

Next, we write the content of each text file into the livedoor.tsv file we have created before. In my case, I had to correct the first line of the tsv (a tsv file is similar to a csv file, except it is tab separated instead of comma separated) file before I could read it into pandas correctly. When you read the tsv file into a Dataframe, if the titles and the contents do not match correctly, open the tsv file and make sure the first line (the labels) are tab-separated instead of space-separated.

$ for filename in `basename -a ./text/dokujo-tsushin/dokujo-tsushin-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/dokujo-tsushin/$filename`; echo -e "\t1\t0\t0\t0\t0\t0\t0\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/it-life-hack/it-life-hack-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/it-life-hack/$filename`; echo -e "\t0\t1\t0\t0\t0\t0\t0\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/kaden-channel/kaden-channel-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/kaden-channel/$filename`; echo -e "\t0\t0\t1\t0\t0\t0\t0\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/livedoor-homme/livedoor-homme-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/livedoor-homme/$filename`; echo -e "\t0\t0\t0\t1\t0\t0\t0\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/movie-enter/movie-enter-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/movie-enter/$filename`; echo -e "\t0\t0\t0\t0\t1\t0\t0\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/peachy/peachy-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/peachy/$filename`; echo -e "\t0\t0\t0\t0\t0\t1\t0\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/smax/smax-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/smax/$filename`; echo -e "\t0\t0\t0\t0\t0\t0\t1\t0\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/sports-watch/sports-watch-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/sports-watch/$filename`; echo -e "\t0\t0\t0\t0\t0\t0\t0\t1\t0"; done >> ./text/livedoor.tsv
$ for filename in `basename -a ./text/topic-news/topic-news-*`; do echo -n "$filename"; echo -ne "\t"; echo -n `sed -e '1,3d' ./text/topic-news/$filename`; echo -e "\t0\t0\t0\t0\t0\t0\t0\t0\t1"; done >> ./text/livedoor.tsv

By running the codes above, we have successfully complied the news into their corresponding categories. The correct label is also one-hot encoded. To load the file, all we need to do is to run the following code:

Snippet 2 read the tsv file

import pandas as pd
df = pd.read_csv('livedoor.tsv', sep='\t')

For other machine learning models, we could start training the models right away. However, for fastText, one special treatment is needed which I will explain after normalisation.

Remove hyperlink

Before performing segmentation, we need to remove any potential hyperlinks present in the text. The reason for removing hyper links is because it will mess up with tokenizer in the next section. Even if the segmentation is done correctly, it will adds almost no useful information to the segmented text. The following code will remove potential hyperlinks using regular expression. The detail of how regular expression works will be omitted in this blog. if you want to learn more, you can learn it here

Snippet 3 read hyperlink

import re
import os

#remove hyperlink cell

topics = ['dokujo-tsushin', 'it-life-hack', 'kaden-channel', 'livedoor-homme', 'movie-enter', 'peachy', 'smax', 'sports-watch', 'topic-news']

for topic in topics: 
    for i in range(len(df.loc[df[topic]==1]['article'].tolist())):
        test = df.loc[df[topic]==1]['article'].tolist()[i]

        test2 = test.split()
        newlist = []
        for word in test2:
            word = re.sub(r'http\S+', '', word)
            if word != "":
                newlist.append(word)
        newlist = " ".join(newlist)
        mask = df[topic]==1
        offset = df[mask].index[0]
        df.iloc[offset + i, df.columns.get_loc("article")] = newlist

Word Segmentation

For the Japanese language, the absolute minimal pre-processing is word segmentation, which is referred as wakati in Japanese(分かち書き). For example, this is an unsegmented Japanese sentence:

ちょん掛け（ちょんがけ、丁斧掛け・手斧掛けとも表記）とは、相撲の決まり手のひとつである。

This is a segmented sentence:

ちょん␣掛け␣（ちょん␣がけ、␣丁␣斧 ␣掛け␣・␣手斧␣掛け␣と␣も␣表記）␣と␣は、␣相撲␣の␣決まり␣手␣の␣ひとつ␣で␣ある。

␣ is added for easier viewing purpose.

Fortunately, there is no need to attempt to segment the sentence ourselves. The tool that can perform segmentation task automatically is called a tokenizer. There are many tokenizer options including MeCab, Janome and Sudachi. We will use Sudachi for this task. The following codes uses Sudachi to perform segmentation for each entry cell. This will take a few minutes to perform. Therefore I have included a module called progressbar which will shows the progress of the segmentation progress (this is optional if you do not want to install this module).

Another process that we will perform along with segmentation is normalization of punctuation symbols (not to be confused with words). In Japanese, the special symbols and numerical can be either in Zenkaku or Hankaku format. Although their meanings are the same, they are recognized as different characters which adds unnecessary information. We perform the normalization of symbols by using neologdn module.

Snippet 4 wakati and normalization

#Sudachi wakati step
import os
from sudachipy import tokenizer 
from sudachipy import dictionary
import progressbar
import neologdn

tokenizer_obj = dictionary.Dictionary().create()

def wakati( sentence ):
    mode = tokenizer.Tokenizer.SplitMode.C
    return " ".join( [m.surface() for m in tokenizer_obj.tokenize(sentence, mode)] )

processed = []
df2 = df.copy()
        
topics = ['dokujo-tsushin', 'it-life-hack', 'kaden-channel', 'livedoor-homme', 'movie-enter', 'peachy', 'smax', 'sports-watch', 'topic-news']

for topic in topics: 
    for i in progressbar.progressbar(range(len(df2.loc[df2[topic]==1]['article'].tolist()))):
        test = df2.loc[df2[topic]==1]['article'].tolist()[i]

        processed = wakati(neologdn.normalize(test))
        mask = df2[topic]==1
        offset = df2[mask].index[0]
        df2.iloc[offset + i, df2.columns.get_loc("article")] = processed

Append labels to the file

After segmentation, we need to add the label to the original entry. As I have mentioned before, fastText classification requires a label for each entry. This is done by attaching a special string starts with "/label/" which fastText will recognizes this as a label instead of a normal word. Therefore we need to attach all the 9 labels to each row. For example, if the correct label is 'topic-news', then we need to append the label (string) "label/topic-news" to the entry. The place of the label does not matter, therefore we will put it as the first word in the first sentence.

This can be easily done using pandas.

Snippet 5 append label

#Append label

topics = ['dokujo-tsushin', 'it-life-hack', 'kaden-channel', 'livedoor-homme', 'movie-enter', 'peachy', 'smax', 'sports-watch', 'topic-news']

for topic in topics: 
    for i in range(len(df2.loc[df2[topic]==1]['article'].tolist())):
        original = df2.loc[df2[topic]==1]['article'].tolist()[i]
        label = "__label__" + topic
        mask = df2[topic]==1
        offset = df2[mask].index[0]
        df2.iloc[offset + i, df2.columns.get_loc("article")] = label + " " + original

Removing stopwords and numericals

We have performed the minimal pre-processing that for our models to work. However, there are more pre-processing actions that we can perform to further increase the model performance. One action is to remove the stopwords. Stopwords are the words that are used frequently but provides little meaning to the context. In English, the examples would be this, that, the, a. In Japanese there are stop words as well. We will use this stopwords list: https://github.com/stopwords-iso/stopwords-ja/blob/master/stopwords-ja.txt

In a similar way, numerical numbers provides little information for text classification and can be removed or replaced as well. We will replace all the numbers with an arbitrary symbol. The following code removes the stopwords and replaces all the numbers:

Snippet 6 Remove stop words

temp = open('stop_words_japanese.txt','r').read().split('\n')
my_stopwords = temp
remove_digits = str.maketrans('0123456789', '%%%%%%%%%%')

def remove_mystopwords(sentence):
    tokens = sentence.split(" ")
    tokens_filtered= [word for word in tokens if not word in my_stopwords]
    return (" ").join(tokens_filtered)

df3 = df2.copy()

for i in range(len(df3)):
    temp = df3['article'].iloc[i]
    temp = remove_mystopwords(temp)
    df3.iloc[i, df3.columns.get_loc("article")] = temp.translate(remove_digits)

The model performance will be discussed in the training section.

Training the model

Although there are more potential pre-processing steps that can be done, we can start training and validating the models to see our initial model performance.

First we split our Dataframe into train and validation set. We will use the most common ratio 80-20 (80% train and 20% validation set). Strictly speaking, a separate test set that is not used in the train and validation set should also be included for the final evaluation. We will omit this step for the simplicity purpose. We use the following code:

Snippet 7 split the data into train and validation set

#Split train and variation set
import numpy as np

topics = ['dokujo-tsushin', 'it-life-hack', 'kaden-channel', 'livedoor-homme', 'movie-enter', 'peachy', 'smax', 'sports-watch', 'topic-news']
train = []
valid = []

for topic in topics:    
    a, b = np.split(df2.loc[df2[topic]==1]["article"], [int(.8*len(df2.loc[df2[topic]==1]))])
    for item in a.values:
        train.append(item + "\n")
    for item in b.values:
        valid.append(item + "\n")
        
with open("train.txt", 'w') as filehandle:
    filehandle.writelines(train) 
with open("valid.txt", 'w') as filehandle:
    filehandle.writelines(valid)

Now all that is left is to train the model using the following fastText train_supervised, then we validate the precision using the validation set. The whole training only takes a few seconds to complete.

import fasttext
model = fasttext.train_supervised(input="train.txt")
model.test("valid.txt")

Snippet 8 train and test the model

By using segmented text, the precision is around 35% to 40%. The main reason that the precision is low (40%) is that the parameters are not set properly. This is the result obtained by default parameter. fastText contains many possible parameters that can be changed. This is the full list of parameters available

One of the parameters that we must change is the epoch. The default value is 5, which means our model is only trained 5 times, which is clearly not enough for our model. You can observe the training rate and loss in the prompt and see the model has converged (the loss can still be improved). We cannot determine at what epoch it will converge, but one way is to use a sufficient large number to make sure the model has converged. Therefore we increase the epoch to 100 just in case.

Another parameter that increases the precision is the learning rate (lr), the default value is 0.1, we can increase this value to 0.5. Rerun the validation set again, we would get around 82.5% for the segmented version. This is already impressive result for the model with very little processing and hyper parameter tuning. In fact, the model performance remains almost the same even after hyper parameter turning.

Now we can examine how the removing stop words pre-processing has affected our precision again. Run Snipplet 7 except change the names of the train and variation files to a different name.

The resulting precision is still around 82.5%, it is only slightly better than the model before. Due to the random nature of machine learning, we do not have enough evidence that this model performs better than the previous model. However, if you examine the size of the train and validation text, the file size has reduced by around 25%. In other words, we have maintained the model performance while being able to reduce the file size. For the file of this size, this may not be significant, but for larger corpus such as wikipedia, reducing file size will boost training time significantly. Furthermore, we can expand the stopwords list by removing more words that do not contribute to the classification task such as names, places and other words that have high occurrences. However, the effect is not guaranteed to be positive as removing too many words could hurt the model performance. If you have time you can try and experimenting with removing more stopwords and do let me know the results.

Conclusion

In this blog, I have introduced how to use fastText classification for text classification. I have demonstrated the effect of segmentation and removing stopwords.

In terms of our model performance standing, our model performed similar with this blog (³) which uses BERT model, which is supposed to have better performance than our model. However, other blogs have reported over 90% precision with BERT model⁴. If such results are true, BERT model is superior in terms of precision than fastText classification.

Future work

So far I have examined the effect of text pre-processing for fastText classification. There is in fact another type of pre-processing called data-augmentation. Data augmentation is creating artificial data when there is not enough source data. One particular type of data augmentation I am interested is EDA (Easy Data Augmentation), I will examine the effect of EDA in another blog.

Appendix

The python notebook containing the results has been uploaded to my GitHub repository: https://github.com/wang-shuochen/livedoor

フレクトのクラウドblog re:newal

http://blog.flect.co.jp/cloud/からさらに引っ越しています