Regex for Japanese Keywords

Recently I came over a great Regex expression for sorting out Japanese keywords. I have never seen this on StackOverflow and it is really something I’ve been searching for. I must say I was really surprised finding this, because I didn’t think it existed!

Regex Expression below

([\((「『]+.*?[\))」』]|\ |<br>|[a-zA-Z0-9]+\.[a-z]{2,}|[一-龠々〆ヵヶゝ]+|[ぁ-んゝ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+)

Same code as above but without the br tag and non-breakable space character entity reference.

/([\((「『]+.*?[\))」』]|[a-zA-Z0-9]+\.[a-z]{2,}|[一-龠々〆ヵヶゝ]+|[ぁ-んゝ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+)/g;

Example JavaScipt Code with the Regex in action

const regex = /([\((「『]+.*?[\))」』]|\ |<br>|[a-zA-Z0-9]+\.[a-z]{2,}|[一-龠々〆ヵヶゝ]+|[ぁ-んゝ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+)/g;
const str = `【AFP=時事】新型コロナウイルスのパンデミック(世界的な大流行)が7月に入って急加速していることが、各国・機関のデータを基にしたAFPの集計で明らかになった。

`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

Source

Found this on qiita.com posted by ykkey.

Leave a comment

Your email address will not be published. Required fields are marked *