Recently I came over a great Regex expression for sorting out Japanese keywords. I have never seen this on StackOverflow and it is really something I’ve been searching for. I must say I was really surprised finding this, because I didn’t think it existed!
Regex Expression below
([\((「『]+.*?[\))」』]|\ |<br>|[a-zA-Z0-9]+\.[a-z]{2,}|[一-龠々〆ヵヶゝ]+|[ぁ-んゝ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+)
Same code as above but without the br tag and non-breakable space character entity reference.
/([\((「『]+.*?[\))」』]|[a-zA-Z0-9]+\.[a-z]{2,}|[一-龠々〆ヵヶゝ]+|[ぁ-んゝ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+)/g;
Example JavaScipt Code with the Regex in action
const regex = /([\((「『]+.*?[\))」』]|\ |<br>|[a-zA-Z0-9]+\.[a-z]{2,}|[一-龠々〆ヵヶゝ]+|[ぁ-んゝ]+|[ァ-ヴー]+|[a-zA-Z0-9]+|[a-zA-Z0-9]+)/g; const str = `【AFP=時事】新型コロナウイルスのパンデミック(世界的な大流行)が7月に入って急加速していることが、各国・機関のデータを基にしたAFPの集計で明らかになった。 `; let m; while ((m = regex.exec(str)) !== null) { // This is necessary to avoid infinite loops with zero-width matches if (m.index === regex.lastIndex) { regex.lastIndex++; } // The result can be accessed through the `m`-variable. m.forEach((match, groupIndex) => { console.log(`Found match, group ${groupIndex}: ${match}`); }); }