I was using article-parser 4.2.1 for my node.js program to get article data from a news website. Works perfectly until I realised I couldn’t get the line breaks from the article . With line breaks I mean the “\n” characters.
I solved this by going through the plugin and change some properties concerning the htmlmin npm package that is a dependency of article-parser, and replacing \n to <br> with cheerio.
By going to node_modules\article-parser\src\utils\standalizeArticle.js and changing where htmlmin is used at row 32, the properties must be changed so that preserveLineBreaks are set to true.
I made a repo which you can use npm install mosesweb/article-parser of said changes.
const minifiedHtml = htmlmin($.html(), { removeComments: true, removeEmptyElements: true, removeEmptyAttributes: true, collapseWhitespace: false, conservativeCollapse: true, removeTagWhitespace: false, preserveLineBreaks: true });
In the nodejs application This means that next time you fetch data with extract you will get your content with the lovely \n characters.
let myUrl = "mycoolurlforanarticle.com" extract(myUrl).then((article) => { // Loading in the article DOM to a cheerio object let con = $.load(article.content) // Print in to console: the html content with html line breaks <br> instead. console.log(con.html({ decodeEntities: false }).replaceAll('\n', '<br />')) // Note you need this function to use replaceAll as used above String.prototype.replaceAll = function (search, replacement) { var target = this; return target.replace(new RegExp(search, 'g'), replacement); } });