Problem not preserving \n (linebreaks) with article-parser and sanitizeHtml

I was using article-parser 4.2.1 for my node.js program to get article data from a news website. Works perfectly until I realised I couldn’t get the line breaks from the article . With line breaks I mean the “\n” characters.

I solved this by going through the plugin and change some properties concerning the htmlmin npm package that is a dependency of article-parser, and replacing \n to <br> with cheerio.

By going to node_modules\article-parser\src\utils\standalizeArticle.js and changing where htmlmin is used at row 32, the properties must be changed so that preserveLineBreaks are set to true.

I made a repo which you can use npm install mosesweb/article-parser of said changes.

  const minifiedHtml = htmlmin($.html(), {
    removeComments: true,
    removeEmptyElements: true,
    removeEmptyAttributes: true,
    collapseWhitespace: false,
    conservativeCollapse: true,
    removeTagWhitespace: false,
    preserveLineBreaks: true
  });

In the nodejs application This means that next time you fetch data with extract you will get your content with the lovely \n characters.

let myUrl = "mycoolurlforanarticle.com"
extract(myUrl).then((article) => {
    
    // Loading in the article DOM to a cheerio object
    let con = $.load(article.content)

    // Print in to console: the html content with html line breaks <br> instead.
    console.log(con.html({ decodeEntities: false }).replaceAll('\n', '<br />'))
    
    // Note you need this function to use replaceAll as used above
    String.prototype.replaceAll = function (search, replacement) {
        var target = this;
        return target.replace(new RegExp(search, 'g'), replacement);
    }
});