I was using article-parser 4.2.1 for my node.js program to get article data from a news website. Works perfectly until I realised I couldn’t get the line breaks from the article . With line breaks I mean the “\n” characters.
I solved this by going through the plugin and change some properties concerning the htmlmin npm package that is a dependency of article-parser, and replacing \n to <br> with cheerio.
By going to node_modules\article-parser\src\utils\standalizeArticle.js and changing where htmlmin is used at row 32, the properties must be changed so that preserveLineBreaks are set to true.
I made a repo which you can use npm install mosesweb/article-parser of said changes.
const minifiedHtml = htmlmin($.html(), {
removeComments: true,
removeEmptyElements: true,
removeEmptyAttributes: true,
collapseWhitespace: false,
conservativeCollapse: true,
removeTagWhitespace: false,
preserveLineBreaks: true
});
In the nodejs application This means that next time you fetch data with extract you will get your content with the lovely \n characters.
let myUrl = "mycoolurlforanarticle.com"
extract(myUrl).then((article) => {
// Loading in the article DOM to a cheerio object
let con = $.load(article.content)
// Print in to console: the html content with html line breaks <br> instead.
console.log(con.html({ decodeEntities: false }).replaceAll('\n', '<br />'))
// Note you need this function to use replaceAll as used above
String.prototype.replaceAll = function (search, replacement) {
var target = this;
return target.replace(new RegExp(search, 'g'), replacement);
}
});