io.archivesunleashed.spark.matchbox
UDF for extracting raw text content from an HTML page, minus "boilerplate" content (using boilerpipe).
UDF for extracting raw text content from an HTML page, minus "boilerplate" content (using boilerpipe).