Packages

o

io.archivesunleashed.matchbox

ExtractBoilerpipeText

object ExtractBoilerpipeText

Extract raw text content from an HTML page, minus "boilerplate" content (using boilerpipe).

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ExtractBoilerpipeText
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. def apply(input: String): String

    Uses boilerpipe to extract raw text content from a page.

    Uses boilerpipe to extract raw text content from a page.

    ExtractBoilerpipeText removes boilerplate text (e.g. a copyright statement) from an HTML string.

    input

    an html string possibly containing boilerpipe text

    returns

    text with boilerplate removed or Nil if the text is empty.