Extract plain text from web archive using DataFrame and Spark SQL.
Extract plain text from web archive using DataFrame and Spark SQL.
DataFrame obtained from RecordLoader
Dataset[Row], where the schema is (crawl date, domain, url, text)
Extract plain text from web archive using RDD.
Extract plain text from web archive using RDD.
RDD[ArchiveRecord] obtained from RecordLoader
RDD[(String, String, String, String)], which is (crawl date, domain, url, text)