Extract plain text from web archive using Data Frame and Spark SQL.
Extract plain text from web archive using Data Frame and Spark SQL.
Data frame obtained from RecordLoader
Dataset[Row], where the schema is (CrawlDate, Domain, Url, Text)
Extract plain text from web archive using MapReduce.
Extract plain text from web archive using MapReduce.
RDD[ArchiveRecord] obtained from RecordLoader
RDD[(String, String, String, String)], which holds (CrawlDate, Domain, Url, Text)