Extract web pages from web archive using DataFrame and Spark SQL.
Extract web pages from web archive using DataFrame and Spark SQL.
DataFrame obtained from RecordLoader
Dataset[Row], where the schema is (crawl date, url, mime_type_web_server, mime_type_tika, language, content)