Extract information about word processor files from web archive using DataFrame and Spark SQL.
Extract information about word processor files from web archive using DataFrame and Spark SQL.
DataFrame obtained from RecordLoader
Dataset[Row], where the schema is (crawl date, url, mime_type_web_server, mime_type_tika, language, content)