Gets all non-empty archive files.
Gets all non-empty archive files.
the path to the directory containing archive files
filesystem
a String consisting of all non-empty archive files path.
Creates an Archive Record RDD from a WARC or ARC file.
Creates an Archive Record RDD from a WARC or ARC file.
the path to the WARC(s)
the apache spark context
an RDD of ArchiveRecords for mapping.
Loads records from either WARCs or ARCs.