Gets all non-empty archive files.
Gets all non-empty archive files.
the path to the directory containing archive files
filesystem
a String consisting of all non-empty archive files path.
Creates an Archive Record RDD from a WARC or ARC file.
Creates an Archive Record RDD from a WARC or ARC file.
the path to the WARC(s)
the apache spark context
an RDD of ArchiveRecords for mapping.
Creates an Archive Record RDD from tweets.
Creates an Archive Record RDD from tweets.
the path to the Tweets file
the apache spark context
an RDD of JValue (json objects) for mapping.
Loads records from either WARCs, ARCs or Twitter API data (JSON).