Filters detected content (regex) from RDDs.
Filters detected content (regex) from RDDs.
a list of regular expressions
Filters detected dates from RDDs.
Filters detected dates from RDDs.
a list of dates
Filters detected domains (regex) from RDDs.
Filters detected domains (regex) from RDDs.
a list of urls for the source domains
Filters detected status codes from RDDs.
Filters detected status codes from RDDs.
a list of HTTP status codes
Filters ArchiveRecord MimeTypes from RDDs.
Filters ArchiveRecord MimeTypes from RDDs.
a list of Mime Types
Filters detected MimeTypes from RDDs.
Filters detected MimeTypes from RDDs.
a list of Mime Types
Filters detected url patterns from RDDs.
Filters detected url patterns from RDDs.
a list of Regular expressions
Filters detected urls from RDDs.
Filters detected urls from RDDs.
a list of urls
Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of regular expressions to keep
Removes all data that does not have selected data.
Removes all data that does not have selected data.
a list of dates
the selected DateComponent enum value
Removes all data but selected source domains.
Removes all data but selected source domains.
a list of urls for the source domains
Removes all data that does not have selected status codes.
Removes all data that does not have selected status codes.
a list of HTTP status codes
Removes all data except images.
Removes all data not in selected language.
Removes all data not in selected language.
a set of ISO 639-2 codes
Removes all data but selected mimeTypes specified in ArchiveRecord.
Removes all data but selected mimeTypes specified in ArchiveRecord.
a list of Mime Types
Removes all data but selected mimeTypes as detected by Tika.
Removes all data but selected mimeTypes as detected by Tika.
a list of Mime Types
Removes all data but selected url patterns.
Removes all data but selected url patterns.
a list of regular expressions
Removes all data but selected exact URLs.
Removes all data but selected exact URLs.
a list of URLs to keep
Removes all non-html-based data (images, executables, etc.) from html text.
A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an RDD, please see RecordLoader.