Filters detected content (regex).
Filters detected content (regex).
a list of regular expressions
Filters detected dates.
Filters detected dates.
a list of dates
Filters detected domains (regex).
Filters detected domains (regex).
a list of urls for the source domains
Filters detected HTTP status codes.
Filters detected HTTP status codes.
a list of HTTP status codes
Filters detected language.
Filters detected language.
a set of ISO 639-2 codes
Filters ArchiveRecord MimeTypes (web server).
Filters ArchiveRecord MimeTypes (web server).
a list of Mime Types
Filters detected MimeTypes (Tika).
Filters detected MimeTypes (Tika).
a list of Mime Types
Filters detected URL patterns (regex).
Filters detected URL patterns (regex).
a list of Regular expressions
Filters detected URLs.
Filters detected URLs.
a list of urls
Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of regular expressions to keep
Removes all data that does not have selected date.
Removes all data that does not have selected date.
a list of dates
the selected DateComponent enum value
Removes all data but selected source domains.
Removes all data but selected source domains.
a list of urls for the source domains
Removes all data that does not have selected HTTP status codes.
Removes all data that does not have selected HTTP status codes.
a list of HTTP status codes
Removes all data except images.
Removes all data not in selected language.
Removes all data not in selected language.
a set of ISO 639-2 codes
Removes all data but selected mimeTypes specified.
Removes all data but selected mimeTypes specified.
a list of Mime Types
Removes all data but selected mimeTypes as detected by Tika.
Removes all data but selected mimeTypes as detected by Tika.
a list of Mime Types
Removes all data but selected URL patterns.
Removes all data but selected URL patterns.
a list of regular expressions
Removes all data but selected exact URLs.
Removes all data but selected exact URLs.
a list of URLs to keep
Removes all non-html-based data (images, executables, etc.) from html text.
Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.
Extracts webpages with columns for crawl data, url, MIME type, and content.
A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an RDD, please see RecordLoader.