Filters MimeTypes from RDDs.
Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of Regular expressions to keep
Removes all data that does not have selected data.
Removes all data that does not have selected data.
a list of dates to keep
the selected DateComponent enum value
Removes all data but selected source domains.
Removes all data but selected source domains.
a Set of urls for the source domains to keep
Removes all data except images.
Removes all data not in selected language.
Removes all data not in selected language.
a Set of ISO 639-2 codes
Removes all data but selected mimeTypes.
Removes all data but selected mimeTypes.
a Set of Mimetypes to keep
Removes all data but selected url patterns.
Removes all data but selected url patterns.
a Set of Regular Expressions to keep
Removes all data but selected exact URLs
Removes all data but selected exact URLs
a Set of URLs to keep
Removes all non-html-based data (images, executables etc.) from html text.
A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an RDD, please see RecordLoader.