Filters detected content (regex).
Filters detected content (regex).
a list of regular expressions
Filters detected dates.
Filters detected dates.
a list of dates
Filters detected domains.
Filters detected domains.
a list of domains for the source domains
Filters detected HTTP status codes.
Filters detected HTTP status codes.
a list of HTTP status codes
Filters detected language.
Filters detected language.
a set of ISO 639-2 codes
Filters ArchiveRecord MimeTypes (web server).
Filters ArchiveRecord MimeTypes (web server).
a list of Mime Types
Filters detected URL patterns (regex).
Filters detected URL patterns (regex).
a list of Regular expressions
Filters detected URLs.
Filters detected URLs.
a list of urls
Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of regular expressions to keep
Removes all data that does not have selected date.
Removes all data that does not have selected date.
a list of dates
the selected DateComponent string
Removes all data but selected source domains.
Removes all data that does not have selected HTTP status codes.
Removes all data that does not have selected HTTP status codes.
a list of HTTP status codes
Removes all data except images.
Removes all data not in selected language.
Removes all data not in selected language.
a set of ISO 639-2 codes
Removes all data but selected mimeTypes specified.
Removes all data but selected mimeTypes specified.
a list of Mime Types
Removes all data but selected mimeTypeTikas specified.
Removes all data but selected URL patterns.
Removes all data but selected URL patterns.
a list of regular expressions
Removes all data but selected exact URLs.
Removes all data but selected exact URLs.
a list of URLs to keep
Removes all non-html-based data (images, executables, etc.) from html text.
A Wrapper class around DF to allow Dfs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an DF, please use RecordLoader and apply .all() on it.