 
      
    
      
     
      
    
       
      
    
       
      
    
       
      
    
       
      
    
       
      
    
      
     
      
    
      
     
      
    
      
     
      
    
      Filters MimeTypes from RDDs.
 
      
    
      
     
      
    
      
     
      
    
       
      
    
       
      
    
      
     
      
    
      
     
      
    
      
     
      
    
      
     
      
    
       
      
    
       
      
    
       
      
    
       
      
    
      Removes all content that does not pass Regular Expression test.
Removes all content that does not pass Regular Expression test.
a list of Regular expressions to keep
 
      
    
      Removes all data that does not have selected data.
Removes all data that does not have selected data.
a list of dates to keep
the selected DateComponent enum value
 
      
    
      Removes all data but selected source domains.
Removes all data but selected source domains.
a Set of urls for the source domains to keep
 
      
    
      Removes all data except images.
 
      
    
      Removes all data not in selected language.
Removes all data not in selected language.
a Set of ISO 639-2 codes
 
      
    
      Removes all data but selected mimeTypes.
Removes all data but selected mimeTypes.
a Set of Mimetypes to keep
 
      
    
      Removes all data but selected url patterns.
Removes all data but selected url patterns.
a Set of Regular Expressions to keep
 
      
    
      Removes all data but selected exact URLs
Removes all data but selected exact URLs
a Set of URLs to keep
 
      
    
      Removes all non-html-based data (images, executables etc.) from html text.
 
      
    
       
      
    
       
      
    
       
      
    
       
      
    
       
      
    
       
      
    
       
      
    
      
A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.
To load such an RDD, please see RecordLoader.