Packages

c

io.archivesunleashed

WARecordRDD

implicit class WARecordRDD extends Serializable

A Wrapper class around RDD to allow RDDs of type ArchiveRecord to be queried via a fluent API.

To load such an RDD, please see RecordLoader.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordRDD
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordRDD(rdd: RDD[ArchiveRecord])

Value Members

  1. def all(): DataFrame
  2. def audio(): DataFrame
  3. def css(): DataFrame
  4. def discardContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected content (regex).

    Filters detected content (regex).

    contentREs

    a list of regular expressions

  5. def discardDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Filters detected dates.

  6. def discardDomains(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected domains (regex).

    Filters detected domains (regex).

    urls

    a list of urls for the source domains

  7. def discardHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Filters detected HTTP status codes.

    Filters detected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  8. def discardLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Filters detected language.

    Filters detected language.

    lang

    a set of ISO 639-2 codes

  9. def discardMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters ArchiveRecord MimeTypes (web server).

    Filters ArchiveRecord MimeTypes (web server).

    mimeTypes

    a list of Mime Types

  10. def discardMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters detected MimeTypes (Tika).

    Filters detected MimeTypes (Tika).

    mimeTypes

    a list of Mime Types

  11. def discardUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected URL patterns (regex).

    Filters detected URL patterns (regex).

    urlREs

    a list of Regular expressions

  12. def discardUrls(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected URLs.

    Filters detected URLs.

    urls

    a list of urls

  13. def html(): DataFrame
  14. def imagegraph(): DataFrame
  15. def images(): DataFrame
  16. def js(): DataFrame
  17. def json(): DataFrame
  18. def keepContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  19. def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Removes all data that does not have selected date.

    Removes all data that does not have selected date.

    dates

    a list of dates

    component

    the selected DateComponent enum value

  20. def keepDomains(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected source domains.

    Removes all data but selected source domains.

    urls

    a list of urls for the source domains

  21. def keepHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Removes all data that does not have selected HTTP status codes.

    Removes all data that does not have selected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  22. def keepImages(): RDD[ArchiveRecord]

    Removes all data except images.

  23. def keepLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  24. def keepMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes specified.

    Removes all data but selected mimeTypes specified.

    mimeTypes

    a list of Mime Types

  25. def keepMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes as detected by Tika.

    Removes all data but selected mimeTypes as detected by Tika.

    mimeTypes

    a list of Mime Types

  26. def keepUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all data but selected URL patterns.

    Removes all data but selected URL patterns.

    urlREs

    a list of regular expressions

  27. def keepUrls(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  28. def keepValidPages(): RDD[ArchiveRecord]

    Removes all non-html-based data (images, executables, etc.) from html text.

  29. def pdfs(): DataFrame
  30. def plainText(): DataFrame
  31. def presentationProgramFiles(): DataFrame
  32. def removeFiledesc(): RDD[ArchiveRecord]

    Filters out filedesc:// and dns: records.

  33. def spreadsheets(): DataFrame
  34. def videos(): DataFrame
  35. def webgraph(): DataFrame

    Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.

  36. def webpages(): DataFrame

    Extracts webpages with columns for crawl data, url, MIME type, and content.

  37. def wordProcessorFiles(): DataFrame
  38. def xml(): DataFrame