Packages

c

io.archivesunleashed

WARecordRDD

implicit class WARecordRDD extends Serializable

A Wrapper class around RDD to allow RDDs of type ArchiveRecord to be queried via a fluent API.

To load such an RDD, please see RecordLoader.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordRDD
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordRDD(rdd: RDD[ArchiveRecord])

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def all(): DataFrame
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def audio(): DataFrame
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native() @HotSpotIntrinsicCandidate()
  8. def css(): DataFrame
  9. def discardContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected content (regex).

    Filters detected content (regex).

    contentREs

    a list of regular expressions

  10. def discardDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Filters detected dates.

  11. def discardDomains(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected domains (regex).

    Filters detected domains (regex).

    urls

    a list of urls for the source domains

  12. def discardHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Filters detected HTTP status codes.

    Filters detected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  13. def discardLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Filters detected language.

    Filters detected language.

    lang

    a set of ISO 639-2 codes

  14. def discardMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters ArchiveRecord MimeTypes (web server).

    Filters ArchiveRecord MimeTypes (web server).

    mimeTypes

    a list of Mime Types

  15. def discardMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Filters detected MimeTypes (Tika).

    Filters detected MimeTypes (Tika).

    mimeTypes

    a list of Mime Types

  16. def discardUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Filters detected URL patterns (regex).

    Filters detected URL patterns (regex).

    urlREs

    a list of Regular expressions

  17. def discardUrls(urls: Set[String]): RDD[ArchiveRecord]

    Filters detected URLs.

    Filters detected URLs.

    urls

    a list of urls

  18. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  19. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  20. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  21. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  22. def html(): DataFrame
  23. def imagegraph(): DataFrame
  24. def images(): DataFrame
  25. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  26. def js(): DataFrame
  27. def json(): DataFrame
  28. def keepContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  29. def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Removes all data that does not have selected date.

    Removes all data that does not have selected date.

    dates

    a list of dates

    component

    the selected DateComponent enum value

  30. def keepDomains(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected source domains.

    Removes all data but selected source domains.

    urls

    a list of urls for the source domains

  31. def keepHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Removes all data that does not have selected HTTP status codes.

    Removes all data that does not have selected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  32. def keepImages(): RDD[ArchiveRecord]

    Removes all data except images.

  33. def keepLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  34. def keepMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes specified.

    Removes all data but selected mimeTypes specified.

    mimeTypes

    a list of Mime Types

  35. def keepMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected mimeTypes as detected by Tika.

    Removes all data but selected mimeTypes as detected by Tika.

    mimeTypes

    a list of Mime Types

  36. def keepUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Removes all data but selected URL patterns.

    Removes all data but selected URL patterns.

    urlREs

    a list of regular expressions

  37. def keepUrls(urls: Set[String]): RDD[ArchiveRecord]

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  38. def keepValidPages(): RDD[ArchiveRecord]

    Removes all non-html-based data (images, executables, etc.) from html text.

  39. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  40. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  41. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @HotSpotIntrinsicCandidate()
  42. def pdfs(): DataFrame
  43. def plainText(): DataFrame
  44. def presentationProgramFiles(): DataFrame
  45. def removeFiledesc(): RDD[ArchiveRecord]

    Filters out filedesc:// and dns: records.

  46. def spreadsheets(): DataFrame
  47. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  48. def toString(): String
    Definition Classes
    AnyRef → Any
  49. def videos(): DataFrame
  50. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  51. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  52. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  53. def webgraph(): DataFrame

    Extracts a webgraph with columns for crawl date, source url, destination url, and anchor text.

  54. def webpages(): DataFrame

    Extracts webpages with columns for crawl data, url, MIME type, and content.

  55. def wordProcessorFiles(): DataFrame
  56. def xml(): DataFrame

Deprecated Value Members

  1. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] ) @Deprecated @deprecated
    Deprecated

    (Since version ) see corresponding Javadoc for more information.

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped