Class

io.archivesunleashed

WARecordRDD

Related Doc: package archivesunleashed

Permalink

implicit class WARecordRDD extends Serializable

A Wrapper class around RDD to allow RDDs of type ARCRecord and WARCRecord to be queried via a fluent API.

To load such an RDD, please see RecordLoader.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordRDD
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordRDD(rdd: RDD[ArchiveRecord])

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. def discardContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Filters detected content (regex) from RDDs.

    Filters detected content (regex) from RDDs.

    contentREs

    a list of regular expressions

  7. def discardDate(date: String): RDD[ArchiveRecord]

    Permalink

    Filters detected dates from RDDs.

    Filters detected dates from RDDs.

    date

    a list of dates

  8. def discardDomains(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected domains (regex) from RDDs.

    Filters detected domains (regex) from RDDs.

    urls

    a list of urls for the source domains

  9. def discardHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected status codes from RDDs.

    Filters detected status codes from RDDs.

    statusCodes

    a list of HTTP status codes

  10. def discardMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters ArchiveRecord MimeTypes from RDDs.

    Filters ArchiveRecord MimeTypes from RDDs.

    mimeTypes

    a list of Mime Types

  11. def discardMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected MimeTypes from RDDs.

    Filters detected MimeTypes from RDDs.

    mimeTypes

    a list of Mime Types

  12. def discardUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Filters detected url patterns from RDDs.

    Filters detected url patterns from RDDs.

    urlREs

    a list of Regular expressions

  13. def discardUrls(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Filters detected urls from RDDs.

    Filters detected urls from RDDs.

    urls

    a list of urls

  14. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  16. def extractAudioDetailsDF(): DataFrame

    Permalink
  17. def extractHyperlinksDF(): DataFrame

    Permalink
  18. def extractImageDetailsDF(): DataFrame

    Permalink
  19. def extractImageLinksDF(): DataFrame

    Permalink
  20. def extractPDFDetailsDF(): DataFrame

    Permalink
  21. def extractPresentationProgramDetailsDF(): DataFrame

    Permalink
  22. def extractSpreadsheetDetailsDF(): DataFrame

    Permalink
  23. def extractTextFilesDetailsDF(): DataFrame

    Permalink
  24. def extractValidPagesDF(): DataFrame

    Permalink
  25. def extractVideoDetailsDF(): DataFrame

    Permalink
  26. def extractWordProcessorDetailsDF(): DataFrame

    Permalink
  27. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  28. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  29. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  30. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  31. def keepContent(contentREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  32. def keepDate(dates: List[String], component: DateComponent = DateComponent.YYYYMMDD): RDD[ArchiveRecord]

    Permalink

    Removes all data that does not have selected data.

    Removes all data that does not have selected data.

    dates

    a list of dates

    component

    the selected DateComponent enum value

  33. def keepDomains(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected source domains.

    Removes all data but selected source domains.

    urls

    a list of urls for the source domains

  34. def keepHttpStatus(statusCodes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data that does not have selected status codes.

    Removes all data that does not have selected status codes.

    statusCodes

    a list of HTTP status codes

  35. def keepImages(): RDD[ArchiveRecord]

    Permalink

    Removes all data except images.

  36. def keepLanguages(lang: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  37. def keepMimeTypes(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected mimeTypes specified in ArchiveRecord.

    Removes all data but selected mimeTypes specified in ArchiveRecord.

    mimeTypes

    a list of Mime Types

  38. def keepMimeTypesTika(mimeTypes: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected mimeTypes as detected by Tika.

    Removes all data but selected mimeTypes as detected by Tika.

    mimeTypes

    a list of Mime Types

  39. def keepUrlPatterns(urlREs: Set[Regex]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected url patterns.

    Removes all data but selected url patterns.

    urlREs

    a list of regular expressions

  40. def keepUrls(urls: Set[String]): RDD[ArchiveRecord]

    Permalink

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  41. def keepValidPages(): RDD[ArchiveRecord]

    Permalink

    Removes all non-html-based data (images, executables, etc.) from html text.

  42. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  43. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  44. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  45. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  46. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  47. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  48. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  49. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped