Class

io.archivesunleashed

WARecordDF

Related Doc: package archivesunleashed

Permalink

implicit class WARecordDF extends Serializable

A Wrapper class around DF to allow Dfs of type ARCRecord and WARCRecord to be queried via a fluent API.

To load such an DF, please use RecordLoader and apply .all() on it.

Linear Supertypes
Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. WARecordDF
  2. Serializable
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new WARecordDF(df: DataFrame)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  6. def discardContentDF(contentREs: Set[Regex]): DataFrame

    Permalink

    Filters detected content (regex).

    Filters detected content (regex).

    contentREs

    a list of regular expressions

  7. def discardDateDF(date: String): DataFrame

    Permalink

    Filters detected dates.

    Filters detected dates.

    date

    a list of dates

  8. def discardDomainsDF(domains: Set[String]): DataFrame

    Permalink

    Filters detected domains.

    Filters detected domains.

    domains

    a list of domains for the source domains

  9. def discardHttpStatusDF(statusCodes: Set[String]): DataFrame

    Permalink

    Filters detected HTTP status codes.

    Filters detected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  10. def discardLanguagesDF(lang: Set[String]): DataFrame

    Permalink

    Filters detected language.

    Filters detected language.

    lang

    a set of ISO 639-2 codes

  11. def discardMimeTypesDF(mimeTypes: Set[String]): DataFrame

    Permalink

    Filters ArchiveRecord MimeTypes (web server).

    Filters ArchiveRecord MimeTypes (web server).

    mimeTypes

    a list of Mime Types

  12. def discardUrlPatternsDF(urlREs: Set[Regex]): DataFrame

    Permalink

    Filters detected URL patterns (regex).

    Filters detected URL patterns (regex).

    urlREs

    a list of Regular expressions

  13. def discardUrlsDF(urls: Set[String]): DataFrame

    Permalink

    Filters detected URLs.

    Filters detected URLs.

    urls

    a list of urls

  14. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  15. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  16. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  17. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  18. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  19. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  20. def keepContentDF(contentREs: Set[Regex]): DataFrame

    Permalink

    Removes all content that does not pass Regular Expression test.

    Removes all content that does not pass Regular Expression test.

    contentREs

    a list of regular expressions to keep

  21. def keepDateDF(dates: List[String], component: String = "YYYYMMDD"): DataFrame

    Permalink

    Removes all data that does not have selected date.

    Removes all data that does not have selected date.

    dates

    a list of dates

    component

    the selected DateComponent string

  22. def keepDomainsDF(domains: Set[String]): DataFrame

    Permalink

    Removes all data but selected source domains.

  23. def keepHttpStatusDF(statusCodes: Set[String]): DataFrame

    Permalink

    Removes all data that does not have selected HTTP status codes.

    Removes all data that does not have selected HTTP status codes.

    statusCodes

    a list of HTTP status codes

  24. def keepImagesDF(): DataFrame

    Permalink

    Removes all data except images.

  25. def keepLanguagesDF(lang: Set[String]): DataFrame

    Permalink

    Removes all data not in selected language.

    Removes all data not in selected language.

    lang

    a set of ISO 639-2 codes

  26. def keepMimeTypesDF(mimeTypes: Set[String]): DataFrame

    Permalink

    Removes all data but selected mimeTypes specified.

    Removes all data but selected mimeTypes specified.

    mimeTypes

    a list of Mime Types

  27. def keepMimeTypesTikaDF(mimeTypes: Set[String]): DataFrame

    Permalink

    Removes all data but selected mimeTypeTikas specified.

  28. def keepUrlPatternsDF(urlREs: Set[Regex]): DataFrame

    Permalink

    Removes all data but selected URL patterns.

    Removes all data but selected URL patterns.

    urlREs

    a list of regular expressions

  29. def keepUrlsDF(urls: Set[String]): DataFrame

    Permalink

    Removes all data but selected exact URLs.

    Removes all data but selected exact URLs.

    urls

    a list of URLs to keep

  30. def keepValidPagesDF(): DataFrame

    Permalink

    Removes all non-html-based data (images, executables, etc.) from html text.

  31. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  32. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  33. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  34. val spark: SparkSession

    Permalink
  35. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  36. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  37. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  38. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  39. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped