Statistic functions for DataFrames.

Constructors

Methods

  • Calculate the sample covariance of two numerical columns of a DataFrame.

    Parameters

    • col1: string

      the name of the first column

    • col2: string

      the name of the second column

    Returns Promise<number>

    the covariance of the two columns

  • Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient.

    Parameters

    • col1: string

      the name of the first column

    • col2: string

      the name of the second column

    • Optionalmethod: string

      Optional. Currently only supports 'pearson'

    Returns Promise<number>

    the Pearson Correlation Coefficient as a double

  • Computes a pair-wise frequency table of the given columns. Also known as a contingency table. The first column of each row will be the distinct values of col1 and the column names will be the distinct values of col2. The name of the first column will be col1_col2. Counts will be returned as Longs. Pairs that have no occurrences will have zero as their counts.

    Parameters

    • col1: string

      The name of the first column. Distinct items will make the first item of each row.

    • col2: string

      The name of the second column. Distinct items will make the column names.

    Returns DataFrame

    A DataFrame containing for the contingency table.

  • Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in "https://doi.org/10.1145/762471.762473, proposed by Karp, Schenker, and Papadimitriou". The support should be greater than 1e-4.

    Parameters

    • cols: string[]

      the names of the columns to search frequent items in

    • Optionalsupport: number

      Optional. The minimum frequency for an item to be considered frequent. Should be greater than 1e-4. Default is 1% (0.01).

    Returns DataFrame

    A Local DataFrame with the frequent items in each column.

  • Returns a stratified sample without replacement based on the fraction given on each stratum.

    Parameters

    • col: Column

      column that defines strata

    • fractions: Map<any, number>

      sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.

    • Optionalseed: number

      random seed

    Returns DataFrame

    a new DataFrame that represents the stratified sample

  • Calculates the approximate quantiles of numerical columns of a DataFrame.

    The result will be a DataFrame with the same number of columns as cols, where each column contains the approximate quantiles for the corresponding input column.

    Parameters

    • cols: string[]

      the names of the numerical columns

    • probabilities: number[]

      a list of quantile probabilities. Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.

    • relativeError: number

      The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

    Returns DataFrame

    a DataFrame with the approximate quantiles