Calculate the sample covariance of two numerical columns of a DataFrame.
the name of the first column
the name of the second column
the covariance of the two columns
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient.
the name of the first column
the name of the second column
Optionalmethod: stringOptional. Currently only supports 'pearson'
the Pearson Correlation Coefficient as a double
Computes a pair-wise frequency table of the given columns. Also known as a contingency table.
The first column of each row will be the distinct values of col1 and the column names will
be the distinct values of col2. The name of the first column will be col1_col2. Counts
will be returned as Longs. Pairs that have no occurrences will have zero as their counts.
The name of the first column. Distinct items will make the first item of each row.
The name of the second column. Distinct items will make the column names.
A DataFrame containing for the contingency table.
Finding frequent items for columns, possibly with false positives. Using the
frequent element count algorithm described in "https://doi.org/10.1145/762471.762473,
proposed by Karp, Schenker, and Papadimitriou". The support should be greater than 1e-4.
the names of the columns to search frequent items in
Optionalsupport: numberOptional. The minimum frequency for an item to be considered frequent.
Should be greater than 1e-4. Default is 1% (0.01).
A Local DataFrame with the frequent items in each column.
Returns a stratified sample without replacement based on the fraction given on each stratum.
column that defines strata
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
Optionalseed: numberrandom seed
a new DataFrame that represents the stratified sample
Calculates the approximate quantiles of numerical columns of a DataFrame.
The result will be a DataFrame with the same number of columns as cols, where each
column contains the approximate quantiles for the corresponding input column.
the names of the numerical columns
a list of quantile probabilities. Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
a DataFrame with the approximate quantiles
Statistic functions for DataFrames.