Astraea

Astraea is a package to train Random Forest (RF) models on datasets. It provides tools to train RF classifiers and regressors as well as perform simple cross-validation tests and create performance plots for the test set.

It was first developed to calculate rotation period of stars from various stellar properties provided and is intended to predict long rotation periods (e.g. those of M-dwarfs) from short TESS lightcurves (27-day lightcurves).

We provide access to models trained on stars from the catalog by McQuillan et al. (2014), Garcia et al. (2014), and Santos et al. (2019). Users can predict whether rotation period can be recovered and predict recoverable rotation periods for the stars in the Kepler field by using their temperatures, colors, kinematics, and other stellar parameters.

User Guide

Installation

From GitHub source:

brew install wget
git clone https://github.com/lyx12311/Astraea.git
cd Astraea
python setup.py install

Dependencies

The dependencies of Astraea are NumPy, matplotlib, pandas, Astropy, scikit-learn.

These can be installed using pip:

pip install numpy matplotlib pandas Astropy scikit-learn

Running tests

You can test Astraea from within the base directory using pytest.

pytest

API documentation

Astraea.FLICKERinstall()[source]

Installs the FLICKER software to calcualte Flicker values from light curves. Documentation: https://flicker.readthedocs.io.

Astraea.getFlicker(t, sig)[source]

Calculates the Flicker value

Parameters:
  • t – Time [days]
  • sig – Flux
Returns:

Flicker ([float]): Flicker value

Astraea.getKeplerProt(X_pred)[source]

Predict rotation period from trained models.

This function predicts rotation periods for stars in the Kepler field. The models are trained on rotation periods from McQuillian et all. (2014), Santos et all. (2019) and Garcia et all. (2014). If the models are not already downloaded, this tool will download the model which might take a couple of minues. It first passes the stars through a classifier, which identifies what stars have measureable rotation periods. Then it uses two regressor models (one with 1 estimator and another one with 100 estimators) to predict rotation periods. If column “Prot” exist, it will also output the true periods associated with the predicted periods. The light curve feature “flicker” can be calculated using software FLICKER.

Parameters:X_pred ([Pandas DataFrame]) – DataFrame contains all variables needed, run Astraea.getTrainF() to print out requirements
Returns:
Containing:
TrueProt:True rotation period (if avaliable)
Prot_prediction_1est:
 Period predictions with 1 estimator
Prot_prediction_100est:
 Period prediction with 100 estimators
Return type:<pandas.DataFrame> or <pandas.Series>
Astraea.getRvar(Flux)[source]

Calculates light curve Rvar

Parameters:Flux – The light curve flux in ppm
Returns:The variability of the light curve
Return type:Rvar ([float])
Astraea.getLGpeak(t, sig, sig_err)[source]

Calculates the location of the highest peak and the maximum power of the hightest peak

Parameters:
  • t – Time [days]
  • sig – Flux
  • sig_err – Flux error
Returns:

LG_Prot ([float]): The period calculated from Lomb-Scargle :LG_peaks ([float]): The maximum peak height

Astraea.getTrainF()[source]

Print out needed featuers for the model in order.

Astraea.getVs(df)[source]

Calculates tangential velocity (v_tan) and vertical velocity proximation (v_b).

Parameters:df ([Pandas DataFrame]) – DataFrame contains columns ‘parallax’, ‘pmra’, ‘pmdec’, ‘ra’, ‘dec’, which are parallax, ra proper motion, dec propermotion, right ascension and declination, respectively
Returns:v_t ([array-like]): Tangential velocity :v_b ([array-like]): Proxy for vertical velocity
Astraea.RFclassifier(df, testF, modelout=False, traind=0.8, ID_on='KID', X_train_ind=[], X_test_ind=[], target_var='Prot_flag', n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None)[source]

Train RF classifier model and predict values for cross-validation dataset.

It uses scikit-learn Random Forest classifier model. All default hyper-parameters are taken from the scikit-learn model that user can change by adding in optional inputs. More details on hyper-parameters, see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. To use the module to train a RF model to predict rotation period, input a pandas dataFrame with column names as well as a list of attribute names.

Parameters:
  • df ([Pandas DataFrame]) – DataFrame contains all variables needed
  • testF ([string list]) – List of feature names used to train
  • modelout (Optional [bool]) – Whether to only output the trained model
  • traind (Optinal [float]) – Fraction of data use to train, the rest will be used to perform cross-validation test (default 0.8)
  • ID_on (Optional [string]) – What is the star identifier column name (default ‘KID’). If specified ID column does not exist, it will just take the index as ID
  • X_train_ind (Optional [list]) – List of ID_on for training set, if not specified, take random traind fraction of indexes from ID_on column
  • X_test_ind (Optional [list]) – List of ID_on for testing set, if not specified, take the remaining (1-traind) fraction of indexes from ID_on column that is not in the training set (X_train_ind)
  • target_var (Optional [string]) – Label column name (default ‘Prot_flag’)
Returns:

regr:

Sklearn RF classifier model (attributes see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

<pandas.Series> containing:
 
actrualF ([string list]):
 Actrual features used
importance ([float list]):
 Impurity-based feature importance ordering as actrualF
ID_train ([list]):
 List of ID_on used for training set
ID_test ([list]):
 List of ID_on used for testing set
predictp ([float list]):
 List of prediction on testing set
X_test ([matrix]):
 Matrix used to predict label values for testing set
y_test ([array-like]):
 Array of true label values of testing set
X_train ([matrix]):
 Matrix used to predict label values for training set
y_train ([array-like]):
 Array of true label values of training set

Return type:

<RF model>, <pandas.Series>

Astraea.RFregressor(df, testF, modelout=False, traind=0.8, ID_on='KID', X_train_ind=[], X_test_ind=[], target_var='Prot', target_var_err='Prot_err', chisq_out=False, MREout=False, n_estimators=100, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)[source]

Train RF regression model and perform cross-validation test.

It uses scikit-learn Random Forest regressor model. All default hyper-parameters are taken from the scikit-learn model that user can change by adding in optional inputs. More details on hyper-parameters, see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html. To use the module to train a RF model to predict rotation period, input a pandas dataFrame with column names as well as a list of attribute names.

Parameters:
  • df ([Pandas DataFrame]) – DataFrame contains all variables needed
  • testF ([string list]) – List of feature names used to train
  • modelout (Optional [bool]) – Whether to only output the trained model
  • traind (Optinal [float]) – Fraction of data use to train, the rest will be used to perform cross-validation test (default 0.8)
  • ID_on (Optional [string]) – What is the star identifier column name (default ‘KID’). If specified ID column does not exist, it will just take the index as ID
  • X_train_ind (Optional [list]) – List of ID_on for training set, if not specified, take random traind fraction of indexes from ID_on column
  • X_test_ind (Optional [list]) – List of ID_on for testing set, if not specified, take the remaining (1-traind) fraction of indexes from ID_on column that is not in the training set (X_train_ind)
  • target_var (Optional [string]) – Label column name (default ‘Prot’)
  • target_var_err (Optional [string]) – Label error column name (default ‘Prot_err’)
  • chisq_out (optional [bool]) – If true, only output average chisq value
  • MREout (optional [bool]) – If true, only output median relative error. If both chisq_out and MREout are true, then output only these two values
Returns:

regr:

Sklearn RF regressor model (attributes see https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

<pandas.Series> containing:
 
actrualF ([string list]):
 Actrual features used
importance ([float list]):
 Impurity-based feature importance ordering as actrualF
ID_train ([list]):
 List of ID_on used for training set
ID_test ([list]):
 List of ID_on used for testing set
predictp ([float list]):
 List of prediction on testing set
ave_chi ([float]):
 Average chisq on cross-validation (testing) set
MRE_val ([float]):
 Median relative error on cross-validation (testing) set
X_test ([matrix]):
 Matrix used to predict label values for testing set
y_test ([array-like]):
 Array of true label values of testing set
X_train ([matrix]):
 Matrix used to predict label values for training set
y_train ([array-like]):
 Array of true label values of training set

Return type:

<RF model>, <pandas.Series>

Astraea.load_RF()[source]

Load random forest classifier and regressors from zendo.org.

Two regressors will be laoded, one with 1 estimator and one with 100 estimators. 1 estimator minimizes bias (systematic offset) with the cost of high varience (scattering) and 100 estimators minimizes varience and maximizes bias. If user wants to predict rotation period from Kepler light curves then it is best to use model with 100 estimiators and user should use model with 1 estimator otherwise. Model trained on TESS light curves are still being developed.

Astraea.plot_corr(df, y_vars, x_var='Prot', logplotarg=[], logarg=[], MS=1)[source]

Plot correlations on one variable vs other variables specified by user

Parameters:df – DataFrame contains all variables needed
Returns:plots for feature correlations
Return type:<matplotlib.plot>
Astraea.plot_result(actrualF, importance, prediction, y_test, y_test_err=[], topn=20, MS=3, labelName='Period')[source]

Plot impurity-based feature importance as well as predicted values vs true values for a random forest model

Parameters:
  • actrualF ([array-like]) – Feature used (from function output of RFregressor())
  • importance ([array-like]) – importance of the model (from function output of RFregressor())
  • prediction ([array-like]) – Predicted values (from function output of RFregressor())
  • y_test ([array-like]) – true values (from function output of RFregressor())
  • y_test_err (Optional [array-like]) – Errors for true values (from function output of RFregressor())
  • topn (Optional [int]) – How many most important features to plot
  • MS (Optional [int]) – Markersize for plotting true vs predicted values
  • labelName (Optional [string]) – Label name
Returns:

importance plot as well as true vs prediction plot

Return type:

<matplotlib.plot>

Tutorials

Note

This tutorial was generated from an IPython notebook that can be downloaded here.

Predict rotation period for Kepler stars using existing model

load trained random forest models and predict rotation periods from provided features or light curve(s).

Calculate rotation period(s) from features

Below is a tutorial to calculate rotation period(s) for single or multiple stars using the existing model. Currently, this model is only tested on stars in the Kepler field. To achieve the best results, the light curve statistics should be calculated from Kepler light curves. For any stars outside of the Kepler field, it is best to use the model with 1 estimators to minimize model bias.

import Astraea
import pandas as pd
import numpy as np

# print out needed features in order
TrainF_class, TrainF_reg = Astraea.getTrainF()

# load in existing testing data
KeplerTest = Astraea.load_KeplerTest()
>>> classification features are: ['LG_peaks [Lomb-Scargle peak height]', 'Rvar [ppm]', 'parallax [gaia]', 'radius_percentile_lower [gaia]', 'radius_percentile_upper [gaia]', 'phot_g_mean_flux_over_error [gaia]', 'bp_g [gaia]']


>>> regression features are: ['teff [gaia]','bp_g [gaia]','lum_val [gaia]','v_tan [getVs()]','phot_g_mean_flux_over_error [gaia]','v_b [getVs()]','radius_val [gaia]','b [gaia]','Rvar [ppm]','flicker [FLICKER]']

If the data contains all features, first create a dictionary with required columns, then convert it into a <pd.DataFrame>,

# construct pd.DataFrame that contains all the features
starStat = {'LG_peaks': KeplerTest.LG_peaks.values, 'Rvar': KeplerTest.Rvar.values,
          'parallax': KeplerTest.parallax.values,
          'radius_percentile_lower': KeplerTest.radius_percentile_lower.values,
          'radius_val': KeplerTest.radius_val.values,
          'radius_percentile_upper': KeplerTest.radius_percentile_upper.values,
          'phot_g_mean_flux_over_error': KeplerTest.phot_g_mean_flux_over_error.values,
          'bp_g': KeplerTest.bp_g.values, 'teff': KeplerTest.teff.values,
          'lum_val': KeplerTest.lum_val.values, 'v_tan':KeplerTest.v_tan.values,
          'v_b': KeplerTest.v_b.values, 'b': KeplerTest.b.values,
          'flicker':KeplerTest.flicker.values, 'Prot': KeplerTest.Prot.values,
          'Prot_err': KeplerTest.Prot_err.values}

# dictionary -> dataframe
star_data = pd.DataFrame(starStat)

# only display 6 columns
pd.set_option("display.max_columns", 6)

star_data
LG_peaks Rvar parallax ... flicker Prot Prot_err
0 0.589608 5053.753125 12.472423 ... 0.001139 22.736980 2.369760
1 0.162442 2232.375000 5.677683 ... 0.000471 5.480000 0.500000
2 0.654339 1021.548438 9.395085 ... 0.001802 28.779253 4.690020
3 0.434910 12080.606250 6.609612 ... 0.001328 2.002000 0.014000
4 0.301622 6154.762500 6.351729 ... 0.000479 4.733000 0.059000
... ... ... ... ... ... ... ...
200 5.709757 19928.750000 13.199997 ... 0.001050 9.939000 0.015000
201 0.610268 203.850000 5.504599 ... 0.001476 19.567000 1.062000
202 0.777997 169.729492 8.506015 ... 0.006036 30.843687 2.999815
203 0.644850 384.446289 9.621642 ... 0.008710 34.280000 0.083000
204 0.359622 47.170898 6.945455 ... 0.013822 41.021000 0.080000

205 rows × 16 columns

Plot correlations between features and rotation period,

Astraea.plot_corr(star_data,TrainF_reg,MS=10)
_images/Tutorial_7_0.png

You can now feed the <pd.DataFrame> into the function to predict rotation periods (note: this will take awhile if you are running it for the first time),

predics = Astraea.getKeplerProt(star_data)
>>> classification features are: ['LG_peaks [Lomb-Scargle peak height]', 'Rvar [ppm]', 'parallax [gaia]', 'radius_percentile_lower [gaia]', 'radius_percentile_upper [gaia]', 'phot_g_mean_flux_over_error [gaia]', 'bp_g [gaia]']


>>> regression features are: ['teff [gaia]','bp_g [gaia]','lum_val [gaia]','v_tan [getVs()]','phot_g_mean_flux_over_error [gaia]','v_b [getVs()]','radius_val [gaia]','b [gaia]','Rvar [ppm]','flicker [FLICKER]']


Total 205 stars!
Classifing 205 stars!
205.0 stars have predictable rotation periods (100.0%)
Predicting rotation periods!
Finished!
predics
True Prot True Prot_err Prot prediction w/ 1 est Prot prediction w/ 100 est
0 22.736980 2.369760 18.041 16.096785
1 5.480000 0.500000 4.138 4.644040
2 28.779253 4.690020 22.334 18.535857
3 2.002000 0.014000 9.548 5.751400
4 4.733000 0.059000 4.138 3.841780
... ... ... ... ...
200 9.939000 0.015000 6.426 11.572000
201 19.567000 1.062000 13.459 7.922410
202 30.843687 2.999815 35.814 15.489082
203 34.280000 0.083000 34.239 19.050126
204 41.021000 0.080000 18.347 23.735942

205 rows × 4 columns

Download a light curve and calculate variablity, lomb-scargle peak and flickers needed for the trained model

Here is a basical tutorial for calculating light curve statistics needed for the trained model. Other statistics can found by cross-matching any Kepler stars with Gaia (a useful website crossmatching Kepler and Gaia by Megan Bedell https://gaia-kepler.fun). v_tan and v_b can be calculated after crossmatching Kepler with Gaia and use the function Astraea.getVs().

lightkurve is used to download the light curve (https://docs.lightkurve.org).

import Astraea
import pandas as pd
import numpy as np
from lightkurve import search_targetpixelfile

# download light curve and plot it
tpf = search_targetpixelfile('KIC 2157356', quarter=9).download()
lc = tpf.to_lightcurve(aperture_mask=tpf.pipeline_mask)
lc.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1a198a6b70>
_images/Tutorial_12_1.png

Normalize the light curve and get time, flux and flux error,

t = lc.time # get time in days
sig = lc.flux*1e6/(np.median(lc.flux)-1) # get flux in ppm
sig_err = lc.flux_err/(np.median(lc.flux)-1) # get flux_err in ppm

Get varibility of light curve (Rvar)

Rvar = Astraea.getRvar(sig)
Rvar
42303.15312499995

Get Lomb-Scargle peak height (LG_peaks),

LG_Prot, LG_peaks = Astraea.getLGpeak(t,sig,sig_err)
LG_Prot, LG_peaks
(13.275704451936583, 0.20658748911841823)

Get flicker value (flicker). It will download FLICKER if not already installed,

flicker = Astraea.getFlicker(t,sig)
flicker
12609.067447067864

Train a regressor model and test its performance

Here is a simple example to train a simple regressor model and test the performance by calculating \(\chi^2\), the relative median error, plotting the impurity feature importance and plotting the true vs predicted values from the cross-validation test.

Train a regressor model

Generate a <pd.DataFrame> of random features and labels to test Astraea.RFregressor.

Normally user will create a <pd.DataFrame> to include all the features and labels. Below is an example of creating a DataFrame with 20 feature columns with random numbers, named “\(X0\)”, “\(X1\)”, …, “\(X19\)” and a label column that is a sum of all the linear combinations of the features, named “\(y\)” as well as randomly generated label error named “\(y\_err\)”. Note that for this problem, the coefficients for the linear combinations are in decreasing order, so that \(y=20*X0+19*X1+...+1*X19\). By doing so, we can validate the feature importance output later on.

import Astraea
import pandas as pd
import numpy as np

# create random feature matrix with 20 features and 5000 total data points
X = np.random.rand(5000, 20)

# create labels from features
y = sum([X[:,i] * (20-i) for i in range(np.shape(X)[1])])

# put features and labels into one pandas dataFrame
X_y = pd.DataFrame(np.hstack((X,np.reshape(y, (5000, 1)))),
                   columns = np.append(['X'+str(i) for i in range(np.shape(X)[1])], ['y']))

# assign random errors
X_y['y_err'] = np.random.rand(5000)

# only display 9 columns
pd.set_option("display.max_columns", 9)

X_y
X0 X1 X2 X3 ... X18 X19 y y_err
0 0.606529 0.062959 0.720791 0.981319 ... 0.635287 0.644116 98.640853 0.159159
1 0.777415 0.010855 0.785878 0.399668 ... 0.335854 0.830296 120.593567 0.662260
2 0.702256 0.269779 0.331504 0.521144 ... 0.611254 0.015499 103.314942 0.199658
3 0.068799 0.276065 0.507314 0.956416 ... 0.007095 0.035967 87.488786 0.440642
4 0.941370 0.447072 0.666370 0.061702 ... 0.768792 0.505517 98.850524 0.571305
... ... ... ... ... ... ... ... ... ...
4995 0.850935 0.866016 0.912287 0.974070 ... 0.411835 0.353871 114.993805 0.380209
4996 0.387948 0.155788 0.178811 0.680044 ... 0.327828 0.442220 90.433563 0.173283
4997 0.927812 0.779648 0.412766 0.406887 ... 0.534086 0.906392 114.112874 0.505117
4998 0.791832 0.091674 0.730668 0.880411 ... 0.363859 0.788126 126.543979 0.753451
4999 0.954219 0.475264 0.029344 0.419582 ... 0.384495 0.934769 95.863079 0.415013

5000 rows × 22 columns

Train the regressor model with the <pd.DataFrame> generated above.

To use the regressor (Astraea.RFregressor) with default settings, input the <*pd.DataFrame*> combining all the features, label and label error, the feature column names in a list, the label column name and the label error column name. The default label column name is “Prot” and the default label error column name is “Prot_err”. Here we also used 3 estimators instead of the default, which is 100 estimators, to minimize the variance. User could tune any hyper-parameters described in https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

# train the model with default settings
regr, regr_outs = Astraea.RFregressor(X_y, ['X'+str(i) for i in range(np.shape(X)[1])],
                                      target_var='y', target_var_err='y_err', n_estimators=3)
Simpliest example:
 regr,regr_outs = RFregressor(df,testF)

Fraction of data used to train: 0.8
# of Features attempt to train: 20
Features attempt to train: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19']
ID column not found, using index as ID!
5000 stars in dataframe!
5000 total stars used for RF!
4000 training stars!
Finished training! Making predictions!
Finished predicting! Calculating statistics!
Median Relative Error is: 0.06410001201586864
Average chi^2 is: 1016.3934151688192
Finished!

Print out the model statistics. Output discription see https://astraea.readthedocs.io/en/latest/user/api.html.

regr_outs
importance    [0.12945057256415657, 0.1158549055147204, 0.13...
actrualF      [X0, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, ...
ID_train      [2209, 56, 4923, 3246, 2584, 2332, 3790, 1088,...
ID_test       [0, 5, 11, 14, 24, 26, 27, 29, 32, 34, 44, 46,...
prediction    [101.27825467776267, 106.76937748535295, 120.9...
ave_chi2                                                1016.39
MRE                                                      0.0641
X_test        [[0.6065293441789202, 0.06295926764117177, 0.7...
y_test        [98.64085252981158, 106.29389418685801, 131.77...
X_train       [[0.706831266028258, 0.14969010436464858, 0.06...
y_train       [106.74227914082933, 117.09292189496267, 141.1...
dtype: object

Plot the feature importances and Predicted vs True plot

User can use the function Astraea.plot_result to plot the basic impurity feature importance in decending order. https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e explained what impurity feature importance is and some other way of determining the importance that user can impliment.

To use this function, user could use the outputs from the Astraea.RFregressor function directly or specify the required inputs.

# plot cross-validation result
Astraea.plot_result(regr_outs['actrualF'], regr_outs['importance'], regr_outs['prediction'],
                    regr_outs['y_test'], labelName='y', MS=10)
_images/Tutorial_30_0.png _images/Tutorial_30_1.png

Use the trained model to predict a new label value

User can now use the trained model to predict a new label value based on the trained features. To do so, simply pass the feature values in order.

# generate new random data
X_test_matrix = np.random.rand(5000, 20)

# put into dataframe so we can call the feature names in order
X_test = pd.DataFrame(X_test_matrix, columns = ['X'+str(i) for i in range(np.shape(X)[1])])

# predict using the trained model
y_test = regr.predict(X_test[regr_outs['actrualF']])

y_test
array([ 93.65384246,  92.54149011, 103.50259242, ..., 120.4926051 ,
        86.8585833 , 102.70998368])

Train a classifier model and plot the Receiver operating characteristic (ROC) curve

Here is a example to train a classifier and plot the ROC curve.

Train a classification model

The process is very similar to training a regressor. User need to first generate a <pd.DataFrame> with all the features and labels and feed it to the model.

import Astraea
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# use sklearn.datasets to generate a dataset to test
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

# put features and labels into one pandas dataFrame
X_y = pd.DataFrame(np.hstack((X,np.reshape(y,(1000,1)))),
                   columns=np.append(['X'+str(i) for i in range(np.shape(X)[1])], ['y']))

# print out DataFrame
X_y
X0 X1 X2 X3 y
0 -1.668532 -1.299013 0.274647 -0.603620 0.0
1 -2.972883 -1.088783 0.708860 0.422819 0.0
2 -0.596141 -1.370070 -3.116857 0.644452 0.0
3 -1.068947 -1.175057 -1.913743 0.663562 0.0
4 -1.305269 -0.965926 -0.154072 1.193612 0.0
... ... ... ... ... ...
995 -0.383660 0.952012 -1.738332 0.707135 1.0
996 -0.120513 1.172387 0.030386 0.765002 1.0
997 0.917112 1.105966 0.867665 -2.256250 1.0
998 0.100277 1.458758 -0.443603 -0.670023 1.0
999 1.041523 -0.019871 0.152164 -1.940533 1.0

1000 rows × 5 columns

# train the model with default settings
regr, regr_outs = Astraea.RFclassifier(X_y, ['X'+str(i) for i in range(np.shape(X)[1])],
                                       target_var='y', n_jobs=1)
Simpliest example:
 regr,regr_outs = RFregressor(df,testF)

Fraction of data used to train: 0.8
# of Features attempt to train: 4
Features attempt to train: ['X0', 'X1', 'X2', 'X3']
ID column not found, using index as ID!
1000 stars in dataframe!
1000 total stars used for RF!
800 training stars!
Finished training! Making predictions!
Finished predicting!
Finished!

Plot the ROC curve

import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# predict the probability for testing set using the trained model
probs = regr.predict_proba(regr_outs.X_test)
preds = probs[:,1]

# calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = metrics.roc_curve(regr_outs.y_test, preds)

# get the accuracy
roc_auc = metrics.auc(fpr, tpr)

# plot the ROC curve
plt.figure(figsize=(10,8))
plt.title('Receiver Operating Characteristic',fontsize=25)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.tight_layout()
plt.savefig('ROC.png')
_images/Tutorial_39_0.png

License & attribution

Copyright 2020, Yuxi Lu.

The source code is made available under the terms of the MIT license.

If you make use of this code, please cite this package and its dependencies. You can find more information about how and what to cite in the citation documentation (not yet complete).