Note

This tutorial was generated from an IPython notebook that can be downloaded here.

Predict rotation period for Kepler stars using existing model

load trained random forest models and predict rotation periods from provided features or light curve(s).

Calculate rotation period(s) from features

Below is a tutorial to calculate rotation period(s) for single or multiple stars using the existing model. Currently, this model is only tested on stars in the Kepler field. To achieve the best results, the light curve statistics should be calculated from Kepler light curves. For any stars outside of the Kepler field, it is best to use the model with 1 estimators to minimize model bias.

import Astraea
import pandas as pd
import numpy as np

# print out needed features in order
TrainF_class, TrainF_reg = Astraea.getTrainF()

# load in existing testing data
KeplerTest = Astraea.load_KeplerTest()
>>> classification features are: ['LG_peaks [Lomb-Scargle peak height]', 'Rvar [ppm]', 'parallax [gaia]', 'radius_percentile_lower [gaia]', 'radius_percentile_upper [gaia]', 'phot_g_mean_flux_over_error [gaia]', 'bp_g [gaia]']


>>> regression features are: ['teff [gaia]','bp_g [gaia]','lum_val [gaia]','v_tan [getVs()]','phot_g_mean_flux_over_error [gaia]','v_b [getVs()]','radius_val [gaia]','b [gaia]','Rvar [ppm]','flicker [FLICKER]']

If the data contains all features, first create a dictionary with required columns, then convert it into a <pd.DataFrame>,

# construct pd.DataFrame that contains all the features
starStat = {'LG_peaks': KeplerTest.LG_peaks.values, 'Rvar': KeplerTest.Rvar.values,
          'parallax': KeplerTest.parallax.values,
          'radius_percentile_lower': KeplerTest.radius_percentile_lower.values,
          'radius_val': KeplerTest.radius_val.values,
          'radius_percentile_upper': KeplerTest.radius_percentile_upper.values,
          'phot_g_mean_flux_over_error': KeplerTest.phot_g_mean_flux_over_error.values,
          'bp_g': KeplerTest.bp_g.values, 'teff': KeplerTest.teff.values,
          'lum_val': KeplerTest.lum_val.values, 'v_tan':KeplerTest.v_tan.values,
          'v_b': KeplerTest.v_b.values, 'b': KeplerTest.b.values,
          'flicker':KeplerTest.flicker.values, 'Prot': KeplerTest.Prot.values,
          'Prot_err': KeplerTest.Prot_err.values}

# dictionary -> dataframe
star_data = pd.DataFrame(starStat)

# only display 6 columns
pd.set_option("display.max_columns", 6)

star_data
LG_peaks Rvar parallax ... flicker Prot Prot_err
0 0.589608 5053.753125 12.472423 ... 0.001139 22.736980 2.369760
1 0.162442 2232.375000 5.677683 ... 0.000471 5.480000 0.500000
2 0.654339 1021.548438 9.395085 ... 0.001802 28.779253 4.690020
3 0.434910 12080.606250 6.609612 ... 0.001328 2.002000 0.014000
4 0.301622 6154.762500 6.351729 ... 0.000479 4.733000 0.059000
... ... ... ... ... ... ... ...
200 5.709757 19928.750000 13.199997 ... 0.001050 9.939000 0.015000
201 0.610268 203.850000 5.504599 ... 0.001476 19.567000 1.062000
202 0.777997 169.729492 8.506015 ... 0.006036 30.843687 2.999815
203 0.644850 384.446289 9.621642 ... 0.008710 34.280000 0.083000
204 0.359622 47.170898 6.945455 ... 0.013822 41.021000 0.080000

205 rows × 16 columns

Plot correlations between features and rotation period,

Astraea.plot_corr(star_data,TrainF_reg,MS=10)
../_images/Tutorial_7_0.png

You can now feed the <pd.DataFrame> into the function to predict rotation periods,

predics = Astraea.getKeplerProt(star_data)
>>> classification features are: ['LG_peaks [Lomb-Scargle peak height]', 'Rvar [ppm]', 'parallax [gaia]', 'radius_percentile_lower [gaia]', 'radius_percentile_upper [gaia]', 'phot_g_mean_flux_over_error [gaia]', 'bp_g [gaia]']


>>> regression features are: ['teff [gaia]','bp_g [gaia]','lum_val [gaia]','v_tan [getVs()]','phot_g_mean_flux_over_error [gaia]','v_b [getVs()]','radius_val [gaia]','b [gaia]','Rvar [ppm]','flicker [FLICKER]']


Total 205 stars!
Classifing 205 stars!
205.0 stars have predictable rotation periods (100.0%)
Predicting rotation periods!
Finished!
predics
True Prot True Prot_err Prot prediction w/ 1 est Prot prediction w/ 100 est
0 22.736980 2.369760 18.041 16.096785
1 5.480000 0.500000 4.138 4.644040
2 28.779253 4.690020 22.334 18.535857
3 2.002000 0.014000 9.548 5.751400
4 4.733000 0.059000 4.138 3.841780
... ... ... ... ...
200 9.939000 0.015000 6.426 11.572000
201 19.567000 1.062000 13.459 7.922410
202 30.843687 2.999815 35.814 15.489082
203 34.280000 0.083000 34.239 19.050126
204 41.021000 0.080000 18.347 23.735942

205 rows × 4 columns

Download a light curve and calculate variablity, lomb-scargle peak and flickers needed for the trained model

Here is a basical tutorial for calculating light curve statistics needed for the trained model. Other statistics can found by cross-matching any Kepler stars with Gaia (a useful website crossmatching Kepler and Gaia by Megan Bedell https://gaia-kepler.fun). v_tan and v_b can be calculated after crossmatching Kepler with Gaia and use the function Astraea.getVs().

lightkurve is used to download the light curve (https://docs.lightkurve.org).

import Astraea
import pandas as pd
import numpy as np
from lightkurve import search_targetpixelfile

# download light curve and plot it
tpf = search_targetpixelfile('KIC 2157356', quarter=9).download()
lc = tpf.to_lightcurve(aperture_mask=tpf.pipeline_mask)
lc.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1a198a6b70>
../_images/Tutorial_12_1.png

Normalize the light curve and get time, flux and flux error,

t = lc.time # get time in days
sig = lc.flux*1e6/(np.median(lc.flux)-1) # get flux in ppm
sig_err = lc.flux_err/(np.median(lc.flux)-1) # get flux_err in ppm

Get varibility of light curve (Rvar)

Rvar = Astraea.getRvar(sig)
Rvar
42303.15312499995

Get Lomb-Scargle peak height (LG_peaks),

LG_Prot, LG_peaks = Astraea.getLGpeak(t,sig,sig_err)
LG_Prot, LG_peaks
(13.275704451936583, 0.20658748911841823)

Get flicker value (flicker). It will download FLICKER if not already installed,

flicker = Astraea.getFlicker(t,sig)
flicker
12609.067447067864

Train a regressor model and test its performance

Here is a simple example to train a simple regressor model and test the performance by calculating \(\chi^2\), the relative median error, plotting the impurity feature importance and plotting the true vs predicted values from the cross-validation test.

Train a regressor model

Generate a <pd.DataFrame> of random features and labels to test Astraea.RFregressor.

Normally user will create a <pd.DataFrame> to include all the features and labels. Below is an example of creating a DataFrame with 20 feature columns with random numbers, named “\(X0\)”, “\(X1\)”, …, “\(X19\)” and a label column that is a sum of all the linear combinations of the features, named “\(y\)” as well as randomly generated label error named “\(y\_err\)”. Note that for this problem, the coefficients for the linear combinations are in decreasing order, so that \(y=20*X0+19*X1+...+1*X19\). By doing so, we can validate the feature importance output later on.

import Astraea
import pandas as pd
import numpy as np

# create random feature matrix with 20 features and 5000 total data points
X = np.random.rand(5000, 20)

# create labels from features
y = sum([X[:,i] * (20-i) for i in range(np.shape(X)[1])])

# put features and labels into one pandas dataFrame
X_y = pd.DataFrame(np.hstack((X,np.reshape(y, (5000, 1)))),
                   columns = np.append(['X'+str(i) for i in range(np.shape(X)[1])], ['y']))

# assign random errors
X_y['y_err'] = np.random.rand(5000)

# only display 9 columns
pd.set_option("display.max_columns", 9)

X_y
X0 X1 X2 X3 ... X18 X19 y y_err
0 0.606529 0.062959 0.720791 0.981319 ... 0.635287 0.644116 98.640853 0.159159
1 0.777415 0.010855 0.785878 0.399668 ... 0.335854 0.830296 120.593567 0.662260
2 0.702256 0.269779 0.331504 0.521144 ... 0.611254 0.015499 103.314942 0.199658
3 0.068799 0.276065 0.507314 0.956416 ... 0.007095 0.035967 87.488786 0.440642
4 0.941370 0.447072 0.666370 0.061702 ... 0.768792 0.505517 98.850524 0.571305
... ... ... ... ... ... ... ... ... ...
4995 0.850935 0.866016 0.912287 0.974070 ... 0.411835 0.353871 114.993805 0.380209
4996 0.387948 0.155788 0.178811 0.680044 ... 0.327828 0.442220 90.433563 0.173283
4997 0.927812 0.779648 0.412766 0.406887 ... 0.534086 0.906392 114.112874 0.505117
4998 0.791832 0.091674 0.730668 0.880411 ... 0.363859 0.788126 126.543979 0.753451
4999 0.954219 0.475264 0.029344 0.419582 ... 0.384495 0.934769 95.863079 0.415013

5000 rows × 22 columns

Train the regressor model with the <pd.DataFrame> generated above.

To use the regressor (Astraea.RFregressor) with default settings, input the <*pd.DataFrame*> combining all the features, label and label error, the feature column names in a list, the label column name and the label error column name. The default label column name is “Prot” and the default label error column name is “Prot_err”. Here we also used 3 estimators instead of the default, which is 100 estimators, to minimize the variance. User could tune any hyper-parameters described in https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.

# train the model with default settings
regr, regr_outs = Astraea.RFregressor(X_y, ['X'+str(i) for i in range(np.shape(X)[1])],
                                      target_var='y', target_var_err='y_err', n_estimators=3)
Simpliest example:
 regr,regr_outs = RFregressor(df,testF)

Fraction of data used to train: 0.8
# of Features attempt to train: 20
Features attempt to train: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19']
ID column not found, using index as ID!
5000 stars in dataframe!
5000 total stars used for RF!
4000 training stars!
Finished training! Making predictions!
Finished predicting! Calculating statistics!
Median Relative Error is: 0.06410001201586864
Average chi^2 is: 1016.3934151688192
Finished!

Print out the model statistics. Output discription see https://astraea.readthedocs.io/en/latest/user/api.html.

regr_outs
importance    [0.12945057256415657, 0.1158549055147204, 0.13...
actrualF      [X0, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, ...
ID_train      [2209, 56, 4923, 3246, 2584, 2332, 3790, 1088,...
ID_test       [0, 5, 11, 14, 24, 26, 27, 29, 32, 34, 44, 46,...
prediction    [101.27825467776267, 106.76937748535295, 120.9...
ave_chi2                                                1016.39
MRE                                                      0.0641
X_test        [[0.6065293441789202, 0.06295926764117177, 0.7...
y_test        [98.64085252981158, 106.29389418685801, 131.77...
X_train       [[0.706831266028258, 0.14969010436464858, 0.06...
y_train       [106.74227914082933, 117.09292189496267, 141.1...
dtype: object

Plot the feature importances and Predicted vs True plot

User can use the function Astraea.plot_result to plot the basic impurity feature importance in decending order. https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e explained what impurity feature importance is and some other way of determining the importance that user can impliment.

To use this function, user could use the outputs from the Astraea.RFregressor function directly or specify the required inputs.

# plot cross-validation result
Astraea.plot_result(regr_outs['actrualF'], regr_outs['importance'], regr_outs['prediction'],
                    regr_outs['y_test'], labelName='y', MS=10)
../_images/Tutorial_30_0.png ../_images/Tutorial_30_1.png

Use the trained model to predict a new label value

User can now use the trained model to predict a new label value based on the trained features. To do so, simply pass the feature values in order.

# generate new random data
X_test_matrix = np.random.rand(5000, 20)

# put into dataframe so we can call the feature names in order
X_test = pd.DataFrame(X_test_matrix, columns = ['X'+str(i) for i in range(np.shape(X)[1])])

# predict using the trained model
y_test = regr.predict(X_test[regr_outs['actrualF']])

y_test
array([ 93.65384246,  92.54149011, 103.50259242, ..., 120.4926051 ,
        86.8585833 , 102.70998368])

Train a classifier model and plot the Receiver operating characteristic (ROC) curve

Here is a example to train a classifier and plot the ROC curve.

Train a classification model

The process is very similar to training a regressor. User need to first generate a <pd.DataFrame> with all the features and labels and feed it to the model.

import Astraea
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# use sklearn.datasets to generate a dataset to test
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0,
                           random_state=0, shuffle=False)

# put features and labels into one pandas dataFrame
X_y = pd.DataFrame(np.hstack((X,np.reshape(y,(1000,1)))),
                   columns=np.append(['X'+str(i) for i in range(np.shape(X)[1])], ['y']))

# print out DataFrame
X_y
X0 X1 X2 X3 y
0 -1.668532 -1.299013 0.274647 -0.603620 0.0
1 -2.972883 -1.088783 0.708860 0.422819 0.0
2 -0.596141 -1.370070 -3.116857 0.644452 0.0
3 -1.068947 -1.175057 -1.913743 0.663562 0.0
4 -1.305269 -0.965926 -0.154072 1.193612 0.0
... ... ... ... ... ...
995 -0.383660 0.952012 -1.738332 0.707135 1.0
996 -0.120513 1.172387 0.030386 0.765002 1.0
997 0.917112 1.105966 0.867665 -2.256250 1.0
998 0.100277 1.458758 -0.443603 -0.670023 1.0
999 1.041523 -0.019871 0.152164 -1.940533 1.0

1000 rows × 5 columns

# train the model with default settings
regr, regr_outs = Astraea.RFclassifier(X_y, ['X'+str(i) for i in range(np.shape(X)[1])],
                                       target_var='y', n_jobs=1)
Simpliest example:
 regr,regr_outs = RFregressor(df,testF)

Fraction of data used to train: 0.8
# of Features attempt to train: 4
Features attempt to train: ['X0', 'X1', 'X2', 'X3']
ID column not found, using index as ID!
1000 stars in dataframe!
1000 total stars used for RF!
800 training stars!
Finished training! Making predictions!
Finished predicting!
Finished!

Plot the ROC curve

import sklearn.metrics as metrics
import matplotlib.pyplot as plt

# predict the probability for testing set using the trained model
probs = regr.predict_proba(regr_outs.X_test)
preds = probs[:,1]

# calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = metrics.roc_curve(regr_outs.y_test, preds)

# get the accuracy
roc_auc = metrics.auc(fpr, tpr)

# plot the ROC curve
plt.figure(figsize=(10,8))
plt.title('Receiver Operating Characteristic',fontsize=25)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.tight_layout()
plt.savefig('ROC.png')
../_images/Tutorial_39_0.png