Note
This tutorial was generated from an IPython notebook that can be downloaded here.
Predict rotation period for Kepler stars using existing model¶
load trained random forest models and predict rotation periods from provided features or light curve(s).
Calculate rotation period(s) from features¶
Below is a tutorial to calculate rotation period(s) for single or multiple stars using the existing model. Currently, this model is only tested on stars in the Kepler field. To achieve the best results, the light curve statistics should be calculated from Kepler light curves. For any stars outside of the Kepler field, it is best to use the model with 1 estimators to minimize model bias.
import Astraea
import pandas as pd
import numpy as np
# print out needed features in order
TrainF_class, TrainF_reg = Astraea.getTrainF()
# load in existing testing data
KeplerTest = Astraea.load_KeplerTest()
>>> classification features are: ['LG_peaks [Lomb-Scargle peak height]', 'Rvar [ppm]', 'parallax [gaia]', 'radius_percentile_lower [gaia]', 'radius_percentile_upper [gaia]', 'phot_g_mean_flux_over_error [gaia]', 'bp_g [gaia]']
>>> regression features are: ['teff [gaia]','bp_g [gaia]','lum_val [gaia]','v_tan [getVs()]','phot_g_mean_flux_over_error [gaia]','v_b [getVs()]','radius_val [gaia]','b [gaia]','Rvar [ppm]','flicker [FLICKER]']
If the data contains all features, first create a dictionary with required columns, then convert it into a <pd.DataFrame>,
# construct pd.DataFrame that contains all the features
starStat = {'LG_peaks': KeplerTest.LG_peaks.values, 'Rvar': KeplerTest.Rvar.values,
'parallax': KeplerTest.parallax.values,
'radius_percentile_lower': KeplerTest.radius_percentile_lower.values,
'radius_val': KeplerTest.radius_val.values,
'radius_percentile_upper': KeplerTest.radius_percentile_upper.values,
'phot_g_mean_flux_over_error': KeplerTest.phot_g_mean_flux_over_error.values,
'bp_g': KeplerTest.bp_g.values, 'teff': KeplerTest.teff.values,
'lum_val': KeplerTest.lum_val.values, 'v_tan':KeplerTest.v_tan.values,
'v_b': KeplerTest.v_b.values, 'b': KeplerTest.b.values,
'flicker':KeplerTest.flicker.values, 'Prot': KeplerTest.Prot.values,
'Prot_err': KeplerTest.Prot_err.values}
# dictionary -> dataframe
star_data = pd.DataFrame(starStat)
# only display 6 columns
pd.set_option("display.max_columns", 6)
star_data
LG_peaks | Rvar | parallax | ... | flicker | Prot | Prot_err | |
---|---|---|---|---|---|---|---|
0 | 0.589608 | 5053.753125 | 12.472423 | ... | 0.001139 | 22.736980 | 2.369760 |
1 | 0.162442 | 2232.375000 | 5.677683 | ... | 0.000471 | 5.480000 | 0.500000 |
2 | 0.654339 | 1021.548438 | 9.395085 | ... | 0.001802 | 28.779253 | 4.690020 |
3 | 0.434910 | 12080.606250 | 6.609612 | ... | 0.001328 | 2.002000 | 0.014000 |
4 | 0.301622 | 6154.762500 | 6.351729 | ... | 0.000479 | 4.733000 | 0.059000 |
... | ... | ... | ... | ... | ... | ... | ... |
200 | 5.709757 | 19928.750000 | 13.199997 | ... | 0.001050 | 9.939000 | 0.015000 |
201 | 0.610268 | 203.850000 | 5.504599 | ... | 0.001476 | 19.567000 | 1.062000 |
202 | 0.777997 | 169.729492 | 8.506015 | ... | 0.006036 | 30.843687 | 2.999815 |
203 | 0.644850 | 384.446289 | 9.621642 | ... | 0.008710 | 34.280000 | 0.083000 |
204 | 0.359622 | 47.170898 | 6.945455 | ... | 0.013822 | 41.021000 | 0.080000 |
205 rows × 16 columns
Plot correlations between features and rotation period,
Astraea.plot_corr(star_data,TrainF_reg,MS=10)

You can now feed the <pd.DataFrame> into the function to predict rotation periods,
predics = Astraea.getKeplerProt(star_data)
>>> classification features are: ['LG_peaks [Lomb-Scargle peak height]', 'Rvar [ppm]', 'parallax [gaia]', 'radius_percentile_lower [gaia]', 'radius_percentile_upper [gaia]', 'phot_g_mean_flux_over_error [gaia]', 'bp_g [gaia]']
>>> regression features are: ['teff [gaia]','bp_g [gaia]','lum_val [gaia]','v_tan [getVs()]','phot_g_mean_flux_over_error [gaia]','v_b [getVs()]','radius_val [gaia]','b [gaia]','Rvar [ppm]','flicker [FLICKER]']
Total 205 stars!
Classifing 205 stars!
205.0 stars have predictable rotation periods (100.0%)
Predicting rotation periods!
Finished!
predics
True Prot | True Prot_err | Prot prediction w/ 1 est | Prot prediction w/ 100 est | |
---|---|---|---|---|
0 | 22.736980 | 2.369760 | 18.041 | 16.096785 |
1 | 5.480000 | 0.500000 | 4.138 | 4.644040 |
2 | 28.779253 | 4.690020 | 22.334 | 18.535857 |
3 | 2.002000 | 0.014000 | 9.548 | 5.751400 |
4 | 4.733000 | 0.059000 | 4.138 | 3.841780 |
... | ... | ... | ... | ... |
200 | 9.939000 | 0.015000 | 6.426 | 11.572000 |
201 | 19.567000 | 1.062000 | 13.459 | 7.922410 |
202 | 30.843687 | 2.999815 | 35.814 | 15.489082 |
203 | 34.280000 | 0.083000 | 34.239 | 19.050126 |
204 | 41.021000 | 0.080000 | 18.347 | 23.735942 |
205 rows × 4 columns
Download a light curve and calculate variablity, lomb-scargle peak and flickers needed for the trained model¶
Here is a basical tutorial for calculating light curve statistics needed for the trained model. Other statistics can found by cross-matching any Kepler stars with Gaia (a useful website crossmatching Kepler and Gaia by Megan Bedell https://gaia-kepler.fun). v_tan and v_b can be calculated after crossmatching Kepler with Gaia and use the function Astraea.getVs().
lightkurve is used to download the light curve (https://docs.lightkurve.org).
import Astraea
import pandas as pd
import numpy as np
from lightkurve import search_targetpixelfile
# download light curve and plot it
tpf = search_targetpixelfile('KIC 2157356', quarter=9).download()
lc = tpf.to_lightcurve(aperture_mask=tpf.pipeline_mask)
lc.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1a198a6b70>

Normalize the light curve and get time, flux and flux error,
t = lc.time # get time in days
sig = lc.flux*1e6/(np.median(lc.flux)-1) # get flux in ppm
sig_err = lc.flux_err/(np.median(lc.flux)-1) # get flux_err in ppm
Get varibility of light curve (Rvar)
Rvar = Astraea.getRvar(sig)
Rvar
42303.15312499995
Get Lomb-Scargle peak height (LG_peaks),
LG_Prot, LG_peaks = Astraea.getLGpeak(t,sig,sig_err)
LG_Prot, LG_peaks
(13.275704451936583, 0.20658748911841823)
Get flicker value (flicker). It will download FLICKER if not already installed,
flicker = Astraea.getFlicker(t,sig)
flicker
12609.067447067864
Train a regressor model and test its performance¶
Here is a simple example to train a simple regressor model and test the performance by calculating \(\chi^2\), the relative median error, plotting the impurity feature importance and plotting the true vs predicted values from the cross-validation test.
Train a regressor model¶
Generate a <pd.DataFrame> of random features and labels to test Astraea.RFregressor.
Normally user will create a <pd.DataFrame> to include all the features and labels. Below is an example of creating a DataFrame with 20 feature columns with random numbers, named “\(X0\)”, “\(X1\)”, …, “\(X19\)” and a label column that is a sum of all the linear combinations of the features, named “\(y\)” as well as randomly generated label error named “\(y\_err\)”. Note that for this problem, the coefficients for the linear combinations are in decreasing order, so that \(y=20*X0+19*X1+...+1*X19\). By doing so, we can validate the feature importance output later on.
import Astraea
import pandas as pd
import numpy as np
# create random feature matrix with 20 features and 5000 total data points
X = np.random.rand(5000, 20)
# create labels from features
y = sum([X[:,i] * (20-i) for i in range(np.shape(X)[1])])
# put features and labels into one pandas dataFrame
X_y = pd.DataFrame(np.hstack((X,np.reshape(y, (5000, 1)))),
columns = np.append(['X'+str(i) for i in range(np.shape(X)[1])], ['y']))
# assign random errors
X_y['y_err'] = np.random.rand(5000)
# only display 9 columns
pd.set_option("display.max_columns", 9)
X_y
X0 | X1 | X2 | X3 | ... | X18 | X19 | y | y_err | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.606529 | 0.062959 | 0.720791 | 0.981319 | ... | 0.635287 | 0.644116 | 98.640853 | 0.159159 |
1 | 0.777415 | 0.010855 | 0.785878 | 0.399668 | ... | 0.335854 | 0.830296 | 120.593567 | 0.662260 |
2 | 0.702256 | 0.269779 | 0.331504 | 0.521144 | ... | 0.611254 | 0.015499 | 103.314942 | 0.199658 |
3 | 0.068799 | 0.276065 | 0.507314 | 0.956416 | ... | 0.007095 | 0.035967 | 87.488786 | 0.440642 |
4 | 0.941370 | 0.447072 | 0.666370 | 0.061702 | ... | 0.768792 | 0.505517 | 98.850524 | 0.571305 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4995 | 0.850935 | 0.866016 | 0.912287 | 0.974070 | ... | 0.411835 | 0.353871 | 114.993805 | 0.380209 |
4996 | 0.387948 | 0.155788 | 0.178811 | 0.680044 | ... | 0.327828 | 0.442220 | 90.433563 | 0.173283 |
4997 | 0.927812 | 0.779648 | 0.412766 | 0.406887 | ... | 0.534086 | 0.906392 | 114.112874 | 0.505117 |
4998 | 0.791832 | 0.091674 | 0.730668 | 0.880411 | ... | 0.363859 | 0.788126 | 126.543979 | 0.753451 |
4999 | 0.954219 | 0.475264 | 0.029344 | 0.419582 | ... | 0.384495 | 0.934769 | 95.863079 | 0.415013 |
5000 rows × 22 columns
Train the regressor model with the <pd.DataFrame> generated above.
To use the regressor (Astraea.RFregressor) with default settings, input the <*pd.DataFrame*> combining all the features, label and label error, the feature column names in a list, the label column name and the label error column name. The default label column name is “Prot” and the default label error column name is “Prot_err”. Here we also used 3 estimators instead of the default, which is 100 estimators, to minimize the variance. User could tune any hyper-parameters described in https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
# train the model with default settings
regr, regr_outs = Astraea.RFregressor(X_y, ['X'+str(i) for i in range(np.shape(X)[1])],
target_var='y', target_var_err='y_err', n_estimators=3)
Simpliest example:
regr,regr_outs = RFregressor(df,testF)
Fraction of data used to train: 0.8
# of Features attempt to train: 20
Features attempt to train: ['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19']
ID column not found, using index as ID!
5000 stars in dataframe!
5000 total stars used for RF!
4000 training stars!
Finished training! Making predictions!
Finished predicting! Calculating statistics!
Median Relative Error is: 0.06410001201586864
Average chi^2 is: 1016.3934151688192
Finished!
Print out the model statistics. Output discription see https://astraea.readthedocs.io/en/latest/user/api.html.
regr_outs
importance [0.12945057256415657, 0.1158549055147204, 0.13...
actrualF [X0, X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, ...
ID_train [2209, 56, 4923, 3246, 2584, 2332, 3790, 1088,...
ID_test [0, 5, 11, 14, 24, 26, 27, 29, 32, 34, 44, 46,...
prediction [101.27825467776267, 106.76937748535295, 120.9...
ave_chi2 1016.39
MRE 0.0641
X_test [[0.6065293441789202, 0.06295926764117177, 0.7...
y_test [98.64085252981158, 106.29389418685801, 131.77...
X_train [[0.706831266028258, 0.14969010436464858, 0.06...
y_train [106.74227914082933, 117.09292189496267, 141.1...
dtype: object
Plot the feature importances and Predicted vs True plot¶
User can use the function Astraea.plot_result to plot the basic impurity feature importance in decending order. https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e explained what impurity feature importance is and some other way of determining the importance that user can impliment.
To use this function, user could use the outputs from the Astraea.RFregressor function directly or specify the required inputs.
# plot cross-validation result
Astraea.plot_result(regr_outs['actrualF'], regr_outs['importance'], regr_outs['prediction'],
regr_outs['y_test'], labelName='y', MS=10)


Use the trained model to predict a new label value¶
User can now use the trained model to predict a new label value based on the trained features. To do so, simply pass the feature values in order.
# generate new random data
X_test_matrix = np.random.rand(5000, 20)
# put into dataframe so we can call the feature names in order
X_test = pd.DataFrame(X_test_matrix, columns = ['X'+str(i) for i in range(np.shape(X)[1])])
# predict using the trained model
y_test = regr.predict(X_test[regr_outs['actrualF']])
y_test
array([ 93.65384246, 92.54149011, 103.50259242, ..., 120.4926051 ,
86.8585833 , 102.70998368])
Train a classifier model and plot the Receiver operating characteristic (ROC) curve¶
Here is a example to train a classifier and plot the ROC curve.
Train a classification model¶
The process is very similar to training a regressor. User need to first generate a <pd.DataFrame> with all the features and labels and feed it to the model.
import Astraea
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
# use sklearn.datasets to generate a dataset to test
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
# put features and labels into one pandas dataFrame
X_y = pd.DataFrame(np.hstack((X,np.reshape(y,(1000,1)))),
columns=np.append(['X'+str(i) for i in range(np.shape(X)[1])], ['y']))
# print out DataFrame
X_y
X0 | X1 | X2 | X3 | y | |
---|---|---|---|---|---|
0 | -1.668532 | -1.299013 | 0.274647 | -0.603620 | 0.0 |
1 | -2.972883 | -1.088783 | 0.708860 | 0.422819 | 0.0 |
2 | -0.596141 | -1.370070 | -3.116857 | 0.644452 | 0.0 |
3 | -1.068947 | -1.175057 | -1.913743 | 0.663562 | 0.0 |
4 | -1.305269 | -0.965926 | -0.154072 | 1.193612 | 0.0 |
... | ... | ... | ... | ... | ... |
995 | -0.383660 | 0.952012 | -1.738332 | 0.707135 | 1.0 |
996 | -0.120513 | 1.172387 | 0.030386 | 0.765002 | 1.0 |
997 | 0.917112 | 1.105966 | 0.867665 | -2.256250 | 1.0 |
998 | 0.100277 | 1.458758 | -0.443603 | -0.670023 | 1.0 |
999 | 1.041523 | -0.019871 | 0.152164 | -1.940533 | 1.0 |
1000 rows × 5 columns
# train the model with default settings
regr, regr_outs = Astraea.RFclassifier(X_y, ['X'+str(i) for i in range(np.shape(X)[1])],
target_var='y', n_jobs=1)
Simpliest example:
regr,regr_outs = RFregressor(df,testF)
Fraction of data used to train: 0.8
# of Features attempt to train: 4
Features attempt to train: ['X0', 'X1', 'X2', 'X3']
ID column not found, using index as ID!
1000 stars in dataframe!
1000 total stars used for RF!
800 training stars!
Finished training! Making predictions!
Finished predicting!
Finished!
Plot the ROC curve¶
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
# predict the probability for testing set using the trained model
probs = regr.predict_proba(regr_outs.X_test)
preds = probs[:,1]
# calculate the fpr and tpr for all thresholds of the classification
fpr, tpr, threshold = metrics.roc_curve(regr_outs.y_test, preds)
# get the accuracy
roc_auc = metrics.auc(fpr, tpr)
# plot the ROC curve
plt.figure(figsize=(10,8))
plt.title('Receiver Operating Characteristic',fontsize=25)
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.tight_layout()
plt.savefig('ROC.png')
