Predicting the temperature of a superconductor: Regression

By Joe Ganser

ABSTRACT

Data on superconductors was studied with the goal of predicting the critical temperature at which a material undergoes superconductivity; hence a regession task. The data initially consisted of 168 features and 21,263 rows. The goal was not to predict whether or not a material was superconductive, because all data was on superconductive materials. Another researcher had gotten a root mean squared error (RMSE) of ±9.5kelvin on this data set, and this project's goal was to out perform this metric while comparing model techniques and analyzing features. This study was able to get ±9.4Kelvin RMSE. The data is originally sourced from the superconducting material database. This paper is oriented towards an audience of scientists, and was coded in python 3.6.

PROJECT OUTLINE

Introduction
Background info: What is a super conductor?
Pre-processing the target variable
Filtering out irrelevant features
Selecting the right model
Random Forest: Tuning hyperparameters
Checking for a good fit
Feature importance
Conclusions
Links to coded notebooks
Sources

Introduction

There were several objectives to this project

Predict the temperature of superconductivity with the highest possible R2 score and lowest possible RMSE
Compare multiple model techniques and find the one that produces the best metrics
Identify relevant and irrelevant features
Ensure model validity

The approach was to first identify the irrelevant features and discard them, followed by putting the remaining features into a bunch of models for wide comparison.

What is a superconductor?

A superconductor is a material that, at a very, very low temperature allows for infinite conductivity (zero electrical resistance). Essentialy this means thats all the electrons/ions will flow on a material without any disturbance what so ever. One of the most interesting things about superconductors is that at their critical temperature they can create magnetic levitation! This is called the Meissner effect. Lexus, the car company, used this phenomenon to create a hoverboard.

Superconducting temperature

The temperature of at which a material becomes superconductive was what I was trying to predict in this analysis. So why is temperature relevant to super conductivity? To understand this, consider a familiar example of something that's the complete opposite.

A toaster works using electrical resistance. In the circuit of a toaster, the electrons are forced to slow down and colide with each other, which in turn causes an accumulation of thermal energy. That thermal energy then raises the temperature, which cooks the toast.

The physics of a super conductor is (sort of) like the opposite of a toaster. Instead of electrons bumnping into each other chaotically (creating heat), the cold causes them all move uniformly. When the temperature is low enough, the uniform motion of the electrons becomes comes very coherent, producing powerful magnetic fields.

There are many materials that can support superconductivity. Each one is a combination of many different elements, and has many intrinsic properties. Thus by creating a table for each super conductor's properties, along with it's critical temperature, a regression analysis can be done.

Preprocessing the target variable: critical temperature

In any regression problem, you want the target variable to be as close to a normal distribution as possible. Sometimes this is possible to do, other times not. In this case I was able to transform the critical temperature to be approximately normal.

Originally the data was very, very positively skewed. After a lot of trial and error, I used an exponential transformation to get the data (approximately) normal.

Unfortunately in the best fit there remained a slight skew. The highest peak wasn't simply an outlier and couldn't be dropped or imputed. Despite this, the results observed at the end were pretty good.

Filtering out irrelevant features

Initially the data shape was 21263 rows by 168 features. The head of the original dataframe looked like this;

number_of_elements	mean_atomic_mass	wtd_mean_atomic_mass	gmean_atomic_mass	wtd_gmean_atomic_mass	entropy_atomic_mass	wtd_entropy_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	std_atomic_mass	wtd_std_atomic_mass	mean_fie	wtd_mean_fie	gmean_fie	wtd_gmean_fie	entropy_fie	wtd_entropy_fie	range_fie	wtd_range_fie	std_fie	wtd_std_fie	mean_atomic_radius	wtd_mean_atomic_radius	gmean_atomic_radius	wtd_gmean_atomic_radius	entropy_atomic_radius	wtd_entropy_atomic_radius	range_atomic_radius	wtd_range_atomic_radius	std_atomic_radius	wtd_std_atomic_radius	mean_Density	wtd_mean_Density	gmean_Density	wtd_gmean_Density	entropy_Density	wtd_entropy_Density	range_Density	wtd_range_Density	std_Density	wtd_std_Density	mean_ElectronAffinity	wtd_mean_ElectronAffinity	gmean_ElectronAffinity	wtd_gmean_ElectronAffinity	entropy_ElectronAffinity	wtd_entropy_ElectronAffinity	range_ElectronAffinity	wtd_range_ElectronAffinity	std_ElectronAffinity	wtd_std_ElectronAffinity	mean_FusionHeat	wtd_mean_FusionHeat	gmean_FusionHeat	wtd_gmean_FusionHeat	entropy_FusionHeat	wtd_entropy_FusionHeat	range_FusionHeat	wtd_range_FusionHeat	std_FusionHeat	wtd_std_FusionHeat	mean_ThermalConductivity	wtd_mean_ThermalConductivity	gmean_ThermalConductivity	wtd_gmean_ThermalConductivity	entropy_ThermalConductivity	wtd_entropy_ThermalConductivity	range_ThermalConductivity	wtd_range_ThermalConductivity	std_ThermalConductivity	wtd_std_ThermalConductivity	mean_Valence	wtd_mean_Valence	gmean_Valence	wtd_gmean_Valence	entropy_Valence	wtd_entropy_Valence	range_Valence	wtd_range_Valence	std_Valence	wtd_std_Valence	critical_temp	O	Cu	Ag	Ba	La
4	88.9444675	57.862692285714296	66.36159243157189	36.1166119053847	1.1817952393305	1.06239554519617	122.90607	31.7949208571429	51.968827786103404	53.6225345301219	775.425	1010.26857142857	718.1528999521299	938.016780052204	1.30596703599158	0.791487788469155	810.6	735.9857142857139	323.811807806633	355.562966713294	160.25	105.514285714286	136.126003095455	84.528422716633	1.2592439721428899	1.2070399870146102	205	42.914285714285704	75.23754049674942	69.2355694829807	4654.35725	2961.5022857142894	724.953210852388	53.5438109235142	1.03312880053102	0.814598190091683	8958.571	1579.58342857143	3306.1628967555	3572.5966237083794	81.8375	111.72714285714301	60.1231785550982	99.4146820543113	1.15968659338134	0.7873816907632231	127.05	80.9871428571429	51.4337118896741	42.55839575195	6.9055	3.8468571428571403	3.4794748493632697	1.04098598567486	1.08857534188499	0.994998193254128	12.878	1.7445714285714298	4.599064116752451	4.66691955388659	107.75664499999999	61.0151885714286	7.06248773046785	0.62197948704754	0.308147989812345	0.262848266362233	399.97342000000003	57.12766857142861	168.854243757651	138.51716251123	2.25	2.25714285714286	2.2133638394006403	2.21978342968743	1.36892236074022	1.0662210317362	1	1.08571428571429	0.43301270189221897	0.43705881545081	29.0	4.0	1.0	0.0	0.2	1.8
5	92.729214	58.518416142857106	73.13278722250651	36.3966020291995	1.44930919335685	1.05775512271911	122.90607	36.161939000000004	47.094633170313394	53.9798696513451	766.44	1010.61285714286	720.605510513725	938.745412527433	1.5441445432697298	0.8070782149387309	810.6	743.164285714286	290.183029138508	354.963511171592	161.2	104.971428571429	141.465214777999	84.3701669575628	1.50832754035259	1.2041147982326001	205	50.571428571428605	67.321319060161	68.0088169554027	5821.4858	3021.01657142857	1237.09508033858	54.0957182556368	1.3144421846210501	0.9148021770663429	10488.571000000002	1667.3834285714302	3767.4031757706202	3632.64918471043	90.89	112.316428571429	69.8333146094209	101.166397739874	1.42799655342352	0.83866646563365	127.05	81.2078571428572	49.438167441765096	41.6676207979191	7.7844	3.7968571428571405	4.40379049753476	1.0352511158281401	1.37497728009085	1.07309384625263	12.878	1.59571428571429	4.473362654648071	4.60300005985449	172.20531599999998	61.37233142857139	16.0642278788044	0.6197346323305469	0.847404163195705	0.5677061078766371	429.97342000000003	51.4133828571429	198.554600255545	139.630922368904	2.0	2.25714285714286	1.8881750225898	2.2106794087065498	1.55711309805765	1.04722136819323	2	1.12857142857143	0.6324555320336761	0.468606270481621	26.0	4.0	0.9	0.1	0.1	1.9
4	88.9444675	57.885241857142894	66.36159243157189	36.1225090359592	1.1817952393305	0.9759804641654979	122.90607	35.741099	51.968827786103404	53.65626773209821	775.425	1010.82	718.1528999521299	939.009035665864	1.30596703599158	0.773620193146673	810.6	743.164285714286	323.811807806633	354.804182855034	160.25	104.685714285714	136.126003095455	84.214573243296	1.2592439721428899	1.13254686280436	205	49.31428571428571	75.23754049674942	67.797712320685	4654.35725	2999.15942857143	724.953210852388	53.9740223651659	1.03312880053102	0.760305152674073	8958.571	1667.3834285714302	3306.1628967555	3592.01928133231	81.8375	112.213571428571	60.1231785550982	101.082152388012	1.15968659338134	0.7860067360250121	127.05	81.2078571428572	51.4337118896741	41.63987779712121	6.9055	3.8225714285714303	3.4794748493632697	1.03743942474191	1.08857534188499	0.927479442031317	12.878	1.75714285714286	4.599064116752451	4.64963546519334	107.75664499999999	60.94376	7.06248773046785	0.6190946827042739	0.308147989812345	0.250477444193951	399.97342000000003	57.12766857142861	168.854243757651	138.54061273834998	2.25	2.27142857142857	2.2133638394006403	2.23267852196607	1.36892236074022	1.02917468729772	1	1.11428571428571	0.43301270189221897	0.444696640464954	19.0	4.0	1.0	0.0	0.1	1.9
4	88.9444675	57.8739670714286	66.36159243157189	36.1195603503211	1.1817952393305	1.0222908923957	122.90607	33.7680099285714	51.968827786103404	53.63940496787	775.425	1010.54428571429	718.1528999521299	938.512776724546	1.30596703599158	0.7832066603612908	810.6	739.575	323.811807806633	355.18388442194396	160.25	105.1	136.126003095455	84.371352045645	1.2592439721428899	1.17303291789271	205	46.11428571428571	75.23754049674942	68.52166497852029	4654.35725	2980.33085714286	724.953210852388	53.758486291021796	1.03312880053102	0.7888885322214609	8958.571	1623.48342857143	3306.1628967555	3582.3705966440502	81.8375	111.970357142857	60.1231785550982	100.24495020209	1.15968659338134	0.7869004893749151	127.05	81.0975	51.4337118896741	42.1023442240235	6.9055	3.83471428571429	3.4794748493632697	1.0392111922717702	1.08857534188499	0.96403104867923	12.878	1.7445714285714298	4.599064116752451	4.658301352402599	107.75664499999999	60.97947428571429	7.06248773046785	0.6205354084838871	0.308147989812345	0.257045108326848	399.97342000000003	57.12766857142861	168.854243757651	138.528892724768	2.25	2.26428571428571	2.2133638394006403	2.2262216392083	1.36892236074022	1.04883427304761	1	1.1	0.43301270189221897	0.440952123830019	22.0	4.0	1.0	0.0	0.15	1.85
4	88.9444675	57.840142714285705	66.36159243157189	36.11071573753821	1.1817952393305	1.12922373013507	122.90607	27.8487427142857	51.968827786103404	53.5887706050743	775.425	1009.7171428571401	718.1528999521299	937.025572960086	1.30596703599158	0.8052296408133571	810.6	728.8071428571429	323.811807806633	356.31928137213	160.25	106.342857142857	136.126003095455	84.8434418389765	1.2592439721428899	1.26119371912948	205	36.514285714285705	75.23754049674942	70.63444843787241	4654.35725	2923.8451428571398	724.953210852388	53.1170285737926	1.03312880053102	0.859810869291435	8958.571	1491.78342857143	3306.1628967555	3552.6686635803203	81.8375	111.24071428571399	60.1231785550982	97.7747186271033	1.15968659338134	0.7873961825201741	127.05	80.76642857142859	51.4337118896741	43.4520592088545	6.9055	3.8711428571428597	3.4794748493632697	1.04454467078021	1.08857534188499	1.04496953590973	12.878	1.7445714285714298	4.599064116752451	4.6840139510763095	107.75664499999999	61.0866171428571	7.06248773046785	0.624877733754845	0.308147989812345	0.272819938850094	399.97342000000003	57.12766857142861	168.854243757651	138.493671473918	2.25	2.24285714285714	2.2133638394006403	2.20696281450132	1.36892236074022	1.0960519179005401	1	1.05714285714286	0.43301270189221897	0.42880945770867496	23.0	4.0	1.0	0.0	0.3	1.7

Thus, several steps were taken to eliminate the irrelevant features. Specifically;

Dropping features that varied 5% of less across all rows (using VarianceThreshold)
Dropping duplicated features (using .transpose().drop_duplicates(keep='first').transpose())
Dropping features with a very small correlation with the target variable (less than |+-0.1|)
Dropping features that were highly correlated with each other (greater than |+-0.8|)

It was concluded that 132 features out of the original 168 had very little relevance.

The data that remained, which would be used for modelling, had 36 features and 21263 rows.

The head of the cleaned dataframe was;

number_of_elements	mean_atomic_mass	range_atomic_mass	wtd_range_atomic_mass	mean_fie	wtd_mean_fie	wtd_entropy_fie	range_fie	wtd_range_fie	mean_atomic_radius	wtd_range_atomic_radius	mean_Density	range_Density	mean_ElectronAffinity	wtd_mean_ElectronAffinity	range_ElectronAffinity	wtd_range_ElectronAffinity	mean_FusionHeat	range_FusionHeat	mean_ThermalConductivity	wtd_mean_ThermalConductivity	gmean_ThermalConductivity	wtd_entropy_ThermalConductivity	range_ThermalConductivity	range_Valence	wtd_range_Valence	critical_temp	O	Cu	Ba
4.0	88.9444675	122.90607	31.7949208571429	775.425	1010.26857142857	0.791487788469155	810.6	735.9857142857139	160.25	42.914285714285704	4654.35725	8958.571	81.8375	111.72714285714301	127.05	80.9871428571429	6.9055	12.878	107.75664499999999	61.0151885714286	7.06248773046785	0.262848266362233	399.97342000000003	1.0	1.08571428571429	29.0	4.0	1.0	0.2
5.0	92.729214	122.90607	36.161939000000004	766.44	1010.61285714286	0.8070782149387309	810.6	743.164285714286	161.2	50.571428571428605	5821.4858	10488.571000000002	90.89	112.316428571429	127.05	81.2078571428572	7.7844	12.878	172.20531599999998	61.37233142857139	16.0642278788044	0.5677061078766371	429.97342000000003	2.0	1.12857142857143	26.0	4.0	0.9	0.1
4.0	88.9444675	122.90607	35.741099	775.425	1010.82	0.773620193146673	810.6	743.164285714286	160.25	49.31428571428571	4654.35725	8958.571	81.8375	112.213571428571	127.05	81.2078571428572	6.9055	12.878	107.75664499999999	60.94376	7.06248773046785	0.250477444193951	399.97342000000003	1.0	1.11428571428571	19.0	4.0	1.0	0.1
4.0	88.9444675	122.90607	33.7680099285714	775.425	1010.54428571429	0.7832066603612908	810.6	739.575	160.25	46.11428571428571	4654.35725	8958.571	81.8375	111.970357142857	127.05	81.0975	6.9055	12.878	107.75664499999999	60.97947428571429	7.06248773046785	0.257045108326848	399.97342000000003	1.0	1.1	22.0	4.0	1.0	0.15
4.0	88.9444675	122.90607	27.8487427142857	775.425	1009.7171428571401	0.8052296408133571	810.6	728.8071428571429	160.25	36.514285714285705	4654.35725	8958.571	81.8375	111.24071428571399	127.05	80.76642857142859	6.9055	12.878	107.75664499999999	61.0866171428571	7.06248773046785	0.272819938850094	399.97342000000003	1.0	1.05714285714286	23.0	4.0	1.0	0.3

Selecting the right model

Seven different types of regression were performed on the data, with the intention of simply estimating the performance of each one without tuning hyperparameters. The metrics of evaluation for each one were the R2 score and the root mean squared error (RMSE). I also measured the time required for each model to fit and predict the data. After all the models and metrics were observed, the best performing one was then selected hyperparameter tuning.

Ordinary Least Squares
Ridge Regression
Lasso Regression
Elastic Net regression
Bayesian Ridge
K-nearest neighbors regression
Random Forest Regression

The process of evaluating the models was done by a for loop. At each loop, a sequentially increasing number of features from the data was used on all the models. The order of the features fed into the models was based upon the magnitude of correlation each feature had with the critical_temperature. Hence the first feature used was range_ThermalConductivity which had a correlation of 0.687654, and the last feature was mean_fie which had a correlation of 0.102268. The loop passed through all 36 features. The first iteration had the first feature, the second had the first two features, the third had the first three, and so on. Each model was initially fit without tuning hyperparameters.

In code, the structure was;

#get the correlation of each feature with the target variable
corr = pd.DataFrame(data.corr()['critical_temp'])
#get the absolute value of the correlation
corr['abs'] = np.abs(corr['critical_temp'])
#sort the values by their absolute value, in descending order
corr = corr.sort_values(by='abs',ascending=False)

#loop through the rows of the corr dataframe, getting each feature sequentially
for row in corr.index:
    #add more features iteratively & cumulatively, in order of correlation magnitude
    features = list(corr['index'].loc[:row])
    #use all those features up to that point
    X = df[features]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
    #fit/train the model
    model.fit(X_train,y_train)
    #then do RMSE and R2 scoring

See the links to coded notebooks for actual code

At each iteration the seven models were fit on a 70-30 train test split. The R2 and RMSE were then measured.

Graphing the evaluation metrics of each model as a function of the number of features used, we have;

The best performance metrics by each model were *(without tuning hyperparameters)*;

model	RMSE	R2 score	# of features used	time
RandomForestRegressor	10.3001	0.9104	30	1.6523
KNN	15.5705	0.7917	14	0.0596
BayesianRidge	20.5437	0.6363	27	0.0167
Ridge	20.5642	0.6356	27	0.0061
OLS	20.5645	0.6356	27	0.0123
ElasticNet	22.9051	0.5479	27	0.0153
Lasso	23.5355	0.5227	27	0.0145

Thus, the best performing model (Random Forest regressor) was then fine tuned with it's hyperparameters.

Random Forest: Tuning hyperparameters

Because the main goal is to get the most accurate conclusions about superconductors, accuracy, in this context, is more important than efficiency. Hence, despite random forest regressor being the slowest model to fit the data (over 1 second), it was chosen to be the most important because it had the lowest loss function.

The hyperparameter that was tuned was the number of decision trees in the random forest. To do this,I again used a trial and error approach where I cycled through a range of forest sizes. In conjunction, I also tried using different 'max feature' parameters, specifically log2, sqrt and the default.

After lots of trial and error, a RMSE of ±9.396Kelvin was found, before tuning it was ±10.3Kelvin.

Checking for a good fit

As in any regression analysis, R2 score and RMSE are not the only standards of evaluation. Examining the plot of the predicted values versus actual values is important, as is looking for normality and homoskedascity.

The plot of the predicted temperatures versus actual temperature was

Normality of errors & Homoskedascity

Using seaborn's distplot and probplot functions, I could plot the distribution of errors as well as their Q-Q plots (for both the train sets and the test sets). Approximately normality was observed.

Feature Importance

One of the more convenient things about the random forest package is it does feature importance analysis automatically. Here, we can see the relative importances of each measure.

Conclusions

The other researchers that analyzed this data concluded that the important features extracted were ones based on thermal conductivity, atomic radius, valence, electron affinity, and atomic mass. In the graph above we can the same results, with a few extra important features such as the presence of Copper, Oxygen, Barium, the number of elements in the material, et. al. Perhaps it is these few extras I observed that allowed me to get a RMSE of 9.4Kelvin, which is lower than the other research's RMSE of 9.5Kelvin.

Links to coded notebooks

Sources/Links

A data-driven statistical model for predicting the critical temperature of a superconductor, by Kam Hamidieh
- This was the other research I wanted to compare to.
UCI Machine learning repository on superconducting data
- The data can be downloaded from here.
Super conducting material database
- From the National Institute of Material Science; the organization that collected the data

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.ipynb_checkpoints		.ipynb_checkpoints
coded_notebooks		coded_notebooks
images		images
.DS_Store		.DS_Store
README.md		README.md
readme.ipynb		readme.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting the temperature of a superconductor: Regression

ABSTRACT

PROJECT OUTLINE

Introduction

What is a superconductor?

Preprocessing the target variable: critical temperature

Filtering out irrelevant features

Selecting the right model

Random Forest: Tuning hyperparameters

Checking for a good fit

Feature Importance

Conclusions

Links to coded notebooks

Sources/Links

About

Releases

Packages

Languages

jkginfinite/Superconductors_Regression

Folders and files

Latest commit

History

Repository files navigation

Predicting the temperature of a superconductor: Regression

ABSTRACT

PROJECT OUTLINE

Introduction

What is a superconductor?

Preprocessing the target variable: critical temperature

Filtering out irrelevant features

Selecting the right model

Random Forest: Tuning hyperparameters

Checking for a good fit

Feature Importance

Conclusions

Links to coded notebooks

Sources/Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages