论文标题
通过机器学习预测大气分子的气体分配系数
Predicting Gas-Particle Partitioning Coefficients of Atmospheric Molecules with Machine Learning
论文作者
论文摘要
大气中二级有机气溶胶的形成,特性和寿命在很大程度上取决于参与有机蒸气的燃气粒子分配系数。由于这些系数通常很难测量或计算,因此我们开发了一个机器学习(ML)模型,以预测它们为分子结构作为输入。我们的数据驱动方法基于Wang等人的数据集。 (Atmos。Chem。Phys。,17,7529(2017)),他使用Cosmotherm程序计算了从主化学机制的3414大气氧化产物的分配系数和饱和蒸气压力。我们在饱和蒸气压($ p_ {sat} $)上训练内核脊回归(KRR)ML模型,以及两个平衡分配系数:在水 - 不溶性有机物相和气相相位($ k_ {wiom/g} $)之间,以及与纯水和$ k_ $ k_ $ k_的溶解度($ k_ {wiom/g)之间($ k_)。对于每个有机分子对机器的原子结构的输入表示,我们测试了不同的描述符。我们最好的ML模型预测$ p_ {sat} $和$ k_ {wiom/g} $在0.3之内,$ k_ {w/g} $ to原始cosmotherm计算的0.4对数单元内。与实验数据相比,这是相等或更好。然后,我们将ML模型应用于35,383个分子的数据集中,该数据集是基于碳10主链生成的,并用0到6个羧基,羧基或羟基功能化,以评估其具有潜在低$ P_ {SAT} $的多功能化合物的性能。由此产生的$ P_ {SAT} $和分配系数分布在物理化学上是合理的,并且最高度氧化化合物的波动性预测与具有相似元素组成的大气氧化产物的实验性波动率具有定性一致。
The formation, properties and lifetime of secondary organic aerosols in the atmosphere are largely determined by gas-particle partitioning coefficients of the participating organic vapours. Since these coefficients are often difficult to measure or compute, we developed a machine learning (ML) model to predict them given molecular structure as input. Our data-driven approach is based on the dataset by Wang et al. (Atmos. Chem. Phys., 17, 7529 (2017)), who computed the partitioning coefficients and saturation vapour pressures of 3414 atmospheric oxidation products from the master chemical mechanism using the COSMOtherm program. We train a kernel ridge regression (KRR) ML model on the saturation vapour pressure ($P_{sat}$), and on two equilibrium partitioning coefficients: between a water-insoluble organic matter phase and the gas phase ($K_{WIOM/G}$), and between an infinitely dilute solution with pure water and the gas phase ($K_{W/G}$). For the input representation of the atomic structure of each organic molecule to the machine, we test different descriptors. Our best ML model predicts $P_{sat}$ and $K_{WIOM/G}$ to within 0.3 and $K_{W/G}$ to within 0.4 logarithmic units of the original COSMOtherm calculations. This is equal or better than the typical accuracy of COSMOtherm predictions compared to experimental data. We then apply our ML model to a dataset of 35,383 molecules that we generated based on a carbon 10 backbone and functionalized with 0 to 6 carboxyl, carbonyl or hydroxyl groups to evaluate its performance for polyfunctional compounds with potentially low $P_{sat}$. The resulting $P_{sat}$ and partitioning coefficient distributions were physico-chemically reasonable, and the volatility predictions for the most highly oxidized compounds were in qualitative agreement with experimentally inferred volatilities of atmospheric oxidation products with similar elemental composition.