Elevated seismic noise for moderate-size earthquakes recorded at teleseismic distances has limited our ability to see their complexity. We develop a machine-learning-based algorithm to separate noise and earthquake signals that overlap in frequency. The multi-task encoder-decoder model is built around a kernel pre-trained on local (e.g., short distances) earthquake data \cite{yin_multitask_2022} and is modified by continued learning with high-quality teleseismic data. We denoise teleseismic P waves of deep Mw5.0+ earthquakes and use the clean P waves to estimate source characteristics with reduced uncertainties of these understudied earthquakes. We find a scaling of moment and duration to be $M_0\simeq \tau^{4.16}$, and a resulting strong scaling of stress drop and radiated energy with magnitude ($\sigma\simeq M_0^{0.2}$ and $E_R \simeq M_0^{1.23}$). The median radiation efficiency is 5\%, a low value compared to shallow earthquakes. Overall, we show that deep earthquakes have weak rupture directivity and few subevents, suggesting a simple model of a circular crack with radial rupture propagation is appropriate. When accounting for their respective scaling with earthquake size, we find no systematic depth variations of duration, stress drop, or radiated energy within the 100-700 km depth range. Our study supports the findings of \citeA{poli_global_2016} with a doubled amount of earthquakes investigated and with earthquakes of lower magnitudes.