For the introductory, less math-y post that explains more about what this project is, click here.
The concept of reliability comes from the classical test theory designed for psychological, research, and educational tests. The classical test theory uses the model of a true score, error (or noise) and observed score. 
To adapt this to baseball, the true “score” would be the true talent level we are seeking to find, and observed “score” is the actual production of a player. Unfortunately, the true talent level can’t be directly measured. There are several methods to estimate true talent by accounting for different factors. This is, to an extent, what projection systems try to do. For our purposes we are defining the true talent level as the actual talent level and not the value the player provides adjusted to park, competition, etc. The observed score is easy to measure, of course — it’s the recorded outcomes from the games the player in question has played. It’s the stat you see on our leaderboards.
The error term contains everything that can affect cause a discrepancy between the true score and the observed score. It contains almost everything that affects the observed outcome in the stat: weather, pitcher, defenses, park factors, injuries, and so on. This analysis isn’t interested in accounting for those factors but rather measuring the noise those factors in aggregate impart to our observed stat.