Use of electronic health record (EHR) content for comparative effectiveness research (CER) and population health management requires significant data configuration. A retrospective cohort study was conducted using patients with diabetes followed longitudinally (N?=?36,353) in the EHR deployed at outpatient practice networks of 2 health care systems. A data extraction and classification algorithm targeting identification of patients with a new diagnosis of type 2 diabetes mellitus (T2DM) was applied, with the main criterion being a minimum 30-day window between the first visit documented in the EHR and the entry of T2DM on the EHR problem list. Chart reviews (N?=?144) validated the performance of refining this EHR classification algorithm with external administrative data. Extraction using EHR data alone designated 3205 patients as newly diagnosed with T2DM with classification accuracy of 70.1%. Use of external administrative data on that preselected population improved classification accuracy of cases identified as new T2DM diagnosis (positive predictive value was 91.9% with that step). Laboratory and medication data did not help case classification. The final cohort using this 2-stage classification process comprised 1972 patients with a new diagnosis of T2DM. Data use from current EHR systems for CER and disease management mandates substantial tailoring. Quality between EHR clinical data generated in daily care and that required for population health research varies. As evidenced by this process for classification of newly diagnosed T2DM cases, validation of EHR data with external sources can be a valuable step.