Projective drawing tests are widely used to assess mental health in children, yet they require scarce clinical expertise. While multimodal large language models (MLLMs) offer potential for automating such assessments, their current use is limited by interpretive noise, subjectivity, and superficial analysis. To address this, we introduce a benchmark for rigorously evaluating MLLMs on projective drawing-based diagnosis. Aligning with the critical need for early mental health intervention, our dataset was collected in collaboration with institutional partners under the Drawing-A-Person test, comprising 559 annotated drawings from K12 children aged 6-17. The benchmark employs a multiple-choice question format to objectively assess three diagnostic capabilities: determination (whether abnormal), detection (what abnormality), and comparison (between multiple drawings). Results reveal a substantial performance gap between MLLMs and human experts. MLLMs perform better on determination tasks and on abnormalities related to structural cues (e.g., adjustment and dependence problems), yet struggle with abnormalities related to affective or stylistically subtle cues (e.g., anxiety, depression). The findings suggest that MLLMs could rely on some surface patterns to assist human experts in supporting initial screening, yet show limited competence in emotion and reasoning integration for further diagnosis. This work provides a principled framework for developing interpretable and equitable AI tools for children and adolescent mental health screening.