UniF$^2$ace is the first unified multimodal model (UMM) specialized in understanding and generating fine-grained facial features. Unlike previous studies that mainly focus on coarse-grained understanding of facial features, UniF$^2$ace is designed to handle fine-grained facial features and generate them. We build a large-scale face dataset, UniF$^2$ace-130K, which consists of 130,000 image-text pairs and 1 million question-answer pairs, and train the model using two diffusion techniques and a two-stage expert mixture architecture. Through this, we establish theoretical connections between discrete diffusion score matching and mask generation models, and simultaneously optimize the lower bound of evidence to improve the ability to synthesize facial details. By introducing token-level and sequence-level expert mixtures, we enable efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on UniF$^2$ace-130K show that UniF$^2$ace outperforms existing UMMs and generation models.