Autism Spectrum Disorder (ASD) is defined by ongoing difficulties in interpersonal communication, social interaction, and behavioral flexibility. One of the salient difficulties seen among ASD children is the disrupted capacity for both recognizing and expressing feelings appropriately, leading to increased emotional dysregulation. These deficits are often presented as frequent episodes of distress, which are widely known as meltdowns, that place significant physical and psychological demands. Traditional emotion recognition models, which are usually designed by neurotypical data, are not adequate for this group since they do not capture the atypical and heterogeneous nature of affective presentations seen in children with ASD. Deep Learning tools have been increasingly used to identify specific autistic symptoms. This paper develops a personalized multimodal neural network, that aims to effectively identify the affective states of children with autism using information from the facial and vocal expression modalities. The design involves a personalized facial feature extraction module where it utilizes a distance metric to aggregate similar label embeddings while successfully distinguishing dissimilar representations. Concurrently, a convolutional neural network (CNN)-based audio feature extractor is applied to speech samples to extract vocal cues related to emotional expressions in children with autism. Consequently, we propose a multimodal data fusion strategy for emotion recognition and construct a feature fusion model based on ensemble techniques.