Media server interfaces use a lot of stimulus signaling that encompasses a wide range of mechanisms, ranging from clicking on hyperlinks, to pressing buttons, to traditional dual-tone multi-frequency input. Request for Comment (RFC) 5629 describes a framework for the interaction between users and Session Initiation Protocol (SIP)-based applications. The stimulus signaling allows a user agent to interact with an application without knowledge of the semantics of that application. Application interaction can be either functional or stimulus. Functional interaction requires the user device to understand the semantics of the application, whereas stimulus interaction does not. Stimulus signaling allows for applications to be built without requiring modifications to the user device. The services invoked from media server are examples of SIP-based applications creation using a kind of stimulus signaling. The details of session establishment and termination, media negotiation using Session Description Protocol offer–answer model, returning data (e.g., collected utterance or digit information) from the media server to the application server, and other services using SIP interfaces are discussed. We have taken the SIP interface to VoiceXML media server specified in RFC 5552 as an example in this chapter. Similarly, the audio, video, or data media bridging/ mixing server for multimedia conferencing in SIP also demands that the media server needs to be distributed for scalability, especially for the large-scale SIP network. It should be noted that the media services are a huge area that need to be looked into for the creation of more scalable multimedia services, making them distributed for the generation large-scale next SIP network.