A new multi-sensor fusion framework is proposed, which is based on Convolution Neural Network (CNN) and Dynamic Bayesian Network (DBN) for Sign Language Recognition (SLR). In this framework, a Microsoft Kinect, which is a low-cost RGB-D sensor, is used as a tool for the Human-Computer Interaction (HCI). In our method, firstly, the color and depth videos are collected using Kinect, then all image sequences features are extracted out using the CNN. The color and depth feature sequences are input into the DBN as observation data. Based on the graph model fusion machine, the maximum hidden state probability is calculated as recognition results of dynamic isolated sign language. The dataset is tested using the existing SLR methods. Using the proposed DBN+CNN SLR framework, the highest recognition rate can reach 99.40%. The test results show that our approach is effective.