Mobile eye tracking is beneficial for the analysis of human-machine interactions of tangible products, as it tracks the eye movements reliably in natural environments, and it allows for insights into human behaviour and the associated cognitive processes. However, current methods require a manual screening of the video footage, which is time-consuming and subjective. This work aims to automatically detect cognitive demanding phases in mobile eye tracking recordings. The approach presented combines the user's perception (gaze) and action (hand) to isolate demanding interactions based upon a multi-modal feature level fusion. It was validated in a usability study of a 3D printer with 40 participants by comparing the usability problems found to a thorough manual analysis. The new approach detected 17 out of 19 problems, while the time for manual analyses was reduced by 63 percent. More than eye tracking alone, adding the information of the hand enriches the insights into human behaviour. The field of AI could significantly advance our approach by improving the hand-tracking through region proposal CNNs, by detecting the parts of a product and mapping the demanding interactions to these parts, or even by a fully automated end-to-end detection of demanding interactions via deep learning. This could set the basis for machines providing real-time assistance to the machine's users in cases where they are struggling.