bookmark

Hadoop TaskTracker - EOF Exception (Missing File Handles)


Description

We have started to see this port issue crop up in the past 2 months during our ever increasing unit testing. The latest spate of problems were however fixed by simply updating our Linux instances with a higher number of file descriptors. On one of the systems showing this local port issue, we noticed that the default setup for open files was at a mere 2,000. [joakim@lapetus jetty]$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 10.04 LTS Release: 10.04 Codename: lucid

[joakim@lapetus jetty]$ ulimit -n 2000 So we updated the /etc/security/limits.conf to bump this number up to 20,000, and this solved our bad local port issues (after a reboot) [joakim@lapetus jetty]$ grep nofile /etc/security/limits.conf

- nofile - max number of open files

  • hard nofile 40000
  • soft nofile 20000

[joakim@lapetus jetty]$ ulimit -a | grep -i file core file size (blocks, -c) 0 file size (blocks, -f) unlimited open files (-n) 20000 file locks (-x) unlimited This new ulimit helped our unit testing on the systems having issues.
Our analysis shows that the aggressive unit testing that we do starts and stops a jetty server (on a system assigned port, using special port #0) consumes the socket at a rate faster than they can be recycled back into the "open files" ulimit, and caused our unit tests to eventually all fail due to a "-1" local port.
See examples of error messages at https://bugs.eclipse.org/bugs/show_bug.cgi?id=310634 Having a loop that continues to attempt to start the server while looking for a valid port number (as seen in the Hadoop codebase) will not help in this excessive "open files" ulimit condition.
The best choice we've been able to come up with is to simply increase the "open files" ulimit.

Preview

Tags

Users

  • @macek

Comments and Reviews