LightOi的高性能服务器程序框架（二）_C

I/O复用

前言
select
poll
epoll
三者的区别
总结

前言

本章将介绍LightOi项目中使用的I/O复用，常见的I/O复用有select、poll、epoll（重点)。

select

在指定的时间内，可以监听一个或多个文件描述符的可读、可写、异常事件。

select的API

       /* According to POSIX.1-2001, POSIX.1-2008 */
       #include 

       /* According to earlier standards */
       #include 
       #include 
       #include 
	   //函数原型
       int select(int nfds, fd_set *readfds, fd_set *writefds,
                  fd_set *exceptfds, struct timeval *timeout);
       参数
       int nfds：监听的fd总数
       fd_set *readfds：可读集合
       fd_set *writefds：可写集合
       fd_set *exceptfds：是否异常集合
       struct tinmeval *timeout：监听时间，设置NULL为阻塞、设置0为立即返回
	   struct timeval {
               long    tv_sec;         /* seconds */
               long    tv_usec;        /* microseconds */
           };
	   返回值：
	   成功：监听的所有集合中，满足条件的fd总数
	   失败：-1
	   //常用的使用函数
       void FD_CLR(int fd, fd_set *set);//将fd从fd_set集合中清除
       int  FD_ISSET(int fd, fd_set *set);//查询fd是否在fd_set集合中
       void FD_SET(int fd, fd_set *set);//将fd添加到fd_set集合中
       void FD_ZERO(fd_set *set);//将fd_set集合清空

select的特点：
1、select能监听的最大文件描述符的数量有限制（一般是1024），用户可以修改该数量，但可能会出现预想不到的后果。

2、select因为fd_set结构体只是文件描述符的数组，故需要通过三个fd_set集合参数来传入可读、可写、异常事件，另外不能传入其他类型事件。

3、select返回的是就绪事件的个数，内核在线修改传入的三个集合参数，将事件不发生的文件描述符去掉，来反馈其就绪事件，故用户获取就绪事件需要O(n)时间，对整个事件集合遍历进行FD_ISSET()判断。

4、在每次select之前需要对fd_set进行重置，因为内核会在线修改它们。

5、select采用轮询的方式。

poll

poll和select本质上区别不大，都是采用轮询的方式管理文件描述符，监听可读、可写、异常等事件。

poll的API：

int poll(struct pollfd *fds, nfds_t nfds, int timeout);
fds : pollfd结构体类型的指针，数组每个元素都是一个struct pollfd结构
nfds: 最大文件描述符个数，一般可达到65535
timeout: 和select的参数有点不同，设置-1为阻塞（不同点），设置为0立即返回

struct pollfd{
	int fd;			//文件描述符
	short events;	//等待的事件，用户设置的传入事件
	short revents;	//实际发生的事件，内核修改，反馈的就绪事件
};

epoll

epoll是Linux独有的I/O复用函数，可以监听多个文件描述符的可读、可写、异常等事件，它和select、poll有很大的差异。

它提供了一组系统调用函数，
如epoll_ctl()将用户关心的文件描述符上的事件注册到内核中的事件表，不需要内核从用户空间获取事件，内核将就绪事件放入就绪事件表中，等到适合时间再将就绪事件表输出到epoll_wait的参数events中。

epoll的API：
epoll_create()

//epoll_create
NAME
       epoll_create, epoll_create1 - open an epoll file descriptor

SYNOPSIS
       #include 
//函数原型：创建epoll模型
       int epoll_create(int size);
参数：
	   int size：建议内核分配的最大fd个数
返回值：
	   成功：返回一个文件描述符epfd，指向红黑树的根节点
	   失败：-1

epoll_ctl()

NAME
       epoll_ctl - control interface for an epoll descriptor

SYNOPSIS
       #include 
//函数原型：注册函数
	   //该函数是线程安全的，在多线程中使用epoll_wait没问题
       int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
//参数：
	   int epfd：指向根节点的文件描述符
	   int op：对fd文件描述符的 *** 作类型（添加、修改、删除）
	   EPOLL_CTL_ADD：添加
	   EPOLL_CTL_MOD：修改
	   EPOLL_CTL_DEL：删除
	   int fd：需要 *** 作的文件描述符
	   struct epoll_event *event：
       typedef union epoll_data {	//联合体
          void        *ptr;//泛型指针，一般用于传任意类型参数，比如结构体
          int          fd;//需要 *** 作的文件描述符，ptr和fd不能同时使用
          uint32_t     u32;
          uint64_t     u64;
      } epoll_data_t;

      struct epoll_event {
          uint32_t     events;      /* Epoll events */
          epoll_data_t data;        /* User data variable */
      };
	  
	  events：被 *** 作的文件描述符的事件类型
	  EPOLLIN：可读
      EPOLLOUT：可写
      EPOLLRDHUP：异常
//返回值：
	  成功：0
	  失败：-1

epoll_wait()

NAME
       epoll_wait,  epoll_pwait  -  wait  for  an  I/O  event on an epoll file
       descriptor

SYNOPSIS
       #include 
//函数原型
	  
       int epoll_wait(int epfd, struct epoll_event *events,
                      int maxevents, int timeout);
//参数：
	   int epfd：指向根节点的文件描述符
	   struct epoll_event *events：结构体类型和epoll_ctl参数一样，不过该参数属于数组类型，上述属于变量取地址类型。

该参数是传出参数，存储满足条件的含有文件描述符的结构体。


	   int maxevents：该events数组的大小，必须大于0,可以达到65535
	   int timeout：超时时间，设置0为立即返回，设置-1为阻塞
//返回值：
	  成功：返回满足条件的个数
	  失败：-1

LT 模式：当 epoll_wait()检测到描述符事件到达时，将此事件通知进程，进程可以不立即处理该事件，下次调用epoll_wait()会再次通知进程。

是默认的一种模式，并且同时支持 Blocking 和 No-Blocking。

ET 模式：和 LT 模式不同的是，通知之后进程必须立即处理事件，下次再调用epoll_wait() 时不会再得到事件到达的通知。

很大程度上减少了 epoll 事件被重复触发的次数，因此效率要比 LT 模式高。

只支持 No-Blocking，以避免由于一个文件句柄的阻塞读/阻塞写 *** 作把处理多个文件描述符的任务饿死。

EPOLLONESHOT事件：
它通常和ET模式搭配，在使用ET模式时，一个socket上的事件还是可能被多次触发，比如一个线程读取完某个socket的数据后开始处理数据，而在这时该socket又有新的数据到达可读，另一个线程会读取该socket上的数据，这种情况会出现一个socket同时被两个线程 *** 作，这明显不合适。

EPOLLONESHOT事件可以限制文件描述符的一个事件且只能触发一次，注意当线程处理完该socket数据后，需要重置该socket的EPOLLONESHOT事件，确保下一次可读。

epoll的特点：
1、epoll是Linux的独有的I/O复用函数。

2、epoll在内核中有事件表（红黑树）和就绪事件表（链表），通过epoll_ctl系统调用可以注册、修改、删除文件描述符在事件表中的状态，不需要内核读取传入的参数获取事件，内核将就绪事件放入了就绪事件表中，等到适合时机将就绪事件写入用户传入的events参数。

3、epoll底层的数据结构是红黑树，查询速度快。

4、epoll采用回调方式检测就绪事件，算法复杂度O(n)。

5、用户获取就绪文件描述符的时间复杂度O(n)。

epoll的高效原理：

/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the "rbr" RB tree.
 * Avoid increasing the size of this struct, there can be many thousands
 * of these on a server and we do not want this to take another cache line.
 */
struct epitem {
    union {
        /* RB tree node links this structure to the eventpoll RB tree */
        struct rb_node rbn;
        /* Used to free the struct epitem */
        struct rcu_head rcu;
    };

    /* List header used to link this structure to the eventpoll ready list */
    struct list_head rdllink;

    /*
     * Works together "struct eventpoll"->ovflist in keeping the
     * single linked chain of items.
     */
    struct epitem *next;

    /* The file descriptor information this item refers to */
    struct epoll_filefd ffd;

    /* Number of active wait queue attached to poll operations */
    int nwait;

    /* List containing poll wait queues */
    struct list_head pwqlist;

    /* The "container" of this item */
    struct eventpoll *ep;

    /* List header used to link this item to the "struct file" items list */
    struct list_head fllink;

    /* wakeup_source used when EPOLLWAKEUP is set */
    struct wakeup_source __rcu *ws;

    /* The structure that describe the interested events and the source fd */
    struct epoll_event event;
};

/*
 * This structure is stored inside the "private_data" member of the file
 * structure and represents the main data structure for the eventpoll
 * interface.
 */
struct eventpoll {
    /* Protect the access to this structure */
    spinlock_t lock;

    /*
     * This mutex is used to ensure that files are not removed
     * while epoll is using them. This is held during the event
     * collection loop, the file cleanup path, the epoll file exit
     * code and the ctl operations.
     */
    struct mutex mtx;

    /* Wait queue used by sys_epoll_wait() */
    wait_queue_head_t wq;

    /* Wait queue used by file->poll() */
    wait_queue_head_t poll_wait;

    /* List of ready file descriptors */
    struct list_head rdllist;

    /* RB tree root used to store monitored fd structs */
    struct rb_root rbr;

    /*
     * This is a single linked list that chains all the "struct epitem" that
     * happened while transferring ready events to userspace w/out
     * holding ->lock.
     */
    struct epitem *ovflist;

    /* wakeup_source used when ep_scan_ready_list is running */
    struct wakeup_source *ws;

    /* The user that created the eventpoll descriptor */
    struct user_struct *user;

    struct file *file;

    /* used to optimize loop detection check */
    int visited;
    struct list_head visited_list_link;
};

epoll使用RB-Tree红黑树去监听并维护所有文件描述符，RB-Tree的根节点

调用epoll_create时，内核除了帮我们在epoll文件系统里建了个file结点，在内核cache里建了个红黑树用于存储以后epoll_ctl传来的socket外，还会再建立一个list链表，用于存储准备就绪的事件.

当epoll_wait调用时，仅仅观察这个list链表里有没有数据即可。

有数据就返回，没有数据就sleep，等到timeout时间到后即使链表没数据也返回。

所以，epoll_wait非常高效。

而且，通常情况下即使我们要监控百万计的句柄，大多一次也只返回很少量的准备就绪句柄而已，所以，epoll_wait仅需要从内核态copy少量的句柄到用户态而已.

那么，这个准备就绪list链表是怎么维护的呢？

当我们执行epoll_ctl时，除了把socket放到epoll文件系统里file对象对应的红黑树上之外，还会给内核中断处理程序注册一个回调函数，告诉内核，如果这个句柄的中断到了，就把它放到准备就绪list链表里。

所以，当一个socket上有数据到了，内核在把网卡上的数据copy到内核中后就来把socket插入到准备就绪链表里了。

epoll相比于select并不是在所有情况下都要高效，例如在如果有少于1024个文件描述符监听，且大多数socket都是出于活跃繁忙的状态，这种情况下，select要比epoll更为高效，因为epoll会有更多次的系统调用，用户态和内核态会有更加频繁的切换。

epoll高效的本质在于：

减少了用户态和内核态的文件句柄拷贝
减少了对可读可写文件句柄的遍历
mmap 加速了内核与用户空间的信息传递，epoll是通过内核与用户mmap同一块内存，避免了无谓的内存拷贝
IO性能不会随着监听的文件描述的数量增长而下降
使用红黑树存储fd，以及对应的回调函数，其插入，查找，删除的性能不错，相比于hash，不必预先分配很多的空间

三者的区别

总结

LightOi项目中使用的是epoll + ET + EPOLLONESHOT模式。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/langs/584983.html

LightOi的高性能服务器程序框架（二）

发表评论

评论列表（0条）