読者です 読者をやめる 読者になる 読者になる

Kernel/VM Advent Calendar Day 31.

[OS]This article is for Kernel/VM Advent Calendar Day 31.
Yeah, I know, 2010s Christmas was already finished, but never mind :)

Last year I wrote a simple operating system kernel. actually, It doesn't have any useful features. though, it implements a hardware/software interrupt handler, a paging, a device driver, file system(very limited), and execute an ELF program(also very limited).
I wish I did a new hack, but I didn't. so I will recap what I did as fa as I can remember. It may help someone who wants to write his/her own kernel.

My development machine is x86_64 arch, but the kernel supports only x86_32. I mostly use Bochs as a test machine, also use qemu and bare metal machine. OK, let's start.

First, you need a boot loader that boots your kernel up. However, you have two choices: write a your own boot loader or use GNU Grub. I'm not sure which is better way. It depends on what you are interested in. If you want to know everything about how kernel works, writing a boot loader might be good choice. also, if you are interested in memory management, process scheduling, and so forth, using Grub might be better because it's faster way to do so.
I choose using Grub so that I describe how to use it. So, Grub has a good boot protocol which is called multiboot, you need to read this document.
This document contains sample code: a header file, main() written in C, bootstrap code. Those files are distributed under GPL2, if you use these codes, you need to follow the licence.
This is a bootstrap code. I remove some codes which I didn't need. It passes an arugments for cmain(), then call it. The cmain() is a main function of te kernel. When the bootstrap code is called, cpu is already protect mode. so you don't have to change to protect mode from real mode, but you should do some setup for protect mode because Grub doesn't do everything for you.

#define ASM 1
#include <mikoOS/multiboot.h>
	
.section .entry 
.code32
	
.globl _start
_start:
	jmp	entry
	/* align for 32bit mode */
	.align 	4

	/* This is the multiboot header for GRUB */
multiboot_hader :
	.long 	MULTIBOOT_HEADER_MAGIC
	.long 	MULTIBOOT_HEADER_FLAGS
	.long 	CHECKSUM

entry:
	.comm stack, KERNEL_STACK_SIZE, 32     /*reserve 16k stack on a quadword boundary */
	push 	%ebx /* ebx point to 32bit address of the multiboot information structure */
	push 	%eax /* eax must be 0x2BADB002 */
	call 	cmain  /* call c function */

The cmain() is defined like this. This function never finish.

void cmain(unsigned long magic, unsigned long addr)

when you write a bootstrap code, and main function, you can build your kernel. but before that, you need to have a linker script. Then, compile bootstrap, c file, and link them by linker script. I have a really simple one.

ENTRY(_start)
SECTIONS
{
       . = 0x100000;
       .text :{
               *(.entry);
               *(.text);
       }

       .data ALIGN (0x1000) : {
               *(.data)
       }

       .rodata ALIGN (0x1000) : {
           *(.rodata)
       }

       .bss :{
               *(.bss);
       }
       end = .;
}

So, now you may be able to boot your kerenl from Grub:)

Second, you must setup the GDT because Grub setup GDT for temporary. I did it following code. Setting up the GDT is pretty simle. Prepare data structure, set value to each data, setup GDTR, and load the GDRT.
This structure represents segment descriptor.

struct segment_descriptor {
        u_int16_t limit_l;
        u_int16_t base_l;
        u_int8_t base_m;
        u_int8_t type;
        u_int8_t limit_h;
        u_int8_t base_h;
} __attribute__ ((packed));

This structure represents GDTR. One of the important point is to use "__attribute__ *1" line because of x86's alignment. If you don't use it, this structure size would be 64 bits(it may have another 16 bits between limit and base), but GDTR size should be 48 bits. so you need to tell gcc to do not add external bits in it.

struct descriptor_table {
        u_int16_t limit;
        u_int32_t base;
} __attribute__ ((packed));

Third, you've done setup basic protect mode setup. I think third step is setup interrupt handlers. You can setup paging instead of it, but you can't use it until you have setup interrput handlers. you know, if page fault hander isn't enable, who handles it? btw, I catch it, but I didn't fix it:( although, it isn't bad bacause you can catch page fault when kernel accesses invalid memory address.
I wrote four files to setup interrupt handlers.
1.interrupt.c
This file setup setup IDT and IDTR.
2.interrupt_handler.c
This file defines an interrupt handler table.
3.interrupt_handler.h
This file defines an interrupt handler functions.
4.intr_gate.S
This file handls an interrupt when hardware/software interrupt occurs.

Setting up IDT and IDTR is pretty same as setting up the GDT. IDTR structure is completely same as GDTR so that it uses same strucure as GDTR(struct descirptor_table). One of the difference is IDT has an address of interrupt hander function. IDT use that structure to save values as an array, and IDTR.base holds the array's start address.

struct gate_descriptor {
        u_int16_t offsetL;
        u_int16_t selector;
        u_int8_t count;
        u_int8_t type;
        u_int16_t offsetH;
} __attribute__((packed));

I did setup it like this.

static void set_handler(int idx, u_int32_t base,
                        u_int16_t selector, u_int8_t type)
{
        struct gate_descriptor *p = &intr_table[idx];

        p->offsetL = base & 0xffff;
        p->offsetH = (base >> 16) & 0xffff;
        p->selector = selector;
        p->type = type;
        p->count = 0; // unused.
}

Then, setup paging. mm.c does it and, has simple page allocatetor. It is important that page directroy should be aligned by 4KB address so I use "__attribute__((aligned(0x1000)))" to do so.

You have done basic setup using protect mode :) so what's next? I dicided to have file system, but before that I need to write device drivers to use Hard disk drive. I wrote two drivers: pci driver and ATA driver. also, I wrote some libraries such as kmalloc, memset, strcmp and so on. These libraries in there. Aa you may know, dynamic memory allocation is useful to write a program so it's better to have it.
Anyway, back to the driver, I wrote that function to find PCI devices.

/**
 * Find PCI device by bus number and device number.
 * @param bus bus numer.
 * @param dev device number.
 * @return always 0.
 */
static u_int32_t find_pci_data(u_int8_t bus, u_int8_t dev)
{
        u_int32_t data;
        u_int32_t status;
        u_int32_t class;
        u_int32_t header;
        u_int32_t subsystem;

        int i;
        struct pci_configuration_register reg;
        bool b;

        // At first, check function number zero.
        memset(&reg, 0, sizeof(reg));
        reg.bus_num = bus;
        reg.dev_num = dev;

        // Check all function numbers.
        for (i = 0; i < PCI_FUNCTION_MAX; i++) {
                reg.func_num = i;               
                data = read_pci_reg00(&reg);
                if (data != 0xffffffff) {

                        class = read_pci_class(&reg);
                        header = read_pci_header_type(&reg);
                        subsystem = read_pci_sub_system(&reg);
                        status = read_pci_command_register(&reg);

                        b = store_pci_device_to_list(bus, dev, data, i, class, header, subsystem);
                        if (!b)
                                KERN_ABORT("kmalloc failed");

                        // if it's not a multi function, we need to search other function.
                        if (i == 0 && !is_multi_function(header))
                                return 0;
                }
        }
        
        return 0;
}

As you may know, maximum bus number and device number is 256, so I checked 256 * 256 pairs. Also, you need to check if a bus is multi function or not. It has maximus 7 device in one bus. If bus and device number is 0, and a bit is enabled, it is a multi function.

You can see "Bus 0:Devfn 1" is a multi function device.
In this function, read_pci_foobar() is set register number and calls read_pci_data() so these are not important. btw, PCI has two data, the one is PCI CONFIG ADDRESS and the other is PCI DATA. At first, you need to write data to CONFIG ADDRESS to prepare reading data from PCI DATA. then, you can read data from PCI CONFIG DATA. When you finish reading data, you need to tell reading data has done to PCI CONFIG ADDRESS.
So, these four functions and data structure are needed for me. The struct pci_configuration_register represents PCI CONFIG ADDRESS register.
Complete code is here.
Oh, PCI CONFIG ADDRESS size is 32 bit, but this structure has several variables because it's easy to use. When I access to the register, I convert this variable to 32 bit integer.

// This structure represents PCI's CONFIG_ADDRESS register.
struct pci_configuration_register {
        u_int32_t enable_bit;      // 31: enable bit.
        u_int32_t reserved;        // 24-30: reserved.
        u_int32_t bus_num;         // 16-23: bus number.
        u_int32_t dev_num;         // 11-15: device number.
        u_int32_t func_num;        // 8-10: function number.
        u_int32_t reg_num;         // 2-7: regster number.
        u_int32_t bit0;            // 0-1: always 0.
};

/**
 * Set ENABLE bit to 0 and write data to CONFIG_ADDRESS.
 */
static inline void finish_access_to_config_data(struct pci_configuration_register *reg)
{
        reg->enable_bit = 0;
        write_pci_config_address(reg);
}

/**
 * Read CONFIG_DATA.
 * @param reg it should be set bus, device, function and so forth.
 * @return data from CONFIG_DATA.
 */
static u_int32_t read_pci_data(struct pci_configuration_register *reg)
{
        u_int32_t data;

        // Enable bit should be 1 before read PCI_DATA.
        reg->enable_bit = 1;

        // write data to CONFIG_ADDRESS.
        write_pci_config_address(reg);
        
        data = inl(CONFIG_DATA_1);

        finish_access_to_config_data(reg);

        return data;
}

/**
 * Write to CONFIG_DATA.
 * @param reg it should be set bus, device, function and so forth.
 * @param data should be write to CONFIG_DATA
 * @return data from CONFIG_DATA.
 */
static void write_pci_data(struct pci_configuration_register *reg, u_int32_t data)
{
        // Enable bit should be 1 before read PCI_DATA.
        reg->enable_bit = 1;

        // write data to CONFIG_ADDRESS.
        write_pci_config_address(reg);
        
        outl(CONFIG_DATA_1, data);
        finish_access_to_config_data(reg);
}


/**
 * Write data to CONFIG_ADDRESS.
 * @param reg it should be set bus, device, function and so forth.
 */
static inline void write_pci_config_address(const struct pci_configuration_register *reg)
{
        u_int32_t data = 0;

        data = (reg->enable_bit << 31) |
                (reg->reserved << 24) | 
                (reg->bus_num << 16) | 
                (reg->dev_num << 11) | 
                (reg->func_num << 8) |
                reg->reg_num;

        outl(PCI_CONFIG_ADDRESS, data); 
}

Now you can find devices, so next step is find an ATA device, and if it's found, setup it :) You can see all code. There is some important functions such as initialize, read/write sector.
Initializing an ATA disk is like this. This code only supports an ATA device. ATAPI isn't supported. First, checking a device which is ATA or not.

/**
 * Main initialize routine.
 * @param device is master or slave.
 * @return true or false.
 */
static bool initialize_common(int device)
{
        u_int8_t high, low;
        sector_t buf[256];
        int dev = 0;
        bool ret = false;

        memset(buf, 0x0, sizeof(buf));

        high = low = 0;

        do_soft_reset(device);

        // Read Cylinder register high and low before use.
        low = get_cylinder_regster(CYLINDER_LOW_REGISTER);
        high = get_cylinder_regster(CYLINDER_HIGH_REGISTER);

        // Is this device ATA?
        dev = get_device_type(high, low);
        switch (dev) {
        case DEV_TYPE_ATA: // Only supports ATA device yet.
                break;
        case DEV_TYPE_ATAPI:
                printk("ATAPI device is not supported yet.\n");
                return false;
        default:
                printk("Unknown device\n");
                return false;
        }

        ret = do_identify_device(device, buf);
        if (!ret) {
                printk("identify device failed\n");
                return false;
        }

        max_logical_sector_num = ((u_int32_t) buf[61] << 16) | buf[60];
        
        return true;
}

Next, executes Identify Device command to initialize the device. Mainly, this function reads register and if it ready, do next step.

/**
 * Execute Identify Device command.
 * @param device number.
 * @param buf is to store data.
 * @param false is failed Identify Device command.
 */
static bool do_identify_device(int device, sector_t *buf)
{
        bool ret = false;
        u_int8_t data;
        int i;

        do_device_selection_protocol(device);
        ret = get_DRDY();
        if (ret) {

                write_command(0xec);

//      read_bsy: // unused.
                wait_loop_usec(5);

                if (!wait_until_BSY_is_zero(STATUS_REGISTER)) {
                        printk("wait failed\n");
                        return false;
                }

                data = inb(ALTERNATE_STATUS_REGISTER);

        read_status_register:
                data = inb(STATUS_REGISTER);

                if (is_error(data)) {
                        printk("error occured:0x%x\n", data);
                        print_error_register(device);
                        return false;
                }

                if (!is_drq_active(data))
                        goto read_status_register;

                if (is_device_fault()) {
                        printk("some error occured\n");
                        return false;
                }

                for (i = 0; i < 256; i++)
                        buf[i] = inw(DATA_REGISTER);
        } 

        return true;

}

Reading/Writing features have same steps, so they use common function. Talking about PID mode, reading and writing is quite same. Sending command is of course different, but its step is same.

/**
 * Sector Read/Write common function for PIO data R/W.
 * @param cmd is read or write command.
 * @param device number.
 * @param sector number.
 * @return true if success.
 */
static bool sector_rw_common(u_int8_t cmd, int device, u_int32_t sector)
{
        bool b = false;
        u_int8_t status;
        int loop = 0;

        // sector number need to be checked.
        if (sector > max_logical_sector_num) {
                printk("Invalid Sector number 0x%lx\n", sector);
                return false;
        }

        b = wait_until_device_is_ready(device);
        if (!b) {
                printk("Device wasn't ready.\n");
                return false;
        }

        // nIEN bit should be enable and other bits are disable.
        outb(DEVICE_CONTROL_REGISTER, 0x02);

        // Features register should be 0.
        outb(FEATURES_REGISTER, 0x00);

        // Set Logical Sector.
        outb(SECTOR_NUMBER_REGISTER, sector & 0xff);
        outb(CYLINDER_LOW_REGISTER, (sector >> 8) & 0xff);
        outb(CYLINDER_HIGH_REGISTER, (sector >> 16) & 0xff);
        outb(DEVICE_HEAD_REGISTER, ((sector >> 24) & 0x1f) | 0x40);
        outb(SECTOR_COUNT_REGISTER, 1);

#if 0
        printk("device:0x%x secnum:0x%x low:0x%x high:0x%x head:0x%x\n",
               device,
               sector & 0xff,
               (sector >> 8) & 0xff,
               (sector >> 16) & 0xff,
               (((sector >> 24) & 0x1f) | 0x40));
#endif

        // Execute command.
        outb(COMMAND_REGISTER, cmd);

        wait_loop_usec(4);

        inb(ALTERNATE_STATUS_REGISTER);

read_status_register_again:
        status = inb(STATUS_REGISTER);

        if (is_error(status)) {
                printk("error occured:0x%x\n", status);
                print_error_register(device);
                return false;
        }
        
        if (!is_drq_active(status)) {
                if (loop > 5) {
                        printk("DRQ didn't be active\n");
                        return false;
                }
                loop++;
                goto read_status_register_again;
        }

        return true;
}

Reading and Writing functions is that. They're pretty same, aren't they?

/**
 * Writing one sector.
 * @param device number.
 * @param sector number.
 * @param data to write.
 * @param data size. it should be 256.
 * @return negative if fail to write data.
 */
int write_sector(int device, u_int32_t sector, 
                  sector_t *buf, size_t buf_size)
{
        bool ret;
        size_t i;

        ret = sector_rw_common(PIO_SECTOR_WRITE_CMD, device, sector);
        if (!ret)
                return -1;

        finish_sector_rw();

        for (i = 0; i < buf_size; i++) 
                outw(DATA_REGISTER, buf[i]);

        return 0;
}

/**
 * Reading one sector.
 * @param device number.
 * @param sector number.
 * @param data to store..
 * @param data size. it should be 256.
 * @return  0 if process is success. if something wrong it returns negative value
 */
int read_sector(int device, u_int32_t sector, 
                 sector_t *buf, size_t buf_size)
{
        bool ret;
        size_t i;

        if (buf_size != 256) {
                printk("buf_size isn't 256\n");
                while (1);
        }
                
        ret = sector_rw_common(PIO_SECTOR_READ_CMD, device, sector);
        if (!ret)
                return -1;

        for (i = 0; i < buf_size; i++)
                buf[i] = inw(DATA_REGISTER);

        finish_sector_rw();

        return 0;
}

When you write an ATA driver, you can write data to disk and read data from disk without file system! Next step is writing a file system!!
I dicided use Minix`s file system because it documented(I have the Minix book) and it's quite simple. Using Super block, dentry, and inode. Alghouth, I didn't support write system call, and file size should be smaller than sector size.
Reading a file from Minix file system, it should read a super block first. Minixfs's super block size is 1KB, and it starts address 0x400. so reading it isn't difficult.
# header files and c source are here.

/**
 * Read minix file system's super block.
 * @param mount point.
 * @return 0 is success, negative numbers are fail.
 */ 
static int minix_get_sb(struct vfs_mount *vmount)
{
        int ret;

        printk("%s\n", __FUNCTION__);

        ret = read_one_sector(vmount->driver, &sblock, 0x400 / BLOCK_SIZE);
        memcpy(&minix_sb, sblock.data, sizeof(minix_sb));
#if 0
        printk("Superblock info\n");
        printk("s_ninodes: 0x%x\n", minix_sb.s_ninodes);
        printk("s_nzones:  0x%x\n", minix_sb.s_nzones);
        printk("s_imap_blocks: 0x%x\n", minix_sb.s_imap_blocks);
        printk("s_zmap_blocks: 0x%x\n", minix_sb.s_zmap_blocks);
        printk("s_firstdatazone: 0x%x\n", minix_sb.s_firstdatazone);
        printk("s_log_zone_size: 0x%x\n", minix_sb.s_log_zone_size);
        printk("s_max_size: 0x%x\n", minix_sb.s_max_size);
        printk("s_magic: 0x%x\n", minix_sb.s_magic);
        printk("s_pad: 0x%x\n", minix_sb.s_pad);
        printk("s_zones: 0x%x\n", minix_sb.s_zones);
#endif
        return 0;
}

Reading dentry and inode aren't difficult too.

/**
 * read inode.
 * @param mount point.
 * @param inode number.
 * @param inode structure.
 * @param data zone.
 * @param 0 is success.
 */
static int read_inode(struct vfs_mount *vmount, u_int16_t inode_num, 
                       struct minix_inode *inode, unsigned long addr)
{
        int ret;

        ret = read_one_sector(vmount->driver, &sblock, addr);
        if (ret)
                return -1;

        memcpy(inode, sblock.data + ((inode_num - 1) * sizeof(*inode)), sizeof(*inode));
        
        return 0;
}

/**
 * Read directory entry.
 * @param dentry structure to store the result.
 * @param address of data zone.
 * @param offset from data zone.
 * @param 0 is success.
 */ 
static int read_dentry(struct vfs_mount *vmount, struct minix_dentry *dentry, 
                        unsigned long address, unsigned long offset)
{
        int ret;

        ret = read_one_sector(vmount->driver, &sblock, address);
        if (ret) {
                printk("read sector error\n");
                return ret;
        }

        // dentry->name is 15 bytes which reserved for '\0'.
        memcpy(dentry, sblock.data + offset, sizeof(*dentry) - 1);

        return 0;
}

If you can read super block, dentry and inode, you can find and read a file. The read_file() finds a file which is passed by argument. If file is exists, read the data from disk. The find_file() is recursive function.

/**
 * Read file and store data to buf.
 * @param vfs mount point.
 * @param super block.
 * @param file name.
 * @param output buffer.
 * @param maximum read bytes
 * @result read bytes. 
 */
static ssize_t read_file(struct vfs_mount *vmount, struct minix_superblock *sb, 
                         const char *fname, char *buf, size_t num)
{
        u_int16_t ino;
        struct minix_inode inode;
        unsigned long inode_tbl_bass = get_inode_table_address(*sb);
        int ret;

        ino = find_file(vmount, sb, get_first_data_zone(*sb), fname);
        if (!ino) {
                printk("file %s not found\n", fname);
                return 0;
        }

        ret = read_inode(vmount, ino, &inode, inode_tbl_bass);
        if (ret)
                KERN_ABORT("read inode error\n");

        ret = read_one_sector(vmount->driver, &sblock, get_data_zone(inode.i_zone[0]));
        if (ret) {
                printk("read sector error\n");
                return ret;
        }

        if (num > inode.i_size)
                num = inode.i_size;

        memcpy(buf, sblock.data, num);

        return inode.i_size;
}

/**
 * Find file by file name.
 * @param mount point.
 * @param super block.
 * @param Address of data zone.
 * @param file name.
 * @return if find a file, it returns inode number.
 */ 
static u_int16_t find_file(struct vfs_mount *vmount, struct minix_superblock *sb, unsigned long address, const char *fname)
{
        unsigned long offset = 0;
        struct minix_dentry dentry;
        struct minix_inode inode;
        unsigned long inode_tbl_bass = get_inode_table_address(*sb);
        const char *tmp;
        u_int16_t ret = 0;
        
        int result;
        int len = 0;
        int ftype;
        while (1) {
                // read first entry.
                result = read_dentry(vmount, &dentry, address, offset);
                if (result) 
                        KERN_ABORT("read hdd error\n");

                if (dentry.inode == 0)
                        break;

                result = read_inode(vmount, dentry.inode, &inode, inode_tbl_bass);
                if (result) 
                        KERN_ABORT("read hdd error\n");

                tmp = fname;
                if (tmp[0] == '/') 
                        tmp = tmp + 1;

                ftype = get_file_type(&inode); 
                if (ftype == I_FT_DIR) {
                        len = count_delimita_length(tmp, '/');
                        if (len == -1) {
                                if (!strcmp(tmp, dentry.name))
                                        return dentry.inode;
                        } else if (!strncmp(tmp, dentry.name, len)) {
                                ret = find_file(vmount, sb, get_data_zone(inode.i_zone[0]), tmp + len);
                        } else {
                                // if final character is '/', finish searching.
                                if (!strcmp(tmp + len, "/"))
                                        return dentry.inode;
                        }
                } else if (ftype == I_FT_REGULAR) {
                        if (!strcmp(dentry.name, tmp))
                                return dentry.inode;
                }
                if (ret)
                        return ret;

                offset += sizeof(dentry) - 1;
        }

        return 0;

}

Ok, you can read a file means, you can read an ELF file. so let's execute it.
This is a test program. As you can see, it really small. It call system call using software interrupt.

#define UNUSED __attribute__((unused))

int main(UNUSED int argc, UNUSED char **argv)
{
        __asm__("mov $1, %eax\n\t"
                "int $0x80\n\t"
                );

        return 0;
}

Reading an ELF program is here. It reads some section which needs to execute the test program. That test program doesn't have local variable so it makes source code simple. Although, elf.c reads program header, section header, string table, symbol table, symbol string table, text section and bss section.
I defined these structure to store an ELF data.

struct string_tables {
        unsigned char *string_tbl;
        unsigned char *symbol_str_tbl;
};

struct section {
        unsigned char *data;
        size_t size;
};
struct sections {
        struct section *text;
        struct section *bss;
};

struct elf {
        Elf32_Ehdr e_hdr;
        Elf32_Phdr *p_hdr;
        Elf32_Shdr *s_hdr;
        Elf32_Sym  *sym;
        int sym_count;
        Elf32_Rel  *rel;
        Elf32_Rela *rela;
        struct string_tables str_table;
        struct sections section_data;
};

Checking if a file is an ELF format or not is easy, but reading an ELF data, you need to culcalate where the section is. For example, if you want to get .text section, this use case is like this.
1.read_text_section()
This is start point.
2.read_section()
This is main feature. It searchs section header from section name, then get its size and returns the section.
3.get_section_header()
It main faunctino of searching section feader by given section name.
4.get_section_name()
It gets section name from ELF data.

static const Elf32_Shdr *get_section_header(const struct elf *data, const char *section)
{
        const Elf32_Shdr *sym = NULL;
        unsigned char *str_tbl = data->str_table.string_tbl;
        int i;
        
        for (i = 0; i < data->e_hdr.e_shnum; i++) {
                const Elf32_Shdr *p = data->s_hdr + i;
                char buf[64] = { 0 };
                get_section_name(p, str_tbl, buf);
                if (!strcmp(buf, section)) {
                        sym = p;
                        break;
                }
        }

        return sym;

}

static inline unsigned long get_program_table_size(const struct elf *data)
{
        return data->e_hdr.e_phentsize * data->e_hdr.e_phnum;
}


static inline unsigned long get_section_size(const struct elf *data)
{
        return data->e_hdr.e_shentsize * data->e_hdr.e_shnum;
}

static int read_section_header(struct elf *data, const unsigned char *file)
{
        unsigned long section_size = get_section_size(data);
        
        // Is there any section header?
        if (!data->e_hdr.e_shnum)
                return 1;
        
        data->s_hdr = kmalloc(section_size);
        if (!data->s_hdr)
                return -1;
        
        memcpy(data->s_hdr, file + data->e_hdr.e_shoff, section_size);
        
        return 0;
}

static struct section *read_section(struct elf *data, const unsigned char *file, const char *name)
{
        const Elf32_Shdr *sym = get_section_header(data, name);
        struct section *ret;

        if (!sym)
                return NULL;

        ret = kmalloc(sizeof(*ret));
        ret->size = sym->sh_size;
        if (!sym->sh_size) {
                printk("%s section size is 0\n", name);
                return NULL;
        }

        ret->data = kmalloc(sym->sh_size);
        if (!ret->data)
                return NULL;

        memcpy(ret->data, file + sym->sh_offset, sym->sh_size);

        return ret;
}

static int read_text_section(struct elf *data, const unsigned char *file)
{
        data->section_data.text = read_section(data, file, ".text");
        if (!data->section_data.text)
                return -1;

        return 0;
}

static int read_bss_section(struct elf *data, const unsigned char *file)
{
        data->section_data.bss = read_section(data, file, ".bss");
        if (!data->section_data.bss)
                return -1;

        return 0;
}


static int read_program_header(struct elf *data, const unsigned char *file)
{
        unsigned long size = get_program_table_size(data);
        // Is there program header.
        if (!data->e_hdr.e_phnum)
                return 1;
        
        data->p_hdr = kmalloc(size);
        if (!data->p_hdr)
                return -1;
        
        memcpy(data->p_hdr, file + data->e_hdr.e_phoff, size);
        
        return 0;
}

static int read_string_table(struct elf *data, const unsigned char *file)
{
        unsigned long offset = data->s_hdr[data->e_hdr.e_shstrndx].sh_offset;
        unsigned long size = data->s_hdr[data->e_hdr.e_shstrndx].sh_size;

        data->str_table.string_tbl = kmalloc(size);
        if (!data->str_table.string_tbl)
                return -1;

        memcpy(data->str_table.string_tbl, file + offset, size);

        return 0;
}

static int read_symbol_string_table(struct elf *data, const unsigned char *file)
{
        const Elf32_Shdr *sym = get_section_header(data, ".strtab");

        if (!sym)
                return -2;

        data->str_table.symbol_str_tbl = kmalloc(sym->sh_size);
        if (!data->str_table.symbol_str_tbl)
                return -1;

        memcpy(data->str_table.symbol_str_tbl, file + sym->sh_offset, sym->sh_size);
        return 0;
        
}


static int is_elf(const struct elf *data)
{
        if (data->e_hdr.e_ident[0] == 0x7f &&
                data->e_hdr.e_ident[1] == 'E' &&
                data->e_hdr.e_ident[2] == 'L' &&
                data->e_hdr.e_ident[3] == 'F')
                return 1;
                
        return 0;
}

static int read_header(struct elf *data, const unsigned char *file)
{
        memcpy(&data->e_hdr, file, sizeof(data->e_hdr));
        
        return is_elf(data) == 1 ? 0 : -1;
}

static int read_symbol_table(struct elf *data, const unsigned char *file)
{
        const Elf32_Shdr *sym = get_section_header(data, ".symtab");

        if (!sym)
                return -2;
        
        data->sym = kmalloc(sym->sh_size);
        if (!data->sym)
                return -1;

        memcpy(data->sym, file + sym->sh_offset, sym->sh_size); 
        data->sym_count = sym->sh_size / sizeof(Elf32_Sym);

        return 0;
}

static void get_section_name(const Elf32_Shdr *data, const unsigned char *table, char *buf)
{
        strcpy(buf, (const char *) table + data->sh_name);
}

static void get_symbol_string_name(const Elf32_Sym *data, const unsigned char *table, char *buf)
{
        strcpy(buf, (const char *) table + data->st_name);
}

When, you read an ELF data, you can execute it. Since test program is small and simple, I needed only text section to execute it.
Running process code is written in process.c.
This function is an ugly at all, but it runs two process. However, I didn't have scheduler that means when test program finishes, general fault occurs because it didn't do post execution when program is finished.

int setup_tss(unsigned char *data)
{
        u_int16_t cs, ds, ss;
        u_int32_t esp;

        printk("Setup TSS\n");
 
        __asm__ __volatile__("movw %%cs, %0\n\t;"
                             "movw %%ds, %1\n\t;"
                             "movw %%ss, %2\n\t;"
                             "movl %%esp, %3\n\t;"
                             :"=g"(cs), "=g"(ds), "=g"(ss), "=g"(esp)
                );
 
//      printk("cs:0x%x ds:0x%x ss:0x%x esp:0x%x\n", cs, ds, ss, esp);

        set_tss(cs, ds, (u_int32_t) &test_task1, 0x202,
                (u_int32_t) &process_stack[0], ss,
                esp, ss);
        
        set_tss(cs, ds, (u_int32_t) data, 0x202,
                (u_int32_t) &process_stack[1], ss,
                esp, ss);
 
        
        set_gdt_values(0x28, (u_int32_t) &tss[0], sizeof(struct tss_struct), SEG_TYPE_TSS); 
        set_gdt_values(0x30, (u_int32_t) &tss[1], sizeof(struct tss_struct), SEG_TYPE_TSS); 

        ltr(0x28);

        return 0;
}

This is the result. I got general exception :P

btw, file system layout which I used is this.

/mnt/test:
.  ..  dir_a  dir_A  hello  test.txt

/mnt/test/dir_a:
.  ..  dir_b

/mnt/test/dir_a/dir_b:
.  ..  foobar.txt

/mnt/test/dir_A:
.  ..  dir_B

/mnt/test/dir_A/dir_B:
.  ..

Finally, This is what I have done last year. I wrote bootstrap, setup GDT, interrupt handlers, paging, a PCI driver, an ATA driver, a file system, an ELF reader, and process creation. It has basic kernel features.

*1:packed