Chapter 3
Names and Name Charts

Hash Table Summary

Hash Tables are constructed with paired Data Values. One is the value is the key Value and the other is the Data Value. A hash table is not the only method storing and retrieving Key/Value pairs. For example it is possible to use a linked list.

Struct entry {
    ID key;
    VALUE val;
    Struct entry *next;  /* Points at next entry */
};

However, this method is about the slowest way to implement a Key/Value pair table. If there are a 1000 entries, it can take up to 1000 steps to find the correct pair. If the Keys are ordered, alphabetically or Numerically, the retrieval time can be shorten dramatically. As we will see, Hash Tables are the most efficient method, in most cases, for storing Key/Value pairs.

Ruby's Hash Table Functions are provided by the Public Domain Package st.c.

  /* This is a public domain general purpose hash table package
     Written by Peter Moore @ UCB. */

(st.c)

Hash Table Design

First let us setup a hash table with 'n' Bins. In the following example 'n' is chosen to be 64.

(Array)

Figure 1: Basic Hash Table

Now a function is created that will generate an index in the range of 0-(n-1) or 63 from the Key value. This function is called the Hash Function. For example, if you suppose the key is always a positive integer, the simply dividing by 64 will be a valid Hash Function. The remainder is always between 0 and 63.

When inserting the Value of a Key/Value pair into a Hash table, a Hash Function uses the Key to generate an index. In the simplest case, the Value is stored in the Hash Table at the computed index. Regardless of the Key's data type, the Hash Function must return an index in the range of 0-(n-1).

(Aset)

Figure 2: Insertion Without Collision

Now there is a basic problem with Hash Tables, what if the Hash Function generates non-unique Key. With our simple Hash Function, Keys of 3 and 131 both produce an index of 3. Hash Functions must find a way to handle Identical Keys. This problem is referred to as a Collision.

For example when collision happens, one method of resolving this problem is to insert the data value in the following element. It is called the opening address method (figure 3).

(Nexti)

Figure 3: Opening address method

Ruby uses the Chain Method. When a collision is detected, the instead of advancing the next address in the Hash Table, it starts a chain a the collision point. That is a sub-list is created and the new value is stored in the sub-list. Additional collisions at the same index will be added to the end of the chain.

(Chain)

Figure 4: Chain method

There are methods of generating a Perfect Hash Function , that is Hash Functions that guarantee that no collisions will take place. The GNU gperf package is one method of generating Perfect Hash Functions.

Unfortunately, these methods only work when the complete set of Keys is known in advance. Since this is not possible for Ruby, the Chain Method is used.

Hash Table Data Structures

Each Hash Table must have a root structure. For Ruby Hash Table's it is the structure st_table. It maintains the information necessary for the Hash Table's processing functions. The supporting source code is shown after the description of st_table.

  1) st_hash_type structure contains the following pointers:
     a) Pointer to a function that compares Keys.
     b) Pointer to the function the generate the Hash Index.
  2) num_bins specifies the number of bins in the table.
  3) num_entries contains the total number of entries in the table.
  4) **bins is a pointer the array of st_table_entry pointers.

 typedef struct st_table st_table;

 struct st_table {
   struct st_hash_type;          /* Hash Table Type */
   int num_bins;                 /* Number of Slots */
   int num_entries;              /* Total Number of entries in the Table */
   struct st_table_entry **bins; /* Root of Bin Table */
 };

(st.h)

 struct st_table_entry {
   unsigned int hash;            /* Computed hash value */
   char *key;                    /* Key */
   char *record;                 /* Data Value */
   st_table_entry *next;         /* Next entry in chain */
  };

(st.c)

(Sttable)

Figure 5: st_table and st_table_entry structures

The st_hash_type entry points to the following structure.

 struct st_hash_type {
 int (*compare) ();     /* Compare Function */
 int (*hash) ();        /* Hash Function */
 };

(st.h)

The construction "int (*compare) ()" means:
Call the function pointed to by (*compare) and return an int value.

The usage of the construction "int (*hash) ()" is comparable.

There is a compare function for both Numeric and String Keys. Both return zero if they match a key in the Hash Table, otherwise they return a non-zero value.

A compare function is necessary because Ruby's Hash Function does not guarantee a 'perfect hash'. The possibility that a entry in the referenced Bin is part of a chain of entries. The compare function is necessary to determine that none of the entries in the chain are equal to the current Key.

Ruby's Hash Table functions are general purpose in nature because of two
design decisions.
  1) The st_hash_type structure allows the Compare and Hash Functions to be
     tailored to the type of Key.
  3) The set_table_entry references the Key and the Value through pointers.

Example: st_hash_type

In this example we will examine the Hash Table Functions for Numeric Keys. The st_init_numtable function creates a Hash Table for these Keys.

 st_table*
 st_init_numtable ()
 {
 return st_init_table (&type_numhash);
 }

(st.c)

The function st_init_table builds and initializes an empty st_table. In this case it is creating a Hash Table for Numeric Keys. The numeric specific table constructions are as follows:

 static struct st_hash_type type_numhash = {
 numcmp,
 numhash,
 } ;

 static int
 numcmp (x, y)
 long x, y;
 {
 return x!= Y;
 }

 static int
 numhash (n)
 long n;
 {
 return n;
 }

(st.c)

This type of table is used often within Ruby to convert Numeric ID's into their associated String Values. The constructions for type_strhash are similar.

st_lookup

Now let's look at a function that uses the st_table structures.

st_lookup(table, key, value)
 st_table *table;
 register st_data_t key;
 st_data_t *value;
 {
  unsigned int hash_val, bin_pos;
  register st_table_entry *ptr;

  hash_val = do_hash(key, table);
  FIND_ENTRY(table, ptr, hash_val, bin_pos);

  if (ptr == 0) {
      return 0;
  }
  else {
      if (value != 0)  *value = ptr->record;
      return 1;
  }
 }

(st.c)

The st_lookup function tries to find the Value associated with the Key in the specified st_table table.

The Macros do_hash and FIND_ENTRY constitute to core of this functions processing.

The Macro do_hash executes the specified Hash Function.

#define do_hash(key,table) (unsigned int)(*(table)->type->hash)((key))

(st.c)

This macro purpose is to call the specified Hash Function with the Key as it's parameter. The particular table is referenced via '*(table)', the '->type' references the st_hash_type entry in that table, and the '->hash' points to the actual Hash Function . The result is returned as an unsigned integer hash key.

Given the hash key the macro FIND_ENTRY is called, which in turn calls two other macros, PTR_NOT_EQUAL and EQUAL..

#define FIND_ENTRY(table, ptr, hash_val, bin_pos) \
bin_pos = hash_val%(table)->num_bins;\
ptr = (table)->bins[bin_pos];\
if (PTR_NOT_EQUAL(table, ptr, hash_val, key)) {\
    while (PTR_NOT_EQUAL(table, ptr->next, hash_val, key)) {\
	ptr = ptr->next;\
    }\
    ptr = ptr->next;\
}

#define PTR_NOT_EQUAL(table, ptr, hash_val, key) \
( (ptr) != 0 && \
  (ptr->hash != (hash_val) || !EQUAL((table), (key), (ptr)->key)) )


#define EQUAL(table,x,y) \
  ( (x)==(y) || (*table->type->compare)((x),(y)) == 0 )


(st.c)

All macros are one-line constructions. For that reason each line in a macro except the last line are "continued" with the back-slash, ' \' , character.

The macro FIND_ENTRY takes two inputs and returns two outputs. The inputs are the st_table to be searched and the Hash Key. It returns the bin_pos and either a st_table_entry pointer or zero(0).

The PTR_NOT_EQUAL macro also uses the input Key parameter of st_lookup.

The st_lookup function returns TRUE(1) if the value parameter equals the value in st_table_entry associated with the key parameter.

`St_add_direct ()`

This functions creates a new entry in the specified st_table. This function does not inspect the table to see if the entry is already present. It simply creates a new entry.

 void
 st_add_direct (table, key and value)
 st_table *table;
 char *key;
 char *value;
 {
 unsigned int hash_val and bin_pos;

 hash_val = do_hash (key and table);
 bin_pos = hash_val % table-> Num_bins;
 ADD_DIRECT (table, key, value, hash_val and bin_pos);
}

(st.c)

The core operations are performed by two macros: do_hash which was described earlier and ADD_DIRECT.

#define ADD_DIRECT(table, key, value, hash_val, bin_pos)\
{\
 st_table_entry *entry;\
 if (table->num_entries/(table->num_bins) > ST_DEFAULT_MAX_DENSITY) {\
     rehash(table);\
     bin_pos = hash_val % table->num_bins;\
 }\
 \
 entry = alloc(st_table_entry);\
 \
 entry->hash = hash_val;\
 entry->key = key;\
 entry->record = value;\
 entry->next = table->bins[bin_pos];\
 table->bins[bin_pos] = entry;\
 table->num_entries++;\
}

(st.c)

Before adding a new entry, the current Density of the table is examined. If greater than ST_DEFAULT_MAX_DENSITY, the table size is increased and the table is 'rehashed'.

 #define ST_DEFAULT_MAX_DENSITY 5

 if (table->num_entries/(table->num_bins) > ST_DEFAULT_MAX_DENSITY) {\
     rehash(table);\
     bin_pos = hash_val % table->num_bins;\
 }\

Regardless of whether the table was expanded or not, a new st_table_entry structure is allocated and the new entry created.

The function rehash will create a new st_table larger than the current table, and recompute the hash_val and bin_pos for each entry. The entries will then be transfered to the new table at the new bin_pos. This procedure is done to prevent table crowding. When the number of bins is too small for the number of entries, the number of st_table_entry strings tend to get long. This leads to slowing of response for table lookups.

st_insert ()

The function st_insert combines st_lookup and st_add_direct.

 int
 st_insert (table, key and value)
 register st_table *table;
 register char *key;
 char *value;
 {
 unsigned int hash_val and bin_pos;
 register st_table_entry *ptr;

 hash_val = do_hash (key and table);
 FIND_ENTRY (table, ptr, hash_val and bin_pos);

 if (ptr == 0) {
  ADD_DIRECT (table, key, value, hash_val and bin_pos);
  return 0;
  }
 else {
  ptr-> Record = value;
  return 1;
  }
}

(st.c)

As can be seen, the macros do_hash and FIND_ENTRY perform as described in st_lookup. If the Key is found in the table, then value of the key/value pair is returned in the Value parameter and the function returns TRUE(1) . Otherwise, the ADD_DIRECT macro is called to create a new entry and the function returns FALSE(0).

ID's and Symbol Tables

Internally, ruby uses an ID (unsigned int) to refer to symbols. This increases processing efficiency. In order to convert from Symbols to ID's and ID's to Symbols two st_tables are used: sym_tbl and sym_rev_tbl.

ID Construction

While an ID is described above as an unsigned integer, it like many things in ruby it is a little more complicated. The ID is actually constructed out of two separate items. The lower 3 bits are encoded with type information. The actual ID Value is shifted left three bits to accommodate this type information.

The lower 3 bits of the ID contain the following Type Information: 

   ID_LOCAL     0x01 - Local Variables 
   ID_INSTANCE  0x02 - Instance Variables
   ID_GLOBAL    0x03 - Global Variables
   ID_ATTRSET   0x04 - Attributed
   ID_CONST     0x05 - Constants
   ID_CLASS     0x06 - Class Variables
   ID_JUNK      0x07 - Junk - Not Viable

(symbol ID)

Figure 6: Complete Symbol ID Construction

Additionally, when a symbol ID is encoded in a VALUE, it is left shifted again by 8 bits to make room the Symbol Flag - 0xe. See "VALUE's Dark Secret" in chapter 2.

Symbol to ID Conversions

The function rb_intern takes a Symbol(name) as a parameter and returns an ID.


static st_table *sym_tbl;        /* Symbol -> ID Conversions */
static st_table *sym_rev_tbl;    /* ID -> Symbol Conversions */

ID
rb_intern(name)
    const char *name;
{
  const char *m = name;
  ID id;
  int last;

  if (st_lookup(sym_tbl, (st_data_t)name, (st_data_t *)&id))
      return id;

  last = strlen(name)-1;
  id = 0;
  switch (*name) {
    case '$':
      id |= ID_GLOBAL;
      m++;
      if (!is_identchar(*m)) m++;
      break;
    case '@':
      if (name[1] == '@') {
          m++;
          id |= ID_CLASS;
      }
      else {
          id |= ID_INSTANCE;
      }
      m++;
      break;
    default:
      if (name[0] != '_' && !ISALPHA(name[0]) && !ismbchar(name[0])) {
          /* operators */
          int i;

          for (i=0; op_tbl[i].token; i++) {
      	    if (*op_tbl[i].name == *name &&
      	      strcmp(op_tbl[i].name, name) == 0) {
      	        id = op_tbl[i].token;
      	        goto id_regist;
            }
          
      }

      if (name[last] == '=') {
          /* attribute assignment */
          char *buf = ALLOCA_N(char,last+1);

          strncpy(buf, name, last);
          buf[last] = '\0';
          id = rb_intern(buf);
          if (id > tLAST_TOKEN && !is_attrset_id(id)) {
      	    id = rb_id_attrset(id);
      	    goto id_regist;
          }
          id = ID_ATTRSET;
      }
      else if (ISUPPER(name[0])) {
          id = ID_CONST;
      }
      else {
          id = ID_LOCAL;
      }
      break;
  }
  while (m <= name + last && is_identchar(*m)) {
      m += mbclen(*m);
  }
  if (*m) id = ID_JUNK;
  id |= ++last_id << ID_SCOPE_SHIFT;
id_regist:
    name = strdup(name);
    st_add_direct(sym_tbl, (st_data_t)name, id);
    st_add_direct(sym_rev_tbl, id, (st_data_t)name);
    return id;
}

(parse.y)

If the Symbol is already in the sym_tbl, then the function st_lookup will return the respective ID.

If the Symbol is not found, then rest of the rb_intern function will create an ID for the input Symbol. This entails creating the appropriate flags for the type of Symbol. This is done in the switch statement.

Finally, the Symbol / ID Pair is inserted into the two tables, sym_tbl and sym_rev_tbl. The sym_rev_tbl as we will see next is used to convert ID's to Symbols.

ID to Symbol Conversion

The function rb_id2name takes an input parameter of ID and returns the appropriate Symbol String.

char *
rb_id2name(id)
    ID id;
{
 char *name;

 if (id < tLAST_TOKEN) {
     int i = 0;

     for (i=0; op_tbl[i].token; i++) {
         if (op_tbl[i].token == id)
     	return op_tbl[i].name;
     }
 }

 if (st_lookup(sym_rev_tbl, id, (st_data_t *)&name))
     return name;

 if (is_attrset_id(id)) {
     ID id2 = (id & ~ID_SCOPE_MASK) | ID_LOCAL;

   again:
     name = rb_id2name(id2);
     if (name) {
         char *buf = ALLOCA_N(char, strlen(name)+2);

         strcpy(buf, name);
         strcat(buf, "=");
         rb_intern(buf);
         return rb_id2name(id);
     }
     if (is_local_id(id2)) {
         id2 = (id & ~ID_SCOPE_MASK) | ID_CONST;
         goto again;
     }
 }
 return 0;
}

(parse.y)

The first thing we check is the ID for an Operator. This is possible because all operators have a value less the symbol tLAST_TOKEN. If so, then the Symbol returned is from the table op_tbl.

Next, we attempt a st_lookup for the ID. If the function returns TRUE(1), the name parameter is returned as the Symbol.

If both of these conversions fail and the Attribute bit is set, then the ID is checked to see if it is ID_LOCAL or ID_CONST. If it is, the search will continue and check Locals and Constants for a match. This is done by looping back to the label again: .

If no conversion can be found, then a FALSE(0) is returned.

String to Symbol Conversion

At the Ruby Level there are times when the programmer wishes to convert a string into a Symbol.

 static VALUE
 rb_str_intern (str)
 VALUE str;
 {
 ID id;

 if (!RSTRING (str) -> Ptr || RSTRING (str) -> Len == 0) {
 rb_raise (rb_eArgError, "interning empty string");
}
 if (strlen (RSTRING (str) -> Ptr) != RSTRING (str) -> Len)
 rb_raise (rb_eArgError, "string contains `\\0'");
 id = rb_intern (RSTRING (str) -> Ptr) ;
 return ID2SYM (id);
}

(string.c)

The input to this function must be a non-empty String and it can not be a Null Terminated Character String.

If these conditions are met, the function rb_intern is called using the String as the input. It returns the ID for the String interpreted as a Symbol.

The function rb_str_intern then returns the Symbol that is equivalent to the String.

Symbol to String Conversion

The following function takes a Symbol input and returns the equivalent String.

 static VALUE
 sym_to_s (sym)
 VALUE sym;
 {
 return rb_str_new2 (rb_id2name (SYM2ID (sym)));
}

(object.c)

The original work is Copyright © 2002 - 2004 Minero AOKI.
Translated by Vincent ISAMBART
Translations and Additions by C.E. Thornton

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike2.5 License.

Chapter 3Names and Name Charts