String tokenization using PHP

Standard

Most of us, while doing programming, have to handle lots of string operations. One of them is read any content and create unique word list. This is called tokenization (please correct me if I am wrong here). The string that you are reading might contain a text from book with punctuation marks and line breaks. Question is how will we create a list of unique words from the text quickly using PHP?

Lets say we have the following text:

This is a test string, which is used for

demonstrating the tokenization using PHP. PHP is a very (strong) scripting-language

We will use this text as the content in a PHP variable $content.

In order to have the list of unique words from the text, there are two ways to do it.

1. PHP provides a function called strtok() which takes the string and delimiters as the parameter and by creating a loop, you can read each word from it and store those into an array. After completing the loop, you can use array_unique() to get the unique values from the list. Ok, here is the code:

<?php
error_reporting(E_ALL);
ini_set("display_errors", 1);
ini_set("log_errors", 0);

$content = "This is a test string, which is used for

demonstrating the tokenization using PHP. PHP is a very (strong) scripting-language";

$words = array();
$delim = " \n.,;-()";
$tok = strtok($content, $delim);
while ($tok !== false) {
  $words[] = $tok;
  $tok = strtok($delim);
}
$unique_words = array_unique($words);

print "<pre>";
print_r($unique_words);
print "</pre>";
?>

2. Another way to implement it by using str_word_count() function in PHP.

<?php
error_reporting(E_ALL);
ini_set("display_errors", 1);
ini_set("log_errors", 0);

$content = "This is a test string, which is used for

demonstrating the tokenization using PHP. PHP is a very (strong) scripting-language";

$words = array_unique(str_word_count(preg_replace('/-/', ' ', $content), 1));

print "<pre>";
print_r($words);
print "</pre>";
?>

Run both the code and see the time difference in execution and memory consumption. I would prefer to go with #2. What’s your thought?

Advertisements

I will be happy to answer your queries

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s