Jul 23, 2007

การค้นตัดคำ PHP with bag of words algorithm

มาเปิดเผย อัลกอริทึม กันนะครับ
คราวนี้ เป็นแนวคิดของการตัดคำแบบไม่เอาคำเชื่อมต่างๆ นะครับ
ที่จะยกมาให้ดูต่อไปนี้ ผมเขียนให้เป็นแนวของ Function นะครับ
ใช้ภาษา PHP นะ อีกอย่างหนึ่งคือ ผมตัดมาเฉพาะหัวใจหลักของโปรแกรม
เพราะฉะนั้นอาจจะมีข้อผิดพลาด อย่างไรก็ตาม ขอโทษมา ณ ที่นี้ด้วยนะครับ
และ code ต่อไปนี้ อาจจะสับสนบ้างถ้าดูแรกๆ เอาเป็นว่า ถ้าว่างๆ ผมจะมา
comment ในแต่ละบรรทัดให้นะครับ
ว่าตรงไหน ทำอะไรยังไง

ใครมีความคิดเห็นยังไง หรือเจอ Bug ช่วยแจ้งด้วยนะครับ
จะได้นำไปปรับปรุง

ช่วงนี้ ขอแว๊บไปทำงานต่อก่อนล่ะ ครับ



function indexCountText($text,$word)
{
$num = count(explode(strtolower($word), $text)) - 1;
return $num;
}


function cutText($file,$num)
{


$descriptor = fopen ($file, "r");
$contents = fread ($descriptor, filesize ($file));
fclose ($descriptor);
$text = htmlspecialchars($contents);
$text = strtolower($text);


$notChar = array(":",",",";",":",";","?",".","0","1","2",
"3","4","5","6","7","8","9","]","[","(",")","/",
"-","&","*","$","_","'");


$az = array(" a "," b "," c "," d "," e "," f "," g "," h ",
" i "," j "," k "," l "," m "," n "," o "," p ",
" q "," r "," s "," t "," u "," v "," w "," x "," y "," z ");


$notCheck = array(" then "," i "," the "," is "," am "," are "," a ",
" what "," why "," when "," where "," who "," which "," with ",
" and "," but "," will "," be "," to "," or ", " on "," in ",
" if "," do "," of "," any "," by "," at "," for "," as ",
" its ","http","www"," that "," not "," some ",
" it ","the ","for "," was "," than "," can "," an "," all ",
" also "," yes "," you "," your "," after "," has "," have ",
" how "," into "," like "," may "," often "," other "," such ",
" so "," they "," this "," those "," use "," used "," well ",
" were "," would "," vs "," about "," eg "," ie "," ed "," ma ",
" quot "," he "," amp "," one "," two "," age "," no "," from ",
" see "," ms "," form "," rd "," eda "," low "," length ",
" large "," now "," up "," more "," very "," new "," between ",
" over "," text "," out "," take "," these "," only "," etc ",
" there "," however "," order "," same "," review "," wants ",
" while "," until "," us "," we "," id "," our "," own ",
" her "," de "," ca ","\r\n","\r","\t");


$text = str_replace($notChar,"",$text);
$text = str_replace($az," ",$text);
$text = str_replace($notCheck," ",$text);


$wordAll ="";


$text = str_replace("\n"," ",$text);
$text = str_replace("\r"," ",$text);
$text = str_replace("\r\n"," ",$text);
$text = str_replace($notChar,"",$text);
$text = str_replace($az," ",$text);
$text = str_replace($notCheck," ",$text);


$text = trim($text);
$word = explode(" ", $text);
$word = implode(" ",$word);
$word = trim($word);
$word = explode(" ", $word);
$word = implode(" ",$word);
$word = trim($word);
$word = explode(" ", $word);
$word = implode(" ",$word);
$word = trim($word);
$word = explode(" ", $word);
$word = implode(" ",$word);
$word = trim($word);
$word = explode(" ", $word);
$word = implode(" ",$word);
$word = trim($word);
$word = explode(" ", $word);
$word = implode("#",$word);
$word = trim($word);
$word = explode("#", $word);


//array_multisort($word,SORT_ASC);
$word = array_unique($word);


switch($num)
{
case 0:

$word = implode("\n",$word);
$word = explode("\n",$word);

for($i=0; $i<= count($word)-1; $i++)
{
$wordNum[$i] = indexCountText($text,$word[$i]);
}

array_multisort($wordNum,SORT_DESC,$word);

for($i=0; $i<= 9; $i++)
{
$wordAll .= $word[$i]." ".$wordNum[$i]." time(s)\n\n";
}

return $wordAll;
break;

case 1:
$word = implode(" ",$word);
return substr($word, 0, strlen($word));
break;
}// end case
}

3 comments:

  1. พี่โตมอน มา คอมเ้ม้นให้หน่อยดิ กำลังเรียน IR อยู่ พอดี
    อัพเรื่อยๆ นะคับว่างๆ จะมาอ่าน

    ReplyDelete
  2. ใครล่ะครับเนี่ย บล็อกที่ทำลิงค์ ไว้ก็เข้าไม่ได้
    ใครๆๆๆ

    ReplyDelete
  3. อัลกอริทึมตัดคำ ตัดจากหลังมาหน้า หรือ หน้าไปหลัง อันไหนดีกว่ากันครับ และมีอัลกอริทึมตัดคำ ไดที่ตัดคำที่ไม่มีอยู่ในฐานข้อมูล พวกคำที่ไม่รู้จัก | คำที่รู้จัก | คำกำกวม | คำภาษาอังกฤษ หรือ ตัวเลข | อักขระพิเศษ ไช้ภาษา php ครับ ไครมีส่งมาได้ครับ samulaiza@hotmail.com

    ReplyDelete

Blog Comment

บทความที่ได้รับความนิยม