Saturday, November 01, 2008

How to Replace Strings in Java - Using java.util.regex package

Replacing a charater in a String is just a matter of adding a one line to the code.

originalString.replace(oldChar, newChar)

ex.
String originalString = "This/is/my/string";
System.out.println(originalString.replace('/', '|'));
Then the output will be "This|is|my|string"


Anyway how can we replace a character or number of characters(a substring) with another string. This can be done using the regex package in Java.


If we need to replace "my" with "your" we can do it in the following way.
import java.util.regex.*;  

String originalString = "This is my string";
Pattern pat = Pattern.compile("my"); 
Matcher mat = pat.matcher(originalString);
System.out.println(mat.replaceAll("your")); 
mat.reset();
The output will be "This is your string"

Thursday, October 30, 2008

How to Telnet to a Web Server - HTTP Requests through Telnet

A web server, responding to a HTTP GET request sends the required content to the client. Normally we don't see what is sent between the web client and the server. However we can easily see what is sent by the server by telneting to that web server.

To connect to the web server - open a command line and type the command
telnet host port
eg. telnet www.wso2.com 80

Then you'll get connected to the particular web server and then you can enter any HTTP command you want such as GET, HEAD.

If you need to request for a web page from the web server you can type the HTTP request as follows.

GET pageName HTTP/1.0
eg. GET /products HTTP/1.0

Hit enter twice. Then you will get a response (if page exists)as follows.

HTTP/1.0 200 OK
Date: Thu, 30 Oct 2008 18:17:16 GMT
Server: Apache/2.2.9
X-Powered-By: PHP/5.2.6
Set-Cookie: PHPSESSID=5322082bf473207961031e3df1f45a22; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Pingback: http://wso2.com/xmlrpc.php
Connection: close
Content-Type: text/html; charset=UTF-8

content goes here

.
.
.


You'll get a 404 status if page not found, 301 if the page is moved permanently and 401 if you are not authorized to access the page. More HTTP status codes can be found here.

You can follow a similar way to issue other HTTP commands with the relevant content.

Monday, October 06, 2008

How to Play .ram Files in Ubuntu

Files having the .ram extension actually don't give any meaning to the word "Play" as they are realplayer meta files only, referring to the actual audio files. To play .ram files you should be connected to the Internet as they are streamed and played.

In Ubuntu, you can get a .ram file to play by following the steps below.

1. Install Realplayer and the plugin for Mozilla Firefox. (The way to install is well described in Ubuntu - Hardy user guide)

2. Grab the URL inside the .ram file.

3. Create a .html page including a single line containing the URL. (ex. <a href="http://www.blogger.com/URL">My file</a>).

4. Open the html page and just click on the link - My file.

Tuesday, September 30, 2008

Installing Fonts in Ubuntu 8.04 - Hardy Heron

The way of installing fonts is bit different (must say it is pretty simple too) in Ubuntu 8.04.

You just have to create a .fonts folder (if it does not exist) in your home folder and copy the TTF to that folder.

Well. Thats it :)

Monday, September 22, 2008

Writing a Simple Atompub Client Using Apache Abdera

The atom publishing protocol (similar to RSS but with an enhanced ability) is an application level protocol for editing and publishing web resources for periodically updated web sites, using HTTP. An atom document which adheres to the Atom syndication format spec, is used as an atom feed or entry.

Apache Abdera implements this protocol and exposes a simple API to make the thing easy. The way to create an atom feed and to add an entry is shown below.
Abdera abdera = new Abdera();
Feed fd = abdera.newFeed();

fd.setId("unique feed id");
fd.setTitle("feed title");
fd.setUpdated(new Date());
fd.addAuthor("username");

//adding the entry
Entry entry = fd.addEntry();
entry.setId("unique id");
entry.setTitle("entry title");
entry.setSummary("summary");
entry.setContent("The content goes here");
entry.addSimpleExtension(new QName(namespace, "elementName"), "elementContent");//To add an extra element to the entry
entry.setUpdated(new Date());
Now let's see how to post this feed to the abdera server which contains all business logic.
AbderaClient client = new AbderaClient(abdera);
ClientResponse response = client.post("http://www.somesite.com/collection",fd);
Retrieving feeds is just a matter of calling Abdera client's get method.
ClientResponse response = client.get("http://www.somesite.com/collection/atom1.php");
Document doc = response.getDocument();
Abdera client can call put, delete methods as well. Meanwhile the server side should implement the relevant business logic for these methods.This is a good tutorial with further information to follow.


Tuesday, August 26, 2008

How to Index Microsoft Format Documents (Word, Excel, Powerpoint) - Lucene

As my previous post shows how to index PDF Documents with Lucene, I thought that it would be worth to post how to index Microsoft format files too because those file types are very commonly used. Lucene always requires a String in order to index the content and therefore we need to extract the text from the document before giving it to Lucene for indexing. To parse the document we can use Apache POI which provides a Java API for Microsoft format files.

The ways to extract text from Word, Excel and Powerpoint documents are shown below.
//Word text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.doc"));
WordExtractor extractor = new WordExtractor(fs);
String wordText = extractor.getText();

//Excel text extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.xls"));
ExcelExtractor extractor = new ExcelExtractor(fs);
String excelText = extractor.getText();

//Powerpoint extraction
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream("filename.ppt"));
PowerPointExtractor extractor  = new PowerPointExtractor(fs);
String powerText = extractor.getText();

However POI is still not compatible with Office 2007 file formats like .docx, .xlsx and .pptx but it will in the future.

Wednesday, August 06, 2008

How to Index PDF Documents with Lucene

There is no built in support in Lucene to index PDF documents. Therefore the text should be extracted from the document before indexing. A tool which can be used for this purpose is PDFBox. PDFBox is an open source project under BSD license. Although there are many other PDF tools, I experienced that this perfectly fits with Lucene. The little extra thing need to be done here is extracting the text from the document. Following code snippet shows how to do it.
FileInputStream fi = new FileInputStream(new File("sample.pdf"));

PDFParser parser = new PDFParser(fi);
parser.parse();
COSDocument cd = parser.getDocument();
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(new PDDocument(cd));

Now this extracted text can be used to build the Lucene index.

Likewise there are various tools to extract text from word documents and etc. Therefore any kind of document can be added to the Lucene index if the text can be extracted by using an external tool. Even in XML indexing, you should extract the text from XML document if you need to index text values only.

Monday, July 28, 2008

Extracting the Text from XML Documents for Indexing Purposes

In the process of creating a Lucene index for content searching I had to index XML document without XML tags. In simple terms I had to extract every text node from the document. I used the SAX API in doing this and it was just a matter of writing an event handler for character data. The following piece of code shows the way to do it.
final StringBuffer sb = new StringBuffer();

try{
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();

DefaultHandler handler = new DefaultHandler() {
//Other event handlers (for startElement and endElement) also can be implemented similarly
public void characters(char ch[], int start, int length)
throws SAXException {
    sb.append(new String(ch, start, length));
}};

saxParser.parse("fileName.xml", handler);
System.out.println(sb.toString());
}catch(Exception e){
e.printStackTrace();
}

This can be done using the StAX API too. But these libs a are only available in Java 1.6 onwards. The following code works with Java 1.6. However, you may be able to use the Woodstock parser without changing the code.
try {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
InputStream in = new FileInputStream("fileName.xml");
XMLEventReader eventReader = inputFactory.createXMLEventReader(in);
StringBuffer bf = new StringBuffer();

while (eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    //here we only consider the event startElement and get the text inside that element
    if(event.isStartElement()){
    event = eventReader.nextEvent();  
    bf.append(event.asCharacters().getData()+" ");
    }
}

System.out.println(bf.toString());

}catch(Exception e){

}

Monday, June 30, 2008

Loading Lucene Index to the RAM and Flushing Lucene Updates Periodically - Apache Lucene

As my previous post, Creating Lucene Index in a Database, shows, storing lucene index in the database is a solution for applications run on clustered environments. But there is a performance hit as we read/write from/to the database when the index is updated. It is more time consuming.

Therefore we can simply load the lucene index to the RAM (Lucene supports RAMDirectory) and flush the changes to the database periodically. It can be done as follows.

RAMDirectory ramDir = new RAMDirectory();

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "indexTable");


byte [] buffer = new byte [100] ;

LuceneUtils.copy(jdbcDir, ramDir, buffer); //Copying the JdbcDirectory to RAMDirectory

//After this point we can simply deal with RAMDirectory without bothering about index in the database


//After a convenient time period we can flush the changes in the RAMDirectory to the database

timer.schedule(new FlushTimer(10000,ramDir,jdbcDir), 0, 10000);



public class FlushTimer extends TimerTask{

    private int interval;

    RAMDirectory ramDir;

    JdbcDirectory jdbcDir;


    byte [] buffer = new byte [100] ;


    public FlushTimer (int interval, RAMDirectory ramDir, JdbcDirectory jdbcDir){

        this.interval = interval;

        this.ramDir = ramDir;

        this.jdbcDir = jdbcDir;

    }


    public void run() {

        try{

            jdbcDir.deleteContent();

             LuceneUtils.copy(ramDir, jdbcDir, buffer);

         }catch(Exception e){

             e.printStackTrace();

         }

    }


}

Tuesday, June 24, 2008

Creating Lucene Index in a Database - Apache Lucene

My previous post, Indexing a database and searching the content using Lucene, shows how to index records (or stored files) in a database. In that case the index is created in the local file system. However in real scenarios most of the applications run on clustered environments. Then the problem comes where to create the search index.

Creating the index in the local file system is not a solution for the particular situation as the index should be synchronized and shared by every node. One solution is clustering the JVM while using a Lucene RAMDirectory (keep in mind it disappears after a node failure) instead of a FSDirectory. Terracotta framework can be used to cluster the JVM. This blog entry shows a code snippet.

Anyway I thought not to go that far and decided to create the index in the database so that it can be shared by everyone. Lucence contains the JdbcDirectory interface for this purpose. However the implementation of this interface is not shipped with Lucene itself. I found a third party implementation of that. Compass project provides the implementation of JdbcDirectory. (No need to worry about compass configurations etc. JdbcDirectory can be used with pure Lucene without bothering about Compass Lucene stuff).

Here is a simple example
//you need to include lucene and jdbc jars 
import org.apache.lucene.store.jdbc.JdbcDirectory;

import org.apache.lucene.store.jdbc.dialect.MySQLDialect;

import com.mysql.jdbc.jdbc2.optional.MysqlDataSource;
.
//code snippet to create index
MysqlDataSource dataSource = new MysqlDataSource();

dataSource.setUser("root");

dataSource.setPassword("password");

dataSource.setDatabaseName("test");

dataSource.setEmulateLocators(true); //This is important because we are dealing with a blob type data field

JdbcDirectory jdbcDir = new JdbcDirectory(dataSource, new MySQLDialect(), "indexTable");

jdbcDir.create(); // creates the indexTable in the DB (test). No need to create it manually
.

//code snippet for indexing
StandardAnalyzer analyzer = new StandardAnalyzer();

IndexWriter writer = new IndexWriter(jdbcDir, analyzer, true);

indexDocs(writer, dataSource.getConnection());

System.out.println("Optimizing...");

writer.optimize();

writer.close();


static void indexDocs(IndexWriter writer, Connection conn)
throws Exception {
    String sql = "select id, name, color from pet";
    Statement stmt = conn.createStatement();
    ResultSet rs = stmt.executeQuery(sql);

    while (rs.next()) {
        Document d = new Document();
        d.add(new Field("id", rs.getString("id"), Field.Store.YES, Field.Index.NO));
        d.add(new Field("name", rs.getString("name"), Field.Store.YES, Field.Index.TOKENIZED));
        d.add(new Field("color", rs.getString("color"), Field.Store.YES,  Field.Index.TOKENIZED));
        writer.addDocument(d);
    }
}

This is the indexing part. Searching part is same as the one in my previous post.

Thursday, June 05, 2008

Apache Lucene - Indexing a Database and Searching the Content

Here is a Java code sample of using Apache Lucene to create the index from a database. (I am using Lucene version 2.3.2 and MySQL)
final File INDEX_DIR = new File("index");

try{
   Class.forName("com.mysql.jdbc.Driver").newInstance();
   Connection conn = DriverManager.getConnection("jdbc:mysql://localhost/test", "root", "password");
   StandardAnalyzer analyzer = new StandardAnalyzer();
   IndexWriter writer = new IndexWriter(INDEX_DIR, analyzer, true);
   System.out.println("Indexing to directory '" + INDEX_DIR + "'...");
   indexDocs(writer, conn);
   writer.optimize();
   writer.close();
} catch (Exception e) {
   e.printStackTrace();
}

void indexDocs(IndexWriter writer, Connection conn) throws Exception {
  String sql = "select id, name, color from pet";
  Statement stmt = conn.createStatement();
  ResultSet rs = stmt.executeQuery(sql);
  while (rs.next()) {
     Document d = new Document();
     d.add(new Field("id", rs.getString("id"), Field.Store.YES, Field.Index.NO));
     d.add(new Field("name", rs.getString("name"), Field.Store.NO, Field.Index.TOKENIZED));
     d.add(new Field("color", rs.getString("color"),Field.Store.NO, Field.Index.TOKENIZED));
     writer.addDocument(d);
 }
}
I assumed that there is a table named pet in the "test" database with the fields "id" "name" and "color". After running this a folder named index is created in the working directory including indexed content.


The following code (lucene searcher) shows how to search a record containing a particular keyword using the created lucene index.

Searcher searcher = new IndexSearcher(IndexReader.open("index"));
Query query = new QueryParser("color",analyzer).parse("white");
Hits hits = searcher.search(query);
String sql = "select * from pet where id = ?";

PreparedStatement pstmt = conn.prepareStatement(sql);
for (int i = 0; i < hits.length(); i++){
   id = hits.doc(i).get("id");
   pstmt.setString(1, id);
   displayResults(pstmt);
}

void displayResults(PreparedStatement pstmt) {
   try {
      ResultSet rs = pstmt.executeQuery();
      while (rs.next()) {
         System.out.println(rs.getString("name"));
         System.out.println(rs.getString("color")+"\n");
      }
   } catch (SQLException e) {
      e.printStackTrace();
   }
}

Thursday, May 29, 2008

Apache Rampart2 - High Performance Security Module for Apache Axis2

Apache Rampart2, a high performance security module for Apache Axis2 was our final year project.

The reason behind the idea was the existing security module (Apache Rampart) performs low both in memory consumption and processor time usage. This is because Rampart depends on another two apache projects (WSS4J and XMLSec) which uses DOM to parse XML. Therefore Rampart needs another layer (DOOM) for the conversion between DOM and Axiom. Another drawback with Rampart is the post policy validation. The policy validation is done after completely processing the message.

We removed above mentioned drawbacks by completely reimplementing XMLSecurity and SOAP Security layers using Axom and by making the SOAPSecurity layer policy aware to avoid the post policy validation.

According to the performance tests Rampart2 is nearly 6 times faster than Rampart and Rampart2's memory consumption is very less compared to Rampart. Therefore we can conclude that Rampart2 is far better than Rampart both in memory and processor time wise.

Team includes
Saliya Ekanayake
Sameera Jayasoma
Kalani Ruwanpathirana
Isuru Suriarachchi

Thursday, April 03, 2008

Final exam is over

I finished my final exams on 18th March. No more exams :)
Related Posts with Thumbnails