Malicious activities like identity theft, harassment and phishing activities are conducted by the cyber criminals by making use of the anonymous context of the cyber world to their advantage. Phishing scams are conducted in such a manner by the scammers that websites are created by them and emails are sent out in order to trick the account holders into revealing sensitive information like passwords and account numbers. These crimes are usually solved by the investigators in such a manner that they back trap the IP addresses on the basis of the data which is present in the header of these anonymous emails. Although, at times the information which is gathered from the IP address isn't enough to identify the culprit in case that the information is sent from a proxy server or if the computer used to send the email has more than one user (Fouss et al., 2010).
The main problem with anonymity is that the authorship analysis techniques are used to address the online communication. There is a long history associated with the study of authorship when it comes to solving authorial disputes like poetic and historic work, although, in case of the online textual communication the study of authorship analysis is very restricted. The reason behind this is the fact that in the traditional written works there is a lot of data which is written in a very well-structured manner by making use of the grammatical rules and common syntactic. Whereas, in comparison to this the online documents like the instant messages and emails are written in a short and poorly structured manner, these are written mostly in the paragraph language and have a lot of grammatical and spelling mistakes. Due to these differences some of the features of authorship analysis can't be applied to the online textual data (Abbassi and Chen, 2009).
Following are the three main authorship analysis problems that have been addressed in this paper.
Firstly, identification of authorship with the large training samples: take this situation where from among a group of suspects a cybercrime investigator wants to pinpoint a likely author of a particular anonymous text message; it is assumed by us that a huge collection of messages which have previously been written by the suspects are available to the investigator. In the actual investigations, the sample text messages can be gotten from the chat logs or email archives of a suspect's personal computer with the help of a warrant. This is done in order to get a sample of the writing styles of each suspect. A large amount of the previous work that has been done on the authorship identification assumes that there is only one writing style followed by every suspect. An argument has been presented by us regarding the changing in writing style depending on the nature of topic. The challenge that we face here is regarding the identification of particular stylistic variations and making use of these variations in order to bring improvement in the authorship identification's accuracy (Fouss et al., 2010).
Secondly, identification of authorship with small training samples: There are a number of anonymous messages that have been given to a cybercrime investigator, a group of suspects is present as well and the investigator wants to correctly identify the author of every one of those anonymous messages. In this problem the assumption is that there is only a small number of training samples that the investigator has access to. The challenge here is to make identification on the basis of the inadequate training data by finding specific patterns (Fouss et al., 2010).
Thirdly, authorship characterization: In this scenario the cybercrime investigator has a collection of the anonymous text messages but he/she doesn't have any idea about the probable suspects and for this reason there are no training samples available of the suspects either. Still the investigator would prefer concluding some of the characteristics like age group, ethnicity and genders of the authors. This will be done by the investigator by observing the writing styles of the authors. The assumption here is that there are some external sources of the text messages like social network websites or blog postings that the investigator has access to. The challenge here is how to make use of these kinds of external sources in order to deduce the characteristics of these authors (Abbassi and Chen, 2009).
The linguistic and computational characteristic of the documents written by individuals is known as the authorship analysis. Extracting the particular writing traits or writing styles from the written documents of an individual can be made use of in order to distinguish one person from another. There are 5 main categories that the writing styles can be categorized into, these categories are: structural, namely lexical, idiosyncratic features, stylometric and content-specific features (Hu et al., 2010).
When we talk about solving the authorship identification disputes regarding conventional and literary writings, we can see that a very important role has been played by authorship analysis in this. It is a lot more challenging to conduct an analysis of the online documents as, they are smaller in size and have insufficient amount of training data that can be made use of in order to know about an author's writing patterns. Also, the online documents have informal writing style and these have many grammatical and spelling errors in them. For these reasons the techniques that might be very effective in the traditional and literary work are most of the times not applicable to these online documents. For this reason it is very important to come up with analytical techniques which are more appropriate for the online document. Emails, chat logs, web forums and web postings have all got authorship applied to them (Hu et al., 2010).
In a survey conducted by researchers on authorship it was found that there are three main perspectives to an authorship problem which are: authorship identification, authorship characterization and authorship similarity detection (Abbassi et al., 2010).
For the identification of the most probable author from among a group of suspects, authorship identification is made use of. In many of the studies, classification model is developed with the help of the stylometric features that have been extracted from a set of sample documents authored by the suspects, later on, in order to find the most probable suspect these features applied to the anonymous document. We argue that there is a lot of possibility for this assumption to not be true as, the writing style of an individual can change depending on the topic or situation (Abbassi et al., 2010).
In order to determine if two objects are formed by a single entity without its knowledge we make use of the authorship similarity. There are many applications of similarity detection such as, online marketplace and plagiarism detection. With the revolution of the digital world it has become very easy to distribute and copy creative works and because of this a lot of copyright violations are taking place all over the world (Abbassi et al., 2010).
In this same manner there is often a manipulation in the reputation system of online marketplaces as, multiple names are entered into it when actually it is built by making use of the customer's feedback. Researchers have developed some techniques which can help us in detecting the aliases present in the online systems such as eBay. This is done by studying the feedback of the users. Similarity detection techniques have also been developed by researcher and these techniques are used to identify the fraudulent and malicious websites (Abbassi et al., 2010).
In order to collect sociolinguistic attributes like occupation, gender, age and educational level of the probable author of a particular anonymous document the authorship characterization is made use of. Gender-preferential attributes' effects have been studied by some of the researchers on the authorship analysis. Neuroticism, educational level, language background and age are some of the factors that have also been discussed in profiling studies (Abbassi et al., 2010).
There are three categories that the machine learning techniques, which have been employed in many of the authorship analysis, fall into. These are:
1. Probabilistic classifiers,
2. Decision trees and
3. Support vector machine and its alternatives (Iqbal et al., 2010).
There are limitations to each one of these techniques with regards to interpretability, accuracy and scalability.
When it comes to the significant processing steps in data mining and machine learning, feature selection is considered to be a very important step. Various feature selection techniques can be made use of in the authorship studies as well. These can be used in order to define a subset of stylometric features through which the authors can be discriminated. There are two general approaches to the feature selection and these are: Forwards selection and backward selection. An important thing to note here is that the property of uniqueness between the write prints of suspects is not guaranteed by feature selection (Iqbal et al., 2010).