Handwriting Detection Model Based on Four-Dimensional Vector Space Model

Handwriting detection is mainly used in the criminal investigation. We can use four-dimensional vector space model to build a model for handwriting detection. This article selects feature quantities such as word frequency, language style, average word length, and sentence structure from the texts and quantizes them, transforming them into relations between vectors. After quantifying and normalizing the features in an author’s article in advance, we can obtain a standard reference vector. Then we do the same processing on the target text database, and compare it with the standard reference vector in terms of the modulus value and the included angle. Then we could estimate whether the author is the owner of database value. The simulation result shows that the model is more accurate and the author of particular texts can be obtained.


Introduction
With the popularity of e-mails, mobile phone messages, and social networking messages, more and more anonymous texts information have become involved materials.Identifying the origin of the texts is crucial in the detection of criminal cases.From view of functional linguistic, language reflects the inner world of human beings at all times.The world we know is realized through language.Language identification has also become a highly regarded detective method.This gave birth to handwriting detection, that is, matching articles and their authors through specific features in the article.
At present, there are few researches on handwriting detection models.Most of them focus on single-field research and are usually difficult to apply universally.Vector Space Model(VSM), proposed by Gerard Salton and McGill in 1988, using vectors to represent texts, and take weights of the features of texts as components.Through the calculation of word frequency and dimension reduction of the vectors, this model can compute the similarity of texts.Li Xuelei&Zhang Dongmo established a model which correlates the text characters, the term frequency, the hypertext markup language tag information in the web, and semantic analysis for the question sentences to calculate an adjustable term frequency weighting parameter and to increase the separability of feature words vector.(Xuelei Li & Dongmo Zhang, 2003) Turney also used a vector-based model of semantic relations to attain a score of 56% on some multiple-choice questions which are from SAT test.(Turney, Peter D., 2006) This paper proposes a handwriting detection model based on four-dimensional vector space model, which can be applied to many types of text to solve the problem of author identification in criminal investigation.

Handwriting Detection Model Based on Four-dimensional Vector Space Model
The vector space model is an algebraic model that applies to information filtering, information selection, indexing, and assessment of relevance.It simplifies the processing of text content into operations in vector space, and expresses the similarity of language features with spatial similarity.We measure the similarity of texts by calculating the similarity of the vectors.In this article, we also select the feature quantities in the texts and quantify them.

Selecting Feature
(1) Selecting the word with highest frequency in each text as one of the recognition criteria.Because the word with highest frequency of each message also reflects a person's writing habits, in addition to representing the subject of this article.Even the word frequency of one person's several articles settles in a range of frequency fluctuations.(Huiling Wang & et al., 2001) (2) Language style is also an indispensable criteria of great importance.It is undeniable that each person's writing style is very different.What we choose here is the expression of exclamatory sentences.We calculate the frequency of sentences that express the strong emotion of persons as a criterion.
(3) The average word length is also an important evidence on whether an article is written by a particular people.Some people prefer to use short, informal words, while others prefer the long ones.At the same time, we also tend to use long and formal words while writing some formal documents.Therefore, the average word length is a very good measure to judge the author's identity, educational level, etc. (Qiang Li & Jianhua Li, 2006) (4) Sentence structure is also selected as one of the criteria for judgment.Here we use the frequency of link verbs as the research object.This is because the sophistication and difficulty of writing sentences in the articles not only represents a person's educational level, but also deeply reflects one's writing habits.Even everyone has his habitual structure of sentences, and there is a proportion of each structure type.

Extracting the Highest Frequency and Establishing a Database
In the processing of word frequency feature, only the value of the highest word frequency is considered.In this way, we can obtain the writer's writing habits (some people have wordy writing style, while others cherish words like gold).Therefore, the highest word frequency can be used as indicator of observation and analysis.
(1) The initial data is simply a bunch of text database, we wrote the program through JAVA, and analyze the highest word frequency in each article, establishing a new database.
(2) We compute the arithmetic average of each highest word frequency in the above new database, and finally obtain the arithmetic average as a component of the standard parameter vector.
The formula is as follows: (3) We extract the f max and f min in many articles,then subtract from f av , comparing two absolute values.The result is as follows: As a result, f max deviates from the highest average frequency; If not, f min deviates from the highest average frequency.
The results above can be used as the calculation of the dynamic range in the model.

Analyzing the Style of Article Sentences and Building a Quantitative Database
Language styles are various.Usually the number of exclamations and the frequency of modal particles can represent the writing style of a person best.When quantifying, we deal with them mainly by looking for the number and frequency of exclamatory sentences.
(1) The same as the quantification of the highest word frequency database above, the initial data is also just a simple text library.We analyzed the frequency of exclamations and modal particles in each text through JAVA programming to form a new database.
(2) We compute the arithmetic average of each frequency figure in the new database above, and finally obtain the arithmetic average as a component of the standard parameter vector.Assuming the total number of texts is m, then the formula is as follows: (3) We extract y max and y min from the database, and then subtract from y av , comparing two absolute values.The result is as follows: As a result, y max deviates from the highest average frequency; If not, y min deviates from the highest average frequency.
The results above can be used as the calculation of the dynamic range in the model.
(2) We compute the arithmetic average of the average length of each word in the new database above, and finally obtain the arithmetic average as a component of the standard parameter vector.
The formula is as follows: (3) We extract the L max and L min in many emails , then subtract from L av , comparing two absolute values.The result is as follows: As a result, L max deviates from the highest average frequency; If not, L min deviates from the highest average frequency.
The results above can be used as the calculation of the dynamic range in the model.

The Analysis and Quantification of Sentence Structure
Sentence structure is also a tool to reflect a person's writing habits, many people like to use the subject-linking verbpredicative structure or subject-verb-object structure.Here we choose the the subject-linking verb-predicative structure to quantify its proportion.
(1) The original data is still the text library, we use the program written by JAVA programming language to obtain linking verbs in each article, so as to get the proportion of the subject-linking verb-predicative system structure statements, constituting a new database.
(2) We compute the arithmetic average of the proportion in the new database above, and finally obtain the arithmetic average as a component of the standard parameter vector.Assuming the total number of texts is m, then the formula is as follows: (3) We extract S max and S min from the database, and then subtract from S av , comparing two absolute values.The result is as follows: As a result, S max deviates from the highest average frequency; If not, S min deviates from the highest average frequency.
The results above can be used as the calculation of the dynamic range in the model.

Weight Calculation of Feature Quantities
Since the four characteristic quantities have different effects on identifying the author of the text, we need to quantify the proportion of the effect each characteristic quantity.The weight calculation formula is as follows: (5) t f ik is the frequency at which the feature appears in the text; And the denominator is the normalization factor.

Four-dimensional Space Vector Formula
Reference Vector: tolerance scope: The result of vector quantification is: After that, we proceed as follow.Firstly, we calculate the difference between modules: Secondly, we calculate the values of included angle:

Simulation Results
We use texts with more than 1000 words for simulation to ensure accuracy of the model.Therefore, we selected three sets of sample, each with a sample size of 15. (Yiping Zeng & Xiaowen Zhu, 2006)

Data Results
The digital data obtained after processing is shown in the following table (sheets 4-1 to 4-3):

Conclusion
The vector space model turns the articles into vectors and the concept of our model is relatively simple.When the number of texts is high and the data is complete, the results of model are more accurate.Our model can be used to analyze the handwriting of different people, rather than just be used under a certain situation.By establishing a specific database, we can apply it in every situation, which shows that the model is flexible and reliable.We need to compute a lot while using this model, so there are still some disadvantages needed to be optimized, but the model is of great significance and it provides realistic guidance for criminal cases bases on e-mail.