`
gogomarine
  • 浏览: 100420 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

HOW TO PARSE AN EMAIL USING THE JAVAMAIL API

阅读更多

 

用JavaMail API解析邮件内容的一些相关事项,写的很不错。

原文地址:http://techforum4u.com/content.php/177-HOW-TO-PARSE-AN-EMAIL-USING-THE-JAVAMAIL-API

 

Article Origin Location:http://techforum4u.com/content.php/177-HOW-TO-PARSE-AN-EMAIL-USING-THE-JAVAMAIL-API

 

 

 

 

======================Main Content=======================

By Pankaj

 

 

AIM
The aim of this document is to elaborate on how to retrieve an email from the mail server and parse it using the JavaMail API.

PREFACE
The first section of the article explains the architecture used for Mail Parsing. It elaborates the process of how a request to parse the mail is initiated after the mail is sent from the external world to the SMTP server used by the application. This is followed by an explanation for parsing the mail file using the JavaMail API. The practical complications that were encountered, known issues and solutions have also been detailed below.

ARCHITECTURE
The architecture that was used in our project is briefly described below.
The architecture explained can be used to receive mails from the external world, parse them and archive them and deliver the mail to the recipients. It must be mentioned here that this approach can be used when the database design is such that only one content part is archived along with the metadata irrespective of the number of recipients.

DOMAIN NAME CONFIGURATION FOR SMTP SERVER
The mails sent in the external world are relayed and delivered to the SMTP server based on the domain name that is configured. Hence, when the SMTP server is configured, it is mandatory for it to identify that list of domain names that will be used by the end users in their email addresses. Once this is done, the mails sent to those configured domain names will be sent to the SMTP server.

CREATE REQUEST FOR MAIL PARSING
The mails received by the mail server will be available in the email pool.

When the mail server receives a mail, it sends an http request (based on the domain) to the listener with the name of the mail file and the resent flag as parameters.

The listener is a servelet that picks up the file and sends it for further processing. Also, if a mail lies in the email pool beyond a certain time limit, then a failover script should be used to send an http request to the listener with the name of the mail file and the resent flag as parameters.

The resent flag is used to indicate whether the mail has been sent for the first time or not. If the mail is sent for the first time from email server to http listener, the resent flag is N, else Y.

On receiving the request, the listener sends a request to the mail parser with the location, name of the mail file and the resent flag as parameters.



The file is then parsed by retrieving the MimeMessage object which contains information about the recipients, content, and attachments. This information is then stored to the database and delivered to the recipients.

USE OF JAVAMAIL API TO PARSE EMAIL FILES
This section will describe in detail the processes involved in parsing a mail file.
Almost every mail file that is received by the SMTP server has a header called the Message-Id. There may still be a strange case where a mail file will not have the Message-Id. Such mail files need to be ignored when this approach is used, as the message id plays a significant role in deciding the kind of processing required.

The message-id along with the timestamp of the email can be used as a unique identification for a mail that can have one or more recipients and hence should be stored in the database as they will come in handy when duplicates have to be avoided.

RECIPIENTS
To retrieve the information about the recipients, the method available in the MimeMessage – getRecipients (Message.RecipientType type) can be used.
The mapping between the type and the corresponding RFC 822 header is as follows:

Message.RecipientType.TO - "To"
Message.RecipientType.CC - "Cc"
Message.RecipientType.BCC - "Bcc"
MimeMessage.RecipientType.NEWSGROUPS - "Newsgroups"

An external server may send out one mail file with all the recipients of the mail or an individual mail file for each recipient. If the server sends one mail file for all recipients, when there are multiple BCC recipients, it will not result in accurate delivery to all intended recipients.

X-ENVELOPE-TO HEADER

• A header called X-Envelope-To was used to store the information about the recipient. This header can store only one value at a time.
• For each recipient of the mail, an individual mail file will be sent to the listener with one of the recipients in the X-Envelope-To header.
• All mail files will anyway have information about To and Cc recipients and it will not be required to process the files repeatedly for these two types of recipients.
• While processing the mail file, the program should check if the recipient in the X-Envelope-To header is also present in the To or Cc fields.
o If yes, the mail file can be ignored if it has been processed at least once before.
o If not, the recipient in the X-Envelope-To header will be considered as a BCC recipient.
• The unique combination of message id and email timestamp can be used to detect if the mail was processed earlier as these two values will be stored in the database.

AUTHOR
The information about the author can be obtained using the getFrom() method in the MimeMessage class.
The name of the author, if specified, can be retrieved from the getPersonal() method available in the Address class.

The diagram below is a pictorial representation of the process involved in the parsing of a mail file.


PARSE CONTENT
The content of the mail file can belong to a whole lot of MIME types. Since there is no specific list of MIME types on the internet and this can be anything non-standard according to international convention. Hence parsing the content is mainly through trial and error.
This section will explain how parsing of certain known mime types can be handled:
The content of the mail can be retrieved using the getContent() method of the MimeMessage class. The content of a mail can be one of the following two instances:
• String
• MimeMultiPart

(a) When the content is an instance of String, the following are the known possibilities of content.
(i) UUENCODED MAIL
Definition: A mail file is said to be uuencoded when it has an attachment that is preceded by the characters “begin 666” and ends with “`end”. The mail can also have content.
Parsing: To parse such a mail file, you have to take the substring of the content from the beginning till you encounter “begin 666”.

Content: Depending on the content type, the content hast has to be treated as plain text or html content. This information has to be stored in order to preserve the formatting of the content for exact visualization of the source at the target.

Attachment: The attachment name is normally present after the header “begin 666” which is followed by a return key. If the attachment string does not end with the footer “`end”, append the same to the string. The UUDecoder class has to be used to decode the string into a byte array. Following it, the input stream is created from byte array. This is used to create the attachment.

(ii)SIGNED MAIL
Definition: The content type of the mail file not only gives information about the content being a plain text or html and details about the charset. It also contains the information whether the mail is a signed mail. This means that only the intended recipient can open such a mail. It is used as a means of securing the content of the email from other users.
Parsing: When the content type has the string “signed-data”, the mail is treated as a signed email. An email attachment can be created using the input stream of the mail file and delivered to the recipients.

(iii) UNKNOWN MIME TYPEParsing: When the mime type is not known for a communication, the content is treated either as plain text or html text based on the content type or stored in the database.

(b) When the content is an instance of Mimemultipart, the following are the known possibilities of content.

(i) multipart/mixed or multipart/related
When the mime type is “multipart/mixed” or “multipart/related”, the mail file contains a combination of two or more of the remaining mime types. Hence, it’s necessary to get
the count of parts in the multipart content and parse them individually based on the mime type of each part.
(ii) RFC 822 headers (multipart/report)
Normally, delivery receipt mails have the mime type as multipart/report. These mails normally have an input stream with the mime type as “text/rfc822-headers”. The input stream should be converted into a string and stored with the content type as either text/plain or text/html.
(iii) text/*
The mime type text/* implies that the content can either be a text/html or a text/plain. There is also a possibility that there could be an attachment in with this mime type. It can be found if an attachment exists, based on whether a file name is defined for the part. If there is a file name, create an attachment with the input stream. If there is no file name, add the string as the content of the communication and set the content type as either text/plain or text/html.
(iv) multipart/Alternative
• The mime type “multipart/alternative” implies that the content of the communication is a combination of mime types.
• Normally the content will be of both text/plain and text/html types.
• The html version is given more preference keeping the visualization of the mail in mind.
• Apart from these two types, the content can also belong to any one of the remaining mime types.
(v) multipart/*
This means that the content of the mail file has multiple parts but they are neither multipart/mixed nor multipart/related. Hence, we must retrieve each part of the multipart content and parse it individually based on the mime type of each part.
(vi) message/rfc822
The mails with this mime type are added as an attachment to the mail with the extension as “eml”.
(vii) Handle attachments & inline images
• When the mail has attachments, the part will have a file name or the content’s encoding will be “base64’ and the part will be an instance of the MimeBodyPart.
• The input stream is used to create an attachment for the mail.
• When the mail has inline images, store the content id of the image in the database too.
• When the mail is displayed to the recipient, the content has to be parsed again to search for the match between the content id that was stored and the one available in the content.
• When a match is found, place the inline image in that section of the content. In order to ensure that no information is lost, the inline image can also be added as an attachment for the mail. When it is not possible to find a match for the stored content id, the user can at least view it as an attachment.

ATTENTION POINTS
1. Charset Issues:
i. When there is interaction with the external world, there are lots of possibilities that the format of the content that is viewed by the receiver is totally distorted from what was sent.
ii. One of the main reasons can be the charset in which the content is stored, especially in the case of blobs for attachments. Make it a point to store it in the charset used by the database in order to ensure better visualization.

2. Encoded characters in subject, attachment names:
i. When the subject or attachment’s name contains accented characters, they will most probably appear distorted to the end user. To overcome this scenario, use the decodeText and decodeWord methods in MimeUtility.
ii. Also try setting the following system property in your server– System.setProperty(“mail.mime.decodetext.strict”,f alse);

3. Encoding problems in mails sent by MAC users:

When the encoding of the keywords or attachment names contain the words “CSMACINTOSH” or “MAC” or “MACINTOSH’, replace the strings “=\\?” and \\?Q\\? with the string “=?macroman/Q?” and then use the decodeText method in MimeUtility.
4. It is always possible for a mail to not have an author or have an author with only the name and no email address. Ensure that you create a dummy address whenever you figure out that there is no information about the author. The encoding problems mentioned above can also occur in the name of the author.
5. Some charsets are not supported by the mail.jar. Please include jcharset.jar to handle some usual charsets that are unsupported by the version provided by Sun.
6. When a mail contains only BCC recipients, most mail servers send “undisclosed-recipients:;( undisclosed-recipients;:)” in the To header. There is a possibility of this value being in an illegal format. The exception for the same needs to be handled.
7. There is a possibility of encountering the following exception while parsing the content of the mail:
sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java( Compiled Code)),.
This is an inherent bug in the jar provided by sun itself.
(Ref.: http://bugs.sun.com/bugdatabase/view...bug_id=6274255)
8. When the SMTP server and the application’s database are not up to date with the existing domains of the users it is possible that some mails may be received, but may not match with the defined domain list. Hence such mails will be ignored. Therefore, it is a must to ensure that the domain names and email addresses of users are updated on a daily basis in both the database as well as the SMTP server. Every time the SMTP server is updated, it must be restarted for the changes to take effect.
9. Error handling: Whenever there is an error, if the recipients are known, try to create an email attachment with the original mail file and send the mail to the recipients. This will ensure that the content is not lost. To analyze as to why parsing failed, program it in such a way that whenever an exception occurs, the mail file is moved to an error folder.
The files in this folder can be analyzed to improve the parsing functionality by trial and error.

10. Use of log tables:
i. Three or more log tables would be needed for the mail parser.
ii. One log table must store the unique combination of message-id and email date.
iii. The other log tables can be used to store the log of mail files that were processed successfully and those that failed while processing.
iv. If some mail files were ignored because they were duplicates i.e. same mail for existing recipient, we need to store that information too.
v. Keep the mail files in separate folders such as parsed mails, ignored mails and failed mails so that you can use them for analyzing and validating records at any time.
vi. You can retain the data for a pre-defined period of time and then delete the files.
vii. It is a good practice to store the date of parsing the mail file too in the log tables. A batch cycle can be written to clean the table after a pre-determined number of days.

11. Testing:
• Testing has to be performed with as many external mail servers as possible to ensure completeness of coverage of all possibilities of content and attachment formats and charsets.
• Also, please focus on the mail providers used by the majority of people at your client’s place.
• Make it a practice to always use a different combination of recipients so that you can test if the delivery to BCC is successful.
• Also test with accented characters and for other error scenarios.

GLOSSARY
SMTP:
• The Simple Mail Transfer Protocol (SMTP) is the mechanism for delivery of email. In the context of the JavaMail API, JavaMail-based program will communicate with a particular company or Internet Service Provider's (ISP's) SMTP server.
• That SMTP server will relay the message on to the SMTP server of the recipient(s) to eventually be acquired by the user(s) through POP or IMAP.
• This does not require your SMTP server to be an open relay, as authentication is supported, but it is your responsibility to ensure the SMTP server is configured properly.
• There is nothing in the JavaMail API for tasks like configuring a server to relay messages or to add and remove email accounts.

MIME:
• MIME stands for Multipurpose Internet Mail Extensions. It is not a mail transfer protocol.
• Instead, it defines the content of what is transferred: the format of the messages, attachments, and so on. There are many different documents that take effect here: RFC 822, RFC 2045, RFC 2046, and RFC 2047.
• As a user of the JavaMail API, these formats are not a matter of concern. However, these formats do exist and are used by your programs.

SESSION:
The Session class defines a basic mail session. It is through this session that everything else works. The Session object takes advantage of a java.util.Properties object to get information like mail server, username, password, and other information that can be shared across your entire application.

MESSAGE:
• Being an abstract class, you must work with a subclass, in most cases javax.mail.internet.MimeMessage.
• A MimeMessage is a email message that understands MIME types and headers, as defined in the different RFCs.
• Message headers are restricted to US-ASCII characters only, though non-ASCII characters can be encoded in certain header fields.

CONCLUSION
The parsing process explained above will never be a perfect replacement for existing thick client applications like Lotus Notes, Outlook Express, etc. It only provides an alternative approach when you create a web-mail application. The Internet is source of immense information when it comes to dealing with issues in Java Mail parsing.

分享到:
评论

相关推荐

    JavaMail API核心类

    JavaMail API是Java平台上的一个标准接口,用于发送和接收电子邮件。它提供了丰富的类和接口,使得开发者可以方便地在应用程序中集成电子邮件功能。本文将深入探讨JavaMail API的核心类,帮助你理解如何利用这些类来...

    使用JavaMail API 发送邮件

    JavaMail API是Java编程语言中用于处理电子邮件的标准API,它提供了在Java应用程序中发送和接收邮件的功能。在本文中,我们将深入探讨如何使用JavaMail API在MyEclipse开发环境中创建一个能够发送邮件的项目。 首先...

    javamail_API

    JavaMail API 是Java编程语言中用于处理电子邮件的接口和类集合,它允许开发者编写应用程序来发送、接收和处理电子邮件。这个API提供了与邮件协议无关的抽象层,使得开发者无需直接处理SMTP、POP3、IMAP等具体协议的...

    javamail及其支持jaf的jar包

    JavaMail 是一个开源的 Java API,它允许开发者在 Java 应用程序中处理电子邮件。这个API提供了丰富的功能,包括发送、接收、读取和管理邮件。JavaMail 的核心库依赖于另一个组件,即 JavaBeans Activation ...

    Jboss启动报Failed to parse WEB-INFweb.xml; - nested throwable错误

    Jboss启动报Failed to parse WEB-INF/web.xml; - nested throwable错误解决方案 在Jboss应用服务器中,启动报错Failed to parse WEB-INF/web.xml; - nested throwable是一种常见的错误,本文将对此错误进行深入分析...

    简单javamail的实现(HelloWorld)

    message.setText("This is a test email sent using JavaMail API."); // 发送邮件 Transport.send(message); } } ``` 在这个例子中,我们首先设置了SMTP服务器的相关属性,并创建了一个会话实例。然后,我们...

    JavaMail邮件开发详解

    msg.setText("Hello, this is a test email sent using JavaMail."); // 发送邮件 Transport.send(msg); ``` #### 四、总结 JavaMail API 提供了一个强大且灵活的框架,使得开发者能够在 Java 应用程序中轻松地...

    JavaMail相关jar包

    JavaMail是Java编程语言中用于处理电子邮件的API,它提供了丰富的功能,允许开发人员发送、接收和管理电子邮件。本篇文章将深入探讨JavaMail的核心概念、关键组件以及如何使用这两个特定的jar包——`jaf-1_1_1.zip`...

    解决dbf Failed to parse Number: For input string: "-.---"

    本文将深入探讨标题和描述中提到的问题:“解决dbf Failed to parse Number: For input string: "-.---""”,以及如何在不依赖特定jar包的情况下处理DBF文件。 首先,"Failed to parse Number: For input string: ...

    [iOS] Parse 应用开发 (iOS SDK 实现) (英文版)

    Learn how to create your own applications using Parse SDK, with the help of the step- by- step, practical tutorials ☆ 出版信息:☆ [作者信息] Bhanu Birani [出版机构] Packt Publishing [出版日期] ...

    Learning to Parse Natural Language with Maximum Entropy Models

    Its machine learning technology, based on the maximum entropy framework, is highly reusable and not specific to the parsing problem, while the linguistic hints that it uses to learn ban be specified ...

    邮箱验证--javamail的简单使用

    message.setText("This is a test email sent using JavaMail API."); ``` 3. 发送邮件 最后,使用`Transport`类的`send`方法发送邮件: ```java Transport.send(message); ``` 三、进阶功能 JavaMail还支持更...

    Java使用JavaMail API发送和接收邮件的代码示例

    This is a test email sent using JavaMail API."); // 发送邮件 Transport.send(message); System.out.println("Email sent successfully!"); } } ``` 在这个示例中,我们首先配置了SMTP服务器的属性,然后...

    iOS Apprentice 4 StoreSearch v4.1

    You are going to build an app that lets you search the iTunes store. Of course, your iPhone already has apps for that (“App Store” and “iTunes Store” to name two), but what’s the harm in ...

    JAVA邮件服务API详解

    message.setText("Hello, this is a test email sent using JavaMail API."); ``` 发送邮件则通过`Transport`类完成。一旦邮件准备好,你可以调用`Transport.send()`方法发送: ```java Transport.send(message); ...

    Parse.comAPI服务器ParseServer.zip

    // and the location to your Parse cloud code var api = new ParseServer({  databaseURI: 'mongodb://localhost:27017/dev',  cloud: '/home/myApp/cloud/main.js', // Provide an ...

    JAVAMAIL例子

    message.setText("Hello, this is a test email sent using JavaMail."); ``` 5. **发送邮件** 最后,通过`Transport`类的`send`方法发送邮件: ```java Transport.send(message); System.out.println("Email...

    javamail需要的包

    JavaMail 是一个开源的 Java API,它允许开发者在 Java 应用程序中发送和接收电子邮件。这个API提供了全面的功能,包括支持SMTP、POP3、IMAP等邮件协议,以及MIME消息处理。在使用JavaMail时,需要依赖一些核心的库...

    email客户端-实现发送邮件-纯java实现-包括所需要的库文件

    在这个“email客户端-实现发送邮件-纯java实现-包括所需要的库文件”的主题中,我们将深入探讨如何使用 JavaMail API 实现一个简单的电子邮件客户端,以及必要的库文件。 首先,JavaMail API 提供了 javax.mail 包...

Global site tag (gtag.js) - Google Analytics